Word sense disambiguation in the field of computational linguistics determines which of the possible word meanings is used in a text. It is useful for information retrieval, machine translation, text mining, and computational lexicography. Word sense disambiguation remains an open research question despite recent improvements under the influence of general advances of natural language processing and artificial intelligence. Until recently, the development of models for Slovene was limited by the lack of semantically annotated datasets. That has changed with a new Slovene dataset as well as multilingual models that enable cross-lingual transfer.
This thesis encompasses an interdisciplinary overview of semantic disambiguation and the development of a disambiguation prediction model for Slovene. The review highlights conceptual differences in the understanding of ambiguity and disambiguation between several disciplines. Compared to standard procedures and datasets in the field of natural language processing, psycholinguistics provides a richer and more precise typology of polysemy, while linguistic pragmatics questions the assumption that disambiguation is primarily a semantic process.
The learning task we used was the prediction of sense equivalence in sentence pairs for a target lemma. Using sense annotated Slovene and English datasets, we constructed seven training sets that differed in size (the number of included examples per sense) and in the languages included. For each dataset we fine-tuned the multilingual CroSloEngual BERT transformer model.
The highest F1 test score was achieved using the combined English-Slovene training set (81.6). Alternative testing revealed that the final classification architecture played an important role in the model’s success, as other models have achieved higher or comparable prediction results using transformer layers directly. Additional out-of-vocabulary testing demonstrated a negative relationship between the number of included training examples and successful match prediction on new vocabulary, measured by the Matthews correlation coefficient. This was true for all models, with only the model trained on multilingual data obtaining both a high F1 score on the test set as well as a high correlation coefficient on the out-of-vocabulary set.
|