Throughout history, there have been numerous attempts to determine the key properties of language in accordance with a particular linguistic tradition and the distinctive characteristics of the individual language under scrutiny. However, most languages still base their word class categories on ancient Greek and Latin traditions, so that word class systems and fundamental linguistic properties are influenced by historically established linguistic concepts rather than solely by the structure and organization of language as documented from actual use, which could be viewed as problematic in many languages, including Slovene. The present study focuses on grouping words into clusters based on their similarity in real-life language use. To avoid the influence of established word categorizations on the research outcomes, a Slovene language corpus is analysed with the help of unsupervised machine learning. The system is not provided with any additional linguistic knowledge on parts of speech; instead, it groups words into clusters based merely on their similarity within the corpus. Different clustering algorithms are tested. Word similarity, based on morphological, distributional-syntactic, and semantic features of individual words in the corpus, is used as input data. Different combinations of these criteria are employed. The resulting word clusters are interpreted and compared with traditional word class categorizations. With the help of machine learning it is concluded that partitional clustering (i.e., k-means and k-medoids methods) and agglomerative hierarchical clustering using Ward’s method are the most suitable for grouping words into clusters, while DBSCAN clustering is less appropriate. In addition, it is established that distributional-syntactic and semantic criteria are relevant for identifying word similarity, whereas morphological criteria seem less important. Nevertheless, the results are considered unsatisfactory, as both the optimal number of clusters and the word sets obtained exhibit a rather idiosyncratic nature when compared to historically established word categorizations in Slovene. This makes it difficult to sufficiently explain which words are similar in Slovene and which established categorization best matches the groupings based on the actual word use in the corpus. Although the results of the study neither provide meaningful conclusions about word groupings nor resolve the inconsistencies in Slovene word categorizations, they nonetheless offer valuable insights that can serve as guidelines for future research.
|