Paraphrase categorization involves systematic classification of different types of linguistic transformations that preserve text meaning while changing surface form. Automatic paraphrase categorization contributes to better understanding of linguistic structures and improves interpretability of natural language processing systems. In this work, we developed the first systematic approach to ontology-driven paraphrase categorization for Slovenian, based on large language models. Due to the specificity of Slovenian as a less-resourced language, we tested the specialized GaMS-1B model and the multilingual LLaMA-3.1-8B model. Both are based on transformer architecture, which currently dominates the field of natural language processing. From four English paraphrase corpora, we obtained examples, translated them into Slovenian, and thus created a training dataset with 372 annotated paraphrase pairs. The dataset is useful for further research and building models for Slovenian paraphrase categorization. We developed a two-level ontological schema with four main categories and twelve subcategories to categorize the training examples and guide the model adaptation. Based on this, we conducted the evaluation quantitatively using similarity and performance metrics, and qualitatively through human judgment. The GaMS model achieved better results in syntactic and pragmatic paraphrasing, while LLaMA performed better in lexical and semantic paraphrasing. Through analysis of the results, we identified suitable training set sizes and showed that large language models require only 6-8 examples per category for successful categorization.
|