We present the implementation of the Slovenian annotation pipeline in Spacy,
which is one of the most popular libraries for natural language processing.
We outline some of the existing tools, models and corpora. Spacy and it’s
low-level pipeline for language annotations are described in detail. We imple-
mentint new models for lemmatization, part-of-speech tagging, dependency
parsing and named entity recognition for Slovenian. We generate static word
embeddings from existing and publicly available corpora. The models are
built using neural networks and the open source library Thincc. We describe
the configuration and training of the models on two public corpora, ssj500k
(for standard Slovenian) and Janes-Tag (for nonstandard Slovenian). The
models are evaluated and compared to existing tools.
|