The thesis deals with a comparison of different tools and libraries for natural language processing in Python programming language. In addition to the most popular library for natural language processing NLTK we thoroughly researched other less known libraries, such as SpaCy, pyNLPl, Pattern and Textblob, and made comparisons between them based on different criteria and practical assignments, such as tokenization, lemmatization, stemming, part of speech tagging, dependency tree building, searching for patterns in text, word frequency counting and n-grams building. The libraries were compared by functionality and their methods and tools for natural language processing were analysed and described in detail. Special attention was paid to library abilities to access different text corpora and processing of Slovenian language. The most common methods of natural language processing, such as tokenization, lemmatization, part of speech tagging, were compared by speed and accuracy. Afterwards we focused on the sentiment analysis of Slovenian tweets and internet comments by machine learning, where we tested classifiers from different libraries and compared them by speed and accuracy. We tried to improve accuracy of classifiers with different methods, such as part of speech filtering, filtering based on TF-IDF, n-grams inclusion and removal of URL's and character repetitions in words. The machine learning approach to sentiment analysis was compared to lexical approach. We came to a conclusion that PyNLPl library lacks the basic methods for natural language processing. NLTK library is slow but suitable for learning because of the huge documentation and a big set of alternative methods. It is also very flexible, which makes it the only library that partially supports processing of Slovenian language. Spacy and TextBlob libraries are very fast and also accurate but they lack flexibility and functionality. Pattern library turned out to be average in terms of speed and accuracy but is pretty flexible, has many functionalities and is easy to use.
|