This thesis focuses on methods for processing and analysing natural language in texts. It starts with a brief description of the Matlab \emph{Text Analytics Toolbox}, and then presents the basic objects in Matlab for text processing. It describes the structural components of natural language and how they can be processed in a way that can be understood by a computer. Some methods such as tokenisation, lemmatisation and stemming are presented in the thesis. An example of how these methods are applied in Matlab are demonstrated using the public domain book \emph{Crime and Punishment} by Dostoyevsky. Further, Zipf's law and some of the main statistical properties of text corpora are presented. Their implications are mentioned, and they are illustrated by an example. A part of the thesis also focuses on geometric models. In this part, different ways of representing text collections are defined: with term-document matrix and \emph{tf-idf} matrix. It also describes the word co-occurrence matrix and the overlap matrix. The term-document matrix represents the vector space of words and documents, so the basic features of the representation of documents in vector spaces are briefly described. Methods of measuring the similarity between documents are also presented. In chapter 5 co-occurrences of words in paragraphs are analysed and the theoretical background on Shannon entropy and mutual information is provided. The pointwise mutual information for word pairs in \emph{Crime and Punishment} is calculated based on this theory. The conclusions drawn from the calculation of the pointwise mutual information are presented and described. Pearson’s $\chi^2$-test for independence is also described and its usefulness in the analysis of word collocations is demonstrated. Chapter 6 presents $n$-gram models: unigram, bigram and trigram. These models are based on the Markov assumption. This chapter further explains the estimation of parameters in $n$-gram language models, and then describes discounting methods which remove some probability mass from events that occur in the training corpus and distribute it among unseen events. The last chapter describes some techniques of dimension reduction such as vocabulary pruning and merging, and dimension reduction using the singular value decomposition.
|