In the paper a language model based on probabilities of character n-grams has been applied to texts of collected works of two leading Slovenian twentieth-century writers, i.e., Ciril Kosma~ and Ivan Cankar. During the construction of each modela Huffman tree is generated from allthe n-grams (n=1 to 20, frequency of 2 or more) of each text corpus (0.4 and 2.0 million running words, 45,889,000 and 223,553,000 n-grams, 26,274,000 and 116,588,000 different n-grams), and appropriate Huffman codes are computed for every leaf in the tree. To apply the model to an arbitrary text, the text is cut into n-grams (1–20) in such a way that the sum of the lengths of the model Huffman codes for all the obtained n-grams of the new text is minimal. If the model is applied to the text from which it was generated, the resulting entropy is minimal; this entropy is also a measure of the information content of the text from the standpoint of information theory. When the model of Ciril Kosmač was applied to his texts, the entropy of 2.26 bits per character was obtained and 2.27 bits per character for the modeland texts of Ivan Cankar.
|