
Nizkoentropijski jezikovni model na besedilih Cirila Kosmača in Ivana Cankarja
ID Jakopin, Primož (Author)

V prispevku je bil jezikovni model, ki temelji na pogostnostnih znakovnih n-terčkov (nizov znakov, tj. črk, presledkov, števk in ločil dolžine n), uporabljen na besedilnih zbranih delih Cirila Kosmača in Ivana Cankarja. Pri vsakem modelu je najšrej treba napraviti Huffmanovo drevo iz vseh n-terčkov (n=1 do 20, pogostnost vsaj 2) posamezne besedilne zbirke (400.000 oz. 2 milijona besed, 45.889.000 oz. 223.553.000 n-terčkov, 26.274.000 oz. 116.588.000 različnih n-terčkov) in izračunati ustrezne Huffmanove kodeza vsak list v obeh drrevesih. Pri uporabi modela na daenm besedilu pa to besedilo razrežemo na n-terčke (1-20) tako, da je vsota dolžin Huffmanovih kod modela na danem besedilu najmanjša. Če model uporabimo na besedilu, iz katerega smo ga napravili, dobimo tudi najmanjšo entropijo besedila, ki je obenem tudi mera za njegovo informacijsko vsebnost. Dobljena entropija besedil Cirila Kosmača glede na njegov model je bila 2,26 bita na znak, entropija besedil Ivana Cankarja z njegovim modelom pa 2,27 bita na znak.

Keywords:entropija, teorija informacij, jezikovni model, slovenščina, leposlovje, besedila, statistični opis besedila, oblikoslovno označevanje, kvantitativno jezikoslovje, Ciril Kosmač, Ivan Cankar
Typology:1.06 - Published Scientific Conference Contribution (invited lecture)
Organization:FF - Faculty of Arts
Number of pages:Str. 421-428
PID:20.500.12556/RUL-164980 This link opens in a new window
UDC:821.163.6.09 Kosmač C.:004.6
COBISS.SI-ID:21472045 This link opens in a new window
Publication date in RUL:19.11.2024
Record is a part of a monograph

Title:Slovenski roman
Editors:Miran Hladnik, Gregor Kocijan
Place of publishing:Ljubljana
Publisher:Center za slovenščino kot drugi/tuji jezik pri Oddelku za slovenistiko Filozofske fakultete
COBISS.SI-ID:125833472 This link opens in a new window
Collection title:Obdobja
Collection numbering:21
Collection ISSN:1408-211X

Secondary language

In the paper a language model based on probabilities of character n-grams has been applied to texts of collected works of two leading Slovenian twentieth-century writers, i.e., Ciril Kosma~ and Ivan Cankar. During the construction of each modela Huffman tree is generated from allthe n-grams (n=1 to 20, frequency of 2 or more) of each text corpus (0.4 and 2.0 million running words, 45,889,000 and 223,553,000 n-grams, 26,274,000 and 116,588,000 different n-grams), and appropriate Huffman codes are computed for every leaf in the tree. To apply the model to an arbitrary text, the text is cut into n-grams (1–20) in such a way that the sum of the lengths of the model Huffman codes for all the obtained n-grams of the new text is minimal. If the model is applied to the text from which it was generated, the resulting entropy is minimal; this entropy is also a measure of the information content of the text from the standpoint of information theory. When the model of Ciril Kosmač was applied to his texts, the entropy of 2.26 bits per character was obtained and 2.27 bits per character for the modeland texts of Ivan Cankar.

Keywords:entropy, information theory, linguistic model, Slovene language, literary texts, statistical description of text, part-of-speech tagging, statistical description of text, quantitative linguistics, Ciril Kosmač, Ivan Cankar

