Nizkoentropijski jezikovni model na besedilih Cirila Kosmača in Ivana Cankarja

Jakopin, Primož

Repository of the University of Ljubljana

Details

Nizkoentropijski jezikovni model na besedilih Cirila Kosmača in Ivana Cankarja
ID Jakopin, Primož (Author)

	PDF - Presentation file, Download (50,92 KB) MD5: 52E6D5E08F93A88015D0B824D2F077CF
	URL - Source URL, Visit https://centerslo.si/simpozij-obdobja/zborniki/obdobja-21/

Abstract

V prispevku je bil jezikovni model, ki temelji na pogostnostnih znakovnih n-terčkov (nizov znakov, tj. črk, presledkov, števk in ločil dolžine n), uporabljen na besedilnih zbranih delih Cirila Kosmača in Ivana Cankarja. Pri vsakem modelu je najšrej treba napraviti Huffmanovo drevo iz vseh n-terčkov (n=1 do 20, pogostnost vsaj 2) posamezne besedilne zbirke (400.000 oz. 2 milijona besed, 45.889.000 oz. 223.553.000 n-terčkov, 26.274.000 oz. 116.588.000 različnih n-terčkov) in izračunati ustrezne Huffmanove kodeza vsak list v obeh drrevesih. Pri uporabi modela na daenm besedilu pa to besedilo razrežemo na n-terčke (1-20) tako, da je vsota dolžin Huffmanovih kod modela na danem besedilu najmanjša. Če model uporabimo na besedilu, iz katerega smo ga napravili, dobimo tudi najmanjšo entropijo besedila, ki je obenem tudi mera za njegovo informacijsko vsebnost. Dobljena entropija besedil Cirila Kosmača glede na njegov model je bila 2,26 bita na znak, entropija besedil Ivana Cankarja z njegovim modelom pa 2,27 bita na znak.

Language:	Slovenian
Keywords:	entropija, teorija informacij, jezikovni model, slovenščina, leposlovje, besedila, statistični opis besedila, oblikoslovno označevanje, kvantitativno jezikoslovje, Ciril Kosmač, Ivan Cankar
Typology:	1.06 - Published Scientific Conference Contribution (invited lecture)
Organization:	FF - Faculty of Arts
Year:	2003
Number of pages:	Str. 421-428
PID:	20.500.12556/RUL-164980
UDC:	821.163.6.09 Kosmač C.:004.6
COBISS.SI-ID:	21472045
Publication date in RUL:	19.11.2024
Views:	112
Downloads:	12
Metadata:
:	Copy citation
Share:

Record is a part of a monograph

Title:	Slovenski roman
Editors:	Miran Hladnik, Gregor Kocijan
Place of publishing:	Ljubljana
Publisher:	Center za slovenščino kot drugi/tuji jezik pri Oddelku za slovenistiko Filozofske fakultete
Year:	2003
ISBN:	961-237-058-3
COBISS.SI-ID:	125833472
Collection title:	Obdobja
Collection numbering:	21
Collection ISSN:	1408-211X

Secondary language

Abstract:
Language:	English
In the paper a language model based on probabilities of character n-grams has been applied to texts of collected works of two leading Slovenian twentieth-century writers, i.e., Ciril Kosma~ and Ivan Cankar. During the construction of each modela Huffman tree is generated from allthe n-grams (n=1 to 20, frequency of 2 or more) of each text corpus (0.4 and 2.0 million running words, 45,889,000 and 223,553,000 n-grams, 26,274,000 and 116,588,000 different n-grams), and appropriate Huffman codes are computed for every leaf in the tree. To apply the model to an arbitrary text, the text is cut into n-grams (1–20) in such a way that the sum of the lengths of the model Huffman codes for all the obtained n-grams of the new text is minimal. If the model is applied to the text from which it was generated, the resulting entropy is minimal; this entropy is also a measure of the information content of the text from the standpoint of information theory. When the model of Ciril Kosmač was applied to his texts, the entropy of 2.26 bits per character was obtained and 2.27 bits per character for the modeland texts of Ivan Cankar.
Keywords:	entropy, information theory, linguistic model, Slovene language, literary texts, statistical description of text, part-of-speech tagging, statistical description of text, quantitative linguistics, Ciril Kosmač, Ivan Cankar

Similar works from RUL:
Similar works from other Slovenian collections:

Details

Record is a part of a monograph

Secondary language

Similar documents