This thesis describes a prototype of a system that evaluates the readability of a given text in Slovene. To estimate the readability of a text, we used two methods - regression and classification. The regression method returns a numerical estimation of the readability of a text expressed as years of education, while the classification method tries to classify the input into two classes, where one of the classes is defined as more readable and the other as less readable. We used the corpus Šolar as a training set and first estimated readability using statistical measures. Using features extracted from the texts, we trained different ML algorithms. To assess the quality of our prototypes, we used newspapers and magazines from ccGigafida corpus as a testing set.
|