The thesis focuses on automatic summarization of Slovene documents. There are large numbers of documents in digital form which we want to summarize in order to make them accessible to humans. This cannot be done manually so we want to automate the process.
Our system, uses a parser for Slovene language to find triplets consisting of a subject, predicate (or verb) and object. We build a graph using the words in the triplets and weight the connections. We rank the nodes with P-PR algorithm, which assesses the importance of words in triples. We weight P-PR values of words in the triples with measures TF-IDF, Okapi BM-25, and word frequency. We chose the best triplets and use them to generate summaries. Generated summaries are evaluated with ROUGE-N and ROUGE-S measures. Evaluation is performed on a corpus, built from Wikipedia, and also with manually created summaries. The results show that humans create significantly better summaries. The best computer generated summaries are created when graph connections are weighted with the number of bigram occurrences and P-PR values are weighted with the frequency of word occurrence in triplets.
|