Abstraktivno povzemanje dokumentov v slovenskem jeziku

JUGOVIC, ANDREJ

Abstraktivno povzemanje dokumentov v slovenskem jeziku
ID JUGOVIC, ANDREJ (Author), ID Robnik Šikonja, Marko (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (700,23 KB)
MD5: E5F74ED54DA1FD37F4C3E7155889448D
PID: 20.500.12556/rul/6ca8e989-dd41-4b3d-8e0a-0c64be99a320

Abstract

V diplomskem delu obravnavamo avtomatsko povzemanje slovenskih dokumentov. Živimo v času, ko imamo na voljo veliko dokumentov v elektronski obliki, ki jih želimo strniti v krajše zapise. Vseh ne moremo ročno povzeti, zato je potrebno postopek avtomatizirati. S pomočjo razčlenjevalnika za slovenski jezik smo v dokumentu poiskali trojice, sestavljene iz osebka, povedka in predmeta. Iz besed, ki so v teh trojčkih, smo zgradili graf in povezave v grafih utežili na različne načine. Vozlišča v grafu smo ocenili z algoritmom P-PR. To nam je služilo kot osnovna ocena pomembnosti besed v trojčkih. P-PR vrednosti besed v trojčkih smo utežili z merami TF-IDF, Okapi BM-25 in frekvenco besed.S pomočjo teh ocen smo izbrali najboljše trojčke in iz njih generirali povzetke. Dobljene povzetke smo ocenili z merama ROUGE-N in ROUGE-S. Evalvacijo smo izvedli na korpusu, ki smo ga zgradili s pomočjo Wikipedije, in z ročno povzetimi besedili. Rezultati so pokazali, da človek ustvari precej boljše povzetke. Najbolje se je izkazal sistem, kjer smo povezave grafa utežili s številom pojavitve dvogramov, P-PR vrednost pa s frekvenco pojavitve besede v trojicah.

Language:	Slovenian
Keywords:	procesiranje naravnega jezika, povzemanje dokumentov, algoritem P-PR (personalizirani PageRank) za rangiranje vozlišč v grafu, mera ROUGE, avtomatsko povzemanje dokumentov
Work type:	Bachelor thesis/paper
Organization:	FRI - Faculty of Computer and Information Science
Year:	2016
PID:	20.500.12556/RUL-85512
Publication date in RUL:	15.09.2016
Views:	1425
Downloads:	430
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Abstractive summarization for Slovene language
The thesis focuses on automatic summarization of Slovene documents. There are large numbers of documents in digital form which we want to summarize in order to make them accessible to humans. This cannot be done manually so we want to automate the process. Our system, uses a parser for Slovene language to find triplets consisting of a subject, predicate (or verb) and object. We build a graph using the words in the triplets and weight the connections. We rank the nodes with P-PR algorithm, which assesses the importance of words in triples. We weight P-PR values of words in the triples with measures TF-IDF, Okapi BM-25, and word frequency. We chose the best triplets and use them to generate summaries. Generated summaries are evaluated with ROUGE-N and ROUGE-S measures. Evaluation is performed on a corpus, built from Wikipedia, and also with manually created summaries. The results show that humans create significantly better summaries. The best computer generated summaries are created when graph connections are weighted with the number of bigram occurrences and P-PR values are weighted with the frequency of word occurrence in triplets.
Keywords:	natural language processing, document summarization, personalized PageRank algorithm, ROUGE measure, weighted links, automatic document summarization

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents