In this thesis, we used a method of latent semantic analysis (LSA) for automatic multi-document summarization. LSA algorithm analyzes the relationships between words and document by producing a set of concepts that describe this relationship. In the preprocessing stage, all words were lemmatized based on Slovenian lexicon. Our work reiterated Slovenian academic contributions to science acquired from the Slovenian digital lexicons. The results of the LSA analysis are paragraphs ranked by relevance. The most promising paragraphs are candidates for the summary. For the proper mapping of the lemmatized paragraphs into the original in the phase of preprocessing we performed syntactical analysis of the source text. The resulting extract was changed into the abstract summary, using semantic analysis of sentences and lexical chaining. For this purpose we used Slovenian morphological lexicon. The quality of the obtained summaries was evaluated using the Rouge algorithm. We compared those summaries with abstracts from the analysis of archetypes and human summaries. To implement the summarization, we implemented a stand-alone web application named SimpleX, which was implemented in a server environment to support the database. Experimental results show that the proposed semantic approach helps to build a way towards the large collections of documents.
|