Keyword extraction and named entity recognition on Reddit submissions

Hudobivnik, Rok

Keyword extraction and named entity recognition on Reddit submissions
ID Hudobivnik, Rok (Author), ID Helic, Denis (Mentor) More about this mentor... This link opens in a new window

, ID Bosnić, Zoran (Comentor)

PDF - Presentation file, Download (2,13 MB)
MD5: 35F573ED0AF38E1E62B088BAD2C7D76D

Abstract

The goal of this thesis was to create a pipeline for extraction of valuable information from short natural language texts, more specifically Reddit submissions. The two main areas of research that we covered were keyword extraction and named entity recognition for the extraction of keywords and the recognition of actors and movie titles in the texts. In our thesis we implemented and evaluated four different approaches for keyword extraction (RAKE, TextRank, LSTM and biLSTM networks) and three different approaches for named entity recognition (Spacy library models, Stanford NER and Fine-tuned BERT models). The analysis of the algorithms showed that the best results were achieved when using a three layered biLSTM network for keyword extraction, an uncased BERT model fine-tuned on the MIT movie corpus dataset for the recognition of actors, and the BERT model fine-tuned on the Ontonotes 5 dataset for the recognition of movie titles.

Language:	English
Keywords:	Globoko učenje, razpoznavanje entitet, luščenje ključnih besed, analiza
Work type:	Master's thesis/paper
Typology:	2.09 - Master's Thesis
Organization:	FRI - Faculty of Computer and Information Science
Year:	2020
PID:	20.500.12556/RUL-117614
COBISS.SI-ID:	17020419
Publication date in RUL:	17.07.2020
Views:	1146
Downloads:	200
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	Slovenian
Title:	Luščenje ključnih besed in razpoznavanje entitet v besedilih s portala Reddit
Cilj te naloge je bila konstrukcija postopka za luščenje pomembnih podatkov iz kratkih besedil v naravnem jeziku, bolj specifično objav s spletnega portala Reddit. Dve glavni področji naših raziskav sta bili luščenje ključnih besed in razpoznavanje entitet. Za namene naloge smo implementirali in analizirali štiri algoritme za luščenje ključnih besed (RAKE, TextRank, nevronske mreže LSTM in biLSTM) in tri algoritme za razpoznavanje entitet (modeli knjižnice Spacy, Stanford NER in umerjeni modeli BERT). Analiza algoritmov je pokazala, da dosežemo najboljše rezultate z uporabo nevronske mreže s tremi sloji biLSTM za luščenje ključnih besed, model biLSTM za male črke, umerjen na podatkovni zbirki MIT movie corpus, za razpoznavanje imen igralcev in model, umerjen na podatkovni zbirki Ontonotes 5, za razpoznavanje naslovov filmov.
Keywords:	Deep learning, named entity recognition, keyword extraction, analysis

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents