izpis_h1_title_alt

Mining patterns for neurodegenerative diseases from biomedical scientific literature
ID ATANASOSKI, RADOSLAV (Author), ID Žitnik, Slavko (Mentor) More about this mentor... This link opens in a new window, ID Eftimov, Tome (Comentor)

.pdfPDF - Presentation file, Download (1,06 MB)
MD5: 803F4AB4B758DA8E902274468CC57669

Abstract
Nowadays, there is a vast amount of biomedical knowledge coming in rapidly every day through scientifically published papers. However, trying to keep up with it, is really challenging and takes up too much time. Even more, when searching for relevant papers with required information. To help medical professionals stay up to date, and find papers related to their search topics, in this thesis we create an Information Retrieval (IR) pipeline, first specifying to which neurodegenerative diseases the papers are related to, and also providing analysis to show the most frequent patterns that are researched and published. For the modeling, we explored several state-of-the-art text representation learning models such as BERT, RoBERTa and BioBERT. After fine-tuning each model, BioBERT giving an outstanding performance with 94% cross-validation CA was chosen as a model for the IR pipeline. We also compare our state-of-the-art model with a more traditional and commonly used model, Random Forest. Furthermore, for the analysis of frequent patterns, the abstracts of the diseases involved were annotated and concepts of chemical and genetic compounds were extracted using a Named Entity Recognition (NER) model. After that, all entities were normalized by applying Named Entity Linking (NEL). On the extracted entities, association rule mining was applied in order to find the most frequently researched patterns for each disease, further displayed by using several visualization techniques. These results will help medical professionals to state up to date, on the other side also pointing to missing gaps that are not well researched for a given disease. The data involved in this study was obtained by a publicly available database, PubMed.

Language:English
Keywords:data mining, text representation learning, association rule mining
Work type:Bachelor thesis/paper
Typology:2.11 - Undergraduate Thesis
Organization:FRI - Faculty of Computer and Information Science
Year:2022
PID:20.500.12556/RUL-139475 This link opens in a new window
COBISS.SI-ID:120508931 This link opens in a new window
Publication date in RUL:02.09.2022
Views:813
Downloads:131
Metadata:XML DC-XML DC-RDF
:
Copy citation
Share:Bookmark and Share

Secondary language

Language:Slovenian
Title:Odkrivanje vzorcev za nevrodegenerativne bolezni iz biomedicinske znanstvene literature
Abstract:
Dandanes obstaja ogromna količina biomedicinskega znanja, ki vsak dan hitro prihaja skozi znanstveno objavljene članke. Vendar pa je poskušati slediti temu resnično zahtevno in vzame preveč časa. Se več, pri iskanju relevantnih dokumentov z zahtevanimi podatki. Da bi zdravstvenim delavcem pomagali ostati na tekočem in najti članke, povezane z njihovimi temami iskanja, v tej diplomski nalogi ustvarimo cevovod za pridobivanje informacij (IR), pri čemer najprej navedemo, s katerimi nevrodegenerativnimi boleznimi so članki povezani, in zagotovimo tudi analizo, ki pokaže, najpogostejših vzorcev, ki so raziskani in objavljeni. Za modeliranje smo raziskali več najsodobnejših modelov učenja za predstavitev besedila, kot so BERT, RoBERTa in BioBERT. Po natančnem prilagajanju vsakega modela je bil kot model za cevovod IR izbran BioBERT, ki zagotavlja izjemno zmogljivost s 94% navzkrižno validacijo CA. Prav tako primerjamo naš najsodobnejši model z bolj tradicionalnim in pogosto uporabljenim modelom Random Forest. Poleg tega so bili za analizo pogostih vzorcev uporabljeni izvlečki vpletenih bolezni opombe in koncepti kemičnih in genetskih spojin so bili ekstrahirani z uporabo modela prepoznavanja poimenovanih entitet (NER). Po tem so bile vse entitete normalizirane z uporabo povezovanja imenovanih entitet (NEL). Na ekstrahiranih entitetah je bilo uporabljeno rudarjenje asociacijskih pravil, da bi našli najpogosteje raziskane vzorce za vsako bolezen, ki so nadalje prikazani z uporabo več tehnik vizualizacije. Ti rezultati bodo zdravstvenim delavcem pomagali pri navajanju najnovejših informacij, po drugi strani pa bodo pokazali tudi na manjkajoče vrzeli, ki za določeno bolezen niso dobro raziskane. Podatki, vključeni v to študijo, so bili pridobljeni iz javno dostopne zbirke podatkov PubMed.

Keywords:podatkovno rudarjenje, učenje tekstovnih predstavitev, učenje asociacijskih pravil

Similar documents

Similar works from RUL:
Similar works from other Slovenian collections:

Back