In this master thesis, we developed a model that can present texts from
life sciences in the vector form that is suitable for machine learning. Our
corpus were abstracts from the MEDLINE collection, where abstracts are labeled
with annotations from the MeSH ontology. The developed model uses
a deep neural network for predicting MeSH annotations from a text. For
the vector representation of a text, we used penultimate layer of a network
that has 1000 neurons. The model was compared to the multinomial logistic
regression, which predicts MeSH annotations from vector representations of
texts that are obtained with doc2vec. In the task of predicting MeSH annotations
on the test dataset, our model achieved higher accuracy. Also, vector
representations of texts obtained with our model were in comparison with
vector representations of texts obtained with doc2vec, better in point-based
visualizations using the t-SNE method.
|