Iterativno pridobivanje semantičnih podatkov iz nestrukturiranih besedilnih virov : doktorska disertacija

Žitnik, Slavko

Iterativno pridobivanje semantičnih podatkov iz nestrukturiranih besedilnih virov : doktorska disertacija
ID Žitnik, Slavko (Author), ID Bajec, Marko (Mentor) More about this mentor... This link opens in a new window

URL - Presentation file, Visit http://eprints.fri.uni-lj.si/2889/ This link opens in a new window

Abstract

Živimo v času, ko ustvarjamo ogromne količine podatkov, od katerih je večina nestrukturiranih. Na internetu uporabniki vsako minuto objavijo več kot 200.000 besedilnih dokumentov in skupaj napišejo več kot 200 milijonov e-poštnih sporočil. Do teh podatkov bi želeli dostopati v strukturirani obliki, zato se v okviru te disertacije ukvarjamo z ekstrakcijo informacij iz besedil. Ekstrakcija informacij je tip informacijskega poizvedovanja, pri čemer so glavne naloge prepoznavanje imenskih entitet, ekstrakcija povezav in odkrivanje koreferenčnosti. Disertacija sestoji iz štirih jedrnih poglavij, kjer v vsakem predstavimo svojo nalogo ekstrakcije in jih na koncu združimo z iterativno metodo v sistem za celostno ekstrakcijo informacij. Najprej predstavimo nalogo odkrivanja koreferenčnosti, katere cilj je poiskati vse omenitve za določeno entiteto in jih združiti. Kot lastno rešitev predlagamo sistem SkipCor, ki problem pretvori v označevanje zaporedij, nad katerimi uporabimo verjetnostne modele prvega reda. Za uspešno odkrivanje koreferenčnih omenitev na daljših razdaljah predlagamo inovativno transformacijo v zaporedja z izpuščenimi omenitvami in dosežemo primerljive ali boljše rezultate kot ostali znani pristopi. Podoben način uporabimo tudi pri odkrivanju povezav. Tu z različnimi tipi oznak in s pomočjo pravil omogočimo razpoznavanje hierarhičnih povezav. S predlaganim pristopom dosežemo najboljši rezultat na tekmovanju za ekstrakcijo povezav za namene odkrivanja genskega regulatornega omrežja. Nazadnje predstavimo še najstarejšo in najbolj raziskano nalogo za prepoznavanje imenskih entitet. Naloga se ukvarja z označevanjem ene ali več besed, ki predstavljajo določen tip entiete -- na primer osebe. V disertaciji prilagodimo uporabo standardnih postopkov za označevanje zaporedij ter na tekmovanju v odkrivanju kemijskih spojin in zdravil dosežemo sedmo mesto. Vse naloge uspemo rešiti z linearno-verižnimi modeli pogojnih naključnih polj in jih združiti v iterativni metodi, ki kot vhod sprejme nestrukturirano besedilo ter vrne ekstrahirane entitete s povezavami med njimi. Le--te so označene na podlagi sistemske ontologije, kar v prihodnje zagotavlja boljšo interoperabilnost. Področje ekstrakcije informacij je v slovenskem jeziku še zelo neraziskano, zato v dodatku vključimo še seznam prevodov izbranih terminov iz angleškega v slovenski jezik.

Language:	Slovenian
Keywords:	ekstrakcija informacij, odkrivanje koreferenčnosti, ekstrakcija povezav, prepoznavanje imenskih entitet, računalništvo, disertacije
Work type:	Doctoral dissertation
Typology:	2.08 - Doctoral Dissertation
Organization:	FRI - Faculty of Computer and Information Science
Publisher:	[S. Žitnik]
Year:	2014
Number of pages:	IX, 134 str.
PID:	20.500.12556/RUL-69131
UDC:	004(043.3)
COBISS.SI-ID:	1536169155
Publication date in RUL:	10.07.2015
Views:	1867
Downloads:	312
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Iterative semantic information extraction from unstructured text sources
Nowadays we generate an enormous amount of data and most of it is unstructured. The users of Internet post more than 200,000 text documents and together write more than 200 million e-mails online every single minute. We would like to access this data in a structured form and that is why we throughout this dissertation deal with information extraction from text sources. Information extraction is a type of information retrieval, where the main tasks are named entity recognition, relationship extraction, and coreference resolution. The dissertation consists of the four main chapters, where each of them represents a separate information extraction task and the last chapter which introduces a combination all of the three tasks into an iterative method within an end-to-end information extraction system. First we introduce the task of coreference resolution with its goal of merging all of the mentions that refer to a specific entity. We propose SkipCor system that casts the task into a sequence tagging problem for which first order probabilistic models can be used. To enable the detection of distant coreferent mentions we propose an innovative transformation into skip-mention sequences and achieve comparable or better results than other known approaches. We also use a similar transformation for relationship extraction. There we use different tags and rules that enable the extraction of hierarchical relationships. The proposed solution achieves the best result at the relationship extraction challenge between genes that form a gene regulations network. Lastly we present the oldest and most thoroughly researched task of named entity recognition. The task deals with a tagging of one or more words that represent a specific entity type - for example, persons. In the dissertation we adapt the use of standard procedures for the sequence tagging tasks and achieve the seventh rank at the chemical compound and drug name recognition challenge. We successfully manage to solve all of the three problems using linear-chain conditional random fields models. We combine the tasks in an iterative method that accepts an unstructured text as input and returns extracted entities along with relationships between them. The output is represented according to a system ontology which provides better data interoperability. The information extraction field for the Slovene language is not yet well researched which is why we also include a list of translations of the selected terms from English to Slovene.
Keywords:	information extraction, coreference resolution, relationship extraction, named entity recognition, doctoral dissertations, theses

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents