Similarity of arbitrarily long legal documents

Vranješ, Luka

Similarity of arbitrarily long legal documents
ID Vranješ, Luka (Author), ID Robnik Šikonja, Marko (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (1,22 MB)
MD5: D6AD17D9AD32AAC209B64BC36F58F805

Abstract

The penetration of modern language technologies into the legal industry is necessary for it to deal with large amounts of texts it produces. Search is a core feature allowing users to perform their work better and faster. The use of modern context-aware approaches can aid in many features related to search, by better quantifying similarity between text. As a solution, we propose a transformer-based model for creating document embeddings using two interlaced encoders. We train three models with various levels of interlacing and also inform one model of the relative location of each segment within the document. As no differences were detected in the training stage, the most feature rich model was selected and compared in human evaluation to a baseline doc2vec model on a task of recommending similar documents. Based on the results, doc2vec proved to be a better and more suitable model for the selected task. The testing outlined some key problems with the proposed model in terms of its concept of similarity, which does not match the requirements of legal document recommendation.

Language:	English
Keywords:	document similarity, document recommendation, legal documents, long documents, natural language processing, transformer neural networks
Work type:	Master's thesis/paper
Typology:	2.09 - Master's Thesis
Organization:	FRI - Faculty of Computer and Information Science
Year:	2022
PID:	20.500.12556/RUL-141628
COBISS.SI-ID:	125574147
Publication date in RUL:	03.10.2022
Views:	695
Downloads:	99
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	Slovenian
Title:	Podobnost poljubno dolgih pravnih besedil
Uporaba sodobnih jezikovnih tehnologij v pravni industriji je potrebna, da se ta lažje spopade z velikimi količinami besedila, ki ga proizvede. Učinkovito iskanje je ena izmed ključnih rešitev, ki dovoljuje uporabnikom, da svoje delo upravljajo bolje in hitreje. Z boljšim zavedanjem konteksta lahko moderni pristopi izboljšajo mnogo funkcij povezanih z iskanjem. Kot rešitev predlagamo arhitekturo na osnovi nevronske mreže transformer, ki z uporabo dveh prekritih kodirnikov ustvari predstavitev dokumenta. Testirali smo tri modele z različnimi nivoji prekrivanja in eden model katerega informiramo o relativni lokaciji segmenta znotraj dokumenta. Med njimi na validacijski množici nismo zaznali razlik, zato smo za ročno testiranje uporabili najbolj dodelan model. V ročnem testiranju na nalogi predlaganja podobnih dokumentov, primerjamo naš izbrani model z modelom doc2vec. Rezultati kažejo, da je model doc2vec primerenejši za uporabo na testiranem problemu. Testiranje je pokazalo pomanjkljivosti predlaganega modela, še posebej v smislu predstavitve podobnosti, ki se ne ujema z zahtevanim v kontekstu priporočanja podobnih pravnih besedil.
Keywords:	podobnost dokumentov, predlaganje dokumentov, pravni dokumenti, dolgi dokumenti, procesiranje naravnega jezika, nevronske mreže transformer

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents