Avtomatizirana gradnja učnih korpusov s pomočjo velikih jezikovnih modelov

Petkovšek, Gal

Avtomatizirana gradnja učnih korpusov s pomočjo velikih jezikovnih modelov
ID Petkovšek, Gal (Author), ID Žitnik, Slavko (Mentor) More about this mentor... This link opens in a new window

, ID Justin, Tadej (Comentor)

PDF - Presentation file, Download (2,14 MB)
MD5: 7C5B7909C705A55F83F353C60EF67D18

Abstract

Zbiranje in označevanje podatkov je drago in zamudno. V tem delu predstavljamo ogrodje, ki izkorišča moč velikih jezikovnih modelov za umetno tvorjenje sintetičnih podatkov. Testirali smo ga na treh nalogah uvrščanja besedil in z njegovo uporabo izboljšali izhodiščen rezultate. Predstavili smo več metod ocenjevanja kvalitete umetnih množic ter predstavili, kako ugotovitve uporabimo za razvoj novih pristopov tvorjenja umetnih primerkov. Razvitih in testiranih je bilo več tehnik umetnega tvorjenja, od katerih izstopa dodajanje pogostih besed v ukazni poziv, kar bistveno izboljša rezultate v primeru, ko imamo na voljo tako majhno množico označenih, kot tudi veliko množico neoznačenih primerkov. Najboljše rezultate smo dosegli z združevanjem umetno tvorjenih podatkov in LLM-označenih primerkov iz velike množice neoznačenih primerkov. Glavni prispevki naloge vključujejo implementacijo ogrodja in razvite strategije tvorjenja, ki smo jih vrednotili z različnimi metrikami na več scenarijih.

Language:	Slovenian
Keywords:	veliki jezikovni modeli, umetno tvorjeni podatki, obdelava naravnega jezika, uvrščanje besedil, podatkovne množice
Work type:	Master's thesis/paper
Typology:	2.09 - Master's Thesis
Organization:	FRI - Faculty of Computer and Information Science
Year:	2024
PID:	20.500.12556/RUL-162603
COBISS.SI-ID:	210392323
Publication date in RUL:	25.09.2024
Views:	165
Downloads:	655
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Automatized construction of learning corpuses with the help of large language models
Collecting and labeling data is costly and time-consuming. In this work, we present a framework that leverages the power of large language models to artificially generate synthetic data. We tested it on three text classification tasks and achieved improvements over baseline results. We introduced several methods for evaluating the quality of artificial datasets and demonstrated how these insights can be used to develop new generation approaches for synthetic data. Several artificial generation techniques were developed and tested, with the most notable being the addition of frequent words in the prompt, which significantly improves results in scenarios with both a small labeled set and a large unlabeled set available. The highest performance was achieved by combining artificially generated data with LLM-labeled samples from a large set of unlabeled examples. The main contributions of this work are the implemented framework and the developed generation strategies, which we evaluated using multiple metrics across various scenarios.
Keywords:	large language models, synthetic data, natural language porcessing, text classification, datasets

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents