Razvoj odprtokodne knjižnice za generiranje umetnih kategoričnih naborov podatkov

Malenšek, Miha

Razvoj odprtokodne knjižnice za generiranje umetnih kategoričnih naborov podatkov
ID Malenšek, Miha (Author), ID Demšar, Jure (Mentor) More about this mentor... This link opens in a new window

, ID Škrlj, Blaž (Comentor), ID Mramor, Blaž (Comentor)

PDF - Presentation file, Download (1,16 MB)
MD5: 59D51797AB11B1CE6E98D7BB53E422CC

Abstract

Umetni nabori podatkov se pogosto uporabljajo za testiranje in evalvacijo modelov strojnega učenja. Za hitra testiranja obstajajo preprosti, predpripravljeni nabori, kot sta Iris ali Wine iz knjižnice Scikit-learn. Za tekmovanja in industrijsko testiranje pa se uporabljajo prečiščene različice resničnih naborov podatkov. Zaradi varovanja osebnih podatkov, razpoložljivosti podatkov in razložljivosti modelov narašča potreba po umetnih naborih podatkov. Večina knjižnic za strojno učenje že podpira generiranje osnovnih umetnih naborov podatkov, a gre predvsem za zvezne podatke, medtem ko lahko v literaturi opazimo pomanjkanje orodij za generiranje umetnih kategoričnih naborov podatkov. Zato smo razvili, testirali in objavili prostodostopno knjižnico za generiranje umetnih naborov podatkov s kategoričnimi značilkami. Knjižnica omogoča generiranje preprostih in kompleksnih naborov s popolnim nadzorom nad procesom. Uporabo knjižnice smo predstavili v treh primerih, ki vključujejo osnovno delovanje, simulacijo resničnih naborov podatkov in uporabo knjižnice v eksperimentalnem kontekstu skozi primerjavo modelov DeepFM in logistične regresije na redkih naborih podatkov z različnimi interakcijami značilk.

Language:	Slovenian
Keywords:	generiranje podatkov, nabori podatkov, kategorični podatki, umetni nabori podatkov
Work type:	Master's thesis
Typology:	2.09 - Master's Thesis
Organization:	FRI - Faculty of Computer and Information Science FE - Faculty of Electrical Engineering
Year:	2024
PID:	20.500.12556/RUL-160696
COBISS.SI-ID:	210076163
Publication date in RUL:	03.09.2024
Views:	176
Downloads:	68
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Development of an open-source library for the generation of artificial categorical datasets
Synthetic datasets are often used for testing and evaluating machine learning models. For quick testing, there are simple, pre-prepared datasets such as Iris or Wine from the Scikit-learn library. For competitions and industrial testing, refined versions of real datasets are used. Due to data privacy, data accessibility, and model explainability, the demand for synthetic datasets is growing. Most machine learning libraries support generating basic synthetic datasets, but mainly for continuous data. However, literature indicates a lack of tools for generating synthetic categorical datasets. Therefore, we developed, tested, and released an open-source library for generating synthetic datasets with categorical features. Our framework allows for the generation of simple and complex datasets with full control over the generative process. We demonstrated its use in three use cases. The first showcases basic functionaliy, the second one simulates a real dataset, while the third one compares DeepFM and logistic regression models on sparse data with various feature interactions.
Keywords:	data generation, datasets, categorical data, synthetic datasets

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents