Naključni gozdovi za identifikacijo različno izraženih spremenljivk za visokorazsežne podatke

SIMONOVICH, PINO

Naključni gozdovi za identifikacijo različno izraženih spremenljivk za visokorazsežne podatke
ID SIMONOVICH, PINO (Author), ID Blagus, Rok (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (1,95 MB)
MD5: CACAA7E8C4E13737AFE00C6ADB3F4CAA

Abstract

V magistrski nalogi sem raziskal različne pristope izbora spremenljivk in napovedovanja v primeru visokorazsežnih in neuravnoteženih podatkov, torej primere, kjer imamo zelo veliko število spremenljivk (dimenzij) in imamo različno frekvenco razredov v podatkih (v primeru uvrščanja). Taki podatki se pojavijo npr. v rezultatih genskih ekspresij, kjer imamo zelo pogosto tudi majhne vzorce. Cilj je izluščiti pomembne spremenljivke iz podatkov, tj. take spremenljivke, ki so zares povezane z izidom in napovedati pripadnost razredu za nove enote. Osredotočil sem se na metodo naključnih gozdov oz. v splošnem na odločitvena drevesa v kombinaciji z zankanjem ali ojačevanjem. Ker v osnovi naključni gozdovi ne izvajajo izbora spremenljivk, sem predlagal nekaj različnih načinov, kako bi lahko to izvedli v sklopu obstoječih modelov. Dodatno sem preveril, ali ponovno grajenje modela nad izbranimi spremenljivkami izboljša napovedi končnega modela. Poleg tega sem raziskal dva glavna načina, kako uspešno napovedovati v primeru neuravnoteženih podatkov. Večina metod v tem primeru ne deluje dobro zaradi pogosto uporabljenih t.i. nepravih mer oz. zaradi neprilagojenih kriterijskih funkcij. Predstavil sem nekaj pravih mer in pokazal prednosti teh. Izbral sem eno pravo mero, ki se mi je zdela najbolj obetavna s takimi podatki in rezultate primerjal z ostalimi metodami, ki uporabljajo neprave mere. Za ovrednotitenje vseh pristopov/modelov sem izvedel simulacijo, kjer sem generiral realistične visokorazsežne in neuravnotežene podatke. Zaradi velike računske zahtevnosti sem izbral zgolj eno kombinacijo parametrov za simulacijo, ki je bila zahtevna z vidika napovedovanja. V podatkih je bila prisotna visoka korelacija v bločni strukturi. Vse modele sem ovrednotil tudi na pravih podatkih iz DNK mikromrež. Rezultati kažejo, da moje predlagane metode za izbor spremenljivk izberejo precej manj napačnih spremenljivk v primerjavi z ostalimi metodami, a hkrati izberejo tudi precej manj spremenljivk v celoti. Z uporabo prave mere in prilagojene funkcije izgube sem uspel izboljšati rezultate v primerjavi z večino ostalih metod.

Language:	Slovenian
Keywords:	naključni gozdovi, visokorazsežni podatki, neuravnoteženi podatki, izbor spremenljivk, napovedna točnost, funkcija izgube
Work type:	Master's thesis/paper
Organization:	FE - Faculty of Electrical Engineering
Year:	2020
PID:	20.500.12556/RUL-118699
Publication date in RUL:	31.08.2020
Views:	2410
Downloads:	283
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Random forests for identifying differentially expressed variables for high-dimensional data
In the Masters thesis I researched various approaches to perform variable selection and prediction in the case of high-dimensional and unbalanced data, that is when we have a lot of variables (dimensions) and we have different class frequencies in the data (for classification). Such a scenario arises for ex. when dealing with gene expression data, also common in such cases is a very small sample size. The goal is to identify the important variables from the data, i.e. variables which are truly associated with the outcome and to assign a class-membership for new samples for which the class is unknown. In my work I focused on random forests, more specifically on decision trees in combination with bootstrapping or boosting. Because random forests do not perform variable selection by default I suggested a couple of approaches how we could achieve this in the context of the existing models. Additionally I verified whether rebuilding a model using the selected variables improves the predictive ability of the final model. I also researched two main approaches of predicting new cases with unbalanced data. Most methods do not work well with such data because of the commonly used improper scoring rules and unadjusted loss functions. I presented a couple of proper scoring rules and showed their advantages. I chose one scoring rule which is used with such data and I compared the results with other methods which use improper scoring rules. To evaluate all the approaches / models I performed a simulation where I generated high-dimensional unbalanced data in structure similar to gene expression data. Due to a very large computational burden, I chose only one combination of parameters for the simulation, which was difficult enough for the purpose of class-prediction. The data was generated with high between variable correlations in a block structure. All the models were also evaluated on real data from DNA microarrays. The results show that my proposed methods for variable selection select far fewer uninformative variables in comparison with the other methods, they also choose far fewer variables in total. Using proper scoring rules and an adjusted loss function I managed to improve the results compared to most other methods.
Keywords:	random forests, high-dimensional data, unbalanced data, variable selection, predictive accuracy, loss function

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents