Odkrivanje povezanih računov v veliki množici podatkov

Novak, Benjamin

Odkrivanje povezanih računov v veliki množici podatkov
ID Novak, Benjamin (Author), ID Sadikov, Aleksander (Mentor) More about this mentor... This link opens in a new window

PDF - Presentation file, Download (926,56 KB)
MD5: F7D5F859E345E15D83774B2EAD88FC10

Abstract

Živimo v obdobju, v katerem pri uporabi svetovnega spleta puščamo sled s svojimi podatki. Podjetja, ki takšne podatke shranjujejo in analizirajo, se zaradi velike količine soočajo z izzivi časovne in prostorske kompleksnosti. Enega takšnih izzivov smo poskušali rešiti v našem magistrskem delu, kjer smo v velikih množicah podatkov iskali pare najbolj podobnih računov. V magistrskem delu smo analizirali časovno učinkovitost in računsko uspešnost metod za iskanje parov primerov z veliko mero podobnosti. Eksperimente smo izvedli na dveh podatkovnih množicah. V delu predstavimo način transformacije podatkov in njihovo predstavitev v redki matriki. To smo v nadaljevanju uporabili v eksperimentih, kjer smo poiskali pare računov z največjo kosinusno podobnostjo z eksaktno metodo vseh parov, metodo LSH in bisekcijskim razvrščanjem z voditelji. Pri tem je bil naš cilj oceniti, katera od omenjenih metod v praksi da najboljše rezultate. Ugotovili smo, da je metoda vseh parov za praktično uporabo zaradi časovne neučinkovitesti neprimerna, uspešnost aproksimacijskih metod pa je odvisna od izbire parametrov. Izkazalo se je, da je metoda LSH povezave nad 80% podobnosti našla v krajšem času, z vidika časovne učinkovitosti pa je za nižje meje mere podobnosti bolj primerno bisekcijsko razvrščanje z voditelji.

Language:	Slovenian
Keywords:	gručenje v skupine, aproksimacijske metode, časovna učinkovitost, mera podobnosti
Work type:	Master's thesis/paper
Organization:	FRI - Faculty of Computer and Information Science
Year:	2019
PID:	20.500.12556/RUL-111446
COBISS.SI-ID:	1538377155
Publication date in RUL:	01.10.2019
Views:	1237
Downloads:	253
Metadata:
:	Copy citation
Share:

Secondary language

Abstract:
Language:	English
Title:	Detection of linked accounts in a large data set
We live in an era where we leave traces of our personal data using the world wide web. Companies that store and analyze such data are facing the challenges of computational and spatial complexity due to their large quantity. In our master's thesis, we tried to solve one of these challenges by identifying linked accounts in large data sets. We analyzed time complexity and computational efficiency of methods used for searching pairs of highly similar accounts. The experiments were carried out on two data sets. In this paper, we presented data transformation and their presentation in a sparse matrix. Next, we searched for pairs of accounts with the cosine similarity above the threshold with the exact All Pairs method, the Locality-Sensitive Hashing, and Bisecting K-Means. Our goal was to evaluate which of these methods yield the best performance with acceptable processing time. To conclude, we found that the All Pairs method is inadequate for practical use due to its time inefficiency. Performance of approximation methods depends on the choice of parameters. It turned out that the LSH method finds pairs with similarity over 80% in the shortest time, but in case of time complexity Bisecting K-Means is more efficient for the lower limits of the similarity.
Keywords:	clustering, approximation methods, time complexity, similarity measure

Similar works from RUL:
Similar works from other Slovenian collections:

Secondary language

Similar documents