izpis_h1_title_alt

Stability of Hierarchical Clustering
ID Turanjanin, Aleksandra (Author), ID Zupan, Blaž (Mentor) More about this mentor... This link opens in a new window

.pdfPDF - Presentation file, Download (947,69 KB)
MD5: 0DDA009505A19A7A9BEE33AE00C63BA9

Abstract
Hierarchical clustering is an unsupervised data mining technique that infers a set of nested, hierarchically organised clusters. Even slight permutations in the data can change the clustering structure. Ideally, we should only be interested in the stable part of the clustering hierarchy. It is thus essential to assess the stability of the nodes in the hierarchy. In this thesis, we review the approaches to determine the stability and statistical significance of the clusters. While all the reviewed methods use resampling, their results could be substantially different because of the details in the implementation and stability scoring. The approach called pvclust is recently most used in practical applications. In its R implementation, it suffers from low speed and visualisation of results. We have implemented pvclust in Python, yielding an implementation that is almost an order of magnitude faster than the version in R. Our implementation is currently the only opensource Python implementation of stability analysis of hierarchical clustering. To visualise the results and enable interactive explorative data analysis, we also incorporated our implementation in the Orange data mining toolbox.

Language:English
Keywords:hierarchical clustering, stability, dendrogram, unsupervised learning
Work type:Master's thesis/paper
Typology:2.09 - Master's Thesis
Organization:FRI - Faculty of Computer and Information Science
Year:2020
PID:20.500.12556/RUL-121038 This link opens in a new window
COBISS.SI-ID:32934403 This link opens in a new window
Publication date in RUL:29.09.2020
Views:1923
Downloads:236
Metadata:XML DC-XML DC-RDF
:
Copy citation
Share:Bookmark and Share

Secondary language

Language:Slovenian
Title:Stabilnost hierarhičnega razvrščanja v skupine
Abstract:
Hierarhično gručenje je nenadzorovana metoda učenja, ki išče vgnezdene, hierarhično organizirane skupine v podatkih. Njena šibkost je občutljivost na majhne permutacije v podatkih, ki lahko povzrčijo velike spremembe v strukturi gručenja. V idealnem primeru nas zanima le stabilen del hierarhije, za kar pa moramo oceniti stabilnost vozlišč. V tej nalogi smo pregledali pristope za ugotavljanje stabilnosti in statistične pomembnosti gruč. Čeprav vse pregledane metode uporabljajo ponovno vzorčenje, se lahko njihovi rezultati bistveno razlikujejo zaradi podrobnosti pri izvajanju in računanju stabilnosti. Metoda imenovana pvclust, se v zadnjem času najpogosteje uporablja v praktičnih aplikacijah. Njena implementacija v R je počasna, vizualizacija dobljenih rezultatov pa slaba. V Pythonu smo implementirali pvclust metodo, in naša izvedba je skoraj za red velikosti hitrejša od različice v R. Naša implementacija je trenutno edina open-source Python implementacija za analizo stabilnosti hierarhičnega združevanja v gruče. Da bi vizualizirali rezultate in omogočili interaktivno analizo raziskovalnih podatkov, smo implementacijo vključili v orodje za podatkovno rudarjenje Orange.

Keywords:hierarhično razvrščanje, stabilnost, dendrogram, nenadzorovano učenje

Similar documents

Similar works from RUL:
Similar works from other Slovenian collections:

Back