In this master’s thesis, we developed and evaluated a new proposed method for clustering. We suggested calling it the log-leaders method. The name itself comes from the fact that it is based on the so-called leaders method, which is an umbrella term for a multitude of clustering procedures including the well-known k-means algorithm. Our idea was to expand on this idea and instead of using the Euclidean distance, we tried implementing a log-likelihood-based distance. The logarithm of likelihood is supposed to be a measure of cohesion within the clusters. Since the Euclidean distance is not meant to be used on categorical data, the proposed distance function tries to deal with this issue as it can handle continuous, categorical and mixed types of variables.
The main purpose of this thesis is to develop a working version of the proposed method,
implement it into the R programming language and evaluate whether the algorithm itself is
competitive in comparison to its alternatives. This was done with the help of simulations and a demonstrative example based on real data about indicators of the human development index (continuous variables) and the political regime of individual countries of the world (categorical variable).
The simulations concluded that in the explored scenarios, the log-leaders method performed
similarly to alternative algorithms. In the simulation with numeric variables only, almost all
methods gave similar results and thus the recommended approach in this case could be k-means which is closely related to our proposed algorithm and also equivalent from a clustering standpoint although much faster in comparison to our implementation. The same cannot be said for mixed-type variables as the k-medoids (PAM) method, that clustered with the help of a similarity matrix based on the Gower distance, gave the best results especially in the case of a higher number of clusters. The effects of nonfixed factors on the proposed method were as expected and much the same as on the other algorithms with results among them being similar in terms of quality in most cases. The main problem regarding our prototype version of the log-leaders method came in the form of high computational intensity, that resulted in a slower execution of the algorithm, which could be partially attributed to the speed of the R programming language itself. As far as the demonstration on real data is concerned, the clustering methods we investigated gave similar results and thus in the real-life example, our implementation of the log-leaders method is equivalent to its alternatives. The simulations concluded that the proposed method is comparable to already established algorithms, even though we cannot argue that it performs any better. Despite that, the newly developed algorithm has an important advantage, because it supports a version of the Bayesian information criterion, that is completely compatible with the log-leaders method, which is helpful when deciding about the optimal number of clusters. The implementation and simulations, however, create a
basis for future research, where the prototype version of the proposed method could be
optimized, and the conditions of the simulations extended to include more complex scenarios.
|