Rapid progress in digital data acquisition techniques has led to huge volume of data. Approximately 80% of the world’s data is in stored as an unstructured text. Text mining has therefore become an exciting research field as it tries to discover valuable information from unstructured texts. Clustering is one of the most interesting and important topics in text mining. This work presents one of the most popular document clustering algorithms, the spherical $k$-means. First, the problem of clustering and representation of documents is described to better understand the method. The main goal of this work is to derive the spherical $k$-means algorithm. For this purpose, the batch version of the algorithm, with its weaknesses and calculation improvements, is introduced first. A description of the incremental version of the algorithm which improves the results of the batch version is presented next. Finally, the batch and incremental iterations are combined to generate the spherical $k$-means algorithm. To conclude the work an example of the use of the spherical $k$-means is given, where the problem is the authorship of books “The Wizard of Oz”. The algorithm assigns authors to the books based on the frequency of used words.
|