Clustering by Compression

R. Cilibrasi and P. M. B. Vitányi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523–1545, April 2005. [url]

———————–

This paper presents anew method for clustering based on compression. The idea is first to determine an universal similarity distance computed from the lenght of compressed data files; Secondly, they apply a hierarchical clustering method, proving good results on different datasets.

According to the authors, two objects are deemed close if we can significantly “compress” one given the information in the other, the idea being that if two pieces are more similar, then we can succintly describe one given the other.

Other keywords: heterogeneous data analysis, hierarchical unsupervised clustering, Kolmogorov Complexity, normalized compression distance, parameter-free data mining, quartet tree method, universal dissimilarity distance

Tags: clustering

Mauro Cherubini

Professor at the University of Lausanne, Switzerland

Leave a Reply Cancel reply