Data representations in low dimensions such as results from unsupervised dimensionality reduction methods are often visually interpreted to find clusters of observations. To identify clusters the result must be appreciably clustered. This property of a result may be called "clusteredness". When judged visually, the appreciation of clusteredness is highly subjective. In this paper we suggest an objective way to assess clusteredness in data representations. We provide a definition of clusteredness that captures important aspects of a clustered appearance. We characterize these aspects and define the extremes rigorously. For this characterization of clusteredness we suggest an index to assess the degree of clusteredness, coined the OPTICS Cordillera. It makes only weak assumptions and is a property of the result, invariant for different partitionings or cluster assignments. We
provide bounds and a normalization for the index, and prove that it represents the aspects of clusteredness. Our index is parsimonious with respect to mandatory parameters but also exible by allowing optional parameters to be tuned. The index can be used as a descriptive goodness-of-clusteredness statistic or to compare different results. For illustration we use a data set of handwritten digits which are very differently represented in two dimensions by various popular dimensionality reduction results. Empirically, observers had a hard time to visually judge the clusteredness in these representations but our index provides a clear and easy characterisation of the clusteredness of each result.
|Discussion Paper Series / Center for Empirical Research Methods
- 101018 Statistics
- 509013 Social statistics
- Discussion Paper Series / Center for Empirical Research Methods