Assessing and quantifying clusteredness: The OPTICS Cordillera

Thomas Rusch, Kurt Hornik, Patrick Mair

Publikation: Working/Discussion PaperWU Working Paper

5 Downloads (Pure)

Abstract

Data representations in low dimensions such as results from unsupervised dimensionality reduction methods are often visually interpreted to find clusters of observations. To identify clusters the result must be appreciably clustered. This property of a result may be called "clusteredness". When judged visually, the appreciation of clusteredness is highly subjective. In this paper we suggest an objective way to assess clusteredness in data representations. We provide a definition of clusteredness that captures important aspects of a clustered appearance. We characterize these aspects and define the extremes rigorously. For this characterization of clusteredness we suggest an index to assess the degree of clusteredness, coined the OPTICS Cordillera. It makes only weak assumptions and is a property of the result, invariant for different partitionings or cluster assignments. We
provide bounds and a normalization for the index, and prove that it represents the aspects of clusteredness. Our index is parsimonious with respect to mandatory parameters but
also exible by allowing optional parameters to be tuned. The index can be used as a descriptive goodness-of-clusteredness statistic or to compare different results. For illustration we use a data set of handwritten digits which are very differently represented in two
dimensions by various popular dimensionality reduction results. Empirically, observers had a hard time to visually judge the clusteredness in these representations but our index provides a clear and easy characterisation of the clusteredness of each result.
OriginalspracheEnglisch
PublikationsstatusVeröffentlicht - 2016

Publikationsreihe

NameDiscussion Paper Series / Center for Empirical Research Methods
Nr.2016/1

Österreichische Systematik der Wissenschaftszweige (ÖFOS)

  • 101018 Statistik
  • 501 not use (Altbestand)
  • 509013 Sozialstatistik
  • 509 not use (Altbestand)

WU Working Paper Reihe

  • Discussion Paper Series / Center for Empirical Research Methods

Dieses zitieren