cluster.stats {fpc}R Documentation

Cluster validation statistics

Description

Computes a number of distance based statistics which can be used for cluster validation, comparison between clusterings and decision about the number of clusters: cluster sizes, cluster diameters, average distances within and between clusters, cluster separation, average silhouette widths, the Calinski and Harabasz index, the best distance based statistics to decide about the number of clusters in a study of Milligan and Cooper (1985), Hubert's gamma coefficient, the Dunn index and two indexes to assess the similarity of two clusterings, namely the corrected Rand index and Meila's VI.

Usage

cluster.stats(d,clustering,alt.clustering=NULL,
                          silhouette=TRUE,G2=FALSE,G3=FALSE,
                          compareonly=FALSE)

Arguments

d

a distance object (as generated by dist) or a distance matrix between cases.

clustering

an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters.

alt.clustering

an integer vector such as for clustering, indicating an alternative clustering. If provided, the corrected Rand index and Meila's VI for clustering vs. alt.clustering are computed.

silhouette

logical. If TRUE, the silhouette statistics are computed, which requires package cluster.

G2

logical. If TRUE, Goodman and Kruskal's index G2 (cf. Gordon (1999), p. 62) is computed. This executes lots of sorting algorithms and can be very slow (it has been improved by R. Francois - thanks!)

G3

logical. If TRUE, the index G3 (cf. Gordon (1999), p. 62) is computed. This executes sort on all distances and can be extremely slow.

compareonly

logical. If TRUE, only the corrected Rand index and Meila's VI are computed and given out (this requires alt.clustering to be specified).

Value

cluster.stats returns a list containing the components n, cluster.number, cluster.size, diameter, average.distance, median.distance, separation, average.toother, separation.matrix, average.between, average.within, n.between, n.within, within.cluster.ss, clus.avg.silwidths, avg.silwidth, g2, g3, pearsongamma, dunn, entropy, wb.ratio, ch, corrected.rand, vi except if compareonly=TRUE, in which case only the last two components are computed.

n

number of cases.

cluster.number

number of clusters.

cluster.size

vector of cluster sizes (number of points).

diameter

vector of cluster diameters (maximum within cluster distances).

average.distance

vector of clusterwise within cluster average distances.

median.distance

vector of clusterwise within cluster distance medians.

separation

vector of clusterwise minimum distances of a point in the cluster to a point of another cluster.

average.toother

vector of clusterwise average distances of a point in the cluster to the points of other clusters.

separation.matrix

matrix of separation values between all pairs of clusters.

average.between

average distance between clusters.

average.within

average distance within clusters.

n.between

number of distances between clusters.

n.within

number of distances within clusters.

within.cluster.ss

a generalisation of the within clusters sum of squares (k-means objective function), which is obtained if d is a Euclidean distance matrix. For general distance measures, this is half the sum of the within cluster squared dissimilarities divided by the cluster size.

clus.avg.silwidths

vector of cluster average silhouette widths. See silhouette.

avg.silwidth

average silhouette width. See silhouette.

g2

Goodman and Kruskal's Gamma coefficient. See Milligan and Cooper (1985), Gordon (1999, p. 62).

g3

G3 coefficient. See Gordon (1999, p. 62).

pearsongamma

correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. "Normalized gamma" in Halkidi et al. (2001).

dunn

minimum separation / maximum diameter. Dunn index, see Haldiki et al. (2002).

entropy

entropy of the distribution of cluster memberships, see Meila(2007).

wb.ratio

average.within/average.between.

ch

Calinski and Harabasz index (Calinski and Harabasz 1974, optimal in Milligan and Cooper 1985; generalised for dissimilarites in Hennig and Liao 2010)

corrected.rand

corrected Rand index (if alt.clustering has been specified), see Gordon (1999, p. 198).

vi

variation of information (VI) index (if alt.clustering has been specified), see Meila (2007).

Author(s)

Christian Hennig chrish@stats.ucl.ac.uk http://www.homepages.ucl.ac.uk/~ucakche/

References

Calinski, R. B., and Harabasz, J. (1974) A Dendrite Method for Cluster Analysis, Communications in Statistics, 3, 1-27.

Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.

Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001) On Clustering Validation Techniques, Journal of Intelligent Information Systems, 17, 107-145.

Hennig, C. and Liao, T. (2010) Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification. Research report no. 308, Department of Statistical Science, UCL. http://www.ucl.ac.uk/Stats/research/reports/psfiles/rr308.pdf

Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895.

Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures for determining the number of clusters. Psychometrika, 50, 159-179.

See Also

silhouette, dist, calinhara, clusterboot computes clusterwise stability statistics by resampling.

Examples

  
  set.seed(20000)
  face <- rFace(200,dMoNo=2,dNoEy=0,p=2)
  dface <- dist(face)
  complete3 <- cutree(hclust(dface),3)
  cluster.stats(dface,complete3,
                alt.clustering=as.integer(attr(face,"grouping")))
  

[Package fpc version 2.0-3 Index]