prediction.strength {fpc}R Documentation

Prediction strength for estimating number of clusters

Description

Computes the prediction strength of a clustering of a dataset into different numbers of components. The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9. See details.

Usage

  prediction.strength(xdata, Gmin=2, Gmax=10, method="kmeans", M=50,
                                cutoff=0.8,...)

Arguments

xdata

data (something that can be coerced into a matrix). Note that this can currently not be a dissimilarity matrix.

Gmin

integer. Minimum number of clusters. Note that the prediction strength for 1 cluster is trivially 1, which is automatically included if GMin>1. Therefore GMin<2 is useless.

Gmax

integer. Maximum number of clusters.

method

one of "kmeans", "pam" or "clara", specifying the clustering method to be applied.

M

integer. Number of times the dataset is divided into two halves.

cutoff

numeric between 0 and 1. The optimal number of clusters is the maximum one with prediction strength above cutoff.

...

arguments to be passed on to the clustering method.

Details

The prediction strength for a certain number of clusters k under a random partition of the dataset in halves A and B is defined as follows. Both halves are clustered with k clusters. Then the points of A are classified to the clusters of B. This is done by assigning every observation in A to the closest cluster centroid in B (using the function knn1). A pair of points A in the same A-cluster is defined to be correctly predicted if both points are classified into the same cluster on B. The same is done with the points of B relative to the clustering on A. The prediction strength for each of the clusterings is the minimum (taken over all clusters) relative frequency of correctly predicted pairs of points of that cluster. The final mean prediction strength statistic is the mean over all 2M clusterings.

Value

List with components

predcorr

list of vectors of length M with relative frequencies of correct predictions (clusterwise minimum). Every list entry refers to a certain number of clusters.

mean.pred

means of predcorr for all numbers of clusters.

optimalk

optimal number of clusters.

cutoff

see above.

Author(s)

Christian Hennig chrish@stats.ucl.ac.uk http://www.homepages.ucl.ac.uk/~ucakche/

References

Tibshirani, R. and Walther, G. (2005) Cluster Validation by Prediction Strength, Journal of Computational and Graphical Statistics, 14, 511-528.

See Also

kmeans, pam, clara

Examples

  set.seed(98765)
  iriss <- iris[sample(150,20),-5]
  prediction.strength(iriss,2,3,M=3)
  prediction.strength(iriss,2,3,M=3,method="pam")
# The examples are fast, but of course M should really be larger.

[Package fpc version 2.0-3 Index]