clustindex {cclust} | R Documentation |
Cluster Indexes
Description
clres
is the result of a clustering algorithm of class such
as "cclust".
This function is calculating the values of several clustering
indexes. The values of the indexes can be independenly used in order
to determine the number of clusters existing in a data set.
Usage
clustindex ( clres, x, index = "all" )
Arguments
clres |
An object of a clustering result |
x |
Data matrix |
index |
The indexes being calculated "calinski", "cindex", "db",
"hartigan", "ratkowsky", "scott", "marriot", "ball", "trcovw",
"tracew", "friedman", "rubin", "ssi", "likelihood", and "all" for all
the indexes. |
Details
The description of the indexes is categorized into 3 groups, based
on the statistics mainly used to compute them.
The first group is based on the sum of squares within (SSW)
and between (SSB) the clusters. These statistics measure the
dispersion of the data points in a cluster and between the clusters
respectively. These indexes are:
- calinski:
(SSB/(k-1))/(SSW/(n-k)), where n is the
number of data points and k is the number of clusters.
- hartigan: then log(SSB/SSW).
- ratkowsky:
mean(sqrt{(varSSB/varSST)}), where varSSB stands for
the SSB for every variable and varSST for the total sum of
squares for every variable.
- ball: SSW/k, where k is the
number of clusters.
The second group is based on the statistics of T, i.e., the
scatter matrix of the data points, and W, which is the sum of the
scatter matrices in every group. These indexes are:
- scott: nlog(|T|/|W|), where n
is the number of data points and |cdot| stands for the
determinant of a matrix.
- marriot: k^2 |W|, where k is
the number of clusters.
- trcovw: Trace Cov W.
- tracew: Trace W.
- friedman: Trace W^{(-1)} B, where
B is the scatter matrix of the cluster centers.
- rubin: |T|/|W|.
The third group consists of four algorithms not belonging to the
previous ones and not having anything in common.
- cindex: if the data set is binary,
then while the C-Index is a cluster
similarity measure, is expressed as:
[d_{(w)}-min(d_{(w)})]/[max(d_{(w)})-min(d_{(w)})], where
d_{(w)}
is the sum of all n_{(d)} within cluster distances,
min(d_{(w)}) is the sum of the n_{(d)} smallest pairwise
distances in the data set, and max (d_{(w)}) is the sum of the
n_{(d)} biggest pairwise distances. In order to compute the
C-Index all pairwise distances in the data set have to be computed and
stored. In the case of binary data, the storage of the distances is
creating no problems since there are only a few possible
distances. However, the computation of all distances can make this
index prohibitive for large data sets.
- db: R=(1/n)*sum(R_{(i)})
where R_{(i)} stands for the maximum value of R_{(ij)} for
ineq j, and R_{(ij)} for
R_{(ij)}=(SSW_{(i)}+SSW_{(j)})/DC_{(ij)}, where DC_{(ij)} is the
distance between the centers of two clusters i, j.
- likelihood: under the assumption of
independence of the variables within a cluster, a cluster solution
can be regarded as a mixture model for the data, where the cluster
centers give the probabilities for each variable to be
1. Therefore, the negative Log-likelihood can be computed and
used as a quantity measure for a cluster solution. Note that the
assumptions for applying special penalty terms, like in AIC or BIC,
are not fulfilled in this model, and also they show no effect for
these data sets.
- ssi: this ``Simple Structure Index''
combines three elements which influence the interpretability of a
solution, i.e., the maximum difference of each variable between the
clusters, the sizes of the most contrasting clusters and the
deviation of a variable in the cluster centers compared to its
overall mean. These three elements are multiplicatively combined and
normalized to give a value between 0 and 1.
Value
Returns an vector with the indexes values.
Author(s)
Evgenia Dimitriadou and Andreas Weingessel
References
Andreas Weingessel, Evgenia Dimitriadou and Sara Dolnicar,
An Examination Of Indexes For Determining The Number
Of Clusters In Binary Data Sets,
http://www.wu-wien.ac.at/am/wp99.htm#29
and the references therein.
See Also
cclust
, kmeans
Examples
# a 2-dimensional example
x<-rbind(matrix(rnorm(100,sd=0.3),ncol=2),
matrix(rnorm(100,mean=1,sd=0.3),ncol=2))
cl<-cclust(x,2,20,verbose=TRUE,method="kmeans")
resultindexes <- clustindex(cl,x, index="all")
resultindexes