seriate {seriation}R Documentation

Seriate objects in dissimilarity matrices, matrices or arrays

Description

Tries to find an linear order for objects using data in form of a dissimilarity matrix (two-way one mode data), a data matrix (two-way two-mode data) or a data array (k-way k-mode data).

Usage

## S3 method for class 'dist'
seriate(x, method = NULL, control = NULL, ...)
## S3 method for class 'matrix'
seriate(x, method = NULL, control = NULL, 
    margin = c(1,2), ...)
## S3 method for class 'array'
seriate(x, method = NULL, control = NULL, 
    margin = seq(length(dim(x))), ...)

Arguments

x

the data.

method

a character string with the name of the seriation method (default: varies by data type).

control

a list of control options passed on to the seriation algorithm.

margin

a vector giving the margins to be seriated. For matrix, 1 indicates rows, 2 indicates columns, c(1,2) indicates rows and columns. For array, margin gets a vector with the dimensions to seriate.

...

further arguments (unused).

Details

Two-way two-mode data has to be provided as a dist object (not as a symmetric matrix). Similarities have to be transformed in a suitable way into dissimilarities. Currently the following methods are implemented for dist:

"ARSA"

Anti-Robinson seriation by simulated annealing.

A heuristic developed by Brusco et al (2007).

"BBURCG"

Anti-Robinson seriation (unweighted)

A branch-and-bound implementation by Brusco and Stahl (2005).

"BBWRCG"

Anti-Robinson seriation (weighted)

A branch-and-bound implementation by Brusco and Stahl (2005).

"TSP"

Traveling salesperson problem solver.

A solver in TSP can be used (see solve_TSP in package TSP). The solver method can be passed on via the control argument, e.g. control = list(method = "insertion").

Since a tour returned by a TSP solver is a connected circle and we are looking for a path representing a linear order, we need to find the best cutting point. Climer and Zhang (2006) suggest to add a dummy city with equal distance to each other city before generating the tour. The place of this dummy city in an optimal tour with minimal length is the best cutting point (it lies between the most distant cities).

"Chen"

Rank-two ellipse seriation (Chen 2002).

This method starts with generating a sequence of correlation matrices R^1, R^2, …. R^1 is the correlation matrix of the original distance matrix D (supplied to the function as x), and

R^{n+1} = φ R^n,

where φ calculates the correlation matrix.

The rank of the matrix R^n falls with increasing n. The first R^n in the sequence which has a rank of 2 is found. Projecting all points in this matrix on the first two eigenvectors, all points fall on an ellipse. The order of the points on this ellipse is the resulting order.

The ellipse can be cut at the two interception points (top or bottom) of the vertical axis with the ellipse. In this implementation the top most cutting point is used.

"MDS"

Multidimensional scaling (MDS).

Use multidimensional scaling techniques to find an linear order. Note that unidimensional scaling would be more appropriate but is very hard to solve. Generally, MDS provides good results.

By default, metric MDS (cmdscale in stats) is used. In case of of general dissimilarities, non-metric MDS can be used. The choices are isoMDS and sammon from MASS. The method can be specified as the element method ("cmdscale", "isoMDS" or "sammon") in control.

"HC"

Hierarchical clustering.

Using the order of the leaf nodes in a dendrogram obtained by hierarchical clustering can be used as a very simple seriation technique. This method applies hierarchical clustering (hclust) to x. The clustering method can be given using a "method" element in the control list. If omitted, the default "complete" is used.

"GW", "OLO"

Hierarchical clustering with optional reordering.

Uses also the order of the leaf nodes in a dendrogram (see method "HC"), however, the leaf notes are reordered.

A dendrogram (binary tree) has 2^{n-1} internal nodes (subtrees) and the same number of leaf orderings. That is, at each internal node the left and right subtree (or leaves) can be swapped, or, in terms of a dendrogram, be flipped.

Method "GW" uses an algorithm developed by Gruvaeus and Wainer (1972) and implemented in package gclus (Hurley 2004). The clusters are ordered at each level so that the objects at the edge of each cluster are adjacent to that object outside the cluster to which it is nearest. The method produces an unique order.

Method "OLO" (Optimal leaf ordering, Bar-Joseph et al., 2001) produces an optimal leaf ordering with respect to the minimizing the sum of the distances along the (Hamiltonian) path connecting the leaves in the given order. The time complexity of the algorithm is O(n^3). Note that non-finite distance values are not allowed.

Both methods start with a dendrogram created by hclust. As the "method" element in the control list a clustering method (default "complete") can be specified. Alternatively, a hclust object can be supplied using an element named "hclust".

Two-way two mode data are general positive matrices. Currently the following methods are implemented for matrix:

"BEA"

Bond Energy Algorithm (BEA; McCormick 1972).

The algorithm tries to maximize the measure of effectiveness (see criterion) of a non-negative matrix. Due to the definition of this measure, the tasks of ordering rows and columns is separable and can be solved independently.

A row is arbitrarily placed; then rows are positioned one by one. When this is completed, the columns are treated similarly. The overall procedure amounts to two approximate traveling salesperson problems (TSP), one on the rows and one on the columns. The so-called 'best insertion strategy' is used: rows (or columns) are inserted into the current permuted list of rows (or columns). Several consecutive runs of the algorithm might improve the energy.

Note that Arabie and Hubert (1990) question its use with non-binary data if the objective is to find a seriation or one-dimensional orderings of rows and columns.

The BEA code used in this package was implemented by Fionn Murtagh.

In control as element "rep" the number of runs can be specified. The results of the best run will be returned.

"BEA_TSP"

Use a TSP to optimize the measure of effectiveness (Lenstra 1974).

Use a TSP solver to optimize ME.

In control as element "method" a TSP solver method can be specified (see package TSP).

"PCA"

Principal component analysis.

Uses the projection of the data on its first principal component to determine the order.

Note that for a distance matrix calculated from x with Euclidean distance, this methods minimizes the least square criterion.

For array no built-in methods are currently available.

Value

Returns an object of class ser_permutation.

References

P. Arabie and L.J. Hubert (1990): The bond energy algorithm revisited, IEEE Transactions on Systems, Man, and Cybernetics, 20(1), 268–274.

Z. Bar-Joseph, E. D. Demaine, D. K. Gifford, and T. Jaakkola. (2001): Fast Optimal Leaf Ordering for Hierarchical Clustering. Bioinformatics, 17(1), 22–29.

Brusco, M., Koehn, H.F., and Stahl, S. (2007): Heuristic Implementation of Dynamic Programming for Matrix Permutation Problems in Combinatorial Data Analysis. Psychometrika, conditionally accepted.

Brusco, M., and Stahl, S. (2005): Branch-and-Bound Applications in Combinatorial Data Analysis. New York: Springer.

Chen, C. H. (2002): Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices. Statistica Sinica, 12(1), 7–29.

Sharlee Climer, Weixiong Zhang (2006): Rearrangement Clustering: Pitfalls, Remedies, and Applications, Journal of Machine Learning Research, 7(Jun), 919–943.

Gruvaeus, G. and Wainer, H. (1972): Two Additions to Hierarchical Cluster Analysis, British Journal of Mathematical and Statistical Psychology, 25, 200–206.

Hurley, Catherine B. (2004): Clustering Visualizations of Multidimensional Data. Journal of Computational and Graphical Statistics, 13(4), 788–806.

J.K. Lenstra (1974): Clustering a Data Array and the Traveling-Salesman Problem, Operations Research, 22(2) 413–414.

W.T. McCormick, P.J. Schweitzer and T.W. White (1972): Problem decomposition and data reorganization by a clustering technique, Operations Research, 20(5), 993–1009.

See Also

criterion, solve_TSP in TSP, hclust in stats

Examples

##seriate dist
data("iris")
x <- as.matrix(iris[-5])
x <- x[sample(1:nrow(x)),]
d <- dist(x)

## default seriation
order <- seriate(d)
order

## plot
def.par <- par(no.readonly = TRUE)
layout(cbind(1,2), respect = TRUE)

pimage(d, main = "Random")
pimage(d, order, main = "Reordered")

par(def.par)

## compare quality
rbind(
        random = criterion(d),
        reordered = criterion(d, order)
     )


## seriate matrix
data("iris")
x <- as.matrix(iris[-5])

## to make the variables comparable, we scale the data
x <- scale(x, center = FALSE)

## try some methods
def.par <- par(no.readonly = TRUE)
layout(matrix(1:4, ncol = 2, byrow = TRUE), respect=TRUE)

pimage(x, main = "original data")
criterion(x)

order <- seriate(x, method = "BEA_TSP")
pimage(x, order, main = "TSP to optimize ME")
criterion(x, order)

order <- seriate(x, method="PCA")
pimage(x, order, main = "first principal component")
criterion(x, order)

## 2 TSPs
order <- c(
    seriate(dist(x), method = "TSP"),
    seriate(dist(t(x)), method = "TSP")
)
pimage(x, order, main = "2 TSPs")
criterion(x, order)

par(def.par)

[Package seriation version 1.0-6 Index]