Package nltk_lite :: Package cluster
[hide private]
[frames] | no frames]

Source Code for Package nltk_lite.cluster

  1  # Natural Language Toolkit: Clusterers 
  2  # 
  3  # Copyright (C) 2004-2007 University of Melbourne 
  4  # Author: Trevor Cohn <tacohn@cs.mu.oz.au> 
  5  # Porting: Steven Bird <sb@csse.unimelb.edu.au> 
  6  # URL: <http://nltk.sf.net> 
  7  # For license information, see LICENSE.TXT 
  8   
  9  """ 
 10  This module contains a number of basic clustering algorithms. Clustering 
 11  describes the task of discovering groups of similar items with a large 
 12  collection. It is also describe as unsupervised machine learning, as the data 
 13  from which it learns is unannotated with class information, as is the case for 
 14  supervised learning.  Annotated data is difficult and expensive to obtain in 
 15  the quantities required for the majority of supervised learning algorithms. 
 16  This problem, the knowledge acquisition bottleneck, is common to most natural 
 17  language processing tasks, thus fueling the need for quality unsupervised 
 18  approaches. 
 19   
 20  This module contains a k-means clusterer, E-M clusterer and a group average 
 21  agglomerative clusterer (GAAC). All these clusterers involve finding good 
 22  cluster groupings for a set of vectors in multi-dimensional space. 
 23   
 24  The K-means clusterer starts with k arbitrary chosen means then allocates each 
 25  vector to the cluster with the closest mean. It then recalculates the means of 
 26  each cluster as the centroid of the vectors in the cluster. This process 
 27  repeats until the cluster memberships stabilise. This is a hill-climbing 
 28  algorithm which may converge to a local maximum. Hence the clustering is 
 29  often repeated with random initial means and the most commonly occurring 
 30  output means are chosen. 
 31   
 32  The GAAC clusterer starts with each of the M{N} vectors as singleton clusters. 
 33  It then iteratively merges pairs of clusters which have the closest centroids. 
 34  This continues until there is only one cluster. The order of merges gives rise 
 35  to a dendogram - a tree with the earlier merges lower than later merges. The 
 36  membership of a given number of clusters M{c}, M{1 <= c <= N}, can be found by 
 37  cutting the dendogram at depth M{c}. 
 38   
 39  The Gaussian EM clusterer models the vectors as being produced by a mixture 
 40  of k Gaussian sources. The parameters of these sources (prior probability, 
 41  mean and covariance matrix) are then found to maximise the likelihood of the 
 42  given data. This is done with the expectation maximisation algorithm. It 
 43  starts with k arbitrarily chosen means, priors and covariance matrices. It 
 44  then calculates the membership probabilities for each vector in each of the 
 45  clusters - this is the 'E' step. The cluster parameters are then updated in 
 46  the 'M' step using the maximum likelihood estimate from the cluster membership 
 47  probabilities. This process continues until the likelihood of the data does 
 48  not significantly increase. 
 49   
 50  They all extend the ClusterI interface which defines common operations 
 51  available with each clusterer. These operations include. 
 52     - cluster: clusters a sequence of vectors 
 53     - classify: assign a vector to a cluster 
 54     - classification_probdist: give the probability distribution over cluster memberships 
 55   
 56  The current existing classifiers also extend cluster.VectorSpace, an 
 57  abstract class which allows for singular value decomposition (SVD) and vector 
 58  normalisation. SVD is used to reduce the dimensionality of the vector space in 
 59  such a manner as to preserve as much of the variation as possible, by 
 60  reparameterising the axes in order of variability and discarding all bar the 
 61  first d dimensions. Normalisation ensures that vectors fall in the unit 
 62  hypersphere. 
 63   
 64  Usage example (see also demo()):: 
 65      vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]] 
 66       
 67      # initialise the clusterer (will also assign the vectors to clusters) 
 68      clusterer = cluster.KMeans(2, euclidean_distance) 
 69      clusterer.cluster(vectors, True) 
 70   
 71      # classify a new vector 
 72      print clusterer.classify(array([3, 3])) 
 73   
 74  Note that the vectors must use numpy array-like 
 75  objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for 
 76  efficiency when required. 
 77  """ 
 78   
 79  from nltk_lite.probability import DictionaryProbDist 
 80  import copy, numpy, math, random, sys, types 
 81  from numpy import array, linalg 
 82   
 83  #====================================================================== 
 84  # Generic interfaces 
 85  #====================================================================== 
 86   
87 -class ClusterI:
88 """ 89 Interface covering basic clustering functionality. 90 """ 91
92 - def cluster(self, vectors, assign_clusters=False):
93 """ 94 Assigns the vectors to clusters, learning the clustering parameters 95 from the data. Returns a cluster identifier for each vector. 96 """ 97 raise AssertionError()
98
99 - def classify(self, token):
100 """ 101 Classifies the token into a cluster, setting the token's CLUSTER 102 parameter to that cluster identifier. 103 """ 104 raise AssertionError()
105
106 - def likelihood(self, vector, label):
107 """ 108 Returns the likelihood (a float) of the token having the 109 corresponding cluster. 110 """ 111 if self.classify(vector) == label: 112 return 1.0 113 else: 114 return 0.0
115
116 - def classification_probdist(self, vector):
117 """ 118 Classifies the token into a cluster, returning 119 a probability distribution over the cluster identifiers. 120 """ 121 likelihoods = {} 122 sum = 0.0 123 for cluster in self.cluster_names(): 124 likelihoods[cluster] = self.likelihood(vector, cluster) 125 sum += likelihoods[cluster] 126 for cluster in self.cluster_names(): 127 likelihoods[cluster] /= sum 128 return DictionaryProbDist(likelihoods)
129
130 - def num_clusters(self):
131 """ 132 Returns the number of clusters. 133 """ 134 raise AssertError()
135
136 - def cluster_names(self):
137 """ 138 Returns the names of the clusters. 139 """ 140 return range(self.num_clusters())
141
142 - def cluster_name(self, index):
143 """ 144 Returns the names of the cluster at index. 145 """ 146 return index
147
148 -class VectorSpace(ClusterI):
149 """ 150 Abstract clusterer which takes tokens and maps them into a vector space. 151 Optionally performs singular value decomposition to reduce the 152 dimensionality. 153 """
154 - def __init__(self, normalise=False, svd_dimensions=None):
155 """ 156 @param normalise: should vectors be normalised to length 1 157 @type normalise: boolean 158 @param svd_dimensions: number of dimensions to use in reducing vector 159 dimensionsionality with SVD 160 @type svd_dimensions: int 161 """ 162 self._Tt = None 163 self._should_normalise = normalise 164 self._svd_dimensions = svd_dimensions
165
166 - def cluster(self, vectors, assign_clusters=False, trace=False):
167 assert len(vectors) > 0 168 169 # normalise the vectors 170 if self._should_normalise: 171 vectors = map(self._normalise, vectors) 172 173 # use SVD to reduce the dimensionality 174 if self._svd_dimensions and self._svd_dimensions < len(vectors[0]): 175 [u, d, vt] = linalg.svd(numpy.transpose(array(vectors))) 176 S = d[:self._svd_dimensions] * \ 177 numpy.identity(self._svd_dimensions, numpy.Float64) 178 T = u[:,:self._svd_dimensions] 179 Dt = vt[:self._svd_dimensions,:] 180 vectors = numpy.transpose(numpy.matrixmultiply(S, Dt)) 181 self._Tt = numpy.transpose(T) 182 183 # call abstract method to cluster the vectors 184 self.cluster_vectorspace(vectors, trace) 185 186 # assign the vectors to clusters 187 if assign_clusters: 188 print self._Tt, vectors 189 return [self.classify(vector) for vector in vectors]
190
191 - def cluster_vectorspace(self, vectors, trace):
192 """ 193 Finds the clusters using the given set of vectors. 194 """ 195 raise AssertionError()
196
197 - def classify(self, vector):
198 if self._should_normalise: 199 vector = self._normalise(vector) 200 if self._Tt != None: 201 vector = numpy.matrixmultiply(self._Tt, vector) 202 cluster = self.classify_vectorspace(vector) 203 return self.cluster_name(cluster)
204
205 - def classify_vectorspace(self, vector):
206 """ 207 Returns the index of the appropriate cluster for the vector. 208 """ 209 raise AssertionError()
210
211 - def likelihood(self, vector, label):
212 if self._should_normalise: 213 vector = self._normalise(vector) 214 if self._Tt != None: 215 vector = numpy.matrixmultiply(self._Tt, vector) 216 return self.likelihood_vectorspace(vector, label)
217
218 - def likelihood_vectorspace(self, vector, cluster):
219 """ 220 Returns the likelihood of the vector belonging to the cluster. 221 """ 222 predicted = self.classify_vectorspace(vector) 223 if cluster == predicted: return 1.0 224 else: return 0.0
225
226 - def vector(self, vector):
227 """ 228 Returns the vector after normalisation and dimensionality reduction 229 """ 230 if self._should_normalise: 231 vector = self._normalise(vector) 232 if self._Tt != None: 233 vector = numpy.matrixmultiply(self._Tt, vector) 234 return vector
235
236 - def _normalise(self, vector):
237 """ 238 Normalises the vector to unit length. 239 """ 240 return vector / math.sqrt(numpy.dot(vector, vector))
241
242 -class _DendogramNode:
243 """ Tree node of a dendogram. """ 244
245 - def __init__(self, value, *children):
246 self._value = value 247 self._children = children
248
249 - def leaves(self, values=True):
250 if self._children: 251 leaves = [] 252 for child in self._children: 253 leaves.extend(child.leaves(values)) 254 return leaves 255 elif values: 256 return [self._value] 257 else: 258 return [self]
259
260 - def groups(self, n):
261 queue = [(self._value, self)] 262 263 while len(queue) < n: 264 priority, node = queue.pop() 265 if not node._children: 266 queue.push((priority, node)) 267 break 268 for child in node._children: 269 if child._children: 270 queue.append((child._value, child)) 271 else: 272 queue.append((0, child)) 273 # makes the earliest merges at the start, latest at the end 274 queue.sort() 275 276 groups = [] 277 for priority, node in queue: 278 groups.append(node.leaves()) 279 return groups
280
281 -class Dendogram:
282 """ 283 Represents a dendogram, a tree with a specified branching order. This 284 must be initialised with the leaf items, then iteratively call merge for 285 each branch. This class constructs a tree representing the order of calls 286 to the merge function. 287 """ 288
289 - def __init__(self, items=[]):
290 """ 291 @param items: the items at the leaves of the dendogram 292 @type items: sequence of (any) 293 """ 294 self._items = [_DendogramNode(item) for item in items] 295 self._original_items = copy.copy(self._items) 296 self._merge = 1
297
298 - def merge(self, *indices):
299 """ 300 Merges nodes at given indices in the dendogram. The nodes will be 301 combined which then replaces the first node specified. All other nodes 302 involved in the merge will be removed. 303 304 @param indices: indices of the items to merge (at least two) 305 @type indices: seq of int 306 """ 307 assert len(indices) >= 2 308 node = _DendogramNode(self._merge, *[self._items[i] for i in indices]) 309 self._merge += 1 310 self._items[indices[0]] = node 311 for i in indices[1:]: 312 del self._items[i]
313
314 - def groups(self, n):
315 """ 316 Finds the n-groups of items (leaves) reachable from a cut at depth n. 317 @param n: number of groups 318 @type n: int 319 """ 320 if len(self._items) > 1: 321 root = _DendogramNode(self._merge, *self._items) 322 else: 323 root = self._items[0] 324 return root.groups(n)
325
326 - def show(self):
327 """ 328 Print the dendogram in ASCII art to standard out. 329 """ 330 331 # ASCII rendering characters 332 JOIN, HLINK, VLINK = '+', '-', '|' 333 334 # find the root (or create one) 335 if len(self._items) > 1: 336 root = _DendogramNode(self._merge, *self._items) 337 else: 338 root = self._items[0] 339 leaves = self._original_items 340 341 # find the bottom row and the best cell width 342 last_row = [str(leaf._value) for leaf in leaves] 343 width = max(map(len, last_row)) + 1 344 lhalf = width / 2 345 rhalf = width - lhalf - 1 346 347 # display functions 348 def format(centre, left=' ', right=' '): 349 return '%s%s%s' % (lhalf*left, centre, right*rhalf)
350 def display(str): 351 sys.stdout.write(str)
352 353 # for each merge, top down 354 queue = [(root._value, root)] 355 verticals = [ format(' ') for leaf in leaves ] 356 while queue: 357 priority, node = queue.pop() 358 child_left_leaf = map(lambda c: c.leaves(False)[0], node._children) 359 indices = map(leaves.index, child_left_leaf) 360 if child_left_leaf: 361 min_idx = min(indices) 362 max_idx = max(indices) 363 for i in range(len(leaves)): 364 if leaves[i] in child_left_leaf: 365 if i == min_idx: display(format(JOIN, ' ', HLINK)) 366 elif i == max_idx: display(format(JOIN, HLINK, ' ')) 367 else: display(format(JOIN, HLINK, HLINK)) 368 verticals[i] = format(VLINK) 369 elif min_idx <= i <= max_idx: 370 display(format(HLINK, HLINK, HLINK)) 371 else: 372 display(verticals[i]) 373 display('\n') 374 for child in node._children: 375 if child._children: 376 queue.append((child._value, child)) 377 queue.sort() 378 379 for vertical in verticals: 380 display(vertical) 381 display('\n') 382 383 # finally, display the last line 384 display(''.join(item.center(width) for item in last_row)) 385 display('\n') 386
387 - def __repr__(self):
388 if len(self._items) > 1: 389 root = _DendogramNode(self._merge, *self._items) 390 else: 391 root = self._items[0] 392 leaves = root.leaves(False) 393 return '<Dendogram with %d leaves>' % len(leaves)
394 395 ######################################################################## 396 397 from kmeans import * 398 from gaac import * 399 from em import * 400