Home | Trees | Indices | Help |
|
---|
|
1 # Natural Language Toolkit: Tree Transformations 2 # 3 # Copyright (C) 2005-2007 Oregon Graduate Institute 4 # Author: Nathan Bodenstab <bodenstab@cslu.ogi.edu> 5 # URL: <http://nltk.sf.net> 6 # For license information, see LICENSE.TXT 7 8 """ 9 A collection of methods for tree (grammar) transformations used 10 in parsing natural language. 11 12 Although many of these methods are technically grammar transformations 13 (ie. Chomsky Norm Form), when working with treebanks it is much more 14 natural to visualize these modifications in a tree structure. Hence, 15 we will do all transformation directly to the tree itself. 16 Transforming the tree directly also allows us to do parent annotation. 17 A grammar can then be simply induced from the modified tree. 18 19 The following is a short tutorial on the available transformations. 20 21 1) Chomsky Normal Form (binarization) 22 23 It is well known that any grammar has a Chomsky Normal Form (CNF) 24 equivalent grammar where CNF is defined by every production having 25 either two non-terminals or one terminal on its right hand side. 26 When we have hierarchically structured data (ie. a treebank), it is 27 natural to view this in terms of productions where the root of every 28 subtree is the head (left hand side) of the production and all of 29 its children are the right hand side constituents. In order to 30 convert a tree into CNF, we simply need to ensure that every subtree 31 has either two subtrees as children (binarization), or one leaf node 32 (non-terminal). In order to binarize a subtree with more than two 33 children, we must introduce artificial nodes. 34 35 There are two popular methods to convert a tree into CNF: left 36 factoring and right factoring. The following example demonstrates 37 the difference between them. 38 39 | Original Right-Factored Left-Factored 40 | 41 | Example: A A A 42 | / | \ / \ / \ 43 | B C D ==> B A|<C-D> OR A|<B-C> D 44 | / \ / \ 45 | C D B C 46 47 2) Parent Annotation 48 49 In addition to binarizing the tree, there are two standard 50 modifications to node labels we can do in the same traversal: parent 51 annotation and Markov order-N smoothing (or sibling smoothing). 52 53 The purpose of parent annotation is to refine the probabilities of 54 productions by adding a small amount of context. With this simple 55 addition, a CYK (inside-outside, dynamic programming chart parse) 56 can improve from 74% to 79% accuracy. A natural generalization from 57 parent annotation is to grandparent annotation and beyond. The 58 tradeoff becomes accuracy gain vs. computational complexity. We 59 must also keep in mind data sparcity issues. 60 61 | Original Parent Annotation 62 | 63 | Example: A A^<?> 64 | / | \ / \ 65 | B C D ==> B^<A> A|<C-D>^<?> where ? is the parent of A 66 | / \ 67 | C^<A> D^<A> 68 69 70 3) Markov order-N smoothing 71 72 Markov smoothing combats data sparcity issues as well as decreasing 73 computational requirements by limiting the number of children 74 included in artificial nodes. In practice, most people use an order 75 2 grammar. 76 77 | Original No Smoothing Markov order 1 Markov order 2 etc... 78 | 79 | Example: A A A A 80 | / / | \ \ / \ / \ / \ 81 | B C D E F ==> B A|<C-D-E-F> ==> B A|<C> ==> B A|<C-D> 82 | / \ / \ / \ 83 | C ... C ... C ... 84 85 Annotation decisions can be thought about in the vertical direction 86 (parent, grandparent, etc) and the horizontal direction (number of 87 siblings to keep). Parameters to the following functions specify 88 these values. For more information see: 89 90 Dan Klein and Chris Manning (2003) "Accurate Unlexicalized Parsing", ACL-03. 91 http://www.aclweb.org/anthology/P03-1054 92 93 4) Unary Collapsing 94 95 Collapse unary productions (ie. subtrees with a single child) into a 96 new non-terminal (Tree node). This is useful when working with 97 algorithms that do not allow unary productions, yet you do not wish 98 to lose the parent information. 99 100 | A 101 | | 102 | Example: B ==> A+B 103 | / \ / \ 104 | C D C D 105 106 """ 107 108 from nltk_lite.parse import * 109110 -def chomskyNormalForm(tree, factor = "right", horzMarkov = None, vertMarkov = 0, childChar = "|", parentChar = "^"):111 """ 112 This method can modify a tree in three ways: 113 1. Convert a tree into its Chomsky Normal Form (CNF) equivalent -- Every subtree 114 has either two non-terminals or one terminal as its children. This process 115 requires the creation of more "artificial" non-terminal nodes. 116 2. Markov (vertical) smoothing of children in new artificial nodes 117 3. Horizontal (parent) annotation of nodes 118 (see documentation in code for more information) 119 120 @param tree: The Tree to be modified 121 @type tree: C{Tree} 122 @param factor: Right or left factoring method (default = "right") 123 @type factor: C{string} = [left|right] 124 @param horzMarkov: Markov order for sibling smoothing in artificial nodes (None (default) = include all siblings) 125 @type horzMarkov: C{int} | None 126 @param vertMarkov: Markov order for parent smoothing (0 (default) = no vertical annotation) 127 @type vertMarkov: C{int} | None 128 @param childChar: A string used in construction of the artificial nodes, separating the head of the 129 original subtree from the child nodes that have yet to be expanded (default = "|") 130 @type childChar: C{string} 131 @param parentChar: A string used to separate the node representation from its vertical annotation 132 @type parentChar: C{string} 133 """ 134 135 # assume all subtrees have homogeneous children 136 # assume all terminals have no siblings 137 138 # A semi-hack to have elegant looking code below. As a result, 139 # any subtree with a branching factor greater than 999 will be incorrectly truncated. 140 if horzMarkov == None: horzMarkov = 999 141 142 # Traverse the tree depth-first keeping a list of ancestor nodes to the root. 143 # I chose not to use the tree.treepositions() method since it requires 144 # two traversals of the tree (one to get the positions, one to iterate 145 # over them) and node access time is proportional to the height of the node. 146 # This method is 7x faster which helps when parsing 40,000 sentences. 147 148 nodeList = [(tree, [tree.node])] 149 while nodeList != []: 150 node, parent = nodeList.pop() 151 if isinstance(node,Tree): 152 153 # parent annotation 154 parentString = "" 155 originalNode = node.node 156 if vertMarkov != 0 and node != tree and isinstance(node[0],Tree): 157 parentString = "%s<%s>" % (parentChar, "-".join(parent)) 158 node.node += parentString 159 parent = [originalNode] + parent[:vertMarkov - 1] 160 161 # add children to the agenda before we mess with them 162 for child in node: 163 nodeList.append((child, parent)) 164 165 # chomsky normal form factorization 166 if len(node) > 2: 167 childNodes = [child.node for child in node] 168 nodeCopy = node.copy() 169 node[0:] = [] # delete the children 170 171 curNode = node 172 numChildren = len(nodeCopy) 173 for i in range(1,numChildren - 1): 174 if factor == "right": 175 newHead = "%s%s<%s>%s" % (originalNode, childChar, "-".join(childNodes[i:min([i+horzMarkov,numChildren])]),parentString) # create new head 176 newNode = Tree(newHead, []) 177 curNode[0:] = [nodeCopy.pop(0), newNode] 178 else: 179 newHead = "%s%s<%s>%s" % (originalNode, childChar, "-".join(childNodes[max([numChildren-i-horzMarkov,0]):-i]),parentString) 180 newNode = Tree(newHead, []) 181 curNode[0:] = [newNode, nodeCopy.pop()] 182 183 curNode = newNode 184 185 curNode[0:] = [child for child in nodeCopy]186 187188 -def unChomskyNormalForm(tree, expandUnary = True, childChar = "|", parentChar = "^", unaryChar = "+"):189 """ 190 This method modifies the tree in three ways: 191 1. Transforms a tree in Chomsky Normal Form back to its original structure (branching greater than two) 192 2. Removes any parent annotation (if it exists) 193 3. (optional) expands unary subtrees (if previously collapsed with collapseUnary(...) ) 194 195 @param tree: The Tree to be modified 196 @type tree: C{Tree} 197 @param expandUnary: Flag to expand unary or not (default = True) 198 @type expandUnary: C{boolean} 199 @param childChar: A string separating the head node from its children in an artificial node (default = "|") 200 @type childChar: C{string} 201 @param parentChar: A sting separating the node label from its parent annotation (default = "^") 202 @type parentChar: C{string} 203 @param unaryChar: A string joining two non-terminals in a unary production (default = "+") 204 @type unaryChar: C{string} 205 """ 206 207 # Traverse the tree-depth first keeping a pointer to the parent for modification purposes. 208 nodeList = [(tree,[])] 209 while nodeList != []: 210 node,parent = nodeList.pop() 211 if isinstance(node,Tree): 212 # if the node contains the 'childChar' character it means that 213 # it is an artificial node and can be removed, although we still need 214 # to move its children to its parent 215 childIndex = node.node.find(childChar) 216 if childIndex != -1: 217 nodeIndex = parent.index(node) 218 parent.remove(parent[nodeIndex]) 219 # Generated node was on the left if the nodeIndex is 0 which 220 # means the grammar was left factored. We must insert the children 221 # at the beginning of the parent's children 222 if nodeIndex == 0: 223 parent.insert(0,node[0]) 224 parent.insert(1,node[1]) 225 else: 226 parent.extend([node[0],node[1]]) 227 228 # parent is now the current node so the children of parent will be added to the agenda 229 node = parent 230 else: 231 parentIndex = node.node.find(parentChar) 232 if parentIndex != -1: 233 # strip the node name of the parent annotation 234 node.node = node.node[:parentIndex] 235 236 # expand collapsed unary productions 237 if expandUnary == True: 238 unaryIndex = node.node.find(unaryChar) 239 if unaryIndex != -1: 240 newNode = Tree(node.node[unaryIndex + 1:], [i for i in node]) 241 node.node = node.node[:unaryIndex] 242 node[0:] = [newNode] 243 244 for child in node: 245 nodeList.append((child,node))246 247249 """ 250 Collapse subtrees with a single child (ie. unary productions) 251 into a new non-terminal (Tree node) joined by 'joinChar'. 252 This is useful when working with algorithms that do not allow 253 unary productions, and completely removing the unary productions 254 would require loss of useful information. The Tree is modified 255 directly (since it is passed by reference) and no value is returned. 256 257 @param tree: The Tree to be collapsed 258 @type tree: C{Tree} 259 @param collapsePOS: 'False' (default) will not collapse the parent of leaf nodes (ie. 260 Part-of-Speech tags) since they are always unary productions 261 @type collapsePOS: C{boolean} 262 @param collapseRoot: 'False' (default) will not modify the root production 263 if it is unary. For the Penn WSJ treebank corpus, this corresponds 264 to the TOP -> productions. 265 @type collapseRoot: C{boolean} 266 @param joinChar: A string used to connect collapsed node values (default = "+") 267 @type joinChar: C{string} 268 """ 269 270 if collapseRoot == False and isinstance(tree, Tree) and len(tree) == 1: 271 nodeList = [tree[0]] 272 else: 273 nodeList = [tree] 274 275 # depth-first traversal of tree 276 while nodeList != []: 277 node = nodeList.pop() 278 if isinstance(node,Tree): 279 if len(node) == 1 and isinstance(node[0], Tree) and (collapsePOS == True or isinstance(node[0,0], Tree)): 280 node.node += joinChar + node[0].node 281 node[0:] = [child for child in node[0]] 282 # since we assigned the child's children to the current node, 283 # evaluate the current node again 284 nodeList.append(node) 285 else: 286 for child in node: 287 nodeList.append(child)288 289291 """ 292 Convert a tree into its treebank-style bracketed equivalent. 293 """ 294 return _toTreebank(tree).strip()295297 s = " (%s" % tree.node 298 for child in tree: 299 if isinstance(child,Tree): 300 s += _toTreebank(child) 301 else: 302 s += " " + child 303 return s + ")"304 305 306 ################################################################# 307 # Demonstration 308 ################################################################# 309311 """ 312 A demonstration showing how each tree transform can be used. 313 """ 314 315 from nltk_lite.draw.tree import draw_trees 316 from nltk_lite.parse import bracket_parse 317 from nltk_lite.parse import treetransforms 318 from copy import deepcopy 319 320 # original tree from WSJ bracketed text 321 sentence = "(TOP (S (S (VP (VBN Turned) (ADVP (RB loose)) (PP (IN in) (NP (NP (NNP Shane) (NNP Longman) (POS 's)) (NN trading) (NN room))))) (, ,) (NP (DT the) (NN yuppie) (NNS dealers)) (VP (AUX do) (NP (NP (RB little)) (ADJP (RB right)))) (. .)))" 322 tree = bracket_parse(sentence) 323 324 # collapse subtrees with only one child 325 collapsedTree = deepcopy(tree) 326 treetransforms.collapseUnary(collapsedTree) 327 328 # convert the tree to CNF 329 cnfTree = deepcopy(collapsedTree) 330 treetransforms.chomskyNormalForm(cnfTree) 331 332 # convert the tree to CNF with parent annotation (one level) and horizontal smoothing of order two 333 parentTree = deepcopy(collapsedTree) 334 treetransforms.chomskyNormalForm(parentTree, horzMarkov=2, vertMarkov=1) 335 336 # convert the tree back to its original form (used to make CYK results comparable) 337 original = deepcopy(parentTree) 338 treetransforms.unChomskyNormalForm(original) 339 340 # convert tree back to bracketed text 341 sentence2 = treetransforms.toTreebank(original) 342 print "Sentences the same? ", sentence == sentence2 343 344 draw_trees(tree, collapsedTree, cnfTree, parentTree, original)345 346 if __name__ == '__main__': demo() 347
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0beta1 on Wed May 16 22:47:52 2007 | http://epydoc.sourceforge.net |