Amara XML Toolkit manual

Introduction

Amara XML Tools is a collection of Pythonic tools for XML data binding. Not just tools that happen to be written in Python, but tools built from the ground up to use Python idioms and take advantage of the many advantages of Python over other programming languages.

Amara builds on 4Suite, but whereas 4Suite focuses more on literal implementation of XML standards in Python, Amara adds a much more Pythonic face to these capabilities.

Amara provides tools you can trust to conform with XML standards without losing the familiar Python feel.

The components of Amara are:

Amara Bindery: XML as easy as py

The following example shows how to create a binding from a simple XML file, monty.xml, which you can find in the demo directory

from amara import binderytools
doc = binderytools.bind_file('monty.xml')

doc is the data binding result, an object representing the XML. Since I fed the binder a full XML document, I get back an object representing the document itself which then has a member representing the top-level element.

It's that simple. You can write

doc.monty.python.spam

In order to get the value "eggs" (as a Python Unicode object) or

doc.monty.python[1]

In order to get the value "But I was looking for argument".

There are also methods binderytools.bind_string, binderytools.bind_stream, and binderytools.bind_uri for parsing XML in atrings, file-like objects and at URIs, respectively.

More complex example

The following example shows how to create a binding from an XBEL file (a popular XML format for bookmarks that was developed by Python's very own XML-SIG). There is a sample XBEL file you can use in the demo directory

from amara import binderytools
doc = binderytools.bind_file('xbel.xml')

The following sample code prints the first and second bookmark titles:

print doc.xbel.folder.bookmark.title
print doc.xbel.folder.bookmark[1].title

Note that Bindery tries to make things as natural as possible. You can access child elements by just using their name, usually. If there is more than one with the same name it grabs the first one. You can use list indices to specify one of multiple child elements with the same name. Naturally, a more explicit way of getting the first bookmark's title is:

print doc.xbel.folder.bookmark[0].title

Bindery also naturally returns child text when you access an element as above. If you prefer to be more specific, you can use the method "xml_text_content()". The following six lines are all roughly equivalent:

print doc.xbel.folder.bookmark.title
print doc.xbel.folder.bookmark[0].title
print unicode(doc.xbel.folder.bookmark.title)
print unicode(doc.xbel.folder.bookmark[0].title)
print doc.xbel.folder.bookmark.title.xml_text_content()
print doc.xbel.folder.bookmark.title[0].xml_text_content()

The following snippet prints out all bookmark URLs in the file:

def all_titles_in_folder(folder):
    #Warning: folder.bookmark will raise an AttributeError if there are no bookmarks
    for bookmark in folder.bookmark:
            print bookmark.href
    if hasattr(folder, "folder"):
        #There are sub-folders
        for folder in folder.folder:
            all_titles_in_folder(folder)
    return

for folder in doc.xbel.folder:
    all_titles_in_folder(folder)

Note: for now if you want to count the number of elements of the same name, you have to explicitly convert to list:

len(list(doc.xbel.folder))

Gives the number of top-level folders. Plain old len(doc.xbel.folder) does not work. I hope to eliminate this restriction soon.

The workings of Bindery

In the default binding XML elements turn into specialized objects. For each generic identifier (element name) a class is generated that is derived from bindery.element_base. Attributes become simple data members whose value is a Unicode object containing the attribute value.

Element objects are specially constructed so they can be treated as single objects (in which case the first child element of the corresponding name is selected, or one can use list item access notation or even iterate.

Going back to the example, binding.xbel.folder.bookmark is the same as binding.xbel.folder.bookmark\[0]. Both return the first bookmark in the first folder. To get the second bookmark in the first folder: binding.xbel.folder.bookmark\[1]

Simple element text

To extract the text content of an element, one can "cast" to Unicode. Thus to get a bookmark's title:

unicode(bookmark.title)

You can also use the equivalent xml_text_content method which returns a:

bookmark.title.xml_text_content()

If you feel you must, you can also cast to string, although it is always dangerous to carelessly treat text relating to XML as strings rather than Unicode objects. Unicode is the foundation of XML. If you choose to ignore this warning you can do:

print bookmark.title

If you want to use print without careless casting, use Python's Unicode codecs, for example:

print unicode(bookmark.title).encode('utf-8')

Complex element children

Bindery by default preserves ordering information from the source XML. You can access this through the children list of element and document objects:

folder.xml_children

Within the children list, child elements are represented using the corresponding binding objects, child text becomes simple Unicode objects. Notice that in the default binding text children are normalized, meaning that you'll never have two text nodes next to each other in xml_children.

Writing XML back out

The preservation of ordering information means that Bindery does a decent job of allowing you to render binding objects back to XML form. Use the xml() method for this.

print doc.xml()

The xml method returns encoded text, not Unicode, so the above print is safe. The default encoding is UTF-8. You can also serialize a portion of the document:

print doc.xbel.folder.xml() #Just the first folder

You can pass in a stream for the output:

doc.xml(sys.stdout)

You can control such matters as the output encoding, whether the output is pretty-printed, whether there is an output XML declaration, etc. by passing in a 4Suite output handler object. These objects provide all the XML output fine-tuning specified for XSLT 1.0:

from Ft.Xml.Xslt.XmlWriter import XmlWriter
from Ft.Xml.Xslt.OutputParameters import OutputParameters

oparams = OutputParameters()
oparams.indent = 'yes'
oparams.encoding = 'iso-8859-1'
writer = XmlWriter(oparams, sys.stdout)

#If you're serializing anything but a full document, you must
#explicitly call writer.startDocument() ... writer.endDocument()
doc.xml(writer=writer)

XPath

Bindery supports an XPath subset, mostly limited to support for element navigation. Luckily, this does means that Bindery can handle one of the most common XPath tasks: locating elements to process in a document.

The folowing example retrieves all top-level folders:

tl_folders = doc.xbel.xml_xpath(u'folder')
for folder in tl_folders:
    print folder.title        #Beware implicit cast to string

You invoke the xml_xpath method on the object you wish to serve as the context for the XPath query. To get the first element child (regardless of node name) of the first bookmark of the first folder:

doc.xbel.folder.bookmark.xml_xpath(u'*[1]')

or

doc.xbel.xml_xpath(u'folder[1]/bookmark[1]/*[1]')

or

doc.xbel.xml_xpath(u'/folder[1]/bookmark[1]/*[1]')

or

doc.xml_xpath(u'xbel/folder[1]/bookmark[1]/*[1]')

etc.

Remember: in Python, lists indices start with 0 while they start with 1 in XPath.

Notice: this XPath returns a node set, rendered in Python as a list of nodes. It happens to be a list of one node, but you still have to extract it with \[0].

The following example prints out all bookmark URLs in the file, but is terser than the similar code earlier in this document:

bookmarks = doc.xml_xpath(u'//bookmark')
for bookmark in bookmarks:
    print bookmark.href

Attribute queries also work. The following just returns all hrefs wherevere they appear in the document:

hrefs = doc.xml_xpath(u'//@href')
for href in hrefs:
    print unicode(href)

The following prints the title of the bookmark for the 4Suite project:

url = u"http://4suite.org/"
title_elements = doc.xml_xpath('//bookmark[@href="%s"]/title'%url)
#XPath node set expression always returns a list
print unicode(title_elements[0])

Namespaces

Bindery supports documents with namespaces. The following example displays a summary of the contents of an RSS 1.0 feed:

from amara import binderytools

#Set up customary namespace bindings for RSS
#These are used in XPath query and XPattern rules
RSS10_NSS = {
    u'rdf': u'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
    u'dc': u'http://purl.org/dc/elements/1.1/',
    u'rss': u'http://purl.org/rss/1.0/',
    }

doc = binderytools.bind_file('rss10.rdf', prefixes=RSS10_NSS)

#Create a dictionary of RSS items
items = {}
item_list = doc.xml_xpath(u'//rss:item')
items = dict( [ ( item.about, item) for item in item_list ] )
print items

for channel in doc.RDF.channel:
    print "Channel:", channel.about
    print "Title:", channel.title
    print "Items:"
    for item_ref in channel.items.Seq.li:
        item = items[item_ref.resource]
        print "\t", item.link
        print "\t", item.title

The following illustrates how namespace details are maintained in bindery objects.

#Show the namespace particulars of the rdf:RDF element
print doc.RDF.namespaceURI
print doc.RDF.localName
print doc.RDF.prefix

Namespaces work naturally with XPath as well:

#Get the RSS item with a given URL
item_url = u'http://www.oreillynet.com/cs/weblog/view/wlg/532'
matching_items = doc.RDF.xml_xpath(u'//rss:item[@rdf:about="%s"]'%item_url)
print matching_items
assert matching_items[0].about == item_url

Push binding

If you're dealing with a large XML file, you may not want the entire data binding in memory at the same time. You may want to instantiate it bit by bit. If you have a clear pattern for how you want to break up the document, you can use the function binderytools.pushbind. See the following example:

from amara import binderytools

for folder in binderytools.pushbind('/xbel/folder', source='xbel.xml'):
    title = folder.title
    bm_count = len(list(folder.bookmark))
    print "Folder", title, "has", bm_count, "top level bookmarks"

The neat thing is that this probram never has the entire binding inn memory at time. Just enough to represent each top-level folder element. binderytools.pushbind is a generator that yields each little subtree each time it's invoked.

Modification

Modifying a binding is pretty straightforward. You can create or modify an attribute by simple assignment:

from amara import binderytools
doc = binderytools.bind_file('monty.xml')
doc.monty.foo = u'bar'

You can empty out all children of an element:

doc.monty.python.xml_clear()

And add new elements and text:

doc.monty.python.xml_append(doc.xml_element(None, u'new'))
doc.monty.python.new.xml_children.append(u'New Content')

You can anlso delete a specific element or text child. First of all it is useful to know how to get an index for a given element on its parent:

ix = doc.monty.python.xml_index_on_parent()

Which assigns ix 1 since the first python element is the second child of monty

ix = doc.monty.python[1].xml_index_on_parent()

Which assigns ix 3 since the second python element is the fourth child of monty. Once you have an index, you can specify that index in order to delete a specific child using the xml_remove_child method:

doc.monty.xml_remove_child(3)

Removes the second python element, using the index determined above. This works for text as well:

doc.monty.xml_remove_child(0)

Removes the first text node.

Note: If you're wondering why you cannot use "del doc.monty.python\[1]", this would require specialization of __del__, which I avoided because it could complicate garbage collection.

Customizing the binding

Bindery works by iterating over XML nodes and firing off a set of rules triggered by the node type and other details. The default binding is the result of the default rules that are registered for each node type, but Bindery makes this easy to tweak by letting you register your own rules.

Bindery comes bundled with 3 easy rule frameworks to handle some common binding needs.

Treating some elements as simple string values

The title elements in XBEL are always simple text, and creating full Python objects for them is overkill in most cases. They could be just as easily simple data members with the Unicode value of the element's content. To make this adjustment in Bindery, register an instance of the simple_string_element_rule rule. This rule takes an list of XPattern expressions which indicate which elements are to be simplified. So to simplify all title elements:

#Specify (using XPatterns) elements to be treated similarly to attributes
rules = [
    binderytools.simple_string_element_rule(u'title')
    ]
#Execute the binding
doc = binderytools.bind_file('xbel.xml', rules=rules)

#title is now simple unicode
print doc.xbel.folder.bookmark.title.__class__
Omitting certain elements entirely

Perhaps you want to focus on only part of a document, and to save memory and hassle, you want to omit certain elements that are not of interest in the binding. You can use the omit_element_rule in this case.

The following example does not create bindings for folder titles at all (but bookmark titles are preserved):

#Specify (using XPatterns) elements to be ignored
rules = [
    binderytools.omit_element_rule(u'folder/title')
    ]
#Execute the binding
doc = binderytools.bind_file('xbel.xml', rules=rules)

#Following would now raise an exception:
#print doc.xbel.folder.title
Stripping whitespace

A common need is to strip out pure whitespace nodes so that they don't clutter up "children" lists. Bindery bundles the ws_strip_element_rule rule for this purpose. It uses XPatterns to determine elements whose children are to be stripped of whitespace.

#Specify (using XPatterns) elements to be omitted
#In this case select all top-level elements for stripping
rules = [
    binderytools.ws_strip_element_rule(u'/*')
    ]
#Execute the binding
doc = binderytools.bind_file('xbel.xml', rules=rules)

You can combine rules, such as stripping white space while still omitting certain elements.

The type inferencer

The basic idea behind data binding is translating XML into native data types. Amara provides a rule that looks at each XML node to see if it can infer a native Python data type for the value, in particular int, float or datetime.

TYPE_MIX = """\
<?xml version="1.0" encoding="utf-8"?>
<a a1="1">
  <b b1="2.1"/>
  <c c1="2005-01-31">
    <d>5</d>
  <e>2003-01-30T17:48:07.848769Z</e>
  </c>
  <g>good</g>
</a>"""

rules=[binderytools.type_inference()]
doc = binderytools.bind_string(TYPE_MIX, rules=rules)
doc.a.a1 == 1     #type int
doc.a.b.b1 == 2.1 #type float
doc.a.c.c1 == datetime.datetime(2005, 1, 31) #type datetime.
Using namespaces in custom rules

The built-in custom rules use XPattern, which uses prefixes to specify namespaces. You have to let the binder tools know what namespace bindings are in effect:

#Set up customary namespace bindings for RSS
#These are used in XPath query and XPattern rules
RSS10_NSS = {
    'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
    'dc': 'http://purl.org/dc/elements/1.1/',
    'rss': 'http://purl.org/rss/1.0/',
    }

rules = [
    binderytools.simple_string_element_rule(u'title')
    ]
#Execute the binding
doc = binderytools.bind_file('rss10.rdf', prefixes=RSS10_NSS, rules=rules)

Using custom binding classes

If you need more sophisticated tweaking, you proably want to register your own customized binding class. The following example gives bookmark elements a method, retrieve(), which retrieves the body of the Web page:

import urllib
from xml.dom import Node
from amara import binderytools
from amara import bindery
from Ft.Xml import InputSource
from Ft.Lib import Uri

#Subclass from the default binding class
#We're adding a specialized method for accessing a bookmark on the net
class specialized_bookmark(bindery.element_base):
    def retrieve(self):
        try:
            stream = urllib.urlopen(self.href)
            content = stream.read()
            stream.close()
            return content
        except IOError:
            import sys; sys.stderr.write("Unable to access %s\n"%self.href)

#Explicitly create a binder instance, in order to customize rules
binder = bindery.binder()

#associate specialized_bookmark class with elements not in an XML
#namespace and having a GI of "bookmark"
binder.set_binding_class(None, "bookmark", specialized_bookmark)

#Execute the binding
doc = binderytools.bind_file('xbel.xml', binderobj=binder)

#Show specialized instance
print doc.xbel.folder.bookmark.__class__

#Exercise the custom method
print "Content of first bookmark:"
print doc.xbel.folder.bookmark.retrieve()

Focus on the line:

binder.set_binding_class(None, "bookmark", specialized_bookmark)

When you register classes to use in binding a given elements type you do so by specifying namespace URI and local name of the element. If you know that the element is not in a namespace, as in the XBEL example, you use None None is the Right Way to signal "not in a namespace" in most Python/XML tools, and not the empty string "".

General warning about customized bindings

Bindery tries to manage things so that writing back the XML from a binding makes sense, and that XPath gives expected results, but it is easy to bring about odd results if you customize the binding.

As an exampe, if you use simple_string_element_rule and then reserialize using xml(), the elements that were simplified will be written back out as XML attributes rather than child elements. If you do run into such artifacts after customizing a binding the usual remedy is to write a custoized xml() method or add specialized XPath wrapper code (see xpath_wrapper_mixin for the default XPath wrappering).

The detailed API

The detailed steps for data binding are:

1. Get either a DOM node or a 4Suite input source for XML content

2. Register any rules you wish to use to guide the binding

3. Generate the binding

The convenience functions in the binderytools module hide a lot of step 1 for you.

Step 2 is entirely optional: Bindery has a very serviceable set of default binding rules.

The following example shows a more explicit procedure for creating a binding from an XBEL file.

from amara import bindery
from Ft.Xml import InputSource
from Ft.Lib import Uri

#Create an input source for the XML
isrc_factory = InputSource.DefaultFactory
#Create a URI from a filename the right way
file_uri = Uri.OsPathToUri('xbel.xml', attemptAbsolute=1)
isrc = isrc_factory.fromUri(file_uri)

#Now bind from the XML given in the input source
binder = bindery.binder()
binding = binder.read_xml(isrc)

The process of generating an input source object (isrc) is straight 4Suite API. It may seem like several steps, but the API is designed to ensure that the user is very careful about what is happening.

Since you are loading from an OS path, the Uri.OsPathToUri translates to a proper, absolutized file:/// URL. If you are truly loading from a URL you can omit that line.

Bindery extension guide

Bindery is designed to be extensible, but this is not a simple proposition given the huge flexibility of XML expression, and the many different ways developpers might want to generate resulting Python objects (and vice versa). You can pretty much do whatever you need to by writing Bindery extensions, but in order to keep things basically manageable, there are some ground rules.

Bindery laws:

1) Binding objects corresponding to an XML document have a single root object from which all other objects can be reached through navigating attributes (no, fancy method calls don't count)

\[TODO: more on this section to come. If you try tweaking bindery extensions and have some useful notes, please pitch in by sending them along.]

Scimitar: the most flexible schema language for the most flexible programming language

Scimitar is an implementation of ISO Schematron that compiles a Schematron schema into a Python validator script.

Scimitar supports all of the draft ISO Schematron specification. See the TODO file for known gaps in Scimitar convenience.

You typically use scimitar in two phases. Say you have a schematron schema schema1.stron and you want to validate multiple XML files against it, instance1.xml, instance2.xml, instance3.xml.

First you run schema1.stron through the scimitar compiler script, scimitar.py:

scimitar.py schema1.stron

A file, schema1.py (same file stem with the "py" extension sunstituted), is generated in the current working directory. If you'd prefer a different location or file name, use the "-o" option. The generated file is a validator script in Python. It checks the schematron rules specified in schema1.stron.

You now run the generated validator script on each XML file you wish to validate:

python schema1.py instance1.xml

The validation report is generated on standard output by default, or you can use the "-o" option to redirect it to a file.

The validation report is an XML external parsed entity, in other words a file that is much like a well-formed XML document, but with some restrictions loosened so that it's effectively text with possible embedded tags.

To elaborate using the example from the schematron 1.5 specification:

$ cat simple1.stron
<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron" version="ISO">
 <sch:title>Example Schematron Schema</sch:title>
 <sch:pattern>
   <sch:rule context="dog">
    <sch:assert test="count(ear) = 2"
    >A 'dog' element should contain two 'ear' elements.</sch:assert>
    <sch:report test="bone"
    >This dog has a bone.</sch:report>
   </sch:rule>
  </sch:pattern>
</sch:schema>

$ scimitar.py simple1.stron
$ ls simple*.py
simple1-stron.py
$ cat instance2.xml
<dog><ear/></dog>

$ python simple1-stron.py instance2.xml
<?xml version="1.0" encoding="UTF-8"?>
Processing schema: Example Schematron Schema

Processing pattern: [unnamed]

Assertion failure:
A 'dog' element should contain two 'ear' elements.

Amara DOM Tools: giving DOM a more Pythonic face

DOM came from the Java world, and hardly the most Pythonic API possible (see the Bindery above for a good step forward). Some DOM-like implementations such as 4Suite's Domlettes mix in some Pythonic idiom. Amara DOM Tools goes even further in this regard.

The pushdom

You're probably familiar with xml.dom.pulldom, which offers a nice hybrid between SAX and DOM, allowing you to efficiently isolate imporant parts of a document in a SAX-like manner, and then using DOM for finer-grained manipulation. Amara's pushdom makes this process even more convenient You give it a set of XPatterns, and it provides a generator yielding a series of DOM chunks according to the patterns.

In this way you can process huge files with very little memory usage, but most of the convenience of DOM.

for docfrag in domtools.pushdom('/labels/label', source='demo/labels.xml'):
    label = docfrag.firstChild
    name = label.xpath('string(name)')
    city = label.xpath('string(address/city)')
    if name.lower().find('eliot') != -1:
        print city.encode('utf-8')

Prints "Stamford"

See also this XML-DEV message

Generator tools

For more on the generator tools see the article "Generating DOM Magic"

Getting an XPath for a given node

domtools.abs_path allows you to get the absolute path for a node. The following code:

from amara import domtools
from Ft.Xml.Domlette import NonvalidatingReader
from Ft.Lib import Uri
file_uri = Uri.OsPathToUri('labels.xml', attemptAbsolute=1)
doc = NonvalidatingReader.parseUri(file_uri)

print domtools.abs_path(doc)
print domtools.abs_path(doc.documentElement)
for node in doc.documentElement.childNodes:
    print domtools.abs_path(node)

Displays:

/
/labels[1]
/labels[1]/text()[1]
/labels[1]/label[1]
/labels[1]/text()[2]
/labels[1]/label[2]
/labels[1]/text()[3]
/labels[1]/label[3]
/labels[1]/text()[4]
/labels[1]/label[4]
/labels[1]/text()[5]

For more on abs_path tools see the article "Location, Location, Location"

Amara SAX Tools: SAX without the brain explosion

Tenorsax (amara.saxtools.tenorsax) is a framework for "linerarizing" SAX logic so that it flows a bit more naturally, and needs a lot less state machine wizardry.

I haven't yet had time to document it heavily, but see test/saxtools/xhtmlsummary.py for an example.

See also this XML-DEV message

Flextyper: user-defined datatypes in Python for XML processing

Flextyper is an implementation of Jeni Tennison's Data Type Library Language (DTLL) (on track to become part 5 of ISO Document Schema Definition Languages (DSDL). You can use Flextyper to generate Python modules containing data types classes that can be used with 4Suite's RELAX NG library

Flextyper is currently experimental. It won't come into its full usefulness until the next release of 4Suite, although you can use it with current CVS releases of 4Suite.

Flextyper compiles a DTLL file into a collection of Python modules implementing the contained data types. Run it as follows:

flextyper.py dtll.xml

A set of files, one per data type namespace defined in the DTLL, is created. By default the output file names are based on the input, e.g. dtll-datatypes1.py, dtll-datatypes2.py, etc.

You can now register these data types modules with a processor instance of 4Suite's RELAX NG implementation.