This document describes (yet) another API for processing XML documents. It focuses on simplifying the kind of processing where a largely linear document is being converted to another largely linear document, such as where a paper or article is being translated to HTML or to print form.
There are lots of XML parser APIs around, most notably the SAX and DOM APIs supported by the W3C organization. Any new API must provide significant and needed new functionality, must significantly simplify processing, or must be the basis for explaining otherwise difficult concepts.
The current paper and its accompanying implementation, for use with Python, describe a new XML parser API that's a based generally on the commonly-used SAX API, but seems, at least for a large number of applications, to have advantages:
It's simple.
It merges some of the advantages of the DOM API and declaration-based processing (such as in XSLT) into a serial model of XML parsing.
It displays some of the advantages of some of the new (version 2.2 and 2.3) Python language features, and demonstrates how general features can improve specific processing models, most notably in the text and markup language processing area.
The XML parser API implementation that comes with this paper should be considered a prototype, and an alpha implementation at that. It's work in process. It has been used to successfully convert an XML document to HTML, but that's it.
An XML parser is encapsulated by the anXMLParser class. Creating and initializing an XML parser is done by invoking the anXMLParser's creator method. It has three option arguments:
documentEntity: A string, Unicode string or an input-file-like object which is or which is the source of the document to be parsed. If it's a string or Unicode string, that's the document's text. If not specified, then the parser will return a markupTokenTheBegining token later on, requesting a source for the document's text.
include: A sequence of letters, indicating which of the optionally returned tokens should be returned. The letters can be in uppercase or lowercase. The currently supported options are:
"S": Return markupTokenStartDocument and markupTokenEndDocument tokens.
"I": Return markupTokenIgnorableWhitespace tokens.
"P": Return markupTokenProcessingInstruction tokens.
"L": Return markupTokenSetDocumentLocation tokens.
"D": Return markupTokenNotationDecl and markupTokenUnparsedEntityDecl tokens.
"W": Return markupTokenException tokens for warnings.
By default none of the above tokens are returned.
entityResolver: A SAX entity resolving call-back function. If entityResolver is provided, the markupTokenEntity token will not be returned to the application.
An anXMLParser object has two useful properties and two useful methods:
.currentToken: The most recently returned token from the XML Parser.
.openElements: A list of markupTokenStartElement objects corresponding to the currently opened elements. The most recently opened element is last.
The current element is parser.openElements [-1]. If there are no open elements, the value of len(parser.openElements) is zero.
.setEntity: A method of one argument, providing the XML Parser with a string, Unicode string or input-file-like object that is the text of a previously requested entity. This method, or the correspondingly-named method of the returned markup token should be called whenever a markupTokenTheBegining or markupTokenEntity token is returned to the application.
Invoking an anXMLParser object as a function returns a generator of the parsers markup tokens. How this works is described in more detail in Retrieving Tokens From The XML Parser.
The primary functionality of an anXMLParser object is to return a stream of XML markup tokens. For those familiar with the SAX API, there's one object corresponding to each of SAX's call-back functions/methods, plus a few others for convenience.
You create an anXMLParser object, something like this:
parser = anXMLParser (open ("mydocument.xml"))
In this case, just the input XML document is provided. You can then start getting the markup tokens from the parser by invoking the parser object as if it were a function (this is a Python thing, but it works well in some cases such as this):
for token in parser (): if token.isStartElement (): print "Starting element", token.name elif token.isEndElement (): print "Ending element", token.name elif token.isCharacters (): print "Characters:", token.characters
(The methods and properties defined for the tokens are described in a later section.)
The first example simply displays the start- and end-tags and any characters found. But a typical application doesn't treat all elements equally -- where they appear, in what context they appear in, determines how they are going to be used. So here's a more realistic example:
for token in parser (): if token.isStartElement ("section"): for token in token.children (): def outputParaContent (token): for token in token.children ("W"): if token.isCharacters (): out.write (token.characters) if token.isStartElement ("title"): out.write ("<H2>") outputParaContent (token) out.write ("</H2>\n") elif token.isStartElement ("para"): out.write ("<P>") outputParaContent (token) out.write ("</P>")
Here, once the start of an "section" element is encountered, the content of the element is asked for using a nested invocation of the parser -- that's what "token.children ()" does. In the example:
"parser ()" generates the content of the document as a whole.
The first "token.children ()" generates the content of the "section" element.
The second "token.children ()", in the "outputParaContent" function, generates the content of a "title" or "para" element.
The nice thing about this way of processing is that:
There's no fiddling around with extending parser classes, or defining call-back functions as there is with the SAX API.
There's no fiddling around with data structures as there is with the DOM API.
Unlike the case with the SAX API, all elements aren't fed to just one call-back function. Elements are encountered in context, so often no context testing is required to know what to do with them.
Any data that needs caching or passing around can be done locally -- no need to add extended functionality to the parser, for example, just to pass information from a sub-element to its parent.
The layout or "shape" of a Python program processing XML in this way is very similar to that in declarative processing languages like XSLT. Functionality can be done "in-line" or common processing can be factored out as in the "outputParaContent" function.
You can just ignore markup tokens you don't want to deal with. No explicit handling is required for returned XML markup tokens.
Or to put it another way, it's simpler than either the SAX API or DOM API, it makes available significant DOM-like functionality to SAX-like applications, and it adds some advantages of using declarative XML processing to a main-stream-like programming language.
There is a down-side to this XML processing model: random reordering of the document is not as well supported as by the DOM model. For many applications this difficulty can be dealt with by having a facility for reordering the output as described in Patched Output.
Once you've created an anXMLParser object, you can call the object, and what you'll get is a generator that returns markup tokens from the parer, until it runs out (or if, when you've done a second call, it encounters the end of the current element's content). The following subsections describe the different kinds of tokens that you might receive. Each kind of token has a test method that tells you what kind of token you've got, plus a few other properties.
All tokens support the ".__str__" method, so that you can pass a token to Python's "str" function or use it in any context in which a string coercion is forced. What's returned is a representation of the token more-or-less as it appears in an XML document.
token.isCharacters ()
Returns True only if the token is a markupTokenCharacters object.
token.characters
The characters returned from the XML Parser.
token.isEndDocument (*names)
Returns True only if the token is a markupTokenEndDocument object.
Only returned if the parser is created with the "S" include option.
token.isEndElement (*names)
Returns True only if the token is a markupTokenEndElement object. If one or more arguments are given, it only returns True if the name of the element being ended is one of those given. If no argument is given, any element will do.
token.name
The name of the element being ended.
token.elementDepth
The nesting depth of the element being ended. The root element is of depth 1.
token.usedAsEnd
A markupTokenEndElement object is not returned to the application if "parser ()" or "token.children ()" is used to generate the content of an element. However, if such a generator is not used, a markupTokenEndElement object will be returned. There are two such cases:
If an element is recognized by the XML parser as the end of an overlapped markup structure -- because the application created a generator for the overlapped markup -- and if the ending element's content was itself not generated -- which would suppress that element's end tag -- then the element's markupTokenEndElement object is returned and it's ".usedAsEnd" field will be True.
In all other cases of a markupTokenEndElement object being returned -- by no generator being created for its content -- ".usedAsEnd" will be False.
token.isIgnorableWhitespace ()
Returns True only if the token is a markupTokenIgnorableWhitespace object.
token.characters
The ignorable whitespace characters returned from the XML Parser.
Only returned if the parser is created with the "I" include option.
token.isProcessingInstruction (*targets)
Returns True only if the token is a markupTokenProcessingInstruction object. If one or more arguments are given, it only returns True if the target of the processing instruction (name at the start of the processing instruction), if any, is one of those given. If no argument is given, any processing instruction will do. If one or more arguments are given, and there is no target, False will be returned.
token.target
The target of the processing instruction, if any. If there isn't one, the value is None.
token.data
The data of the processing instruction (the part of the processing instruction following the target and its terminating space character), if any. If there isn't one, the value is None.
Only returned if the parser is created with the "P" include option.
token.isSetDocumentLocator ()
Returns True only if the token is a markupTokenSetDocumentLocator object. This object type is an artifact of the SAX API.
token.columnNumber
The column number of the input when the token is returned.
token.lineNumber
The line number of the input when the token is returned.
token.publicId
The public identifier of the input being returned. It's value is None if no public identifier is available.
token.systemId
The system identifier of the input being returned. It's value is None if no system identifier is available.
Only returned if the parser is created with the "L" include option.
token.isSkippedEntity (*names)
Returns True only if the token is a markupTokenSkippedEntity object. If one or more arguments are given, it only returns True if the name of the entity is one of those given. If no argument is given, any entity will do.
token.name
The name of the entity.
token.isStartDocument (*names)
Returns True only if the token is a markupTokenStartDocument object.
Only returned if the parser is created with the "S" include option.
token.isStartElement (*names)
Returns True only if the token is a markupTokenStartElement object. If one or more arguments are given, it only returns True if the name of the element being started is one of those given. If no argument is given, any element will do.
token.name
The name of the element being started.
token.attrs
The attributes, if any, specified or defaulted for the element being started. token.attrs is a list of markupAttribute objects, each of which has the following properties:
.name: The name of the attribute.
.type: The type declared or implied for the attribute. (For example "CDATA".)
.value: The value of the attribute, if any. None if no value
token.elementDepth
The nesting depth of the element being started. The root element is of depth 1.
token.usedAsEnd
True if the end of this element will serve as the end of a generated sequence of tokens for overlapped markup. Otherwise it's False.
token.isException (*severitities)
Returns True only if the token is a markupTokenException object. If one or more arguments are given, it only returns True if the severity code of the exception is one of those given. If no argument is given, any exception will do.
token.severity
The severity of the exception:
0: warning,
1: "recoverable" error,
2: "fatal" error, or
3: system error, such as failure to resolve an entity.
token.message
The text of the error message.
token.columnNumber
The column number of the input when the token is returned.
token.lineNumber
The line number of the input when the token is returned.
token.publicId
The public identifier of the input being returned. It's value is None if no public identifier is available.
token.systemId
The system identifier of the input being returned. It's value is None if no system identifier is available.
Only returned for warnings (token.severity == 0) if the parser is created with the "W" include option.
token.isEntity (*names)
Returns True only if the token is a markupTokenEntity object. If one or more arguments are given, it only returns True if the name of the entity that needs to be resolved by the application is one of those given. If no argument is given, any entity will do.
token.name
The name of the entity that needs resolving.
token.publicId
The public identifier of the input being returned. It's value is None if no public identifier is available.
token.systemId
The system identifier of the input being returned. It's value is None if no system identifier is available.
token.setEntity (inputFile)
A method of one argument, providing the XML Parser with a string, Unicode string or input-file-like object that is the text of resolvable entity described by the token. This method, or the correspondingly-named method of the generating anXMLParser object should be called whenever this token is returned to the application.
Only returned if the "entityResolver" argument wasn't specified when the parser was created.
token.isNotationDecl (*names)
Returns True only if the token is a markupTokenNotationDecl object. If one or more arguments are given, it only returns True if the name of the notation being declared is one of those given. If no argument is given, any notation will do.
token.name
The name of the notation being declared.
token.publicId
The public identifier of the input being returned. It's value is None if no public identifier is available.
token.systemId
The system identifier of the input being returned. It's value is None if no system identifier is available.
Only returned if the parser is created with the "D" include option.
token.isUnparsedEntityDecl (*names)
Returns True only if the token is a markupTokenUnparsedEntityDecl object. If one or more arguments are given, it only returns True if the name of the entity being declared is one of those given. If no argument is given, any entity will do.
token.name
The name of the entity being declared.
token.publicId
The public identifier of the input being returned. It's value is None if no public identifier is available.
token.systemId
The system identifier of the input being returned. It's value is None if no system identifier is available.
Only returned if the parser is created with the "D" include option.
token.ndata
The value of the SAX "ndata" argument if the declaration is for a NDATA entity. None if not.
This token type is an artifact of this parsing model. It is only returned if the parser needs to ask the application for a document entity. If returned, it is the very first token returned.
token.isTheBegining ()
Returns True only if the token is a markupTokenTheBegining object.
token.setEntity (inputFile)
A method of one argument, providing the XML Parser with a string, Unicode string or input-file-like object that is the text of document entity to be parsed. This method, or the correspondingly-named method of the generating anXMLParser object should be called whenever this token is returned to the application.
Not returned if the "documentEntity" argument is specified when creating the anXMLParser object.
This token type is an artifact of this parsing model. It is returned on "normal" termination of the parser, following all other tokens for a parse. It won't be returned in some cases of a very severe exception (when the exception's token.severity == 3). This token can typically be ignored.
token.isTheEnd ()
Returns True only if the token is a markupTokenTheEndp object.
There's a number of ways in which tokens can be retrieved from the XML parser, either one at a time, or through a generator. A generator can:
simply return all the tokens generated,
generate only those for the current element, or
generate tokens delimited based on boundaries identified by an "overlapped markup" model.
The simplest way to invoke the XML parser as an XML token generator is to "call" it:
for token in parser ():
When called like this, an XML parser object generates all the XML tokens from the document being parsed.
When called a second time, when tokens are already being returned from the parser, the tokens returned from the call are just the children of the currently opened element, as in:
for token in :red:parser (): if token.isStartElement ("section"): for token in :red:parser (): def outputParaContent (): for token in :red:parser ("W"): if token.isCharacters (): out.write (token.characters) if token.isStartElement ("title"): out.write ("<H2>") outputParaContent () out.write ("</H2>\n") elif token.isStartElement ("para"): out.write ("<P>") outputParaContent () out.write ("</P>")
A invocation of the parser when there is one opened element or more will return all the tokens up to but not including that for the end tag for the most deeply currently opened element. The end tag is used by the token generator to indicate that it's at the end of what it has to generate. It's not returned to the application. So in the example, there are three uses of "for token in parser ()":
The outer one for processing the content of the document as a whole.
The next one for processing the content of the "section" element.
The innermost one, in the "outputParaContent" function, that's used to process the content of a "title" or "para" element.
Equivalent to invoking the parser object with a call is invoking the ".children" method of a token returned by another XML token generator. So in the above example, the nested uses of "parser ()" can be equivalently replaced by "token.children -- so long as "token" is available.
There are other sets of XML tokens that can be generated by an XML parser: see Overlapped Markup.
When invoking an anXMLParser object to generate its XML tokens, or invoking the ".children" method of a token, options can be specified as the first argument, as in the innermost generator in the previous example (the one within outputParaContent):
for token in parser ("W"):
One specifiable option is the letter "W", upper- or lower-case. By default, invoking a parser with no option does not return character tokens consisting only of whitespace. If "W" is specified, all character data is returned.
The "W" option is especially useful when parsing without the help of a DTD or schema, or where the application has needs in addition to those specified in the DTD or schema.
The first argument of the XML token generator can be specified as "" (as in parser ("")) if no options are wanted.
There are three methods available for all returned tokens.
token.next ()
Return the next token from the XML token generator that returned this token. This token is "consumed" and will not be returned again.
If there are no tokens available (at the end of the document entity or at the end of a generator for an element's content) this method raises the Python StopIteration exception, as is normal for a generator.
token.peek ([count])
Return the next token from the XML token generator that returned this token. This token is not "consumed" and will be returned again.
If the "count" argument is specified, then skip that many as-yet unreturned tokens and return the one following. token.peek (0) returns the next token (just as does token.peek ()), and token.peek (1) returns the token after that.
If there are not enough tokens available (at the end of the document entity or at the end of a generator for an element's content) this method returns a markupToken object that refuses to be recognized as any particular token type -- all of the usual ".is" methods will return False. This makes things easy if you're looking for a particular kind of token up ahead.
token.children ()
As described earlier, the ".children" method of any returned token can be used to generated tokens from the XML parser. The allowed arguments of ".children" are the same as for invoking the parser, as described in Invoking The XML Parser As A Token Generator, The Whitespace Option and Overlapped Markup.
(Not recommended outside of XML applications.)
Not all components of a marked up document are of interest. The end tag of an element, for example, marks the end of the element's content, but typically doesn't serve any other purpose -- which is why a simple XML token generator in this implemenation doesn't bother to return it to the user, and by default skips comments and other non-structure markup and skips character tokens consisting entirely of white space.
Applications will want to ignore some elements, especially in overlapped markup applications. There are a variety of ways of doing this, based on application needs:
Not doing anything with a generated token ignores it. One can unconditionally skip the next XML token using the ".next" method:
token.next ()
or one can leave it out of the cases being processed:
for token in token.children (): if token.isElement ("myelement"): # process it # otherwise do nothing with it
All the content of an element can be ignored most easily and most safely by iterating over them and doing nothing with them:
for token in token.children (): pass
Even for an empty element, that should have no content, this technique is preferable to gobbling the next XML token on the assumption that it's an end tag token. If a non-validating XML parser is being used and an element has whitespace in its content, a token may be returned for that white space:
<marker> </marker>
There are currently a variety of experiments going on attempting to come up with XML encodings for data, especially text, with overlapped structure. An example of overlapped structure is the plays of Shakespear wherein a line of spoken verse is split between two speakers, and both the speaker structure and verse structure need to be captured.
A number of techniques for marking up overlapped text have been devised. A number of these techniques use empty elements to mark the start and end of alternative structures. Identification of start/end pairs uses both element names and attribute values:
specified start and end elements, or the same element name for start and end, and
attributes with the same "label" value, with attribute names being the same on the start and end tags, or different.
Element names can do the job alone, or in combination with labeling attribute values.
There are two ways in which this kind of technique can be used:
all of the parallel structures can have their boundaries marked by empty elements, or
one of the nested structures can be marked up in the "usual" way -- with start and end tags -- and all others can have their boundaries marked with empty elements.
In all these techniques, once the starting boundary of an overlapping structure has been recognized, the identity of its ending boundary is known: the name of the ending element, and, if used, the name and required value of the labeling attribute. The labeling attribute value is typically that of a same named or other attribute of the starting boundary. The element name and the attribute name may be the same or different. However, a single model describes all these techniques.
In extension to this way of doing things is to allow the ending element to be not an empty element -- to allow it to have content. Doing this doesn't seem to introduce any difficulty in recognizing the end boundary of a component of overlapped markup -- in effect, the overlapped structure ends at the end of the ending element.
One kind of processing possible using documents with overlapped markup as described above is to simply extract one of the structures, or at least to make one dominate (contain) the other. This kind of processing can make use of generators of the sort described earlier for element content: the ending condition for the generator is the ending boundary of one overlapped structure. The parser and ".children" generators can recognize such boundaries as follows:
If the second argument of the generator creator is specified, it must be a string or a list of strings. If specified, the name of the element that ends the subsequence of XML tokens generated will be one of those specified.
If the third argument of the generator creator is specified, it must be a dictionary with string keys and string values. If specified, a candidate end element is examined for having an attribute name/attribute value pair that have the same values as a key/value pair in the dictionary. If there's such a match, and if the element name test, if any, succeeds, the element is identified as the end boundary of the overlapped structure.
A second argument need not be specified if an end boundary is to be identified solely on the basis of an attribute value. Alternatively, a third argument need not be specified if the element name is sufficient identification. In any case, a first argument of "" can be used if no options are to be specified.
Unlike the case for generating the content of an element, the end boundary element is returned by the generator: its start element token, at least can have attributes of interest. In addition, the end boundary element can have content -- it's the end of the end boundary element that is deemed to end the overlapped structure. The ending element is returned as if content of the overlapped structure. Its markupTokenEndElement object is returned unless the end boundary element uses a generator for its content that suppresses the end tag token.
To help clients recognize an overlapped structure end boundary, both the markupTokenStartElement and markupTokenEndElement objects have a boolean ".usedAsEnd" property, that is only True if the element has been recognized as an overlapped structure end boundary.
This form of processing overlapped markup isn't intended to deal with all processing issues for overlapped markup. Just as the DOM model of processing XML documents is often superior to the sequential processing model, other processing models for overlapped markup are often more appropriate than that described here. On the other hand, there are many cases where sequential processing works -- producing a published form of the document, for instance -- so this approach is worth consideration.
This is where things become a bit tricky. Skip to the examples for cases if this subsection gets a bit heavy.
In XML documents, things are nested within other things -- they each occur at some level of nesting. The issue is at what level of nesting is the end markup to be recognized.
In "normal", well formed markup, it's expected that the end tag will appear at the same level of markup as the corresponding start tag. For overlapped markup there are a number of possibilities:
The end boundary tag is expected to be at the same well formed nesting level as the corresponding start boundary tag. This is a good clean way of doing things.
The end boundary tag can occur at any level.
This situation is exacerbated by the fact that the XML parser is going to validate the well formed markup, but isn't necessarily going to validate the matching of overlapped markup start and end boundary tags. So the model we use has to allow for missing and extra end boundary markers.
To support the possible alternatives, the XML token generator supports three options in addition to the "W" option:
"e": End on exiting current nesting depth.
This is the default when neither the second nor third argument of the generator is specified, and the end of the current element when the generator was created marks the end of the content.
It may need to be specified explicitly for overlapped markup, depending on the condition that starts an overlapped structure.
"o": End on exiting the nesting level one out from the current nesting depth. . This is the default when either the second or third arguments of the generator are specified, indicating an overlapped markup end boundary is to be looked for. The reason for this choice of default is that in the overlapping markup models described above:
it is having encountered the start tag token that one recognizes that an overlapped structure is to be started, and
when you've received that start tag token, you're one level of nesting deeper than the surrounding markup.
"z": Ignore markup nesting level in determing where to end a generator.
This is the default when there are no opened elements, and overlapped markup isn't being asked for. Mostly because there's no nesting to depend on.
The following examples explain most of the useful cases of XML token generation. However they should not be considered exhaustive.
It should be noted that non-overlapped and overlapped token generation can be combined in many ways. They are in no way mutually exclusive. It's quite possible for a document's structure to be primarily "well formed", with overlapped markup used in a few key places.
First, a simple example of non-overlapped markup. Assuming that you've got a well-formed element like the following:
<name>...</name>
The recognition of the <name> element and the processing of its children will look like the following:
if token.isStartElement ("name"): for token in token.children (): # process children
Where "process children" occurs you'll put the recognition and processing logic for the <name> element's children.
As noted earlier, the </name> token doesn't need to be dealt with -- it's used by the XML token generator to signal the end of token generation.
Empty elements are supposed to have no content:
<name/>
It would be nice if one just got one XML token for this. However, there's a few reasons why the XML parser doesn't always do this for you:
Non-validation XML parsers like expat don't know which elements are empty and which ones are not.
The W3C XML specification allows empty elements to be entered with both start and end tags, with space between them, e.g.:
<name></name>
So one has to deal with the content of empty elements, mostly just ignoring the end tag token. One can safely, as described in Ignoring Children, do this:
if token.isStartElement ("name"): for token in token.children (): pass
So now for overlapped markup. The simplest way of using XML elements to mark overlapped boundaries is to have matching start and end elements:
<start/>...<end/>
Telling the generator what element terminates the current content does the job:
if token.isStartElement ("start"): for token in token.children ("", "end"): ...
That's all that's needed. (Note that the first argument needs to be specified in this case.)
By default, when you specify an end element name, the XML token generator looks for it in the content of the parent element of where the generator was invoked. So if the parent element terminates without there being an end element, the generator will terminate without returning a token for the end element.
There are two other ways in which you can deal with this situation. Firstly, you can exit the start element yourself, and then specify that the generator terminates at the end of what then is the currently opened element. This is equivalent to the default ending condition without specifying an option:
if token.isStartElement ("start"): :red:for token in token.children (): pass # skip any content and the end token for <start/> for token in token.children (:red:"e", "end"): ...
Alternatively, you can specify that well formed element boundaries are to be ignored and that you want that end element:
if token.isStartElement ("start"): for token in token.children (:red:"z", "end"): ...
In this latter case the end of the input will terminate the generator too.
A more usable way of marking structure boundaries is to identify them with labels:
<start id="thisstuff"/>...<end id="thisstuff"/>
In this case the third argument of the generator can be used to specify the required label value for the end boundary:
if token.isStartElement ("start"): for token in token.children ("", "end", {"id": token.attrs ["id"].value}): ...
Alternatively to using different element names for the start and end boundaries of overlapped structure, you can use the same element name with different attribute names for the labels:
<q sId="thisstuff"/>...<q eId="thisstuff"/>
The generation of this kind of overlapped content is the same as when using different element names, excepting only in that the element and attribute names change:
if token.isStartElement ("q") and token.attrs.has_key ("sId"): for token in token.children ("", "q", {"eId": token.attrs ["sId"].value}): ...
One advantage of using the same name for start and end boundaries is that you can combine the processing of different elements. In the following case, it's assumed that there are a variety of "quote" structures, all with the same or similar processing. The idea is that whatever of the start element name, the end element must have the same name:
if token.isStartElement ("q", "q1", "q2", "speech") and \ token.attrs.has_key ("sId"): for token in token.children ("", token.name, {"eId": token.attrs ["sId"].value}): ...
You can even dynamically maintain a list of currently interesting overlapping element boundaries, and recognize them by "flattening" the list into multiple arguments for the ".isStartElement" method:
boundaryElements = ["q", "q1", "q2", "speech"] if token.isStartElement (:red:*boundaryElements) and token.attrs.has_key ("sId"): for token in token.children ("", token.name, {"eId": token.attrs ["sId"].value}): ...
You can also just use labels, and ignore the element names. Note the absence of a specified element name when using the ".isStartElement" method and the use of "None" for the second argument, indicating in both cases that the element name is not interesting:
if token.isStartElement :red:() and token.attrs.has_key ("sId"): for token in token.children ("", :red:None, {"eId": token.attrs ["sId"].value}): ...
The ending boundary element can have content, used as an "annotation" of the overlapped content in some models.
<start id="thisstuff"/>...<end id="thisstuff">...</end>
The ".usedAsEnd" property is useful in identifying the annotation. Python's "else" option for "for" loops is also useful, as it is on performed if the loop "dropped through" and was not exited by a "break":
if token.isStartElement ("start"): for token in token.children ("", "end", {"id": token.attrs ["id"].value}): if token.usedAsEnd: # process the "annotation" element used as the end boundary for token in token.children (): # annotation content break # processing for the "content" elements else: # you'll only get here if there was no end boundary element
An anXMLParser object can be terminated simply by deleting it (using Python's "del" statement) or letting all references to it disappear (go out of scope). Doing either of these things terminates the coroutine in which the parser is running.
At present, some kind of errors in the user's program can cause the program to terminate without terminating the XML parser. In this case it may be possible that the program as a whole hangs and needs "terminating" using the operating system's facilities.
Processing XML documents often requires that data be reordered: the order in which things appear in the input is not necessarily that in which they are to appear in the output. There are different ways of achieving this reordering.
The DOM, Document Object Model, approach is to load the whole of the XML document into memory and to provide tools (including XSLT) for randomly accessing the loaded document. The client application can then access the document in the order required to produce the output.
The SAX, Simple API for XML, approach is to provide the components to the client application as they become available from the parsing process. All the issues of reordering are put in the client application's hands.
Both the serial XML processing model described in this document and SAX have the difficulty of not having the whole document available during client processing. To a large extent this difficulty can be offset by tools that allow the client to write output in semi-random order. This approach works, in part, because:
Very often data in the output is largely in the same order as in the parsed XML document.
Ways of "patching" output data are well understood -- having been used in programming language and program loader application since the early days of computer software.
There exist techniques for efficiently resorting output data.
The accompanying module, "patchedoutput.py", is one tool for dealing with the reordering problem. A client application writes data and "labels" to a patchedoutput in the application-appropriate order, and defines the values associated with the labels when the information becomes available, in any order. The patchedoutput then writes the final form of the output in the order in which the data and labels are written, replacing the labels with their associated values.
There are a number of ways in which the data and labels can be buffered:
All data and labels can be buffered until the patchedoutput is closed, doing all the writing to the final destination at that point. Doing so requires buffering the whole of the output, doing the patching at the end of writing to the patched output. It turns out that a good implementation of this technique has as good performance as any.
Data and data associated with labels can be written as they become available. The amount of data varies depending on how close to labels being written their associated values are made available. Depending on the data, very little or very much can be buffered. This is the approach taken by "patchedoutput.py".
Only one definition per label can be enforced, either by using only the first definition, or using the last definition. Or definitions and redefinitions can be used as they become available. The latter technique means that the same label may be replaced by different values at different times -- it's the technique used by "patchedoutput.py".
"patchedoutput.py" is simple enough that programmers can modify it to suite their own requirements.
To use this parser, you need to have Python version 2.3 or later installed, and a Python SAX parser available -- there's one in the Python library. You also need to be using Christian Tismer's Stackless Python. It can be found at www.stackless.com.
The new Python API is available as a ZIP file: xmlparser.zip. This ZIP file includes:
xmlparser.html: this document.
AnXMLParserForPython.txt: this document as a text file. This document was converted to HTML using the "format.py" program available at this site in patternmatching.zip.
xmlparser.py: the XML parser API implementation.
xml2html.py: An example of using the XML parser. The sample input file is "EML2004wilm1115.xml", the paper I presented at the Extreme Markup Conference in Montreal in 2004.
xmlpatterns.py: a pattern matching model of XML processing based on this XML parser implementation. It requires patternmatching.zip.
xml2htmlp.py: An example of pattern matching XML tokens. The sample input file is again "EML2004wilm1115.xml". This isn't a very convincing example, "xml2html.py" is much more straight-forword. Consider it a failed experiment. I've included it here to answer the "what if?" question.
EML2004wilm1115.xml: Sample input for xml2html.py and xml2htmlp.py.
EML2004wilm1115.html: Sample output from xml2html.py. The output from xml2htmlp.py is just about the same.
olutest01.py and olutest02.py: Two small samples that use a simple overlapped markup scheme to extract different views from a common input: olutest.txt.
patchedoutput.py: The implementation of the patchedoutput class described earlier.
There are other approaches to serial XML parsing. The most important alternative to the model presented here is the one in which an XML element or other structure invokes a rule or method (depending on the language), and in which the user can indication where the structure's content is to be processed. Two examples of this alternative are:
The OmniMark programming language. OmniMark provides "rules", which are performed when an XML structure is encountered.
The Java-based XML implementation described at gxparse.sourceforge.net invokes a method whose name is based on the encountered element or other structure. I've been informed that this technique has been applied in Python as well.
The following updates have been made since this document was first posted:
13 August 2004:
A variety of minor organizational changes and clarifications have been made. As well:
The ".children" method has been added for XML markup tokens. See Generating Content.
Support for overlapped markup has been added. See Overlapped Markup in particular, and the olutest*.* files.
Ignoring Children has been added.
The "str" function is now supported for all markup tokens returned from the XML parser.
All classes in the implementation have been made to be Python's "new style" classes.
And last but by no means least, I'm now using Christian Tismer's "tasklets" instead of a thread-based implementation. Tasklets allow for programming directly in a coroutine style, which is what's needed by this application.
22 July 2004:
The xml2htmlp.py program has been removed. I'll put it back when I've time to put a bit more work into it.
13 July 2004:
The section titled "Patched Output" and the accompanying patchedoutput.py module has been added.
© copyright 2004 by Sam Wilmott, All Rights Reserved
Thu Sep 09 20:36:58 2004