Using the Simple API for XML (SAX) Parser
Russell Bateman |
Table of Contents
|
IntroductionSAX is the Simple API for XML, a de facto standard. I first used SAX a decade ago in C and I've used it in Java and Python many times. The SAX parser works by offering a number of interfaces you implement and pass to the generic parser for it to use. The SAX parser is an event-driven algorithm for parsing XML documents. It's an alternative to the parsing style provided by the Document Object Model (DOM). I first used it in C circa 2008, and much more recently in Python and Java. Note in passing that SAX requires the smallest possible amount of memory (especially as compared to DOM) in accomplishing its work. It's generally faster. You can build as much structure as you like, on your own, and in the way you like because of SAX' event-driven nature and how it keeps little-to-no state around. It is possible to use it on streaming XML documents: you do not have to have the whole enchilada in memory. This is another benefit. Think of SAX as going down a staircase with one hand on the rail: only what's under your hand is what's in memory. |
The essence of parsing using SAX is given in the following methods, startDocument, endDocument, but especially, startElement, endElement and characters.
int attributesLength = attributes.getLength(); StringBuilder sb = new StringBuilder(); boolean firstAttribute = true; for( int attr = 0; attr < attributesLength; attr++ ) { String attrUri = attributes.getURI( attr ); String attrLocalName = attributes.getLocalName( attr ); String attrQName = attributes.getQName( attr ); String attrValue = attributes.getValue( attr ); String attrName = ( attrQName == null || attrQName.length() < 1 ) ? attrUri + ':' + attrLocalName : attrQName; if( firstAttribute ) firstAttribute = false; else sb.append( ',' ); sb.append( ' ' ).append( attrName ).append( '=' ).append( '"' ).append( attrValue ).append( '"' ); } System.out.println( sb.toString() );
Create your handler thus with, at first, the five methods listed above. Of course, you'll have to satisfy the DefaultHandler contract by implementing (at least stubs for) all the methods. This example rudimentarily prints out what the SAX parser sees if you run the parser on an XML file.
import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; public class MySaxParserHandler extends DefaultHandler { private String elementContents; public void startElement( String uri, String localName, String elementName, Attributes attributes ) throws SAXException { elementContents = ""; System.out.print( "<" + elementName + ">" ); } public void endElement( String uri, String localName, String elementName ) throws SAXException { System.out.print( " " + elementContents + "" ); System.out.print( "" ); } public void characters( char ch[], int start, int length ) throws SAXException { String contents = new String( ch, start, length ).trim(); elementContents += contents; } public void startDocument() throws SAXException { System.out.print( "Begin" ); } public void endDocument() throws SAXException { System.out.print( "Done!" ) } }
Now create your application's main entry point, set up the parser to consume your handler, then call it (the highlighted line below). We'll pass the sample XML filename as the argument.
import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.ParserConfigurationException; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; public class MySaxParser { public static void main( String[] args ) { MySaxParserHandler handler = new MySaxParserHandler(); SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser parser = null; try { parser = factory.newSAXParser(); } catch( ParserConfigurationException e ) { System.err.println( "Parser-configuration error:" ); System.exit( -1 ); } catch( SAXException e ) { System.err.println( "SAX parser error:" ); System.exit( -1 ); } try { XMLReader reader = parser.getXMLReader(); parser.parse( args[ 1 ], handler ); } catch( Exception e ) { System.err.println( "parser handler error:" ); System.exit( -1 ); } } }
Throughout the examples in these notes, the instantiation of a parser, the definition of a handler and the calling of the parser has required at least two separate classes. This is an example of integrating all three operations into one.
Simply, use the constructor of the handler class, MappingSaxHandler(), to instantiation the SAX parser creating a ready instance and a special (non-handler) method, run(), to invoke the parser later when desired.
I won't flesh out utilities or dependencies such as:
package com.windofkeltia.processor; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.xml.sax.Attributes; import org.xml.sax.Locator; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; import org.xml.sax.helpers.DefaultHandler; import com.windofkeltia.exceptions.MappingException; import com.windofkeltia.pojos.Concept; import com.windofkeltia.pojos.ObjectStack; import com.windofkeltia.utilities.StringUtilities; public class MappingSaxHandler extends DefaultHandler { private static final Logger logger = LoggerFactory.getLogger( MappingSaxHandler.class ); private Locator locator; private final ObjectStack< Concept > stack = new ObjectStack<>(); private final List< Concept > concepts = new ArrayList<>(); private final ParserHelp parserHelp = new ParserHelp(); private boolean INACTIVE = false; private final MappingSaxHandler handler = this; private final SAXParserFactory factory = SAXParserFactory.newInstance(); private SAXParser parser = null; private XMLReader reader = null; public MappingSaxHandler() throws MappingException { try { parser = factory.newSAXParser(); reader = parser.getXMLReader(); } catch( ParserConfigurationException e ) { throw new MappingException( "SAX parser configuration error: " + e.getMessage() ); } catch( SAXException e ) { throw new MappingException( "SAX parser creation error: " + e.getMessage() ); } } public void run( final InputStream inputStream ) throws MappingException { try { parser.parse( inputStream, handler ); } catch( IllegalArgumentException e ) { throw new MappingException( "Parse input argument exception: " + e.getMessage() ); } catch( IOException e ) { throw new MappingException( "SAX parser I/O error: " + e.getMessage() ); } catch( SAXException e ) { throw new MappingException( "SAX parser exception: " + e.getMessage() ); } } public void startDocument() { if( VERBOSE ) SaxHandlerUtilities.startDocument(); } public void startElement( String uri, String localName, String elementName, Attributes saxAttributes ) { if( !INACTIVE && elementName.equals( "document" ) ) { INACTIVE = true; logger.warn( " Parser is now inactive (skipping <document>)..." ); return; } if( INACTIVE ) return; if( VERBOSE ) SaxHandlerUtilities.startElement( locator, uri, localName, elementName, saxAttributes ); Concept concept = new Concept( elementName, locator.getLineNumber(), saxAttributes ); stack.push( concept ); concepts.add( concept ); parserHelp.switchConcept( elementName, concept.attributes.get( "type" ) ); parserHelp.reservingConcept( elementName, concept ); } public void characters( char[] ch, int start, int length ) { if( INACTIVE ) return; String characters = new String( ch, start, length ).trim(); if( VERBOSE ) SaxHandlerUtilities.characters( locator, characters ); Concept concept = stack.peek(); if( !StringUtilities.isEmpty( characters ) ) concept.content.append( characters ); } public void endElement( String uri, String localName, String elementName ) { if( INACTIVE && elementName.equals( "document" ) ) { logger.warn( " Parser is now active (skipped <document>)..." ); INACTIVE = false; return; } if( VERBOSE ) SaxHandlerUtilities.endElement( locator, elementName ); Concept concept = stack.pop(); parserHelp.switchConcept( elementName ); parserHelp.assimilateConcept( elementName, concept ); } public void endDocument() { if( VERBOSE ) SaxHandlerUtilities.endDocument(); } public void setDocumentLocator( Locator locator ) { this.locator = locator; } public List< Concept > getConcepts() { return concepts; } public List< Concept > getConcepts() { return parserHelp.getConceptList(); } public static boolean VERBOSE = false; }
SAX gives additional goodies to the DefaultHandler, everything needed to parse even the most complex XML file. Here are some of my favorite:
Nota bene: if you do not set the property, "http://xml.org/sax/properties/lexical-handler" as shown in code below, even though you're implementing LexicalHandler, processing will never reach your comment() method.
Tells you this stuff; you get an instance of this, via the method, pretty much before anything else and the parser maintains the line number and column throughout its parsing. Very useful for error reporting in XML validation and other activities.
To get this to work, you code the following highlighted lines into your handler. The new method in your handler, setDocumentLocator(), is called by the SAX parser as it runs to maintain information in the locator you've added such that, later, you can call into it, here via method getLineNumber(), but there is also getColumnNumber(), to get this information.
import org.xml.sax.Locator; public class MySaxHandler extends DefaultHandler { private Locator locator; public void setDocumentLocator( Locator locator ) { this.locator = locator; } ... public void startElement( String uri, String localName, String elementName, Attributes attributes ) throws SAXException { System.out.print( locator.getLineNumber() + " <" + elementName + ">" ); } ... }
Back in the years when getLineNumber() always returned -1 because it wasn't implemented, I tried doing line number in my parser handler by counting newlines and pains-taking bookkeeping. The Locator works much better.
This isn't strictly true, but if you choose the method parse( String ), you will find you get an exception
java.net.MalformedURLException: no protocol:
...which seems bizarre since you don't think you're dealing with a URL. In fact, you are because the parser assumes parsing something like
<?xml version='1.0'?> <!DOCTYPE Something SYSTEM "http://www.something.com/Something.dtd">
...when, in fact, that's not in what you're asking to be parsed at all. Whatever the case may be, translating the string to an input stream seems to solve the problem. In code below, there is some of this going on.
Here's a table comparing two additional SAX parser handlers you can add to your parser. For example, to add the LexicalHandler, recreate the handler above thus:
import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; public class MySaxParserHandler extends DefaultHandler implements LexicalHander { . . .
Here's the comparison table. LexicalHandler adds, for example, special methods to allow you to handle XML comments, which are otherwise completely ignored (you won't ever be notified about them), as well as CDATA, DTDs, etc.
DefaultHandler | ContentHandler | LexicalHandler |
---|---|---|
characters | characters | |
comment | ||
endCDATA | ||
endDTD | ||
endDocument | endDocument | |
endElement | endElement | |
endEntity | ||
endPrefixMapping | endPrefixMapping | |
error | ||
fatalError | ||
ignorableWhitespace | ignorableWhitespace | |
notationDecl | ||
processingInstruction | processingInstruction | |
resolveEntity | ||
setDocumentLocator | setDocumentLocator | |
skippedEntity | skippedEntity | |
startCDATA | ||
startDTD | ||
startDocument | startDocument | |
startElement | startElement | |
startEntity | ||
startPrefixMapping | startPrefixMapping | |
unparsedEntityDecl | ||
warning |
I first wrote up these comparisons in an effort to understand why, in some code I was reading, ContentHandler was being used instead of DefaultHandler. Then I realized that it was because the original author did not want to handle errors.
...at your disposal with terse comments on what they do. I have not myself done everything possible with SAX and there's much I don't know.
Method | DefaultHandler | ContentHandler | LexicalHandler | Purpose |
---|---|---|---|---|
characters | ✗ | ✗ | plain text between opening and closing tags | |
comment | ✗ | like characters() for comments | ||
endCDATA | ✗ | reports end of CDATA | ||
endDTD | ✗ | reports end of DOCTYPE section | ||
endDocument | ✗ | ✗ | reports end of (whole) document | |
endElement | ✗ | ✗ | reports end of XML element (tag) | |
endEntity | ✗ | reports end of XML entities | ||
endPrefixMapping | ✗ | ✗ | reports end of namespace mapping | |
error | ✗ | reports (recoverable) parser error | ||
fatalError | ✗ | report fatal (unrecoverable) parser error | ||
ignorableWhitespace | ✗ | ✗ | invoked in combination with a DTD | |
notationDecl | ✗ | |||
processingInstruction | ✗ | ✗ | reports processing instructions like <?target data> | |
resolveEntity | ✗ | |||
setDocumentLocator | ✗ | ✗ | yields an instance of Locator for later use | |
skippedEntity | ✗ | ✗ | reports skipped entity (could have been getting via LexicalHandler) | |
startCDATA | ✗ | reports beginning of CDATA, starting with <![CDATA[ | ||
startDTD | ✗ | reports beginning of DOCTYPE section | ||
startDocument | ✗ | ✗ | reports beginning of (whole) document | |
startElement | ✗ | ✗ | reports beginning of XML element (tag) | |
startEntity | ✗ | reports beginning of XML entities like DTD | ||
startPrefixMapping | ✗ | ✗ | reports beginning of namespace mapping; a namespace is like X in <X:tag ...> | |
unparsedEntityDecl | ✗ | |||
warning | ✗ | reports parser warning |
SAX parsing is done via a sort of "call-back" mechanism. You write a handler that you supply to SAX. The DefaultHandler design imposes the requirement to implement a number of methods that are called as SAX encounters XML elements (tags), attributes, etc. Formally, these are:
In Java, these are the following (and are public in your handler):
void startDocument() void endDocument() void startElement( String uri, String localName, String qName, Attributes attributes ) void endElement( String uri, String localName, String qName ) void characters( char ch[], int start, int length ) —when implementing LexicalHandler... void startDTD( String name, String publicId, String systemId ) void endDTD() void startEntity( String name ) void endEntity( String name ) void startCDATA() void endCDATA() void comment( char[] ch, int start, int length )
Similarly, in Python, you implement:
def startDocument( self ) def endDocument( self ) def startElement( self, tag, attributes=None ) def endElement( self, tag ) def characters( self, content ) —xml.sax.sax2lib.LexicalHandler is not supported until Python 3
package com.etretatlogiciels.cda.analysis; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; import org.xml.sax.helpers.DefaultHandler; public class CdaAnalysis { private SAXParserFactory factory; private SAXParser parser; private XMLReader reader; private DefaultHandler handler; public CdaAnalysis( String outputFilepath ) { factory = SAXParserFactory.newInstance(); try { parser = factory.newSAXParser(); handler = new CdaHandler( outputFilepath, false ); reader = parser.getXMLReader(); reader.setProperty( "http://xml.org/sax/properties/lexical-handler", handler ); } catch( ParserConfigurationException e ) { e.printStackTrace(); } catch( SAXException e ) { e.printStackTrace(); } } public void parse( String pathname ) { try { parser.parse( pathname, handler ); } catch( Exception e ) { e.printStackTrace(); } } }
package com.etretatlogiciels.cda.analysis; import java.util.Collections; import java.util.List; import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.ext.LexicalHandler; import org.xml.sax.helpers.DefaultHandler; import com.etretatlogiciels.utilities.RedirectSystemOutStream; import com.etretatlogiciels.utilities.OccurrencesTable; import com.etretatlogiciels.utilities.XPathTable; import com.etretatlogiciels.utilities.SymbolTableException; import com.etretatlogiciels.utilities.XmlElementStack; /** * We want to analyse elements, modifying our analysis to make a growing number of * observations about CDA and CCD documents. * * History * * We already gather out <text> ... </text> elements from some CDA * elements and put them into our CXML. What we want to do is identify other, import * content to gather out and extend our work to CCDs, which adhere to the CDA format * standard. * * About this file... * * The SAX handler is the cookie jar with all the cookies. It's the code that says * what we're looking for and how it's supposed to show up. In this case, however, * we don't want to make many decisions about what's going to be in the CDA (or * CCD) because we're just learning. And, anyway, we likely would never be able to * predict with complete accuracy what's in there. We're just observers trying to * find a way to disembowel the document to get content out. * * About lineNumber: the starting value for lineNumber must account for the number of * lines in the XML file being parsed that occur prior to the principal/outside * element even if this is one or more comments. * * @author Russell Bateman * @since May 2015 */ public class CdaHandler extends DefaultHandler implements LexicalHandler { private int lineNumber = 2; // (because all documents start with <?xml version="1.0"?>) private String elementName; private String characters; private OccurrencesTable occurrences; private XPathTable xpath; private XmlElementStack stack; private static final String DOCUMENT = "<document-top>"; /** * Just output to the console whatever's found. */ public CdaHandler() { super(); occurrences = new OccurrencesTable(); xpath = new XPathTable(); stack = new XmlElementStack(); } /** * Hereafter, all output goes to System.out which will take it to a file and/or * the console. * @param outputFilepath to which results will go. * @param hush shut the console up (no output to console). */ public CdaHandler( String outputFilepath, boolean hush ) { this(); RedirectSystemOutStream.redirectSystemOutToFile( outputFilepath, hush ); } /** * If processing multiple documents, this would be an opportunity to initialize * processing for the next document (as opposed to a previous one). */ public void startDocument() throws SAXException { stack.push( DOCUMENT ); System.out.println( "Elements in order of appearance and their line numbers..." ); } /** * If processing multiple documents, this is an opportunity to tie off the one * that's been processing and is finished. */ public void endDocument() throws SAXException { List< String > elements = occurrences.keys(); int widest = CdaAnalysisUtilities.getMaximumKeyWidth( elements ); System.out.println( "-----------------------------------------------------------------------" ); System.out.println( "Elements sorted alphabetically and their frequency..." ); // Sort alphabetically by key... Collections.sort( elements ); CdaAnalysisUtilities.printPaddedList( occurrences, elements, widest ); System.out.println( "-----------------------------------------------------------------------" ); System.out.println( "Elements sorted by occurrence from most to least frequent and their frequency..." ); CdaAnalysisUtilities.sortAndPrintPaddedByValue( occurrences, elements, widest ); System.out.println( "-----------------------------------------------------------------------" ); System.out.println( "Elements with their XPath and line number on which they appear..." ); CdaAnalysisUtilities.sortAndPrintXPaths( xpath, elements, widest ); if( stack.peek().equals( DOCUMENT ) ) System.out.println( "Document is well formed" ); } /** * —what identifies and handles the opening element tag. These can be * intercepted and used for analysis. */ public void startElement( String uri, String localName, String qName, Attributes attributes ) throws SAXException { String elementName = qName; stack.push( elementName ); if( !occurrences.contains( elementName ) ) { try { occurrences.put( elementName, 1 ); } catch( SymbolTableException e ) { e.printStackTrace(); } } else { try { Integer count = occurrences.get( elementName ); count++; occurrences.delete( elementName ); occurrences.put( elementName, count ); } catch( SymbolTableException e ) { e.printStackTrace(); } } try { String path = stack.renderStackAsXPath(); xpath.put( elementName, lineNumber, path ); } catch( SymbolTableException e ) { e.printStackTrace(); } System.out.println( lineNumber + " <" + qName + ">" ); } /** * —what identifies the closing element tag, an opportunity to wrap up * the greater tag, whatever we want to do with it. It's at this point that * we know the tag is finished and what we've got in characters is its content. */ public void endElement( String uri, String localName, String qName ) throws SAXException { stack.pop(); } /** * —what gathers the plain text content between the opening and closing tags. * Of course, some elements are comprehensive and contain additional elements without * also having plain text. * @param ch array holding the characters. * @param start starting position in the array. * @param length number of characters to use from the array. */ public void characters( char ch[], int start, int length ) throws SAXException { characters = new String( ch, start, length ).trim(); for( int i = start; i < start+length; i++ ) { if( ch[ i ] == '\n' ) lineNumber++; } } /** * Maybe this can be used to create line numbers? No, it's only invoked in combination * with a DTD. If the parsed XML file doesn't preconize a DTD, it won't happen. */ public void ignorableWhitespace( char ch[], int start, int length ) throws SAXException { // String whitespace = new String( ch, start, length ).trim(); } public void startDTD( String name, String publicId, String systemId ) throws SAXException { } public void endDTD() throws SAXException { } public void startEntity( String name ) throws SAXException { } public void endEntity( String name ) throws SAXException { } /** * Reports beginning of CDATA, starting with <![CDATA[ and ending with ]]>, for example, * * "<![CDATA[<p>The pug snoring on the couch next to me is <em>extremely</em> cute</p>]]>" */ public void startCDATA() throws SAXException { } public void endCDATA() throws SAXException { } /** * Reports any and all XML comments anywhere in (internal) document. * @param ch array holding the characters in the comment. * @param start starting position in the array. * @param length number of characters to use from the array. */ public void comment( char[] ch, int start, int length ) throws SAXException { for( int i = start; i < start+length; i++ ) { if( ch[ i ] == '\n' ) lineNumber++; } } }
There is a useful SAX dummy handler sample that you can play with, use it as a blank or a place to start fresh.
import xml.sax def getResult(): result = httpclient.get( uri ) payload = result.read() resultIdParser = CqPostResponsePayload() try: xml.sax.parseString( payload, resultIdParser ) except Exception as e: print e.message return resultIdParser.getResultId() class CqPostResponsePayload( xml.sax.ContentHandler ): ''' Parse response payload, looks something like: fWkcTS1a ''' def __init__( self ): self.result = StringIO() self.resultIdCharacters = '' def getResultId( self ): return self.result.getvalue().lstrip().rstrip() def startElement( self, tag, attributes=None ): if tag == 'result_id': self.resultIdCharacters = '' else: pass def endElement( self, tag ): if tag == 'result_id': # tie off the result_id... print && self.result, self.resultIdCharacters else: pass def characters(self, content ): self.resultIdCharacters += content
A different example...
def getValueOfTotalAttribute( line ): ''' Just the attributes. ''' parser = HitsTagElementParser() try: xml.sax.parseString( line, parser ) except Exception as e: print e.message return 0 attributes = parser.getAttributes() return attributes class HitsTagElementParser( xml.sax.ContentHandler ): def __init__( self ): self.attributes = {} def getAttributes( self ): return self.attributes def startElement( self, tag, attributes=None ): if tag != 'our-tag': return self.attributes = attributes def endElement( self, tag ): ''' We'll never hit this! ''' pass def characters( self, content ): ''' We're uninterested in this. ''' pass
This example parses XML from a string (see XmlSchemaTest.java), but it could be a file as shown in the earlier Java samples. The goal is to create an instance of XmlElement that recursively contains the rest of (the XML content whether in a string or a file).
Among other things, this exercise demonstrates how to use stacks to keep track of where one is to gather text out of intermingled subelements for the owning element.
This test ensures (visually-only; no assertions are made, so these are not real test cases) that our parser can do a couple of levels deep and a couple abroad for the purpose of demonstrating the idea only.
package com.etretatlogiciels.xml; import org.junit.Before; import org.junit.Rule; import org.junit.Test; import org.junit.rules.TestName; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import java.io.ByteArrayInputStream; import java.io.InputStream; import java.nio.charset.Charset; import com.etretatlogicials.xml.XmlElement; import com.etretatlogicials.xml.XmlSchemaHandler; public class XmlSchemaTest { @Rule public TestName name = new TestName(); @Before public void setUp() throws Exception { String testName = name.getMethodName(); int PAD = 100; int nameWidth = testName.length(); System.out.print( "Test: " + testName + " " ); PAD -= nameWidth; while( PAD-- > 0 ) System.out.print( "-" ); System.out.println(); } private SAXParser createParser( DefaultHandler handler ) { SAXParser parser = null; try { parser = SAXParserFactory.newInstance().newSAXParser(); } catch( ParserConfigurationException e ) { System.err.println( "Parser-configuration error:" ); e.printStackTrace(); } catch( SAXException e ) { System.err.println( "SAX ruleparser error:" ); e.printStackTrace(); } return parser; } private InputStream toInputStream( String string ) { return new ByteArrayInputStream( string.getBytes( Charset.forName( "UTF-8" ) ) ); } /** * Expected: * { * document, * This is a test. * } */ @Test public void testOne() { final String XML_CONTENT = "<?xml version=\"1.0\"?><document>This is a test.</document>"; DefaultHandler handler = new XmlSchemaHandler(); SAXParser parser = createParser( handler ); try { parser.parse( toInputStream( XML_CONTENT ), handler ); } catch( Exception e ) { e.printStackTrace(); } XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct(); System.out.println( finishedProduct.toStringEnchilada( 0, " " ) ); } /** * Expected: * { * document, * This is a test. * { * subdocument, * Hi! * } * } */ @Test public void testTwo() { final String XML_CONTENT = "<?xml version=\"1.0\"?>\n" + "<document>\n" + " This is a test.\n" + " <subdocument>\n" + " Hi!\n" + " </subdocument>\n" + " </document>";\n" DefaultHandler handler = new XmlSchemaHandler(); SAXParser parser = createParser( handler ); try { parser.parse( toInputStream( XML_CONTENT ), handler ); } catch( Exception e ) { e.printStackTrace(); } XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct(); System.out.println( finishedProduct.toStringEnchilada( 0, " " ) ); } /** * Expected: * { * document, * This is a test. * { * subdocument, * Hi! * } * } */ @Test public void testThree() { final String XML_CONTENT = "<?xml version=\"1.0\"?>\n" + "<document attribute=\"fun and games\">\n" + " This is a test.\n" + " <subdocument another=\"more fun and games\">\n" + " Hi!\n" + " </subdocument>\n" + "</document>"; DefaultHandler handler = new XmlSchemaHandler(); SAXParser parser = createParser( handler ); try { parser.parse( toInputStream( XML_CONTENT ), handler ); } catch( Exception e ) { e.printStackTrace(); } XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct(); System.out.println( finishedProduct.toStringEnchilada( 0, " " ) ); } /** * Expected: * { * document, * This is a test. * { * subdocument1, * Hi! * } * { * subdocument2, * Hi, again! * } * } */ @Test public void testAbreast() { final String XML_CONTENT = "<?xml version=\"1.0\"?>\n" + "<document attribute=\"fun and games\">\n" + " This is a test.\n" + " <subdocument1 another=\"more fun and games\">\n" + " Hi!\n" + " </subdocument1>\n" + " <subdocument2 another=\"still more fun and games\">\n" + " Hi, again!\n" + " </subdocument2>\n" + "</document>"; DefaultHandler handler = new XmlSchemaHandler(); SAXParser parser = createParser( handler ); try { parser.parse( toInputStream( XML_CONTENT ), handler ); } catch( Exception e ) { e.printStackTrace(); } XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct(); System.out.println( finishedProduct.toStringEnchilada( 0, " " ) ); } }
This is the brass ring at the end of the run. The entire XML content, as a tree, is in this object (and recursive instances below it).
package com.etretatlogicials.xml; import org.xml.sax.Attributes; import java.util.ArrayList; import java.util.List; public class XmlElement { private String uri; private String localName; private String qname; private String text; private Attributes attributes; private List< XmlElement > children = new ArrayList<>(); public XmlElement( String uri, String localName, String qname, Attributes attributes ) { this.uri = uri; this.localName = localName; this.qname = qname; this.attributes = attributes; } public String getUri() { return uri; } public String getLocalName() { return localName; } public String getQname() { return qname; } public String getText() { return text; } public Attributes getAttributes() { return attributes; } public List< XmlElement > getChildren() { return children; } public void setUri( String uri ) { this.uri = uri; } public void setLocalName( String localName ) { this.localName = localName; } public void setQname( String qname ) { this.qname = qname; } public void setText( String text ) { this.text = text; } public void setAttributes( Attributes attributes ) { this.attributes = attributes; } public void addChild( XmlElement child ) { this.children.add( child); } public String toString() { StringBuilder sb = new StringBuilder(); sb.append( "{ " ); if( localName != null && localName.length() > 0 ) sb.append( qname ); else sb.append( qname ); sb.append( ", " ).append( text ).append( " }" ); return sb.toString(); } private String indentToLevel( int level, String tab ) { if( level < 1 ) return ""; StringBuilder sb = new StringBuilder(); while( level-- > 0 ) sb.append( tab ); return sb.toString(); } public String toStringEnchilada( int level, String tab ) { StringBuilder sb = new StringBuilder(); sb.append( indentToLevel( level, tab ) ).append( "{\n" ); level++; if( localName != null && localName.length() > 0 ) sb.append( indentToLevel( level, tab ) ).append( localName ); else sb.append( indentToLevel( level, tab ) ).append( qname ); if( text != null && text.length() > 0 ) sb.append( ",\n" ).append( indentToLevel( level, tab ) ).append( text ); if( children.size() > 0 ) { sb.append( '\n' ); for( XmlElement element : children ) sb.append( element.toStringEnchilada( level, tab ) ); } else { sb.append( '\n' ); } level--; sb.append( indentToLevel( level, tab ) ).append( "}\n" ); return sb.toString(); } }
Works in concert with internal class ContentBuffer to gather element text. This is necessary because underneath a given element there is text, but XML says nothing against more XML elements tossed among the text though this is inelegant.
package com.etretatlogicials.xml; import java.util.ArrayList; import java.util.List; public class SaxCharactersStack { private List< StringBuffer > stack; public SaxCharactersStack() { stack = new ArrayList<>(); } public void push( StringBuffer element ) { stack.add( element ); } public StringBuffer pop() { return stack.remove( stack.size() - 1 ); } public String toString() { return stack.toString(); } }
We keep the elements in a stack as we encounter them in order to know where to go to attach the element text we don't finish gathering until much later.
package com.etretatlogicials.xml; import java.util.Iterator; import java.util.LinkedList; public class XmlElementStack { // @formatter:off private LinkedList< XmlElement > stack = new LinkedList<>(); public void push( XmlElement element ) { stack.addFirst( element ); } public XmlElement pop() { return stack.removeFirst(); } public XmlElement peek() { return stack.getFirst(); } public int depth() { return stack.size(); } public boolean isEmpty() { return stack.isEmpty(); } public String renderStackAsXPath() { Iterator< XmlElement > iterator = stack.descendingIterator(); StringBuilder path = new StringBuilder(); while( iterator.hasNext() ) { XmlElement element = iterator.next(); if( element.getLocalName().equals( "<document-top>" ) ) continue; path.append( element ); path.append( '/' ); } String xpath = path.toString(); return xpath.substring( 0, xpath.length() - 1 ); } }
This is where the heavy lifting is done. As described elsewhere on this page, SAX calls this handler at key points in the parsing process.
package com.etretatlogicials.xml; import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; import java.util.NoSuchElementException; public class XmlSchemaHandler extends DefaultHandler { private XmlElementStack elementStack; private SaxCharactersStack characterStack; private XmlElement finishedProduct; private ContentBuffer contentBuffer; public void startDocument() throws SAXException { elementStack = new XmlElementStack(); characterStack = new SaxCharactersStack(); contentBuffer = new ContentBuffer(); } public void startElement( String uri, String localName, String qName, Attributes attributes ) throws SAXException { XmlElement element = new XmlElement( uri, localName, qName, attributes ); contentBuffer.stack.push( contentBuffer.buffer ); contentBuffer.buffer = new StringBuffer(); try { XmlElement parent = elementStack.peek(); parent.addChild( element ); } catch( NoSuchElementException e ) { ; } elementStack.push( element ); } public void characters( char[] ch, int start, int length ) throws SAXException { String characters = new String( ch, start, length ); if( contentBuffer.buffer != null ) contentBuffer.buffer.append( characters.trim() ); } public void endElement( String uri, String localName, String qName ) throws SAXException { XmlElement element = elementStack.pop(); if( contentBuffer.buffer != null ) { String text = ( contentBuffer.buffer.length() > 0 ) ? contentBuffer.buffer.toString() : ""; element.setText( text ); try { contentBuffer.buffer = contentBuffer.stack.pop(); } catch( Exception e ) { /* This typically happens on the last element, so we protect against * crashing thus. */ ; } } // on the last element, push it back so it can be returned by endDocument()... if( elementStack.isEmpty() ) elementStack.push( element ); } public void endDocument() throws SAXException { finishedProduct = elementStack.pop(); } public void ignorableWhitespace( char ch[], int start, int length ) throws SAXException { } static void startBufferContent( ContentBuffer contentBuffer, String elementName ) { contentBuffer.stack.push( contentBuffer.buffer ); contentBuffer.buffer = new StringBuffer(); } public static class ContentBuffer { // special case for public StringBuffer buffer; // element text storage public SaxCharactersStack stack = new SaxCharactersStack(); // help composing element text } public XmlElement getFinishedProduct() { return finishedProduct; } }
This is excerpted from a parser in a servlet used to process in-coming requests in XML. The guts of what to do with the parsed XML aren't shown, but just where that happens.
Sets up the parser.
import java.io.IOException; import java.io.InputStream; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.ParserConfigurationException; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; public class ParseIncomingData { private final IncomingDataHandler handler; private XMLReader xmlReader; private SAXParser parser; // use this to create new parser instances... static SAXParserFactory factory = SAXParserFactory.newInstance(); /** * Set up an instance of a SAX parser with handler initialized to zilch. * @param state used by the handler. * @throws Exception should any error occur. */ public ParseIncoming( ParsingState state ) throws Exception { handler = new IncomingDataHandler( state ); try { parser = factory.newSAXParser(); xmlReader = parser.getXMLReader(); } catch( ParserConfigurationException e ) { throw new Exception( "Parser-configuration error: " + e.getMessage() ); } catch( SAXException e ) { throw new Exception( "Parser handler error: " + e.getMessage() ); } } /** * Parse the in-coming data. * @param in in-coming data as a stream. * @throws Exception should any error occur. */ public void parse( final InputStream in ) throws Exception { try { parser.parse( in, handler ); } catch( IOException e ) { throw new Exception( "I/O exception: " + e.getMessage() ); } catch( SAXException e ) { throw new Exception( "SAX parser handler error: " + e.getMessage() ); } catch( Exception e ) { throw new Exception( "Unknown exception: " + e.toString() ); } } }
In order to do something cogent with what's parsed, this is the stack that keeps the data together and useful.
import java.util.HashMap; import java.util.Iterator; import java.util.LinkedList; import java.util.Map; import org.xml.sax.Attributes; /** * Implements a stack that remembers everything important to parsing, to wit: * * - the element name (the stack can be rendered to produce an XPath), * - a map of any attributes ( qualified name as key plus value), and * - text content associated with the element, if any. * * We give up ElementStack.Element intentionally, but the content * field is the only one we consider mutable because of how late in the game it * can be created. */ public class ElementStack { public static final String DOCUMENT_TOP = "<document-top>"; public static class Element { protected final String elementName; // immutable protected final Map< String, String > attributes; // immutable protected StringBuilder content; // intentionally mutable protected Element( final String name, Attributes saxAttributes ) { elementName = name; content = new StringBuilder(); attributes = new HashMap<>(); if( saxAttributes != null ) { int attrLength = saxAttributes.getLength(); for( int attr = 0; attr < attrLength; attr++ ) { String attribute = saxAttributes.getQName( attr ); String value = saxAttributes.getValue( attr ); attributes.put( attribute, value ); } } } } private final LinkedList< Element > stack = new LinkedList<>(); public void push( final String elementName, Attributes attributes ) { Element element = new Element( elementName, attributes ); stack.addFirst( element ); } public Element pop() { return stack.removeFirst(); } public Element peek() { return stack.getFirst(); } public int depth() { return stack.size(); } public boolean isEmpty() { return stack.isEmpty(); } /** Pop the element and return (only) the element name. */ public String popName() { Element element = stack.removeFirst(); return element.elementName; } /** Peek at (only) the element name. */ public String peekName() { Element element = stack.getFirst(); return element.elementName; } /** Render the entire stack (of element names) as an XPath. */ public String renderStackAsXPath() { Iterator< Element > iterator = stack.descendingIterator(); StringBuilder path = new StringBuilder(); while( iterator.hasNext() ) { Element element = iterator.next(); String name = element.elementName; if( name.equals( DOCUMENT_TOP ) ) continue; path.append( name ); path.append( '/' ); } String xpath = path.toString(); return xpath.substring( 0, xpath.length() - 1 ); } public StringBuilder peekContent() { Element element = stack.peek(); assert element != null; return element.content; } /** * To accomplish this, we have to peek at the top element (to * get it), then add the content to the buffer. * @param content to append to what's there. */ public void appendContent( String content ) { Element element = stack.peek(); assert element != null; element.content.append( content ); } /** Format the attributes as a list. */ public static String formatAttributes( ElementStack.Element element ) { StringBuilder attributes = new StringBuilder(); int count = element.attributes.size(); if( count > 0 ) { for( Map.Entry< String, String > attribute : element.attributes.entrySet() ) { count--; attributes.append( attribute.getKey() ); attributes.append( "=\"" ); attributes.append( attribute.getValue() ); attributes.append( "\"" ); if( count > 0 ) attributes.append( ", " ); } } return attributes.toString(); } }
Because this is reentrant code, potentially called by many threads, this is the state instantiated separately for each. There's some debugging aids kept, but mostly it's just a stack of information that's filled out as the SAX parsing proceeds.
public class ParsingState { protected ElementStack stack = new ElementStack(); // debugging aids in case we want them... protected boolean VERBOSE; // whether or not to see debugging stuff protected int indentLevel = 1; // as if generative--which we aren't, but if we want to print protected int lineNumber = 2; // of original source document public ParsingState() { } public ParsingState( boolean verbose ) { this.VERBOSE = verbose; } }
Here's the meat of the parser. It's in method endElement() that the data parsed will be used to call code that does something with the data.
import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.xml.sax.Attributes; import org.xml.sax.helpers.DefaultHandler; import com.windofkeltia.utilities.StringUtilities; import static com.windofkeltia.sax.ElementStack.DOCUMENT_TOP; public class IncomingDataHandler extends DefaultHandler { private final ParsingState state; public IncomingDataHandler( ParsingState state ) { super(); this.state = state; } /** * —what identifies and handles each opening elementName tag. Here we're * concerned with * * a) elementName names (and namespaces), * b) attributes, * c) saving our place for composing elementName text, and * d) building XPaths. * * @param uri namespace URI, where the prefix is from; we ignore this. * @param localName local name (without namespace prefix) * @param elementName (SAX calls this the qname; it's both the namespace prefix and local name. * @param attributes any attributes on the elementName. */ public void startElement( String uri, String localName, String elementName, Attributes attributes ) { // debugging: maintain indentation if( state.VERBOSE ) maintainIndentation(); state.stack.push( elementName, attributes ); // save element and attributes for endElement()... } /** * —what gathers the plain, text content between the opening and closing tags. * Of course, some elements are comprehensive and contain additional elements without * also having plain text. * @param ch array holding the contentStack. * @param start starting position in the array. * @param length number of contentStack to use from the array. */ public void characters( char ch[], int start, int length ) { String content = getTextContentFromSaxCharacters( ch, start, length ); // append content to the owning (current) element... state.stack.appendContent( content ); } /** * —what identifies the closing element name tag, an opportunity to wrap up * the greater tag, whatever we want to do with it. It's at this point that * we know the tag is finished and what we've got in character content. */ public void endElement( String uri, String localName, String elementName ) { String xpath = state.stack.renderStackAsXPath(); ElementStack.Element element = state.stack.pop(); // (and, pop this one off the stack) Map< String, String > attributes = element.attributes; String content = element.content.toString(); String text = state.stack.peekContent().toString(); // ===================================================================== // This is where the majority of business is conducted using // - element name // - associated attributes // - content text (Do stuff here...) if( state.VERBOSE ) { String progress = state.lineNumber + ": " + xpath; if( element.attributes.size() > 0 ) progress += "( " + ElementStack.formatAttributes( element ) + " )"; int length = Math.min( content.length(), 40 ); if( content.length() > 1 ) progress += ": " + content.substring( 0, length ).trim(); System.out.println( progress ); maintainIndentation(); } } /** This is the beginning of the document. Mark it on the stack. */ public void startDocument() { state.stack.push( DOCUMENT_TOP, null ); if( state.VERBOSE ) System.out.println( "Parsed:" ); } /** This is the end of the document. Pop the original beginning mark off the stack. */ public void endDocument() { state.stack.pop(); } /** If we care (debugging is turned on), maintain indentation, indeed, create left margin tabs. */ private String maintainIndentation() { state.indentLevel++; return makeTabForIndentation( state.indentLevel ); } private String makeTabForIndentation( int indentationLevel ) { String tab = ""; while( indentationLevel-- > 0 ) //noinspection StringConcatenationInLoop tab += " "; return tab; } /** * Gather text content from SAX. * @return content bereft of multiple newlines from end leaving at most one. */ private String getTextContentFromSaxCharacters( char ch[], int start, int length ) { String characters = new String( ch, start, length ); String[] lines = characters.split( "\n" ); StringBuilder sb = new StringBuilder(); for( String line : lines ) { sb.append( line.trim() ) .append( '\n' ); } // debugging: bump line count by number of newlines" if( state.VERBOSE ) state.lineNumber += StringUtilities.countCharactersInString( characters, '\n' ); // remove multiple newlines from end of content to leave just one... return StringUtilities.chomp( sb.toString() ); } }
Here's how to set up the parser and call it with some data.
import org.junit.Before; import org.junit.Rule; import org.junit.Test; import org.junit.rules.TestName; import com.windofkeltia.utilities.TestUtilities; import java.io.ByteArrayInputStream; import java.io.IOException; import java.io.InputStream; import java.nio.charset.StandardCharsets; public class SaxParserTest { @Rule public TestName name = new TestName(); @Before public void setUp() { TestUtilities.setUp( name ); } private static final boolean VERBOSE = TestUtilities.VERBOSE; private InputStream openStreamOnFile( final String filename ) throws IOException { try { final String PATH = TestUtilities.TEST_FODDER + filename; final String CONTENT = TestUtilities.getLinesInFile( PATH ); return new ByteArrayInputStream( CONTENT.getBytes( StandardCharsets.UTF_8 ) ); } catch( IOException e ) { if( VERBOSE ) System.out.println( "Failed to open or read file " + filename + " containing test data: " + e.getMessage() ); throw e; } } @Test public void test() throws Exception { ParsingState state = new ParsingState( true ); // turn on debugging ParseIncoming parser = new ParseIncoming( state ); InputStream input = openStreamOnFile( "xml-data.xml" ); parser.parse( input ); } }
SAX chokes on stuff like < and > so if your XML has embedded HTML (it happens to me in the medical documents I deal with), you can delimit the HTML and pass it off to a subparser written using TagSoup. Meanwhile, here's how easy it is to set up and use TagSoup in a very simple example.
The highlighted lines are supposed to draw your attention to the important SAX or TagSoup bits of the code. A version of calling the parser with string- rather than file input is provided for gee whiz value.
package com.etretatlogiciels.html.tagsoup; import java.io.File; import org.xml.sax.SAXException; public class Example { public static void main( String[] args ) { TagSoupParser parser = new TagSoupParser(); try { File file = new File( "test/fodder/Sample.html" ); parser.run( file ); } catch( SAXException e ) { e.printStackTrace(); } } }
package com.etretatlogiciels.html.tagsoup; import java.io.File; import java.io.IOException; import org.xml.sax.SAXException; import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl; public class TagSoupParser { public void run( File file ) throws SAXException { TagSoupHandler handler = new TagSoupHandler(); SAXParserImpl parser = SAXParserImpl.newInstance( null ); try { parser.parse( file, handler ); } catch( IOException e ) { e.printStackTrace(); } } public void run( String content ) throws SAXException { TagSoupHandler handler = new TagSoupHandler(); SAXParserImpl parser = SAXParserImpl.newInstance( null ); try { parser.parse( content, handler ); } catch( IOException e ) { e.printStackTrace(); } } }
package com.etretatlogiciels.html.tagsoup; import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; public class TagSoupHandler extends DefaultHandler { int tab = 0; public void startElement( String uri, String localName, String qName, Attributes attributes ) throws SAXException { System.out.println( getTabs( tab ) + "<" + qName + ">" ); tab++; } public void characters( char ch[], int start, int length ) throws SAXException { String text = new String( ch, start, length ).trim(); if( text.length() > 0 ) System.out.println( getTabs( tab+1 ) + text ); } public void endElement( String uri, String localName, String qName ) throws SAXException { tab--; System.out.println( getTabs( tab ) + "</" + qName + ">" ); } private static String getTabs( int level ) { String spaces = ""; while( level-- > 0 ) spaces += " "; return spaces; } }
What's passed in for parsing...
<html> <head> </head> <body> <table width="100%" border="1"> <thead> <tr> <th> Date </th> <th> Type (Loinc) </th> <th> Result </th> <th> Ref Range </th> </tr> </thead> <tbody> <tr> <td> 6/2/2015 </td> <td> Bilirubin Direct(1968-7) </td> <td> <0.1 </td> <!-- the whole purpose is to see TagSoup handle this <! --> <td> 0.0-0.2 </td> </tr> </tbody> </table> </body> </html>
...when TagSoupExample.main() is run.
<html> <head> </head> <body> <table> <thead> <tr> <th> Date </th> <th> Type (Loinc) </th> <th> Result </th> <th> Ref Range </th> </tr> </thead> <tbody> <tr> <td> 6/2/2015 </td> <td> Bilirubin Direct(1968-7) </td> <td> <0.1 </td> <td> 0.0-0.2 </td> </tr> </tbody> </table> </body> </html>
<properties> <sax.version>2.0.1</sax.version> <tagsoup.version>1.2</tagsoup.version> </properties> <dependencies> <dependency> <groupId>org.ccil.cowan.tagsoup</groupId> <artifactId>tagsoup</artifactId> <version>1.2</version> </dependency> <dependency> <groupId>sax</groupId> <artifactId>sax</artifactId> <version>${sax.version}</version> </dependency> </dependencies>