Using the Simple API for XML (SAX) Parser

Russell Bateman
July 2015
last update:


Table of Contents

Meat and potatoes
How to create a handler
How to register a handler with the parser
SAX goodies
DefaultHandler methods
ContentHandler methods
LexicalHandler methods
Locator
Comparison of handler interfaces
Events
Java samples
Python samples
Java code for creating schema from XML
HTML parsing with TagSoup

Introduction

SAX is the Simple API for XML, a de facto standard. I first used SAX a decade ago in C and I've used it in Java and Python many times. The SAX parser works by offering a number of interfaces you implement and pass to the generic parser for it to use.

The SAX parser is an event-driven algorithm for parsing XML documents. It's an alternative to the parsing style provided by the Document Object Model (DOM). I first used it in C circa 2008, and much more recently in Python and Java.

Note in passing that SAX requires the smallest possible amount of memory (especially as compared to DOM) in accomplishing its work. It's generally faster. You can build as much structure as you like, on your own, and in the way you like because of SAX' event-driven nature and how it keeps little-to-no state around. It is possible to use it on streaming XML documents: you do not have to have the whole enchilada in memory. This is another benefit. Think of SAX as going down a staircase with one hand on the rail: only what's under your hand is what's in memory.

Meat and potatoes

The essence of parsing using SAX is given in the following methods, startDocument, endDocument, but especially, startElement, endElement and characters.

  1. startElement(), called when an element (tag) is encountered, <element ...>, gives you the element and a list of all (any) attributes, allows you to erect element state, probably on a stack.
  2. endElement(), called when element's end is reached, </element>. There may have been intervening elements, so your element's state is on a stack and would get popped off at this point.
  3. characters(), called to give text between element's beginning and end. You must be prepared to accumulate these across multiple calls because it may be called more than one time.
  4. startDocument(), called at beginning of XML document, allows you to set up document state.
  5. endDocument(), called at end of document.

How to create a handler

Create your handler thus with, at first, the five methods listed above. Of course, you'll have to satisfy the DefaultHandler contract by implmenting (at least stubs for) all the methods. This example rudimentarily prints out what the SAX parser sees if you run the parser on an XML file.

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class MySaxParserHandler extends DefaultHandler
{
  private String elementContents;

  public void startElement( String uri, String localName, String elementName, Attributes attributes )
    throws SAXException
  {
    elementContents = "";
    System.out.print( "<" + elementName + ">" );
  }

  public void endElement( String uri, String localName, String elementName ) throws SAXException
  {
    System.out.print( "  " + elementContents + "" );
    System.out.print( "" );
  }

  public void characters( char ch[], int start, int length ) throws SAXException
  {
    String contents = new String( ch, start, length ).trim();

    elementContents += contents;
  }

  public void startDocument() throws SAXException
  {
    System.out.print( "Begin" );
  }

  public void endDocument() throws SAXException
  {
    System.out.print( "Done!" )
  }
}

How to register a handler with the parser

Now create your application's main entry point, set up the parser to consume your handler, then call it (the highlighted line below). We'll pass the sample XML filename as the argument.

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

public class MySaxParser
{
  public static void main( String[] args )
  {
    MySaxParserHandler handler = new MySaxParserHandler();
    SAXParserFactory   factory = SAXParserFactory.newInstance();
    SAXParser          parser  = null;

    try
    {
      parser = factory.newSAXParser();
    }
    catch( ParserConfigurationException e )
    {
      System.err.println( "Parser-configuration error:" );
      System.exit( -1 );
    }
    catch( SAXException e )
    {
      System.err.println( "SAX parser error:" );
      System.exit( -1 );
    }

    try
    {
      XMLReader reader = parser.getXMLReader();

      parser.parse( args[ 1 ], handler );
    }
    catch( Exception e )
    {
      System.err.println( "parser handler error:" );
      System.exit( -1 );
    }
  }
}

SAX goodies

SAX gives additional goodies to the DefaultHandler, everything needed to parse even the most complex XML file. Here are some of my favorite:

DefaultHandler methods

ContentHandler methods

LexicalHandler methods

Locator, given to setDocumentLocator()

Tells you this stuff; you get an instance of this, via the method, pretty much before anything else and the parser maintains the line number and column throughout its parsing. Very useful for error reporting in XML validation and other activities.

Comparison of handler interfaces

Here's a table comparing two additional SAX parser handlers you can add to your parser. For example, to add the LexicalHandler, recreate the handler above thus:

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class MySaxParserHandler extends DefaultHandler implements LexicalHander
{
  .
  .
  .

Here's the comparison table. LexicalHandler adds, for example, special methods to allow you to handle XML comments, which are otherwise completely ignored (you won't ever be notified about them), as well as CDATA, DTDs, etc.

DefaultHandler ContentHandler LexicalHandler
characters characters
comment
endCDATA
endDTD
endDocument endDocument
endElement endElement
endEntity
endPrefixMapping endPrefixMapping
error
fatalError
ignorableWhitespace ignorableWhitespace
notationDecl
processingInstruction       processingInstruction      
resolveEntity
setDocumentLocator setDocumentLocator
skippedEntity skippedEntity
startCDATA
startDTD
startDocument startDocument
startElement startElement
startEntity
startPrefixMapping startPrefixMapping
unparsedEntityDecl
warning

I first wrote up these comparisons in an effort to understand why, in some code I was reading, ContentHandler was being used instead of DefaultHandler. Then I realized that it was because the original author did not want to handle errors.

All methods...

...at your disposal with terse comments on what they do. I have not myself done everything possible with SAX and there's much I don't know.

Method DefaultHandler ContentHandler LexicalHandler Purpose
characters gathers plain text between opening and closing tags
comment like characters() for comments
endCDATA reports end of CDATA
endDTD reports end of DOCTYPE section
endDocument reports end of (whole) document
endElement reports end of XML element (tag)
endEntity reports end of XML entities
endPrefixMapping reports end of namespace mapping
error reports (recoverable) parser error
fatalError report fatal (unrecoverable) parser error
ignorableWhitespace invoked in combination with a DTD
notationDecl
processingInstruction reports processing instructions like <?target data>
resolveEntity
setDocumentLocator yields an instance of Locator for later use
skippedEntity reports skipped entity (could have been getting via LexicalHandler)
startCDATA reports beginning of CDATA, starting with <![CDATA[
startDTD reports beginning of DOCTYPE section
startDocument reports beginning of (whole) document
startElement reports beginning of XML element (tag)
startEntity reports beginning of XML entities like DTD
startPrefixMapping reports beginning of namespace mapping; a namespace is like X in <X:tag ...>
unparsedEntityDecl
warning reports parser warning

Events

SAX parsing is done via a sort of "call-back" mechanism. You write a handler that you supply to SAX. The DefaultHandler design imposes the requirement to implement a number of methods that are called as SAX encounters XML elements (tags), attributes, etc. Formally, these are:

In Java, these are the following (and are public in your handler):

void startDocument()
void endDocument()
void startElement( String uri, String localName, String qName, Attributes attributes )
void endElement( String uri, String localName, String qName )
void characters( char ch[], int start, int length )

—when implementing LexicalHandler...
void startDTD( String name, String publicId, String systemId )
void endDTD()
void startEntity( String name )
void endEntity( String name )
void startCDATA()
void endCDATA()
void comment( char[] ch, int start, int length )

Similarly, in Python, you implement:

def startDocument( self )
def endDocument( self )
def startElement( self, tag, attributes=None )
def endElement( self, tag )
def characters( self, content )

—xml.sax.sax2lib.LexicalHandler is not supported until Python 3

Java samples

CdaAnalysis.java:
package com.etretatlogiciels.cda.analysis;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

public class CdaAnalysis
{
  private SAXParserFactory factory;
  private SAXParser        parser;
  private XMLReader        reader;
  private DefaultHandler   handler;

  public CdaAnalysis( String outputFilepath )
  {
    factory = SAXParserFactory.newInstance();

    try
    {
      parser  = factory.newSAXParser();
      handler = new CdaHandler( outputFilepath, false );
      reader  = parser.getXMLReader();
      reader.setProperty( "http://xml.org/sax/properties/lexical-handler", handler );
    }
    catch( ParserConfigurationException e )
    {
      e.printStackTrace();
    }
    catch( SAXException e )
    {
      e.printStackTrace();
    }
  }

  public void parse( String pathname )
  {
    try
    {
      parser.parse( pathname, handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }
  }
}
CdaHandler.java:
package com.etretatlogiciels.cda.analysis;

import java.util.Collections;
import java.util.List;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.ext.LexicalHandler;
import org.xml.sax.helpers.DefaultHandler;

import com.etretatlogiciels.utilities.RedirectSystemOutStream;
import com.etretatlogiciels.utilities.OccurrencesTable;
import com.etretatlogiciels.utilities.XPathTable;
import com.etretatlogiciels.utilities.SymbolTableException;
import com.etretatlogiciels.utilities.XmlElementStack;

/**
 * We want to analyse elements, modifying our analysis to make a growing number of
 * observations about CDA and CCD documents.
 *
 * History
 *
 * We already gather out <text> ... </text> elements from some CDA
 * elements and put them into our CXML. What we want to do is identify other, import
 * content to gather out and extend our work to CCDs, which adhere to the CDA format
 * standard.
 *
 * About this file...
 *
 * The SAX handler is the cookie jar with all the cookies. It's the code that says
 * what we're looking for and how it's supposed to show up. In this case, however,
 * we don't want to make many decisions about what's going to be in the CDA (or
 * CCD) because we're just learning. And, anyway, we likely would never be able to
 * predict with complete accuracy what's in there. We're just observers trying to
 * find a way to disembowel the document to get content out.
 *
 * @author Russell Bateman
 * @since May 2015
 */
public class CdaHandler extends DefaultHandler implements LexicalHandler
{
  private int              lineNumber = 2; // (because all documents start with <?xml version="1.0"?>)
  private String           elementName;
  private String           characters;
  private OccurrencesTable occurrences;
  private XPathTable       xpath;
  private XmlElementStack  stack;

  private static final String DOCUMENT = "<document-top>";

  /**
   * Just output to the console whatever's found.
   */
  public CdaHandler()
  {
    super();
    occurrences = new OccurrencesTable();
    xpath       = new XPathTable();
    stack       = new XmlElementStack();
  }

  /**
   * Hereafter, all output goes to System.out which will take it to a file and/or
   * the console.
   * @param outputFilepath to which results will go.
   * @param hush shut the console up (no output to console).
   */
  public CdaHandler( String outputFilepath, boolean hush )
  {
    this();
    RedirectSystemOutStream.redirectSystemOutToFile( outputFilepath, hush );
  }

  /**
   * If processing multiple documents, this would be an opportunity to initialize
   * processing for the next document (as opposed to a previous one).
   */
  public void startDocument() throws SAXException
  {
    stack.push( DOCUMENT );
    System.out.println( "Elements in order of appearance and their line numbers..." );
  }

  /**
   * If processing multiple documents, this is an opportunity to tie off the one
   * that's been processing and is finished.
   */
  public void endDocument() throws SAXException
  {
    List< String > elements = occurrences.keys();
    int            widest   = CdaAnalysisUtilities.getMaximumKeyWidth( elements );

    System.out.println( "-----------------------------------------------------------------------" );
    System.out.println( "Elements sorted alphabetically and their frequency..." );
    // Sort alphabetically by key...
    Collections.sort( elements );
    CdaAnalysisUtilities.printPaddedList( occurrences, elements, widest );

    System.out.println( "-----------------------------------------------------------------------" );
    System.out.println( "Elements sorted by occurrence from most to least frequent and their frequency..." );
    CdaAnalysisUtilities.sortAndPrintPaddedByValue( occurrences, elements, widest );

    System.out.println( "-----------------------------------------------------------------------" );
    System.out.println( "Elements with their XPath and line number on which they appear..." );
    CdaAnalysisUtilities.sortAndPrintXPaths( xpath, elements, widest );

    if( stack.peek().equals( DOCUMENT ) )
        System.out.println( "Document is well formed" );
  }

  /**
   * —what identifies and handles the opening element tag. These can be
   * intercepted and used for analysis.
   */
  public void startElement( String uri, String localName, String qName, Attributes attributes ) throws SAXException
  {
    String elementName = qName;

    stack.push( elementName );

    if( !occurrences.contains( elementName ) )
    {
      try
      {
        occurrences.put( elementName, 1 );
      }
      catch( SymbolTableException e )
      {
        e.printStackTrace();
      }
    }
    else
    {
      try
      {
        Integer count = occurrences.get( elementName );
        count++;
        occurrences.delete( elementName );
        occurrences.put( elementName, count );
      }
      catch( SymbolTableException e )
      {
        e.printStackTrace();
      }
    }

    try
    {
      String path = stack.renderStackAsXPath();
      xpath.put( elementName, lineNumber, path );
    }
    catch( SymbolTableException e )
    {
      e.printStackTrace();
    }

    System.out.println( lineNumber + "  <" + qName + ">" );
  }

  /**
   * —what identifies the closing element tag, an opportunity to wrap up
   * the greater tag, whatever we want to do with it. It's at this point that
   * we know the tag is finished and what we've got in characters is its content.
   */
  public void endElement( String uri, String localName, String qName ) throws SAXException
  {
    stack.pop();
  }

  /**
   * —what gathers the plain text content between the opening and closing tags.
   * Of course, some elements are comprehensive and contain additional elements without
   * also having plain text.
   * @param ch     array holding the characters.
   * @param start  starting position in the array.
   * @param length number of characters to use from the array.
   */
  public void characters( char ch[], int start, int length ) throws SAXException
  {
    characters = new String( ch, start, length ).trim();

    for( int i = start; i < start+length; i++ )
    {
      if( ch[ i ] == '\n' )
        lineNumber++;
    }
  }

  /**
   * Maybe this can be used to create line numbers? No, it's only invoked in combination
   * with a DTD. If the parsed XML file doesn't preconize a DTD, it won't happen.
   */
  public void ignorableWhitespace( char ch[], int start, int length ) throws SAXException
  {
//    String whitespace = new String( ch, start, length ).trim();
  }

  public void startDTD( String name, String publicId, String systemId ) throws SAXException
  {
  }

  public void endDTD() throws SAXException
  {
  }

  public void startEntity( String name ) throws SAXException
  {
  }

  public void endEntity( String name ) throws SAXException
  {
  }

  /**
   * Reports beginning of CDATA, starting with <![CDATA[ and ending with ]]>, for example,
   *
   * "<![CDATA[<p>The pug snoring on the couch next to me is <em>extremely</em> cute</p>]]>"
   */
  public void startCDATA() throws SAXException
  {
  }

  public void endCDATA() throws SAXException
  {
  }

  /**
   * Reports any and all XML comments anywhere in (internal) document.
   * @param ch     array holding the characters in the comment.
   * @param start  starting position in the array.
   * @param length number of characters to use from the array.
   */
  public void comment( char[] ch, int start, int length ) throws SAXException
  {
    for( int i = start; i < start+length; i++ )
    {
      if( ch[ i ] == '\n' )
        lineNumber++;
    }
  }
}

Python samples

import xml.sax

def getResult():
  result         = httpclient.get( uri )
  payload        = result.read()
  resultIdParser = CqPostResponsePayload()
  try:
    xml.sax.parseString( payload, resultIdParser )
  except Exception as e:
    print e.message
  return resultIdParser.getResultId()

class CqPostResponsePayload( xml.sax.ContentHandler ):
  '''
  Parse response payload, looks something like:
  fWkcTS1a
  '''
  def __init__( self ):
    self.result = StringIO()
    self.resultIdCharacters = ''
  def getResultId( self ):
    return self.result.getvalue().lstrip().rstrip()
  def startElement( self, tag, attributes=None ):
    if tag == 'result_id':
      self.resultIdCharacters = ''
    else:
      pass
  def endElement( self, tag ):
    if tag == 'result_id':
       # tie off the result_id...
      print && self.result, self.resultIdCharacters
    else:
      pass
  def characters(self, content ):
    self.resultIdCharacters += content

A different example...

def getValueOfTotalAttribute( line ):
    ''' Just the attributes. '''

    parser = HitsTagElementParser()
    try:
        xml.sax.parseString( line, parser )
    except Exception as e:
        print e.message
        return 0
    attributes = parser.getAttributes()
    return attributes

class HitsTagElementParser( xml.sax.ContentHandler ):
  def __init__( self ):
    self.attributes = {}
  def getAttributes( self ):
    return self.attributes
  def startElement( self, tag, attributes=None ):
    if tag != 'our-tag':
      return
    self.attributes = attributes
  def endElement( self, tag ):
    ''' We'll never hit this! '''
    pass
  def characters( self, content ):
    ''' We're uninterested in this. '''
    pass

Java code for creating schema from XML

This example parses XML from a string (see XmlSchemaTest.java), but it could be a file as shown in the earlier Java samples. The goal is to create an instance of XmlElement that recursively contains the rest of (the XML content whether in a string or a file).

Among other things, this exercise demonstrates how to use stacks to keep track of where one is to gather text out of intermingled subelements for the owning element.

XmlSchema.java

This test ensures (visually-only; no assertions are made, so these are not real test cases) that our parser can do a couple of levels deep and a couple abroad for the purpose of demonstrating the idea only.

package com.etretatlogiciels.xml;

import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TestName;

import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.nio.charset.Charset;

import com.etretatlogicials.xml.XmlElement;
import com.etretatlogicials.xml.XmlSchemaHandler;

public class XmlSchemaTest
{
  @Rule
  public TestName name = new TestName();

  @Before
  public void setUp() throws Exception
  {
    String testName  = name.getMethodName();
    int    PAD       = 100;
    int    nameWidth = testName.length();
    System.out.print( "Test: " + testName + " " );
    PAD -= nameWidth;
    while( PAD-- > 0 )
      System.out.print( "-" );
    System.out.println();
  }

  private SAXParser createParser( DefaultHandler handler )
  {
    SAXParser parser = null;

    try
    {
      parser = SAXParserFactory.newInstance().newSAXParser();
    }
    catch( ParserConfigurationException e )
    {
      System.err.println( "Parser-configuration error:" );
      e.printStackTrace();
    }
    catch( SAXException e )
    {
      System.err.println( "SAX ruleparser error:" );
      e.printStackTrace();
    }

    return parser;
  }

  private InputStream toInputStream( String string )
  {
    return new ByteArrayInputStream( string.getBytes( Charset.forName( "UTF-8" ) ) );
  }

  /**
   * Expected:
   * {
   *   document,
   *   This is a test.
   * }
   */
  @Test
  public void testOne()
  {
    final String   XML_CONTENT = "<?xml version=\"1.0\"?><document>This is a test.</document>";
    DefaultHandler handler     = new XmlSchemaHandler();
    SAXParser      parser      = createParser( handler );

    try
    {
      parser.parse( toInputStream( XML_CONTENT ), handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }

    XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct();
    System.out.println( finishedProduct.toStringEnchilada( 0, "  " ) );
  }

  /**
   * Expected:
   * {
   *   document,
   *   This is a test.
   *   {
   *     subdocument,
   *     Hi!
   *   }
   * }
   */
  @Test
  public void testTwo()
  {
    final String   XML_CONTENT = "<?xml version=\"1.0\"?>\n"
		                       + "<document>\n"
		                       + "  This is a test.\n"
		                       + "    <subdocument>\n"
		                       + "      Hi!\n"
		                       + "    </subdocument>\n"
		                       + "  </document>";\n"
    DefaultHandler handler     = new XmlSchemaHandler();
    SAXParser      parser      = createParser( handler );

    try
    {
      parser.parse( toInputStream( XML_CONTENT ), handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }

    XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct();
    System.out.println( finishedProduct.toStringEnchilada( 0, "  " ) );
  }

  /**
   * Expected:
   * {
   *   document,
   *   This is a test.
   *   {
   *     subdocument,
   *     Hi!
   *   }
   * }
   */
  @Test
  public void testThree()
  {
    final String   XML_CONTENT = "<?xml version=\"1.0\"?>\n"
                               + "<document attribute=\"fun and games\">\n"
                               + "  This is a test.\n"
                               + "  <subdocument another=\"more fun and games\">\n"
                               + "    Hi!\n"
                               + "  </subdocument>\n"
                               + "</document>";
    DefaultHandler handler     = new XmlSchemaHandler();
    SAXParser      parser      = createParser( handler );

    try
    {
      parser.parse( toInputStream( XML_CONTENT ), handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }

    XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct();
    System.out.println( finishedProduct.toStringEnchilada( 0, "  " ) );
  }

  /**
   * Expected:
   * {
   *   document,
   *   This is a test.
   *   {
   *     subdocument1,
   *     Hi!
   *   }
   *   {
   *     subdocument2,
   *     Hi, again!
   *   }
   * }
   */
  @Test
  public void testAbreast()
  {
    final String   XML_CONTENT = "<?xml version=\"1.0\"?>\n"
                               + "<document attribute=\"fun and games\">\n"
                               + "  This is a test.\n"
                               + "  <subdocument1 another=\"more fun and games\">\n"
                               + "    Hi!\n"
                               + "  </subdocument1>\n"
                               + "  <subdocument2 another=\"still more fun and games\">\n"
                               + "    Hi, again!\n"
                               + "  </subdocument2>\n"
                               + "</document>";
    DefaultHandler handler     = new XmlSchemaHandler();
    SAXParser      parser      = createParser( handler );

    try
    {
      parser.parse( toInputStream( XML_CONTENT ), handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }

    XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct();
    System.out.println( finishedProduct.toStringEnchilada( 0, "  " ) );
  }
}
XmlElement.java

This is the brass ring at the end of the run. The entire XML content, as a tree, is in this object (and recursive instances below it).

package com.etretatlogicials.xml;

import org.xml.sax.Attributes;

import java.util.ArrayList;
import java.util.List;

public class XmlElement
{
  private String             uri;
  private String             localName;
  private String             qname;
  private String             text;
  private Attributes         attributes;
  private List< XmlElement > children = new ArrayList<>();

  public XmlElement( String uri, String localName, String qname, Attributes attributes )
  {
    this.uri        = uri;
    this.localName  = localName;
    this.qname      = qname;
    this.attributes = attributes;
  }

  public String getUri()                  { return uri; }
  public String getLocalName()            { return localName; }
  public String getQname()                { return qname; }
  public String getText()                 { return text; }
  public Attributes getAttributes()       { return attributes; }
  public List< XmlElement > getChildren() { return children; }

  public void setUri( String uri )                   { this.uri = uri; }
  public void setLocalName( String localName )       { this.localName = localName; }
  public void setQname( String qname )               { this.qname = qname; }
  public void setText( String text )                 { this.text = text; }
  public void setAttributes( Attributes attributes ) { this.attributes = attributes; }
  public void addChild( XmlElement child )           { this.children.add( child); }

  public String toString()
  {
    StringBuilder sb = new StringBuilder();

    sb.append( "{ " );

    if( localName != null && localName.length() > 0 )
      sb.append( qname );
    else
      sb.append( qname );
    sb.append( ", " ).append( text ).append( " }" );
    return sb.toString();
  }

  private String indentToLevel( int level, String tab )
  {
    if( level < 1 )
      return "";

    StringBuilder sb = new StringBuilder();
    while( level-- > 0 )
      sb.append( tab );
    return sb.toString();
  }

  public String toStringEnchilada( int level, String tab )
  {
    StringBuilder sb = new StringBuilder();

    sb.append( indentToLevel( level, tab ) ).append( "{\n" );

    level++;

    if( localName != null && localName.length() > 0 )
      sb.append( indentToLevel( level, tab ) ).append( localName );
    else
      sb.append( indentToLevel( level, tab ) ).append( qname );

    if( text != null && text.length() > 0 )
      sb.append( ",\n" ).append( indentToLevel( level, tab ) ).append( text );

    if( children.size() > 0 )
    {
      sb.append( '\n' );
      for( XmlElement element : children )
        sb.append( element.toStringEnchilada( level, tab ) );
    }
    else
    {
      sb.append( '\n' );
    }

    level--;

    sb.append( indentToLevel( level, tab ) ).append( "}\n" );
    return sb.toString();
  }
}
SaxCharactersStack.java

Works in concert with internal class ContentBuffer to gather element text. This is necessary because underneath a given element there is text, but XML says nothing against more XML elements tossed among the text though this is inelegant.

package com.etretatlogicials.xml;

import java.util.ArrayList;
import java.util.List;

public class SaxCharactersStack
{
  private List< StringBuffer > stack;

  public SaxCharactersStack()              { stack = new ArrayList<>(); }
  public void push( StringBuffer element ) { stack.add( element ); }
  public StringBuffer pop()                { return stack.remove( stack.size() - 1 ); }
  public String toString()                 { return stack.toString(); }
}
XmlElementStack.java

We keep the elements in a stack as we encounter them in order to know where to go to attach the element text we don't finish gathering until much later.

package com.etretatlogicials.xml;

import java.util.Iterator;
import java.util.LinkedList;

public class XmlElementStack
{
  // @formatter:off
  private LinkedList< XmlElement > stack = new LinkedList<>();

  public void       push( XmlElement element ) { stack.addFirst( element ); }
  public XmlElement pop()                      { return stack.removeFirst(); }
  public XmlElement peek()                     { return stack.getFirst(); }
  public int        depth()                    { return stack.size(); }
  public boolean    isEmpty()                  { return stack.isEmpty(); }

  public String renderStackAsXPath()
  {
    Iterator< XmlElement > iterator = stack.descendingIterator();
    StringBuilder          path     = new StringBuilder();

    while( iterator.hasNext() )
    {
      XmlElement element = iterator.next();

      if( element.getLocalName().equals( "<document-top>" ) )
        continue;

      path.append( element );
      path.append( '/' );
    }

    String xpath = path.toString();

    return xpath.substring( 0, xpath.length() - 1 );
  }
}
XmlSchemaHandler.java

This is where the heavy lifting is done. As described elsewhere on this page, SAX calls this handler at key points in the parsing process.

package com.etretatlogicials.xml;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import java.util.NoSuchElementException;

public class XmlSchemaHandler extends DefaultHandler
{
  private XmlElementStack    elementStack;
  private SaxCharactersStack characterStack;
  private XmlElement         finishedProduct;
  private ContentBuffer      contentBuffer;

  public void startDocument() throws SAXException
  {
    elementStack   = new XmlElementStack();
    characterStack = new SaxCharactersStack();
    contentBuffer  = new ContentBuffer();
  }

  public void startElement( String uri, String localName, String qName, Attributes attributes )
      throws SAXException
  {
    XmlElement element = new XmlElement( uri, localName, qName, attributes );

    contentBuffer.stack.push( contentBuffer.buffer );
    contentBuffer.buffer = new StringBuffer();

    try
    {
      XmlElement parent = elementStack.peek();
      parent.addChild( element );
    }
    catch( NoSuchElementException e )
    {
      ;
    }

    elementStack.push( element );
  }

  public void characters( char[] ch, int start, int length ) throws SAXException
  {
    String characters = new String( ch, start, length );

    if( contentBuffer.buffer != null )
      contentBuffer.buffer.append( characters.trim() );
  }

  public void endElement( String uri, String localName, String qName ) throws SAXException
  {
    XmlElement element = elementStack.pop();

    if( contentBuffer.buffer != null )
    {
      String text = ( contentBuffer.buffer.length() > 0 ) ? contentBuffer.buffer.toString() : "";

      element.setText( text );

      try
      {
        contentBuffer.buffer = contentBuffer.stack.pop();
      }
      catch( Exception e )
      {
      /* This typically happens on the last element, so we protect against
       * crashing thus.
       */
        ;
      }
    }

    // on the last element, push it back so it can be returned by endDocument()...
    if( elementStack.isEmpty() )
      elementStack.push( element );
  }

  public void endDocument() throws SAXException
  {
    finishedProduct = elementStack.pop();
  }

  public void ignorableWhitespace( char ch[], int start, int length ) throws SAXException { }

  static void startBufferContent( ContentBuffer contentBuffer, String elementName )
  {
      contentBuffer.stack.push( contentBuffer.buffer );
      contentBuffer.buffer = new StringBuffer();
  }

  public static class ContentBuffer
  {                                                              // special case for
    public StringBuffer       buffer;                            // element text storage
    public SaxCharactersStack stack = new SaxCharactersStack();  // help composing element text
  }

  public XmlElement getFinishedProduct() { return finishedProduct; }
}

HTML parsing with TagSoup

SAX chokes on stuff like &lt; and &gt; so if your XML has embedded HTML (it happens to me in the medical documents I deal with), you can delimit the HTML and pass it off to a subparser written using TagSoup. Meanwhile, here's how easy it is to set up and use TagSoup in a very simple example.

The highlighted lines are supposed to draw your attention to the important SAX or TagSoup bits of the code. A version of calling the parser with string- rather than file input is provided for gee whiz value.

TagSoupExample.java:

package com.etretatlogiciels.html.tagsoup;

import java.io.File;
import org.xml.sax.SAXException;

public class Example
{
  public static void main( String[] args )
  {
    TagSoupParser parser = new TagSoupParser();

    try
    {
      File file = new File( "test/fodder/Sample.html" );

      parser.run( file );
    }
    catch( SAXException e )
    {
      e.printStackTrace();
    }
  }
}
TagSoupParser.java:
package com.etretatlogiciels.html.tagsoup;

import java.io.File;
import java.io.IOException;
import org.xml.sax.SAXException;
import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;

public class TagSoupParser
{
  public void run( File file ) throws SAXException
  {
    TagSoupHandler handler = new TagSoupHandler();
    SAXParserImpl  parser = SAXParserImpl.newInstance( null );

    try
    {
      parser.parse( file, handler );
    }
    catch( IOException e )
    {
      e.printStackTrace();
    }
  }

  public void run( String content ) throws SAXException
  {
    TagSoupHandler handler = new TagSoupHandler();
    SAXParserImpl  parser  = SAXParserImpl.newInstance( null );

    try
    {
      parser.parse( content, handler );
    }
    catch( IOException e )
    {
      e.printStackTrace();
    }
  }
}
TagSoupHandler.java:
package com.etretatlogiciels.html.tagsoup;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class TagSoupHandler extends DefaultHandler
{
  int tab = 0;

  public void startElement( String uri, String localName, String qName, Attributes attributes )
      throws SAXException
  {
    System.out.println( getTabs( tab ) + "<" + qName + ">" );
    tab++;
  }

  public void characters( char ch[], int start, int length ) throws SAXException
  {
    String text = new String( ch, start, length ).trim();
    if( text.length() > 0 )
      System.out.println( getTabs( tab+1 ) + text );
  }

  public void endElement( String uri, String localName, String qName ) throws SAXException
  {
    tab--;
    System.out.println( getTabs( tab ) + "" );
  }

  private static String getTabs( int level )
  {
    String spaces = "";
    while( level-- > 0 )
      spaces += "  ";
    return spaces;
  }
}
Sample.html:

What's passed in for parsing...

<html>
<head>
</head>
<body>
  <table width="100%" border="1">
    <thead>
    <tr>
      <th> Date         </th>
      <th> Type (Loinc) </th>
      <th> Result       </th>
      <th> Ref Range    </th>
    </tr>
    </thead>
    <tbody>
    <tr>
      <td> 6/2/2015                 </td>
      <td> Bilirubin Direct(1968-7) </td>
      <td> &lt;0.1                  </td> <!-- the whole purpose is to see TagSoup handle this <! -->
      <td> 0.0-0.2                  </td>
    </tr>
    </tbody>
  </table>
</body>
</html>

Output

...when TagSoupExample.main() is run.

<html>
  <head>
  </head>
  <body>
    <table>
      <thead>
        <tr>
          <th>
              Date
          </th>
          <th>
              Type (Loinc)
          </th>
          <th>
              Result
          </th>
          <th>
              Ref Range
          </th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>
              6/2/2015
          </td>
          <td>
              Bilirubin Direct(1968-7)
          </td>
          <td>
              <0.1
          </td>
          <td>
              0.0-0.2
          </td>
        </tr>
      </tbody>
    </table>
  </body>
</html>
pom.xml particulars:
  <properties>
    <sax.version>2.0.1</sax.version>
    <tagsoup.version>1.2</tagsoup.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.ccil.cowan.tagsoup</groupId>
      <artifactId>tagsoup</artifactId>
      <version>1.2</version>
    </dependency>
    <dependency>
        <groupId>sax</groupId>
        <artifactId>sax</artifactId>
        <version>${sax.version}</version>
    </dependency>
  </dependencies>