Using the Simple API for XML (SAX) Parser

Russell Bateman
July 2015
last update:


Table of Contents

Meat and potatoes
Basic handler methods
Handling SAX Attributes
How to create a handler
How to register a handler with the parser
Integrated instantiation and handling
SAX goodies
DefaultHandler methods
ContentHandler methods
LexicalHandler methods
Locator
Using Locator for line numbers
SAX only parses streams not strings
Comparison of handler interfaces
List of all interfaces and notes
Events
Java samples
SAX dummy handler sample
Python samples
Java code for creating schema from XML
Straightforward example of SAX parser handler
HTML parsing with TagSoup

Introduction

SAX is the Simple API for XML, a de facto standard. I first used SAX a decade ago in C and I've used it in Java and Python many times. The SAX parser works by offering a number of interfaces you implement and pass to the generic parser for it to use.

The SAX parser is an event-driven algorithm for parsing XML documents. It's an alternative to the parsing style provided by the Document Object Model (DOM). I first used it in C circa 2008, and much more recently in Python and Java.

Note in passing that SAX requires the smallest possible amount of memory (especially as compared to DOM) in accomplishing its work. It's generally faster. You can build as much structure as you like, on your own, and in the way you like because of SAX' event-driven nature and how it keeps little-to-no state around. It is possible to use it on streaming XML documents: you do not have to have the whole enchilada in memory. This is another benefit. Think of SAX as going down a staircase with one hand on the rail: only what's under your hand is what's in memory.

Meat and potatoes

The essence of parsing using SAX is given in the following methods, startDocument, endDocument, but especially, startElement, endElement and characters.

  1. startDocument(), called at beginning of the XML document, allowing you to set up document state. This is not usually of much interest.
  2. startElement(), called when an element (tag) is encountered, <element ...>. It gives you the element and a list of all (any) attributes, allows you to erect element state, probably on a stack. Most of any heavy lifting you'll want to do in your parser handler will happen here or in endElement(). Here are the arguments:
    1. uri, if present, signals a) that there is a definition for it, usually as referenced from early (header) lines in the XML being parsed, and b) it contains that definition (or namespace name). For instance, if the element is <xsi:document>, and xsi is defined (as noted), then uri will be "xsi" otherwise it will be empty.
    2. localName, the short name of the XML element, minus the namespace URI, but only if different from qName. You should expect localName to more frequently empty.
    3. qName, usually what you consider to be the basic, element name, but, if there are no namespace-defining documents included, you will find what's probably a namespace on the front, i.e.: for <xsi:document>, it should be just "document", but you find it to be <xsi:document>.
    4. attributes in a list formatted in a peculiarly non-Java way (i.e.: not using a Collection). This small code sample demonstrates how to reconstruct the XML code incorporating the attributes:
      int attributesLength = attributes.getLength();
      
      StringBuilder sb = new StringBuilder();
      
      boolean firstAttribute = true;
      
      for( int attr = 0; attr < attributesLength; attr++ )
      {
        String attrUri       = attributes.getURI( attr );
        String attrLocalName = attributes.getLocalName( attr );
        String attrQName     = attributes.getQName( attr );
        String attrValue     = attributes.getValue( attr );
        String attrName      = ( attrQName == null || attrQName.length() < 1 )
                            ? attrUri + ':' + attrLocalName
                            : attrQName;
        if( firstAttribute )
          firstAttribute = false;
        else
          sb.append( ',' );
      
        sb.append( ' ' ).append( attrName ).append( '=' ).append( '"' ).append( attrValue ).append( '"' );
      }
      
      System.out.println( sb.toString() );
      
  3. endElement(), called when element's end is reached, i.e.: </document>. There may have been intervening elements, so your element's state must be kept on a stack to be popped off or peeked at in order to complete processing (that is, depending on what you're doing).
  4. characters(), called to give text between element's beginning and end. You must be prepared to accumulate these across multiple calls because it may be called more than one time. what's passed, plain text between opening and closing tags, is a buffer; start is the relevant offset inside (not merely the first character) and length is the valid length of the characters to treat from that offset as germane to the call. This is also where you will find any whitespace (like newlines) between otherwise "textless" XML elements, e.g.: how you can count line numbers.
  5. endDocument(), called at end of document, after the last call to endElement().

How to create a handler

Create your handler thus with, at first, the five methods listed above. Of course, you'll have to satisfy the DefaultHandler contract by implementing (at least stubs for) all the methods. This example rudimentarily prints out what the SAX parser sees if you run the parser on an XML file.

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class MySaxParserHandler extends DefaultHandler
{
  private String elementContents;

  public void startElement( String uri, String localName, String elementName, Attributes attributes )
    throws SAXException
  {
    elementContents = "";
    System.out.print( "<" + elementName + ">" );
  }

  public void endElement( String uri, String localName, String elementName ) throws SAXException
  {
    System.out.print( "  " + elementContents + "" );
    System.out.print( "" );
  }

  public void characters( char ch[], int start, int length ) throws SAXException
  {
    String contents = new String( ch, start, length ).trim();

    elementContents += contents;
  }

  public void startDocument() throws SAXException
  {
    System.out.print( "Begin" );
  }

  public void endDocument() throws SAXException
  {
    System.out.print( "Done!" )
  }
}

How to register a handler with the parser

Now create your application's main entry point, set up the parser to consume your handler, then call it (the highlighted line below). We'll pass the sample XML filename as the argument.

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

public class MySaxParser
{
  public static void main( String[] args )
  {
    MySaxParserHandler handler = new MySaxParserHandler();
    SAXParserFactory   factory = SAXParserFactory.newInstance();
    SAXParser          parser  = null;

    try
    {
      parser = factory.newSAXParser();
    }
    catch( ParserConfigurationException e )
    {
      System.err.println( "Parser-configuration error:" );
      System.exit( -1 );
    }
    catch( SAXException e )
    {
      System.err.println( "SAX parser error:" );
      System.exit( -1 );
    }

    try
    {
      XMLReader reader = parser.getXMLReader();

      parser.parse( args[ 1 ], handler );
    }
    catch( Exception e )
    {
      System.err.println( "parser handler error:" );
      System.exit( -1 );
    }
  }
}

Integrated instantiation and handling

Throughout the examples in these notes, the instantiation of a parser, the definition of a handler and the calling of the parser has required at least two separate classes. This is an example of integrating all three operations into one.

Simply, use the constructor of the handler class, MappingSaxHandler(), to instantiation the SAX parser creating a ready instance and a special (non-handler) method, run(), to invoke the parser later when desired.

I won't flesh out utilities or dependencies such as:

package com.windofkeltia.processor;

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

import com.windofkeltia.exceptions.MappingException;
import com.windofkeltia.pojos.Concept;
import com.windofkeltia.pojos.ObjectStack;
import com.windofkeltia.utilities.StringUtilities;

public class MappingSaxHandler extends DefaultHandler
{
  private static final Logger logger = LoggerFactory.getLogger( MappingSaxHandler.class );

  private       Locator                locator;
  private final ObjectStack< Concept > stack      = new ObjectStack<>();
  private final List< Concept >        concepts   = new ArrayList<>();
  private final ParserHelp             parserHelp = new ParserHelp();
  private       boolean                INACTIVE   = false;

  private final MappingSaxHandler handler = this;
  private final SAXParserFactory  factory = SAXParserFactory.newInstance();
  private       SAXParser         parser  = null;
  private       XMLReader         reader  = null;

  public MappingSaxHandler() throws MappingException
  {
    try
    {
      parser = factory.newSAXParser();
      reader = parser.getXMLReader();
    }
    catch( ParserConfigurationException e )
    {
      throw new MappingException( "SAX parser configuration error: " + e.getMessage() );
    }
    catch( SAXException e )
    {
      throw new MappingException( "SAX parser creation error: " + e.getMessage() );
    }
  }

  public void run( final InputStream inputStream ) throws MappingException
  {
    try
    {
      parser.parse( inputStream, handler );
    }
    catch( IllegalArgumentException e )
    {
      throw new MappingException( "Parse input argument exception: " + e.getMessage() );
    }
    catch( IOException e )
    {
      throw new MappingException( "SAX parser I/O error: " + e.getMessage() );
    }
    catch( SAXException e )
    {
      throw new MappingException( "SAX parser exception: " + e.getMessage() );
    }
  }

  public void startDocument()
  {
    if( VERBOSE )
      SaxHandlerUtilities.startDocument();
  }

  public void startElement( String uri, String localName, String elementName, Attributes saxAttributes )
  {
    if( !INACTIVE && elementName.equals( "document" ) )
    {
      INACTIVE = true;
      logger.warn( "     Parser is now inactive (skipping <document>)..." );
      return;
    }

    if( INACTIVE )
      return;

    if( VERBOSE )
      SaxHandlerUtilities.startElement( locator, uri, localName, elementName, saxAttributes );

    Concept concept = new Concept( elementName, locator.getLineNumber(), saxAttributes );
    stack.push( concept );
    concepts.add( concept );

    parserHelp.switchConcept( elementName, concept.attributes.get( "type" ) );
    parserHelp.reservingConcept( elementName, concept );
  }

  public void characters( char[] ch, int start, int length )
  {
    if( INACTIVE )
      return;

    String characters = new String( ch, start, length ).trim();

    if( VERBOSE )
      SaxHandlerUtilities.characters( locator, characters );

    Concept concept = stack.peek();
    if( !StringUtilities.isEmpty( characters ) )
      concept.content.append( characters );
  }

  public void endElement( String uri, String localName, String elementName )
  {
    if( INACTIVE && elementName.equals( "document" ) )
    {
      logger.warn( "       Parser is now active (skipped <document>)..." );
      INACTIVE = false;
      return;
    }

    if( VERBOSE )
      SaxHandlerUtilities.endElement( locator, elementName );

    Concept concept = stack.pop();
    parserHelp.switchConcept( elementName );
    parserHelp.assimilateConcept( elementName, concept );
  }

  public void endDocument()
  {
    if( VERBOSE )
      SaxHandlerUtilities.endDocument();
  }

  public void setDocumentLocator( Locator locator )
  {
    this.locator = locator;
  }

  public List< Concept > getConcepts()
  {
    return concepts;
  }

  public List< Concept > getConcepts()
  {
    return parserHelp.getConceptList();
  }

  public static boolean VERBOSE = false;
}

SAX goodies

SAX gives additional goodies to the DefaultHandler, everything needed to parse even the most complex XML file. Here are some of my favorite:

DefaultHandler methods

ContentHandler methods

LexicalHandler methods

Nota bene: if you do not set the property, "http://xml.org/sax/properties/lexical-handler" as shown in code below, even though you're implementing LexicalHandler, processing will never reach your comment() method.

Locator, given to setDocumentLocator()

Tells you this stuff; you get an instance of this, via the method, pretty much before anything else and the parser maintains the line number and column throughout its parsing. Very useful for error reporting in XML validation and other activities.

To get this to work, you code the following highlighted lines into your handler. The new method in your handler, setDocumentLocator(), is called by the SAX parser as it runs to maintain information in the locator you've added such that, later, you can call into it, here via method getLineNumber(), but there is also getColumnNumber(), to get this information.

import org.xml.sax.Locator;

public class MySaxHandler extends DefaultHandler
{
  private Locator locator;

  public void setDocumentLocator( Locator locator )
  {
    this.locator = locator;
  }

  ...

  public void startElement( String uri, String localName, String elementName, Attributes attributes )
    throws SAXException
  {
    System.out.print( locator.getLineNumber() + "  <" + elementName + ">" );
  }

  ...
}

Back in the years when getLineNumber() always returned -1 because it wasn't implemented, I tried doing line number in my parser handler by counting newlines and pains-taking bookkeeping. The Locator works much better.

Perplexity: SAX only parses streams not strings?

This isn't strictly true, but if you choose the method parse( String ), you will find you get an exception

java.net.MalformedURLException: no protocol:

...which seems bizarre since you don't think you're dealing with a URL. In fact, you are because the parser assumes parsing something like

<?xml version='1.0'?>
<!DOCTYPE Something SYSTEM "http://www.something.com/Something.dtd">

...when, in fact, that's not in what you're asking to be parsed at all. Whatever the case may be, translating the string to an input stream seems to solve the problem. In code below, there is some of this going on.

Comparison of handler interfaces

Here's a table comparing two additional SAX parser handlers you can add to your parser. For example, to add the LexicalHandler, recreate the handler above thus:

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class MySaxParserHandler extends DefaultHandler implements LexicalHander
{
  .
  .
  .

Here's the comparison table. LexicalHandler adds, for example, special methods to allow you to handle XML comments, which are otherwise completely ignored (you won't ever be notified about them), as well as CDATA, DTDs, etc.

DefaultHandler ContentHandler LexicalHandler
characters characters
comment
endCDATA
endDTD
endDocument endDocument
endElement endElement
endEntity
endPrefixMapping endPrefixMapping
error
fatalError
ignorableWhitespace ignorableWhitespace
notationDecl
processingInstruction       processingInstruction      
resolveEntity
setDocumentLocator setDocumentLocator
skippedEntity skippedEntity
startCDATA
startDTD
startDocument startDocument
startElement startElement
startEntity
startPrefixMapping startPrefixMapping
unparsedEntityDecl
warning

I first wrote up these comparisons in an effort to understand why, in some code I was reading, ContentHandler was being used instead of DefaultHandler. Then I realized that it was because the original author did not want to handle errors.

List of all methods/interfaces and comments...

...at your disposal with terse comments on what they do. I have not myself done everything possible with SAX and there's much I don't know.

Method DefaultHandler ContentHandler LexicalHandler Purpose
characters plain text between opening and closing tags
comment like characters() for comments
endCDATA reports end of CDATA
endDTD reports end of DOCTYPE section
endDocument reports end of (whole) document
endElement reports end of XML element (tag)
endEntity reports end of XML entities
endPrefixMapping reports end of namespace mapping
error reports (recoverable) parser error
fatalError report fatal (unrecoverable) parser error
ignorableWhitespace invoked in combination with a DTD
notationDecl
processingInstruction reports processing instructions like <?target data>
resolveEntity
setDocumentLocator yields an instance of Locator for later use
skippedEntity reports skipped entity (could have been getting via LexicalHandler)
startCDATA reports beginning of CDATA, starting with <![CDATA[
startDTD reports beginning of DOCTYPE section
startDocument reports beginning of (whole) document
startElement reports beginning of XML element (tag)
startEntity reports beginning of XML entities like DTD
startPrefixMapping reports beginning of namespace mapping; a namespace is like X in <X:tag ...>
unparsedEntityDecl
warning reports parser warning

Events

SAX parsing is done via a sort of "call-back" mechanism. You write a handler that you supply to SAX. The DefaultHandler design imposes the requirement to implement a number of methods that are called as SAX encounters XML elements (tags), attributes, etc. Formally, these are:

In Java, these are the following (and are public in your handler):

void startDocument()
void endDocument()
void startElement( String uri, String localName, String qName, Attributes attributes )
void endElement( String uri, String localName, String qName )
void characters( char ch[], int start, int length )

—when implementing LexicalHandler...
void startDTD( String name, String publicId, String systemId )
void endDTD()
void startEntity( String name )
void endEntity( String name )
void startCDATA()
void endCDATA()
void comment( char[] ch, int start, int length )

Similarly, in Python, you implement:

def startDocument( self )
def endDocument( self )
def startElement( self, tag, attributes=None )
def endElement( self, tag )
def characters( self, content )

—xml.sax.sax2lib.LexicalHandler is not supported until Python 3

Java samples

CdaAnalysis.java:
package com.etretatlogiciels.cda.analysis;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

public class CdaAnalysis
{
  private SAXParserFactory factory;
  private SAXParser        parser;
  private XMLReader        reader;
  private DefaultHandler   handler;

  public CdaAnalysis( String outputFilepath )
  {
    factory = SAXParserFactory.newInstance();

    try
    {
      parser  = factory.newSAXParser();
      handler = new CdaHandler( outputFilepath, false );
      reader  = parser.getXMLReader();
      reader.setProperty( "http://xml.org/sax/properties/lexical-handler", handler );
    }
    catch( ParserConfigurationException e )
    {
      e.printStackTrace();
    }
    catch( SAXException e )
    {
      e.printStackTrace();
    }
  }

  public void parse( String pathname )
  {
    try
    {
      parser.parse( pathname, handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }
  }
}
CdaHandler.java:
package com.etretatlogiciels.cda.analysis;

import java.util.Collections;
import java.util.List;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.ext.LexicalHandler;
import org.xml.sax.helpers.DefaultHandler;

import com.etretatlogiciels.utilities.RedirectSystemOutStream;
import com.etretatlogiciels.utilities.OccurrencesTable;
import com.etretatlogiciels.utilities.XPathTable;
import com.etretatlogiciels.utilities.SymbolTableException;
import com.etretatlogiciels.utilities.XmlElementStack;

/**
 * We want to analyse elements, modifying our analysis to make a growing number of
 * observations about CDA and CCD documents.
 *
 * History
 *
 * We already gather out <text> ... </text> elements from some CDA
 * elements and put them into our CXML. What we want to do is identify other, import
 * content to gather out and extend our work to CCDs, which adhere to the CDA format
 * standard.
 *
 * About this file...
 *
 * The SAX handler is the cookie jar with all the cookies. It's the code that says
 * what we're looking for and how it's supposed to show up. In this case, however,
 * we don't want to make many decisions about what's going to be in the CDA (or
 * CCD) because we're just learning. And, anyway, we likely would never be able to
 * predict with complete accuracy what's in there. We're just observers trying to
 * find a way to disembowel the document to get content out.
 *
 * About lineNumber: the starting value for lineNumber must account for the number of
 * lines in the XML file being parsed that occur prior to the principal/outside
 * element even if this is one or more comments.
 *
 * @author Russell Bateman
 * @since May 2015
 */
public class CdaHandler extends DefaultHandler implements LexicalHandler
{
  private int              lineNumber = 2; // (because all documents start with <?xml version="1.0"?>)
  private String           elementName;
  private String           characters;
  private OccurrencesTable occurrences;
  private XPathTable       xpath;
  private XmlElementStack  stack;

  private static final String DOCUMENT = "<document-top>";

  /**
   * Just output to the console whatever's found.
   */
  public CdaHandler()
  {
    super();
    occurrences = new OccurrencesTable();
    xpath       = new XPathTable();
    stack       = new XmlElementStack();
  }

  /**
   * Hereafter, all output goes to System.out which will take it to a file and/or
   * the console.
   * @param outputFilepath to which results will go.
   * @param hush shut the console up (no output to console).
   */
  public CdaHandler( String outputFilepath, boolean hush )
  {
    this();
    RedirectSystemOutStream.redirectSystemOutToFile( outputFilepath, hush );
  }

  /**
   * If processing multiple documents, this would be an opportunity to initialize
   * processing for the next document (as opposed to a previous one).
   */
  public void startDocument() throws SAXException
  {
    stack.push( DOCUMENT );
    System.out.println( "Elements in order of appearance and their line numbers..." );
  }

  /**
   * If processing multiple documents, this is an opportunity to tie off the one
   * that's been processing and is finished.
   */
  public void endDocument() throws SAXException
  {
    List< String > elements = occurrences.keys();
    int            widest   = CdaAnalysisUtilities.getMaximumKeyWidth( elements );

    System.out.println( "-----------------------------------------------------------------------" );
    System.out.println( "Elements sorted alphabetically and their frequency..." );
    // Sort alphabetically by key...
    Collections.sort( elements );
    CdaAnalysisUtilities.printPaddedList( occurrences, elements, widest );

    System.out.println( "-----------------------------------------------------------------------" );
    System.out.println( "Elements sorted by occurrence from most to least frequent and their frequency..." );
    CdaAnalysisUtilities.sortAndPrintPaddedByValue( occurrences, elements, widest );

    System.out.println( "-----------------------------------------------------------------------" );
    System.out.println( "Elements with their XPath and line number on which they appear..." );
    CdaAnalysisUtilities.sortAndPrintXPaths( xpath, elements, widest );

    if( stack.peek().equals( DOCUMENT ) )
        System.out.println( "Document is well formed" );
  }

  /**
   * —what identifies and handles the opening element tag. These can be
   * intercepted and used for analysis.
   */
  public void startElement( String uri, String localName, String qName, Attributes attributes ) throws SAXException
  {
    String elementName = qName;

    stack.push( elementName );

    if( !occurrences.contains( elementName ) )
    {
      try
      {
        occurrences.put( elementName, 1 );
      }
      catch( SymbolTableException e )
      {
        e.printStackTrace();
      }
    }
    else
    {
      try
      {
        Integer count = occurrences.get( elementName );
        count++;
        occurrences.delete( elementName );
        occurrences.put( elementName, count );
      }
      catch( SymbolTableException e )
      {
        e.printStackTrace();
      }
    }

    try
    {
      String path = stack.renderStackAsXPath();
      xpath.put( elementName, lineNumber, path );
    }
    catch( SymbolTableException e )
    {
      e.printStackTrace();
    }

    System.out.println( lineNumber + "  <" + qName + ">" );
  }

  /**
   * —what identifies the closing element tag, an opportunity to wrap up
   * the greater tag, whatever we want to do with it. It's at this point that
   * we know the tag is finished and what we've got in characters is its content.
   */
  public void endElement( String uri, String localName, String qName ) throws SAXException
  {
    stack.pop();
  }

  /**
   * —what gathers the plain text content between the opening and closing tags.
   * Of course, some elements are comprehensive and contain additional elements without
   * also having plain text.
   * @param ch     array holding the characters.
   * @param start  starting position in the array.
   * @param length number of characters to use from the array.
   */
  public void characters( char ch[], int start, int length ) throws SAXException
  {
    characters = new String( ch, start, length ).trim();

    for( int i = start; i < start+length; i++ )
    {
      if( ch[ i ] == '\n' )
        lineNumber++;
    }
  }

  /**
   * Maybe this can be used to create line numbers? No, it's only invoked in combination
   * with a DTD. If the parsed XML file doesn't preconize a DTD, it won't happen.
   */
  public void ignorableWhitespace( char ch[], int start, int length ) throws SAXException
  {
//    String whitespace = new String( ch, start, length ).trim();
  }

  public void startDTD( String name, String publicId, String systemId ) throws SAXException
  {
  }

  public void endDTD() throws SAXException
  {
  }

  public void startEntity( String name ) throws SAXException
  {
  }

  public void endEntity( String name ) throws SAXException
  {
  }

  /**
   * Reports beginning of CDATA, starting with <![CDATA[ and ending with ]]>, for example,
   *
   * "<![CDATA[<p>The pug snoring on the couch next to me is <em>extremely</em> cute</p>]]>"
   */
  public void startCDATA() throws SAXException
  {
  }

  public void endCDATA() throws SAXException
  {
  }

  /**
   * Reports any and all XML comments anywhere in (internal) document.
   * @param ch     array holding the characters in the comment.
   * @param start  starting position in the array.
   * @param length number of characters to use from the array.
   */
  public void comment( char[] ch, int start, int length ) throws SAXException
  {
    for( int i = start; i < start+length; i++ )
    {
      if( ch[ i ] == '\n' )
        lineNumber++;
    }
  }
}

SAX dummy handler sample in Java

There is a useful SAX dummy handler sample that you can play with, use it as a blank or a place to start fresh.

Python samples

import xml.sax

def getResult():
  result         = httpclient.get( uri )
  payload        = result.read()
  resultIdParser = CqPostResponsePayload()
  try:
    xml.sax.parseString( payload, resultIdParser )
  except Exception as e:
    print e.message
  return resultIdParser.getResultId()

class CqPostResponsePayload( xml.sax.ContentHandler ):
  '''
  Parse response payload, looks something like:
  fWkcTS1a
  '''
  def __init__( self ):
    self.result = StringIO()
    self.resultIdCharacters = ''
  def getResultId( self ):
    return self.result.getvalue().lstrip().rstrip()
  def startElement( self, tag, attributes=None ):
    if tag == 'result_id':
      self.resultIdCharacters = ''
    else:
      pass
  def endElement( self, tag ):
    if tag == 'result_id':
       # tie off the result_id...
      print && self.result, self.resultIdCharacters
    else:
      pass
  def characters(self, content ):
    self.resultIdCharacters += content

A different example...

def getValueOfTotalAttribute( line ):
    ''' Just the attributes. '''

    parser = HitsTagElementParser()
    try:
        xml.sax.parseString( line, parser )
    except Exception as e:
        print e.message
        return 0
    attributes = parser.getAttributes()
    return attributes

class HitsTagElementParser( xml.sax.ContentHandler ):
  def __init__( self ):
    self.attributes = {}
  def getAttributes( self ):
    return self.attributes
  def startElement( self, tag, attributes=None ):
    if tag != 'our-tag':
      return
    self.attributes = attributes
  def endElement( self, tag ):
    ''' We'll never hit this! '''
    pass
  def characters( self, content ):
    ''' We're uninterested in this. '''
    pass

Java code for creating schema from XML

This example parses XML from a string (see XmlSchemaTest.java), but it could be a file as shown in the earlier Java samples. The goal is to create an instance of XmlElement that recursively contains the rest of (the XML content whether in a string or a file).

Among other things, this exercise demonstrates how to use stacks to keep track of where one is to gather text out of intermingled subelements for the owning element.

XmlSchema.java

This test ensures (visually-only; no assertions are made, so these are not real test cases) that our parser can do a couple of levels deep and a couple abroad for the purpose of demonstrating the idea only.

package com.etretatlogiciels.xml;

import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TestName;

import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.nio.charset.Charset;

import com.etretatlogicials.xml.XmlElement;
import com.etretatlogicials.xml.XmlSchemaHandler;

public class XmlSchemaTest
{
  @Rule
  public TestName name = new TestName();

  @Before
  public void setUp() throws Exception
  {
    String testName  = name.getMethodName();
    int    PAD       = 100;
    int    nameWidth = testName.length();
    System.out.print( "Test: " + testName + " " );
    PAD -= nameWidth;
    while( PAD-- > 0 )
      System.out.print( "-" );
    System.out.println();
  }

  private SAXParser createParser( DefaultHandler handler )
  {
    SAXParser parser = null;

    try
    {
      parser = SAXParserFactory.newInstance().newSAXParser();
    }
    catch( ParserConfigurationException e )
    {
      System.err.println( "Parser-configuration error:" );
      e.printStackTrace();
    }
    catch( SAXException e )
    {
      System.err.println( "SAX ruleparser error:" );
      e.printStackTrace();
    }

    return parser;
  }

  private InputStream toInputStream( String string )
  {
    return new ByteArrayInputStream( string.getBytes( Charset.forName( "UTF-8" ) ) );
  }

  /**
   * Expected:
   * {
   *   document,
   *   This is a test.
   * }
   */
  @Test
  public void testOne()
  {
    final String   XML_CONTENT = "<?xml version=\"1.0\"?><document>This is a test.</document>";
    DefaultHandler handler     = new XmlSchemaHandler();
    SAXParser      parser      = createParser( handler );

    try
    {
      parser.parse( toInputStream( XML_CONTENT ), handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }

    XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct();
    System.out.println( finishedProduct.toStringEnchilada( 0, "  " ) );
  }

  /**
   * Expected:
   * {
   *   document,
   *   This is a test.
   *   {
   *     subdocument,
   *     Hi!
   *   }
   * }
   */
  @Test
  public void testTwo()
  {
    final String   XML_CONTENT = "<?xml version=\"1.0\"?>\n"
		                       + "<document>\n"
		                       + "  This is a test.\n"
		                       + "    <subdocument>\n"
		                       + "      Hi!\n"
		                       + "    </subdocument>\n"
		                       + "  </document>";\n"
    DefaultHandler handler     = new XmlSchemaHandler();
    SAXParser      parser      = createParser( handler );

    try
    {
      parser.parse( toInputStream( XML_CONTENT ), handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }

    XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct();
    System.out.println( finishedProduct.toStringEnchilada( 0, "  " ) );
  }

  /**
   * Expected:
   * {
   *   document,
   *   This is a test.
   *   {
   *     subdocument,
   *     Hi!
   *   }
   * }
   */
  @Test
  public void testThree()
  {
    final String   XML_CONTENT = "<?xml version=\"1.0\"?>\n"
                               + "<document attribute=\"fun and games\">\n"
                               + "  This is a test.\n"
                               + "  <subdocument another=\"more fun and games\">\n"
                               + "    Hi!\n"
                               + "  </subdocument>\n"
                               + "</document>";
    DefaultHandler handler     = new XmlSchemaHandler();
    SAXParser      parser      = createParser( handler );

    try
    {
      parser.parse( toInputStream( XML_CONTENT ), handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }

    XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct();
    System.out.println( finishedProduct.toStringEnchilada( 0, "  " ) );
  }

  /**
   * Expected:
   * {
   *   document,
   *   This is a test.
   *   {
   *     subdocument1,
   *     Hi!
   *   }
   *   {
   *     subdocument2,
   *     Hi, again!
   *   }
   * }
   */
  @Test
  public void testAbreast()
  {
    final String   XML_CONTENT = "<?xml version=\"1.0\"?>\n"
                               + "<document attribute=\"fun and games\">\n"
                               + "  This is a test.\n"
                               + "  <subdocument1 another=\"more fun and games\">\n"
                               + "    Hi!\n"
                               + "  </subdocument1>\n"
                               + "  <subdocument2 another=\"still more fun and games\">\n"
                               + "    Hi, again!\n"
                               + "  </subdocument2>\n"
                               + "</document>";
    DefaultHandler handler     = new XmlSchemaHandler();
    SAXParser      parser      = createParser( handler );

    try
    {
      parser.parse( toInputStream( XML_CONTENT ), handler );
    }
    catch( Exception e )
    {
      e.printStackTrace();
    }

    XmlElement finishedProduct = ( ( XmlSchemaHandler ) handler ).getFinishedProduct();
    System.out.println( finishedProduct.toStringEnchilada( 0, "  " ) );
  }
}
XmlElement.java

This is the brass ring at the end of the run. The entire XML content, as a tree, is in this object (and recursive instances below it).

package com.etretatlogicials.xml;

import org.xml.sax.Attributes;

import java.util.ArrayList;
import java.util.List;

public class XmlElement
{
  private String             uri;
  private String             localName;
  private String             qname;
  private String             text;
  private Attributes         attributes;
  private List< XmlElement > children = new ArrayList<>();

  public XmlElement( String uri, String localName, String qname, Attributes attributes )
  {
    this.uri        = uri;
    this.localName  = localName;
    this.qname      = qname;
    this.attributes = attributes;
  }

  public String getUri()                  { return uri; }
  public String getLocalName()            { return localName; }
  public String getQname()                { return qname; }
  public String getText()                 { return text; }
  public Attributes getAttributes()       { return attributes; }
  public List< XmlElement > getChildren() { return children; }

  public void setUri( String uri )                   { this.uri = uri; }
  public void setLocalName( String localName )       { this.localName = localName; }
  public void setQname( String qname )               { this.qname = qname; }
  public void setText( String text )                 { this.text = text; }
  public void setAttributes( Attributes attributes ) { this.attributes = attributes; }
  public void addChild( XmlElement child )           { this.children.add( child); }

  public String toString()
  {
    StringBuilder sb = new StringBuilder();

    sb.append( "{ " );

    if( localName != null && localName.length() > 0 )
      sb.append( qname );
    else
      sb.append( qname );
    sb.append( ", " ).append( text ).append( " }" );
    return sb.toString();
  }

  private String indentToLevel( int level, String tab )
  {
    if( level < 1 )
      return "";

    StringBuilder sb = new StringBuilder();
    while( level-- > 0 )
      sb.append( tab );
    return sb.toString();
  }

  public String toStringEnchilada( int level, String tab )
  {
    StringBuilder sb = new StringBuilder();

    sb.append( indentToLevel( level, tab ) ).append( "{\n" );

    level++;

    if( localName != null && localName.length() > 0 )
      sb.append( indentToLevel( level, tab ) ).append( localName );
    else
      sb.append( indentToLevel( level, tab ) ).append( qname );

    if( text != null && text.length() > 0 )
      sb.append( ",\n" ).append( indentToLevel( level, tab ) ).append( text );

    if( children.size() > 0 )
    {
      sb.append( '\n' );
      for( XmlElement element : children )
        sb.append( element.toStringEnchilada( level, tab ) );
    }
    else
    {
      sb.append( '\n' );
    }

    level--;

    sb.append( indentToLevel( level, tab ) ).append( "}\n" );
    return sb.toString();
  }
}
SaxCharactersStack.java

Works in concert with internal class ContentBuffer to gather element text. This is necessary because underneath a given element there is text, but XML says nothing against more XML elements tossed among the text though this is inelegant.

package com.etretatlogicials.xml;

import java.util.ArrayList;
import java.util.List;

public class SaxCharactersStack
{
  private List< StringBuffer > stack;

  public SaxCharactersStack()              { stack = new ArrayList<>(); }
  public void push( StringBuffer element ) { stack.add( element ); }
  public StringBuffer pop()                { return stack.remove( stack.size() - 1 ); }
  public String toString()                 { return stack.toString(); }
}
XmlElementStack.java

We keep the elements in a stack as we encounter them in order to know where to go to attach the element text we don't finish gathering until much later.

package com.etretatlogicials.xml;

import java.util.Iterator;
import java.util.LinkedList;

public class XmlElementStack
{
  // @formatter:off
  private LinkedList< XmlElement > stack = new LinkedList<>();

  public void       push( XmlElement element ) { stack.addFirst( element ); }
  public XmlElement pop()                      { return stack.removeFirst(); }
  public XmlElement peek()                     { return stack.getFirst(); }
  public int        depth()                    { return stack.size(); }
  public boolean    isEmpty()                  { return stack.isEmpty(); }

  public String renderStackAsXPath()
  {
    Iterator< XmlElement > iterator = stack.descendingIterator();
    StringBuilder          path     = new StringBuilder();

    while( iterator.hasNext() )
    {
      XmlElement element = iterator.next();

      if( element.getLocalName().equals( "<document-top>" ) )
        continue;

      path.append( element );
      path.append( '/' );
    }

    String xpath = path.toString();

    return xpath.substring( 0, xpath.length() - 1 );
  }
}
XmlSchemaHandler.java

This is where the heavy lifting is done. As described elsewhere on this page, SAX calls this handler at key points in the parsing process.

package com.etretatlogicials.xml;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import java.util.NoSuchElementException;

public class XmlSchemaHandler extends DefaultHandler
{
  private XmlElementStack    elementStack;
  private SaxCharactersStack characterStack;
  private XmlElement         finishedProduct;
  private ContentBuffer      contentBuffer;

  public void startDocument() throws SAXException
  {
    elementStack   = new XmlElementStack();
    characterStack = new SaxCharactersStack();
    contentBuffer  = new ContentBuffer();
  }

  public void startElement( String uri, String localName, String qName, Attributes attributes )
      throws SAXException
  {
    XmlElement element = new XmlElement( uri, localName, qName, attributes );

    contentBuffer.stack.push( contentBuffer.buffer );
    contentBuffer.buffer = new StringBuffer();

    try
    {
      XmlElement parent = elementStack.peek();
      parent.addChild( element );
    }
    catch( NoSuchElementException e )
    {
      ;
    }

    elementStack.push( element );
  }

  public void characters( char[] ch, int start, int length ) throws SAXException
  {
    String characters = new String( ch, start, length );

    if( contentBuffer.buffer != null )
      contentBuffer.buffer.append( characters.trim() );
  }

  public void endElement( String uri, String localName, String qName ) throws SAXException
  {
    XmlElement element = elementStack.pop();

    if( contentBuffer.buffer != null )
    {
      String text = ( contentBuffer.buffer.length() > 0 ) ? contentBuffer.buffer.toString() : "";

      element.setText( text );

      try
      {
        contentBuffer.buffer = contentBuffer.stack.pop();
      }
      catch( Exception e )
      {
      /* This typically happens on the last element, so we protect against
       * crashing thus.
       */
        ;
      }
    }

    // on the last element, push it back so it can be returned by endDocument()...
    if( elementStack.isEmpty() )
      elementStack.push( element );
  }

  public void endDocument() throws SAXException
  {
    finishedProduct = elementStack.pop();
  }

  public void ignorableWhitespace( char ch[], int start, int length ) throws SAXException { }

  static void startBufferContent( ContentBuffer contentBuffer, String elementName )
  {
      contentBuffer.stack.push( contentBuffer.buffer );
      contentBuffer.buffer = new StringBuffer();
  }

  public static class ContentBuffer
  {                                                              // special case for
    public StringBuffer       buffer;                            // element text storage
    public SaxCharactersStack stack = new SaxCharactersStack();  // help composing element text
  }

  public XmlElement getFinishedProduct() { return finishedProduct; }
}

Straightforward example of SAX parser handler

This is excerpted from a parser in a servlet used to process in-coming requests in XML. The guts of what to do with the parsed XML aren't shown, but just where that happens.

ParseIncomingData.java:

Sets up the parser.

import java.io.IOException;
import java.io.InputStream;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

public class ParseIncomingData
{
  private final IncomingDataHandler handler;
  private       XMLReader           xmlReader;
  private       SAXParser           parser;

  // use this to create new parser instances...
  static SAXParserFactory factory = SAXParserFactory.newInstance();

  /**
   * Set up an instance of a SAX parser with handler initialized to zilch.
   * @param state used by the handler.
   * @throws Exception should any error occur.
   */
  public ParseIncoming( ParsingState state ) throws Exception
  {
    handler = new IncomingDataHandler( state );

    try
    {
      parser    = factory.newSAXParser();
      xmlReader = parser.getXMLReader();
    }
    catch( ParserConfigurationException e )
    {
      throw new Exception( "Parser-configuration error: " + e.getMessage() );
    }
    catch( SAXException e )
    {
      throw new Exception( "Parser handler error: " + e.getMessage() );
    }
  }

  /**
   * Parse the in-coming data.
   * @param in in-coming data as a stream.
   * @throws Exception should any error occur.
   */
  public void parse( final InputStream in ) throws Exception
  {
    try
    {
      parser.parse( in, handler );
    }
    catch( IOException e )
    {
      throw new Exception( "I/O exception: " + e.getMessage() );
    }
    catch( SAXException e )
    {
      throw new Exception( "SAX parser handler error: " + e.getMessage() );
    }
    catch( Exception e )
    {
      throw new Exception( "Unknown exception: " + e.toString() );
    }
  }
}
ElementStack.java:

In order to do something cogent with what's parsed, this is the stack that keeps the data together and useful.

import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.Map;

import org.xml.sax.Attributes;

/**
 * Implements a stack that remembers everything important to parsing, to wit:
 *
 * - the element name (the stack can be rendered to produce an XPath),
 * - a map of any attributes ( qualified name as key plus value), and
 * - text content associated with the element, if any.
 *
 * We give up ElementStack.Element intentionally, but the content
 * field is the only one we consider mutable because of how late in the game it
 * can be created.
 */
public class ElementStack
{
  public static final String DOCUMENT_TOP = "<document-top>";

  public static class Element
  {
    protected final String                elementName;  // immutable
    protected final Map< String, String > attributes;   // immutable
    protected       StringBuilder         content;      // intentionally mutable

    protected Element( final String name, Attributes saxAttributes )
    {
      elementName = name;
      content     = new StringBuilder();
      attributes  = new HashMap<>();

      if( saxAttributes != null )
      {
        int attrLength = saxAttributes.getLength();

        for( int attr = 0; attr < attrLength; attr++ )
        {
          String attribute = saxAttributes.getQName( attr );
          String value     = saxAttributes.getValue( attr );
          attributes.put( attribute, value );
        }
      }
    }
  }

  private final LinkedList< Element > stack = new LinkedList<>();

  public void push( final String elementName, Attributes attributes )
  {
    Element element = new Element( elementName, attributes );
    stack.addFirst( element );
  }

  public Element pop()     { return stack.removeFirst(); }
  public Element peek()    { return stack.getFirst(); }
  public int     depth()   { return stack.size(); }
  public boolean isEmpty() { return stack.isEmpty(); }

  /** Pop the element and return (only) the element name. */
  public String popName()
  {
    Element element = stack.removeFirst();
    return element.elementName;
  }

  /** Peek at (only) the element name. */
  public String peekName()
  {
    Element element = stack.getFirst();
    return element.elementName;
  }

  /** Render the entire stack (of element names) as an XPath. */
  public String renderStackAsXPath()
  {
    Iterator< Element > iterator = stack.descendingIterator();
    StringBuilder      path     = new StringBuilder();

    while( iterator.hasNext() )
    {
      Element element = iterator.next();
      String  name    = element.elementName;

      if( name.equals( DOCUMENT_TOP ) )
        continue;

      path.append( name );
      path.append( '/' );
    }

    String xpath = path.toString();
    return xpath.substring( 0, xpath.length() - 1 );
  }

  public StringBuilder peekContent()
  {
    Element element = stack.peek();
    assert element != null;
    return element.content;
  }

  /**
   * To accomplish this, we have to peek at the top element (to
   * get it), then add the content to the buffer.
   * @param content to append to what's there.
   */
  public void appendContent( String content )
  {
    Element element = stack.peek();
    assert element != null;
    element.content.append( content );
  }

  /** Format the attributes as a list. */
  public static String formatAttributes( ElementStack.Element element )
  {
    StringBuilder attributes = new StringBuilder();
    int           count      = element.attributes.size();

    if( count > 0 )
    {
      for( Map.Entry< String, String > attribute : element.attributes.entrySet() )
      {
        count--;
        attributes.append( attribute.getKey() );
        attributes.append( "=\"" );
        attributes.append( attribute.getValue() );
        attributes.append( "\"" );
        if( count > 0 )
          attributes.append( ", " );
      }
    }
    return attributes.toString();
  }
}
ParsingState.java:

Because this is reentrant code, potentially called by many threads, this is the state instantiated separately for each. There's some debugging aids kept, but mostly it's just a stack of information that's filled out as the SAX parsing proceeds.

public class ParsingState
{
  protected ElementStack stack = new ElementStack();

  // debugging aids in case we want them...
  protected boolean VERBOSE;            // whether or not to see debugging stuff
  protected int     indentLevel = 1;    // as if generative--which we aren't, but if we want to print
  protected int     lineNumber  = 2;    // of original source document

  public ParsingState() { }
  public ParsingState( boolean verbose ) { this.VERBOSE = verbose; }
}
IncomingDataHandler.java:

Here's the meat of the parser. It's in method endElement() that the data parsed will be used to call code that does something with the data.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import org.xml.sax.Attributes;
import org.xml.sax.helpers.DefaultHandler;

import com.windofkeltia.utilities.StringUtilities;

import static com.windofkeltia.sax.ElementStack.DOCUMENT_TOP;

public class IncomingDataHandler extends DefaultHandler
{
  private final ParsingState state;

  public IncomingDataHandler( ParsingState state )
  {
    super();
    this.state = state;
  }

  /**
   * —what identifies and handles each opening elementName tag. Here we're
   * concerned with
   *
   * a) elementName names (and namespaces),
   * b) attributes,
   * c) saving our place for composing elementName text, and
   * d) building XPaths.
   *
   * @param uri namespace URI, where the prefix is from; we ignore this.
   * @param localName local name (without namespace prefix)
   * @param elementName (SAX calls this the qname; it's both the namespace prefix and local name.
   * @param attributes any attributes on the elementName.
   */
  public void startElement( String uri, String localName, String elementName, Attributes attributes )
  {
    // debugging: maintain indentation
    if( state.VERBOSE )
      maintainIndentation();

    state.stack.push( elementName, attributes );  // save element and attributes for endElement()...
  }

  /**
   * —what gathers the plain, text content between the opening and closing tags.
   * Of course, some elements are comprehensive and contain additional elements without
   * also having plain text.
   * @param ch     array holding the contentStack.
   * @param start  starting position in the array.
   * @param length number of contentStack to use from the array.
   */
  public void characters( char ch[], int start, int length )
  {
    String content = getTextContentFromSaxCharacters( ch, start, length );

    // append content to the owning (current) element...
    state.stack.appendContent( content );
  }

  /**
   * —what identifies the closing element name tag, an opportunity to wrap up
   * the greater tag, whatever we want to do with it. It's at this point that
   * we know the tag is finished and what we've got in character content.
   */
  public void endElement( String uri, String localName, String elementName )
  {
    String                xpath      = state.stack.renderStackAsXPath();
    ElementStack.Element  element    = state.stack.pop();   // (and, pop this one off the stack)
    Map< String, String > attributes = element.attributes;
    String                content    = element.content.toString();

    String text = state.stack.peekContent().toString();

    // =====================================================================
    // This is where the majority of business is conducted using
    //   - element name
    //   - associated attributes
    //   - content text

    (Do stuff here...)

    if( state.VERBOSE )
    {
      String progress = state.lineNumber + ": " + xpath;
      if( element.attributes.size() > 0 )
        progress += "( " + ElementStack.formatAttributes( element ) + " )";
      int length  = Math.min( content.length(), 40 );
      if( content.length() > 1 )
        progress += ":   " + content.substring( 0, length ).trim();
      System.out.println( progress );
      maintainIndentation();
    }
  }

  /** This is the beginning of the document. Mark it on the stack. */
  public void startDocument()
  {
    state.stack.push( DOCUMENT_TOP, null );

    if( state.VERBOSE )
      System.out.println( "Parsed:" );
  }

  /** This is the end of the document. Pop the original beginning mark off the stack. */
  public void endDocument() { state.stack.pop(); }

  /** If we care (debugging is turned on), maintain indentation, indeed, create left margin tabs. */
  private String maintainIndentation()
  {
    state.indentLevel++;
    return makeTabForIndentation( state.indentLevel );
  }

  private String makeTabForIndentation( int indentationLevel )
  {
    String tab = "";
    while( indentationLevel-- > 0 )
      //noinspection StringConcatenationInLoop
      tab += "  ";
    return tab;
  }

  /**
   * Gather text content from SAX.
   * @return content bereft of multiple newlines from end leaving at most one.
   */
  private String getTextContentFromSaxCharacters( char ch[], int start, int length )
  {
    String        characters = new String( ch, start, length );
    String[]      lines      = characters.split( "\n" );
    StringBuilder sb         = new StringBuilder();

    for( String line : lines )
    {
      sb.append( line.trim() )
        .append( '\n' );
    }

    // debugging: bump line count by number of newlines"
    if( state.VERBOSE )
      state.lineNumber += StringUtilities.countCharactersInString( characters, '\n' );

    // remove multiple newlines from end of content to leave just one...
    return StringUtilities.chomp( sb.toString() );
  }
}
SaxParserTest.java:

Here's how to set up the parser and call it with some data.

import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TestName;

import com.windofkeltia.utilities.TestUtilities;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;

public class SaxParserTest
{
  @Rule public TestName name = new TestName();
  @Before public void setUp() { TestUtilities.setUp( name ); }

  private static final boolean VERBOSE = TestUtilities.VERBOSE;

  private InputStream openStreamOnFile( final String filename ) throws IOException
  {
    try
    {
      final String PATH    = TestUtilities.TEST_FODDER + filename;
      final String CONTENT = TestUtilities.getLinesInFile( PATH );

      return new ByteArrayInputStream( CONTENT.getBytes( StandardCharsets.UTF_8 ) );
    }
    catch( IOException e )
    {
      if( VERBOSE )
        System.out.println( "Failed to open or read file " + filename + " containing test data: " + e.getMessage() );
      throw e;
    }
  }

  @Test
  public void test() throws Exception
  {
    ParsingState  state  = new ParsingState( true );  // turn on debugging
    ParseIncoming parser = new ParseIncoming( state );
    InputStream   input  = openStreamOnFile( "xml-data.xml" );
    parser.parse( input );
  }
}

HTML parsing with TagSoup

SAX chokes on stuff like &lt; and &gt; so if your XML has embedded HTML (it happens to me in the medical documents I deal with), you can delimit the HTML and pass it off to a subparser written using TagSoup. Meanwhile, here's how easy it is to set up and use TagSoup in a very simple example.

The highlighted lines are supposed to draw your attention to the important SAX or TagSoup bits of the code. A version of calling the parser with string- rather than file input is provided for gee whiz value.

TagSoupExample.java:

package com.etretatlogiciels.html.tagsoup;

import java.io.File;
import org.xml.sax.SAXException;

public class Example
{
  public static void main( String[] args )
  {
    TagSoupParser parser = new TagSoupParser();

    try
    {
      File file = new File( "test/fodder/Sample.html" );

      parser.run( file );
    }
    catch( SAXException e )
    {
      e.printStackTrace();
    }
  }
}
TagSoupParser.java:
package com.etretatlogiciels.html.tagsoup;

import java.io.File;
import java.io.IOException;
import org.xml.sax.SAXException;
import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;

public class TagSoupParser
{
  public void run( File file ) throws SAXException
  {
    TagSoupHandler handler = new TagSoupHandler();
    SAXParserImpl  parser = SAXParserImpl.newInstance( null );

    try
    {
      parser.parse( file, handler );
    }
    catch( IOException e )
    {
      e.printStackTrace();
    }
  }

  public void run( String content ) throws SAXException
  {
    TagSoupHandler handler = new TagSoupHandler();
    SAXParserImpl  parser  = SAXParserImpl.newInstance( null );

    try
    {
      parser.parse( content, handler );
    }
    catch( IOException e )
    {
      e.printStackTrace();
    }
  }
}
TagSoupHandler.java:
package com.etretatlogiciels.html.tagsoup;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class TagSoupHandler extends DefaultHandler
{
  int tab = 0;

  public void startElement( String uri, String localName, String qName, Attributes attributes )
      throws SAXException
  {
    System.out.println( getTabs( tab ) + "<" + qName + ">" );
    tab++;
  }

  public void characters( char ch[], int start, int length ) throws SAXException
  {
    String text = new String( ch, start, length ).trim();
    if( text.length() > 0 )
      System.out.println( getTabs( tab+1 ) + text );
  }

  public void endElement( String uri, String localName, String qName ) throws SAXException
  {
    tab--;
    System.out.println( getTabs( tab ) + "</" + qName + ">" );
  }

  private static String getTabs( int level )
  {
    String spaces = "";
    while( level-- > 0 )
      spaces += "  ";
    return spaces;
  }
}
Sample.html:

What's passed in for parsing...

<html>
<head>
</head>
<body>
  <table width="100%" border="1">
    <thead>
    <tr>
      <th> Date         </th>
      <th> Type (Loinc) </th>
      <th> Result       </th>
      <th> Ref Range    </th>
    </tr>
    </thead>
    <tbody>
    <tr>
      <td> 6/2/2015                 </td>
      <td> Bilirubin Direct(1968-7) </td>
      <td> &lt;0.1                  </td> <!-- the whole purpose is to see TagSoup handle this <! -->
      <td> 0.0-0.2                  </td>
    </tr>
    </tbody>
  </table>
</body>
</html>

Output

...when TagSoupExample.main() is run.

<html>
  <head>
  </head>
  <body>
    <table>
      <thead>
        <tr>
          <th> Date </th>
          <th> Type (Loinc) </th>
          <th> Result </th>
          <th> Ref Range </th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td> 6/2/2015 </td>
          <td> Bilirubin Direct(1968-7) </td>
          <td> <0.1 </td>
          <td> 0.0-0.2 </td>
        </tr>
      </tbody>
    </table>
  </body>
</html>
pom.xml particulars:
  <properties>
    <sax.version>2.0.1</sax.version>
    <tagsoup.version>1.2</tagsoup.version>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.ccil.cowan.tagsoup</groupId>
      <artifactId>tagsoup</artifactId>
      <version>1.2</version>
    </dependency>
    <dependency>
        <groupId>sax</groupId>
        <artifactId>sax</artifactId>
        <version>${sax.version}</version>
    </dependency>
  </dependencies>