Dummy SAX Parser (demonstrated)

Russell Bateman
last update:




You can play with this code, use it as a blank or a place to start fresh. Unlike the examples in Using the SAX Parser, there's little else other than System.out.println() statements in this to confuse you.

The JUnit test

When you run this, you can follow the behavior of the handler to see how how SAX parsing goes down. There is important SAX code in this test too: you learn how to set up and fire off the parser using the handler. See highlighted lines: these will be in your greater, production code.

  @Test
  public void test() throws ParserConfigurationException, SAXException, IOException
  {
    final String INPUT = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
        + "<extended_data>\n"
        + "<!-- This is a test comment -->\n"
        + "  <concept type=\"full_name\"\n"
        + "           location=\"/xml/document[1]/ns:div[1]/ns:div[1]/ns:div[1]/ns:div[1]\"\n"
        + "           analyzed_by=\"MedicalStructureBuilder\">\n"
        + "    <text>Jack the Ripper</text>\n"
        + "    <ns:xml location=\"/xml/document[1]/ns:div[1]/ns:div[1]/ns:div[1]/ns:div[1]\" />\n"
        + "    <data type=\"full_name\" value=\"Jack the Ripper\" />\n"
        + "  </concept>\n"
        + "</extended_data>\n";
    System.out.println( "Input:\n  " + INPUT.replaceAll( "\n", "\n  " ) );

    InputStream       input     = new ByteArrayInputStream( INPUT.getBytes() );
    SAXParserFactory  factory   = SAXParserFactory.newInstance();
    DummySaxHandler   handler   = new DummySaxHandler();
    SAXParser         parser    = factory.newSAXParser();
    XMLReader         xmlReader = parser.getXMLReader();
    parser.parse( input, handler );
  }

The handler

This is a bare-bones SAX parser handler to start from. It doesn't "stack" attributes and text content for use in endElement(), for example, but you can find examples of doing that here.

I included the LexicalHandler so that XML comments would get parsed, but I think I left something out so it's not working. I'll come back and fix that at some point, but I think it's well covered in my larger document on SAX parsing.

import java.util.HashMap;
import java.util.Map;

import org.xml.sax.Attributes;
import org.xml.sax.ext.LexicalHandler;
import org.xml.sax.helpers.DefaultHandler;

/**
 * Dummy SAX parser handler for in-coming data.
 *
 * @author Russell Bateman
 */
public class DummySaxHandler extends DefaultHandler implements LexicalHandler
{
  public DummySaxHandler()
  {
    super();
    System.out.println( "DummySaxHandler():" );
  }

  public void startDocument()
  {
    System.out.println( "  startDocument():" );
  }

  public void startElement( String uri, String localName, String elementName, Attributes attributes )
  {
    Map< String, String > javaAttributes = getAttributesAsJavaMap( attributes );
    System.out.println( "   startElement(): uri=\"" + uri
                 + "\", localName=\"" + localName
                 + "\", qName=\"" + elementName
                 + "\", attributes: "
                 + javaAttributesAsString( javaAttributes ) );
  }

  public void endElement( String uri, String localName, String elementName )
  {
    System.out.println( "     endElement(): uri=\"" + uri
                 + "\", localName=\"" + localName
                 + "\", qName=\"" + elementName + "\"" );
  }

  public void characters( char[] ch, int start, int length )
  {
    String characters = new String( ch, start, length );
    System.out.println( "     characters(): \"" + characters.trim() + "\"" );
  }

  public void comment( char[] ch, int start, int length )
  {
    String comment = new String( ch, start, length );
    System.out.println( "        comment(): \"" + comment.trim() + "\"" );
  }

  public void endDocument()
  {
    System.out.println( "    endDocument():" );
  }

  @Override public void startDTD( String name, String publicId, String systemId )  { }
  @Override public void endDTD()  { }
  @Override public void startEntity( String name ) { }
  @Override public void endEntity( String name )  { }
  @Override public void startCDATA()  { }
  @Override public void endCDATA()  { }

  /**
   * Here's how to make SAX' attributes "Java-useful". If we had uri (namespaces) defined,
   * we'd have to get a lot more serious about how to use uri, localName and qName.
   */
  private Map< String, String > getAttributesAsJavaMap( Attributes saxAttributes )
  {
    int                   attrLength = saxAttributes.getLength();
    Map< String, String > javaAttributes = new HashMap<>( attrLength );

    for( int attr = 0; attr < attrLength; attr++ )
    {
      String attribute = saxAttributes.getQName( attr );
      String value     = saxAttributes.getValue( attr );
      javaAttributes.put( attribute, value );
    }

    return javaAttributes;
  }

  private String javaAttributesAsString( Map< String, String > javaAttributes )
  {
    if( javaAttributes.size() == 0 )
      return "";

    StringBuilder sb = new StringBuilder();
    for( Map.Entry<String, String> attribute : javaAttributes.entrySet() )
      sb.append( "                   " )
        .append( attribute.getKey() )
        .append( "=\"" )
        .append( attribute.getValue() )
        .append( "\"" );

    return sb.toString();
  }
}

The output

Most of the characters() appear empty. Remember that I used .trim() to get rid of newlines to shorten the output. Originally then, they were full of newlines. In fact, all of the newlines in the source XML will find their way into calls to characters(). This is why you can write code to count line numbers (if you wanted to report errors, for example). Bump your line-count variable once for every newline whether or not you keep the newline in whatever you decide is output.

The order below is strictly the order in which SAX calls our handler:

extended_data startElement()
    concept startElement()
        text startElement()
        text endElement()
        ns:xml startElement()
        ns:xml endElement()
        data startElement()
        data endElement()
    concept endElement()
extended_data endElement()

* The reason ns (namespace) doesn't come out in uri is because our XML doesn't include any DTD telling it where to find the definition.

Input:
  <?xml version="1.0" encoding="UTF-8"?>
  <extended_data>
  <!-- This is a test comment -->
    <concept type="full_name"
             location="/xml/document[1]/ns:div[1]/ns:div[1]/ns:div[1]/ns:div[1]"
             analyzed_by="MedicalStructureBuilder">
      <text>Jack the Ripper</text>
      <ns:xml location="/xml/document[1]/ns:div[1]/ns:div[1]/ns:div[1]/ns:div[1]" />
      <data type="full_name" value="Jack the Ripper" />
    </concept>
  </extended_data>

DummySaxHandler():
  startDocument():
   startElement(): uri="", localName="", qName="extended_data"
     characters(): ""
     characters(): ""
   startElement(): uri="", localName="", qName="concept"
                   attributes:
                   type="full_name"
                   analyzed_by="MedicalStructureBuilder"
                   location="/xml/document[1]/ns:div[1]/ns:div[1]/ns:div[1]/ns:div[1]"
     characters(): ""
   startElement(): uri="", localName="", qName="text"
     characters(): "Jack the Ripper"
     endElement(): uri="", localName="", qName="text"
     characters(): ""
   startElement(): uri="", localName="", qName="ns:xml"
                   attributes:
                   location="/xml/document[1]/ns:div[1]/ns:div[1]/ns:div[1]/ns:div[1]"
     endElement(): uri="", localName="", qName="ns:xml"
     characters(): ""
   startElement(): uri="", localName="", qName="data"
                   attributes:
                   type="full_name"
                   value="Jack the Ripper"
     endElement(): uri="", localName="", qName="data"
     characters(): ""
     endElement(): uri="", localName="", qName="concept"
     characters(): ""
     endElement(): uri="", localName="", qName="extended_data"
    endDocument():