The SAX Parser locator Facility

Russell Bateman
last update:




There is a locator in SAX, maintained and working at least for Java, and it keeps track of the current position at which the parser is. However, it has the peculiarity, when accessed from startElement(), of pointing to the end of that element (including any text content) rather than the beginning.

So, it's rather useful in playing horseshoes or detonating large explosive devices, but not for microsurgery or billards. 😉

It's important to note that the SAX parser updates locator using (calling) method setDocumentLocator() as long as the handler provides this method (just as it provides other methods such as startElement(), characters(), etc.)

Caveat

This said, we read in SAX documentation that...

The return value from the method is intended only as an approximation for the sake of diagnostics; it is not intended to provide sufficient information to edit the character content of the original XML document. For example, when lines contain combining character sequences, wide characters, surrogate pairs, or bi-directional text, the value may not correspond to the column in a text editor's display.

It may be possible to increase this facility's accuracy. Here's a bare-bones SAX parser handler with a wrapper to do just that.

Javadoc

for the DefaultHandler below. I don't leave much Javadoc in the code sample since the Gorbachev Syntax Highlighter cannot handle embedded HTML from which much of this Javadoc benefits enormously.

SAX-handler methods characters(), ignorableWhitespace() and endElement() make significant contribution in tracking the current position relative to the starting point of the XML element.

It's important to note that the SAX parser updates locator using (calling) method setDocumentLocator() as long as the handler provides this method (just as it provides other methods such as startElement(), characters(), etc.)

When inspected, locator in...

As long as the ending position is carefully tracked in each of these three methods, the starting point for the current element can be found. The ending position of the current element can already be got using locator in endElement() without using this tracking code (Position). Therefore, the starting and ending points of any XML element should be successfully trackable.

Individual method Javadoc:


In startElement(), ...

<a> ← locator points here
<b>
</a>
public void startElement( String uri, String localName, String qName, Attributes attributes )

When endElement() is called, locator does not point at the opening element.

<a>
<b>
</a> ← locator is pointing here when endElement() is called
public void endElement( String uri, String localName, String qName )

Imagine element text consisting of "some other words":

<a>
  some other words ← locator points here
  <b />
</a>
public void characters( char[] ch, int start, int length )

Just as in characters() above, ...

<a>
        ← locator points here
  <b />
</a>
public void ignorableWhitespace( char[] ch, int start, int length )

SAX handler source code

Russell Bateman based on work by a Wenston Chen on stackoverflow. This source code proposes an internal wrapper, Position (at bottom of source code), for SAX' Locator, to make the latter more accurate:

package com.windofkeltia.sax;

import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.helpers.DefaultHandler;

public class SampleHandler extends DefaultHandler
{
  private Locator  locator;
  private Position position = new Position(); // starting element position we maintain

  /** The SAX parser will call this to update locator as needed. */
  public void setDocumentLocator( Locator location ) { locator = location; }

  /** (See Javadoc above.) */
  public void startElement( String uri, String localName, String qName, Attributes attributes )
  {
    // do startElement() work here...
  }

  /** (See Javadoc above.) */
  public void endElement( String uri, String localName, String qName )
  {
    // We could get our source position--at the end
    Position start = position;
    Position end   = new Position( locator.getLineNumber(), locator.getColumnNumber() );

    // do endElement() work here

    // update the starting point for the next element
    updateElementPoint( locator );
  }

  /** (See Javadoc above.) */
  public void characters( char[] ch, int start, int length )
  {
    updateElementPoint( locator );  // now update the starting point
  }

  /** (See Javadoc above.) */
  public void ignorableWhitespace( char[] ch, int start, int length )
  {
    updateElementPoint( locator ); // now update the starting point
  }

  private void updateElementPoint( Locator locator )
  {
    Position location = new Position( locator.getLineNumber(), locator.getColumnNumber() );
    if( position.compareTo( location ) < 0 )
      position = location;
  }

  /** Wrap and maintain the SAX locator to make it more accurate. */
  static class Position
  {
    private int line;
    private int column;

    public Position()                       { this.line   = 1;    this.column = 1; }
    public Position( int line, int column ) { this.line   = line; this.column = column; }
    public void setLine( int line )         { this.line   = line; }
    public void setColumn( int column )     { this.column = column; }

    public int getLine()                    { return line; }
    public int getColumn()                  { return column; }

    public int compareTo( Position position )
    {
      // if our location is past recorded line...
      if( position.getLine() > getLine() )
        return -1;
      // if on recorded line, but past recorded column...
      else if( position.getLine() == getLine() && position.getColumn() > getColumn() )
        return -1;
      // if on recorded line and also at recorded column...
      else if( position.getLine() == getLine() && position.getColumn() == getColumn() )
        return 0;
      // we're before current line and/or current column...
      else
        return 1;
    }
  }

  private final ParserHandlerPrinter printer = new ParserHandlerPrinter( printerVerbosity );

  private static int printerVerbosity = 0;

  /** Use this from JUnit tests to set the level of debug verbosity. If not done, printer will be quiet. */
  public static void setPrinterVerbosity( int verbosity ) { printerVerbosity = verbosity; }
}

The debugger/printer facility

It would be frustrating not to include a ready-made way of demonstrating both the success and behavior of what we're doing on this page. Of course, you'll want to engage this at level 3 to get line numbers out.

ParserHandlerPrinter.java:
package com.windofkeltia.sax;

import java.util.HashMap;
import java.util.Map;

import org.xml.sax.Attributes;
import org.xml.sax.Locator;

import com.windofkeltia.utilities.StringUtilities;

/**
 * Conceived as JUnit-only. Use static setPrinterVerbosity()
 * in whatever SAX parser handler consumes this to set the level
 * before instantiating the parser.
 */
public class ParserHandlerPrinter
{
  private final int verbosity;

  /** Quiet version of this utility--utters nothing. */
  public ParserHandlerPrinter() { verbosity = 0; }

  /**
   * Enable this utility at any of several levels:
   *
   * 0 - none (quiet mode)
   * 1 - minimal output
   * 2 - verbose output without line numbers
   * 3 - verbose output including line numbers
   */
  public ParserHandlerPrinter( int verbosity ) { this.verbosity = verbosity; }

  public void startElement( Locator locator, final String elementName, Attributes attributes )
  {
    StringBuilder sb = new StringBuilder();
    switch( verbosity )
    {
      case 3 :
        sb.append( getLineNumber( locator ) );
      case 2 :
      case 1 :
        sb.append( " <" ).append( elementName );
        if( attributes.getLength() > 0 )
          sb.append( javaAttributesAsString( attributes ) );
        sb.append( ">" );
        System.out.println( sb );
      case 0 :
        break;
    }
  }

  public void characters( Locator locator, final String characters )
  {
    if( characters.length() < 1 )
      return;

    StringBuilder sb = new StringBuilder();
    switch( verbosity )
    {
      case 3 :
        sb.append( getLineNumber( locator ) );
      case 2 :
      case 1 :
        sb.append( LINE_NUMBER_INDENT ).append( characters );
        System.out.println( sb );
      case 0 :
        break;
    }
  }

  public void endElement( Locator locator, final String elementName )
  {
    StringBuilder sb = new StringBuilder();
    switch( verbosity )
    {
      case 3 :
        sb.append( getLineNumber( locator ) );
      case 2 :
      case 1 :
        sb.append( " </" ).append( elementName ).append( ">" );
        System.out.println( sb );
      case 0 :
        break;
    }
  }

  public void endElement( Locator locator, final String elementName, StringBuilder text )
  {
    //noinspection DuplicatedCode
    StringBuilder sb = new StringBuilder();
    switch( verbosity )
    {
      case 3 :
        sb.append( getLineNumber( locator ) );
      case 2 :
      case 1 :
        sb.append( " </" ).append( elementName ).append( ">" );
        System.out.println( sb );
      case 0 :
        break;
    }
  }

  public void startDocument() { if( verbosity > 2 ) System.out.println( "[start of document]" ); }
  public void endDocument()   { if( verbosity > 2 ) System.out.println( "[end of document]" );   }

  private static int    LINE_NUMBER_PLACES = 3;
  private static String LINE_NUMBER_INDENT = "   ";

  public static void setLineNumberPlaces( int places ) { LINE_NUMBER_PLACES = places; }
  public static void setLineNumberIndent( int indent )
  {
    StringBuilder sb = new StringBuilder();
    while( indent-- > 0 )
      sb.append( ' ' );
    LINE_NUMBER_INDENT = sb.toString();
  }

  private static String getLineNumber( Locator locator )
  {
    return StringUtilities.padStringLeft( locator.getLineNumber()+"", LINE_NUMBER_PLACES );
  }

  private static String javaAttributesAsString( Attributes attributes )
  {
    Map< String, String > javaAttributes = getAttributesAsJavaMap( attributes );

    if( javaAttributes.size() == 0 )
      return "";

    StringBuilder sb = new StringBuilder();

    for( Map.Entry< String, String > attribute : javaAttributes.entrySet() )
      sb.append( " " ).append( attribute.getKey() ).append( "=\"" ).append( attribute.getValue() ).append( "\"" );

    return sb.toString();
  }

  private static Map< String, String > getAttributesAsJavaMap( Attributes attributes )
  {
    int                   attributeLength = attributes.getLength();
    Map< String, String > javaAttributes  = new HashMap<>( attributeLength );

    for( int count = 0; count < attributeLength; count++ )
    {
      String attribute = attributes.getQName( count );
      String value     = attributes.getValue( count );
      javaAttributes.put( attribute, value );
    }

    return javaAttributes;
  }
}

JUnit test

It would be frustrating, etc.

SampleHandlerTest.java:
package com.windofkeltia.sax;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.XMLReader;

import org.junit.After;
import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TestName;

public class SampleHandlerTest
{
  private static final String CONTENT_PATH = TestUtilities.TEST_FODDER + "sample.xml";

  @Test
  public void test() throws Exception
  {
    SampleHandler.setPrinterVerbosity( 3 );

    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser        parser  = factory.newSAXParser();
    XMLReader        reader  = parser.getXMLReader();
    SampleHandler    handler = new SampleHandler();

    parser.parse( CONTENT_PATH, handler );
  }
}

A parting note...

You'll observe that, if you have XML like the following where the element spans several lines with its element name plus attribute names and values,

116 <medication startdate="202205180900"
117             enddate="202205180900"
118             dose="50"
119             unit="mg">
120   This was the patient's Vicodin.
121 </medication>

...the medication element opening line number will be 119 and not 116 (in case that's what you expected). The closing element line number will be 121 as expected. The text content's line number will be 120.