![]() |
The SAX Parser locator Facility
Russell Bateman |
There is a locator in SAX, maintained and working at least for Java, and it keeps track of the current position at which the parser is. However, it has the peculiarity, when accessed from startElement(), of pointing to the end of that element (including any text content) rather than the beginning.
So, it's rather useful in playing horseshoes or detonating large explosive devices, but not for microsurgery or billards. 😉
It's important to note that the SAX parser updates locator using (calling) method setDocumentLocator() as long as the handler provides this method (just as it provides other methods such as startElement(), characters(), etc.)
This said, we read in SAX documentation that...
The return value from the method is intended only as an approximation for the sake of diagnostics; it is not intended to provide sufficient information to edit the character content of the original XML document. For example, when lines contain combining character sequences, wide characters, surrogate pairs, or bi-directional text, the value may not correspond to the column in a text editor's display.
It may be possible to increase this facility's accuracy. Here's a bare-bones SAX parser handler with a wrapper to do just that.
for the DefaultHandler below. I don't leave much Javadoc in the code sample since the Gorbachev Syntax Highlighter cannot handle embedded HTML from which much of this Javadoc benefits enormously.
SAX-handler methods characters(), ignorableWhitespace() and endElement() make significant contribution in tracking the current position relative to the starting point of the XML element.
It's important to note that the SAX parser updates locator using (calling) method setDocumentLocator() as long as the handler provides this method (just as it provides other methods such as startElement(), characters(), etc.)
When inspected, locator in...
As long as the ending position is carefully tracked in each of these three methods, the starting point for the current element can be found. The ending position of the current element can already be got using locator in endElement() without using this tracking code (Position). Therefore, the starting and ending points of any XML element should be successfully trackable.
Individual method Javadoc:
In startElement(), ...
<a> ← locator points here <b> </a> public void startElement( String uri, String localName, String qName, Attributes attributes )
When endElement() is called, locator does not point at the opening element.
<a> <b> </a> ← locator is pointing here when endElement() is called public void endElement( String uri, String localName, String qName )
Imagine element text consisting of "some other words":
<a> some other words ← locator points here <b /> </a> public void characters( char[] ch, int start, int length )
Just as in characters() above, ...
<a> ← locator points here <b /> </a> public void ignorableWhitespace( char[] ch, int start, int length )
Russell Bateman based on work by a Wenston Chen on stackoverflow. This source code proposes an internal wrapper, Position (at bottom of source code), for SAX' Locator, to make the latter more accurate:
01.
package
com.windofkeltia.sax;
02.
03.
import
org.xml.sax.Attributes;
04.
import
org.xml.sax.Locator;
05.
import
org.xml.sax.helpers.DefaultHandler;
06.
07.
public
class
SampleHandler
extends
DefaultHandler
08.
{
09.
private
Locator locator;
10.
private
Position position =
new
Position();
// starting element position we maintain
11.
12.
/** The SAX parser will call this to update locator as needed. */
13.
public
void
setDocumentLocator( Locator location ) { locator = location; }
14.
15.
/** (See Javadoc above.) */
16.
public
void
startElement( String uri, String localName, String qName, Attributes attributes )
17.
{
18.
// do startElement() work here...
19.
}
20.
21.
/** (See Javadoc above.) */
22.
public
void
endElement( String uri, String localName, String qName )
23.
{
24.
// We could get our source position--at the end
25.
Position start = position;
26.
Position end =
new
Position( locator.getLineNumber(), locator.getColumnNumber() );
27.
28.
// do endElement() work here
29.
30.
// update the starting point for the next element
31.
updateElementPoint( locator );
32.
}
33.
34.
/** (See Javadoc above.) */
35.
public
void
characters(
char
[] ch,
int
start,
int
length )
36.
{
37.
updateElementPoint( locator );
// now update the starting point
38.
}
39.
40.
/** (See Javadoc above.) */
41.
public
void
ignorableWhitespace(
char
[] ch,
int
start,
int
length )
42.
{
43.
updateElementPoint( locator );
// now update the starting point
44.
}
45.
46.
private
void
updateElementPoint( Locator locator )
47.
{
48.
Position location =
new
Position( locator.getLineNumber(), locator.getColumnNumber() );
49.
if
( position.compareTo( location ) <
0
)
50.
position = location;
51.
}
52.
53.
/** Wrap and maintain the SAX locator to make it more accurate. */
54.
static
class
Position
55.
{
56.
private
int
line;
57.
private
int
column;
58.
59.
public
Position() {
this
.line =
1
;
this
.column =
1
; }
60.
public
Position(
int
line,
int
column ) {
this
.line = line;
this
.column = column; }
61.
public
void
setLine(
int
line ) {
this
.line = line; }
62.
public
void
setColumn(
int
column ) {
this
.column = column; }
63.
64.
public
int
getLine() {
return
line; }
65.
public
int
getColumn() {
return
column; }
66.
67.
public
int
compareTo( Position position )
68.
{
69.
// if our location is past recorded line...
70.
if
( position.getLine() > getLine() )
71.
return
-
1
;
72.
// if on recorded line, but past recorded column...
73.
else
if
( position.getLine() == getLine() && position.getColumn() > getColumn() )
74.
return
-
1
;
75.
// if on recorded line and also at recorded column...
76.
else
if
( position.getLine() == getLine() && position.getColumn() == getColumn() )
77.
return
0
;
78.
// we're before current line and/or current column...
79.
else
80.
return
1
;
81.
}
82.
}
83.
84.
private
final
ParserHandlerPrinter printer =
new
ParserHandlerPrinter( printerVerbosity );
85.
86.
private
static
int
printerVerbosity =
0
;
87.
88.
/** Use this from JUnit tests to set the level of debug verbosity. If not done, printer will be quiet. */
89.
public
static
void
setPrinterVerbosity(
int
verbosity ) { printerVerbosity = verbosity; }
90.
}
It would be frustrating not to include a ready-made way of demonstrating both the success and behavior of what we're doing on this page. Of course, you'll want to engage this at level 3 to get line numbers out.
001.
package
com.windofkeltia.sax;
002.
003.
import
java.util.HashMap;
004.
import
java.util.Map;
005.
006.
import
org.xml.sax.Attributes;
007.
import
org.xml.sax.Locator;
008.
009.
import
com.windofkeltia.utilities.StringUtilities;
010.
011.
/**
012.
* Conceived as JUnit-only. Use static setPrinterVerbosity()
013.
* in whatever SAX parser handler consumes this to set the level
014.
* before instantiating the parser.
015.
*/
016.
public
class
ParserHandlerPrinter
017.
{
018.
private
final
int
verbosity;
019.
020.
/** Quiet version of this utility--utters nothing. */
021.
public
ParserHandlerPrinter() { verbosity =
0
; }
022.
023.
/**
024.
* Enable this utility at any of several levels:
025.
*
026.
* 0 - none (quiet mode)
027.
* 1 - minimal output
028.
* 2 - verbose output without line numbers
029.
* 3 - verbose output including line numbers
030.
*/
031.
public
ParserHandlerPrinter(
int
verbosity ) {
this
.verbosity = verbosity; }
032.
033.
public
void
startElement( Locator locator,
final
String elementName, Attributes attributes )
034.
{
035.
StringBuilder sb =
new
StringBuilder();
036.
switch
( verbosity )
037.
{
038.
case
3
:
039.
sb.append( getLineNumber( locator ) );
040.
case
2
:
041.
case
1
:
042.
sb.append(
" <"
).append( elementName );
043.
if
( attributes.getLength() >
0
)
044.
sb.append( javaAttributesAsString( attributes ) );
045.
sb.append(
">"
);
046.
System.out.println( sb );
047.
case
0
:
048.
break
;
049.
}
050.
}
051.
052.
public
void
characters( Locator locator,
final
String characters )
053.
{
054.
if
( characters.length() <
1
)
055.
return
;
056.
057.
StringBuilder sb =
new
StringBuilder();
058.
switch
( verbosity )
059.
{
060.
case
3
:
061.
sb.append( getLineNumber( locator ) );
062.
case
2
:
063.
case
1
:
064.
sb.append( LINE_NUMBER_INDENT ).append( characters );
065.
System.out.println( sb );
066.
case
0
:
067.
break
;
068.
}
069.
}
070.
071.
public
void
endElement( Locator locator,
final
String elementName )
072.
{
073.
StringBuilder sb =
new
StringBuilder();
074.
switch
( verbosity )
075.
{
076.
case
3
:
077.
sb.append( getLineNumber( locator ) );
078.
case
2
:
079.
case
1
:
080.
sb.append(
" </"
).append( elementName ).append(
">"
);
081.
System.out.println( sb );
082.
case
0
:
083.
break
;
084.
}
085.
}
086.
087.
public
void
endElement( Locator locator,
final
String elementName, StringBuilder text )
088.
{
089.
//noinspection DuplicatedCode
090.
StringBuilder sb =
new
StringBuilder();
091.
switch
( verbosity )
092.
{
093.
case
3
:
094.
sb.append( getLineNumber( locator ) );
095.
case
2
:
096.
case
1
:
097.
sb.append(
" </"
).append( elementName ).append(
">"
);
098.
System.out.println( sb );
099.
case
0
:
100.
break
;
101.
}
102.
}
103.
104.
public
void
startDocument() {
if
( verbosity >
2
) System.out.println(
"[start of document]"
); }
105.
public
void
endDocument() {
if
( verbosity >
2
) System.out.println(
"[end of document]"
); }
106.
107.
private
static
int
LINE_NUMBER_PLACES =
3
;
108.
private
static
String LINE_NUMBER_INDENT =
" "
;
109.
110.
public
static
void
setLineNumberPlaces(
int
places ) { LINE_NUMBER_PLACES = places; }
111.
public
static
void
setLineNumberIndent(
int
indent )
112.
{
113.
StringBuilder sb =
new
StringBuilder();
114.
while
( indent-- >
0
)
115.
sb.append(
' '
);
116.
LINE_NUMBER_INDENT = sb.toString();
117.
}
118.
119.
private
static
String getLineNumber( Locator locator )
120.
{
121.
return
StringUtilities.padStringLeft( locator.getLineNumber()+
""
, LINE_NUMBER_PLACES );
122.
}
123.
124.
private
static
String javaAttributesAsString( Attributes attributes )
125.
{
126.
Map< String, String > javaAttributes = getAttributesAsJavaMap( attributes );
127.
128.
if
( javaAttributes.size() ==
0
)
129.
return
""
;
130.
131.
StringBuilder sb =
new
StringBuilder();
132.
133.
for
( Map.Entry< String, String > attribute : javaAttributes.entrySet() )
134.
sb.append(
" "
).append( attribute.getKey() ).append(
"=\""
).append( attribute.getValue() ).append(
"\""
);
135.
136.
return
sb.toString();
137.
}
138.
139.
private
static
Map< String, String > getAttributesAsJavaMap( Attributes attributes )
140.
{
141.
int
attributeLength = attributes.getLength();
142.
Map< String, String > javaAttributes =
new
HashMap<>( attributeLength );
143.
144.
for
(
int
count =
0
; count < attributeLength; count++ )
145.
{
146.
String attribute = attributes.getQName( count );
147.
String value = attributes.getValue( count );
148.
javaAttributes.put( attribute, value );
149.
}
150.
151.
return
javaAttributes;
152.
}
153.
}
It would be frustrating, etc.
01.
package
com.windofkeltia.sax;
02.
03.
import
javax.xml.parsers.SAXParser;
04.
import
javax.xml.parsers.SAXParserFactory;
05.
06.
import
org.xml.sax.XMLReader;
07.
08.
import
org.junit.After;
09.
import
org.junit.Before;
10.
import
org.junit.Rule;
11.
import
org.junit.Test;
12.
import
org.junit.rules.TestName;
13.
14.
public
class
SampleHandlerTest
15.
{
16.
private
static
final
String CONTENT_PATH = TestUtilities.TEST_FODDER +
"sample.xml"
;
17.
18.
@Test
19.
public
void
test()
throws
Exception
20.
{
21.
SampleHandler.setPrinterVerbosity(
3
);
22.
23.
SAXParserFactory factory = SAXParserFactory.newInstance();
24.
SAXParser parser = factory.newSAXParser();
25.
XMLReader reader = parser.getXMLReader();
26.
SampleHandler handler =
new
SampleHandler();
27.
28.
parser.parse( CONTENT_PATH, handler );
29.
}
30.
}
You'll observe that, if you have XML like the following where the element spans several lines with its element name plus attribute names and values,
116 <medication startdate="202205180900" 117 enddate="202205180900" 118 dose="50" 119 unit="mg"> 120 This was the patient's Vicodin. 121 </medication>
...the medication element opening line number will be 119 and not 116 (in case that's what you expected). The closing element line number will be 121 as expected. The text content's line number will be 120.