XQuery, Saxon and BaseX Notes

Note: to pretty-print any XML that's not got newlines and tabs, try these commands:

Ken's notes

XQuery is an extremely well-defined, mature and powerful W3C standard that has many implementations and does exactly this kind of thing. You can do some really sophisticated stuff, but for simple XML extractions, the XQuery scripts are correspondingly simple.

XQuery uses standard XPath expressions to drill into XML structure, and it supports various boolean tests and string functions to get just what you need. You can define your own functions. You can output XML (with or without a header), partial XML-like content, or even plain text. I've used XQuery happily to "down-translate" selected information from XML dictionaries directly into non-XML source files that can be compiled by the Xerox lexc program, for example. And I'd use it again to down-translate from XML dictionaries into Kleene source code, though I haven't gotten around to it yet.

One implementation of XQuery is available in the Saxon library (which has free and commercial versions, Java and .NET). XQuery is also built into every serious XML editor (I think), like oXygen. I'll redo your example using saxon.

First, I have defined a little bash alias on OS X that allows me to invoke XQuery easily from the command line:

As you see, it invokes the net.sf.saxon.Query function in saxon9he.jar, which is the free "home edition" of the Saxon library that you can easily download from sourceforge.net/projects/saxon/files/.

Here's a slightly simplified version of a patient database (residing in filename.xml):

<?xml version="1.0" encoding="UTF-8"?>
<ClinicalDocument>
  <recordTarget>
    <patientRole>
      <patient>
        <name use="L">
          <given>Amelia</given>
          <given>M</given>
          <family>Earhart</family>
        </name>
      </patient>
      <patient>
        <name use="L">
          <given>Mortimer</given>
          <given>J</given>
          <family>Snurd</family>
        </name>
      </patient>
      <patient>
        <name use="L">
          <given>Boone</given>
          <family>Pickens</family>
        </name>
      </patient>
      <patient>
        <name>
          <given>Ronald</given>
          <given>M</given>
          <family>Reagan</family>
        </name>
      </patient>
      <patient>
        <name use="L">
          <given>Alexander</given>
          <given>John</given>
          <family>Ellis</family>
        </name>
      </patient>
    </patientRole>
  </recordTarget>
</ClinicalDocument>

And here's a trivial XQuery script that opens the XML file, iterates through the <patient> elements (restricted to those with a <name> child element that has the use="L" attribute, using standard XPath notation) and prints out the full_names as elements. The script is called script.xq.

script.xq:

for $p in doc("filename.xml")/ClinicalDocument/recordTarget/patientRole/patient[name/@use="L"]
  return <full_name>{concat(data($p/name/family), ", ", data($p/name/given[1]), " ",
  data($p/name/given[2]))}</full_name>

There are, of course, many ways to express such an extraction in XQuery; this one is rather abbreviated. One simpler, but perhaps less efficient, version assumes that <patient> elements are always in the same context:

for $p in doc("filename.xml")//patient[name/@use="L"]
  return <full_name>{concat(data($p/name/family), ", ", data($p/name/given[1]), " ",
  data($p/name/given[2]))}</full_name>

<?xml version="1.0" encoding="UTF-8"?>
  <full_name> Earhart, Amelia M </full_name>
  <full_name> Snurd, Mortimer J </full_name>
  <full_name> Pickens, Boone </full_name>
  <full_name> Ellis, Alexander John </full_name>

If you want a wrapper element to appear around the <full_name> elements, to make it at least well-formed XML, you can modify the script to something like this:

script.xq (enhanced):

<full_names>{
  for $p in doc("filename.xml")/ClinicalDocument/recordTarget/patientRole/patient[name/@use="L"]
    return <full_name>{concat(data($p/name/family), ", ", data($p/name/given[1]), " ",
    data($p/name/given[2]))}</full_name>
}</full_names>

<?xml version="1.0" encoding="UTF-8"?>
<full_names>
  <full_name>Earhart, Amelia M</full_name>
  <full_name>Snurd, Mortimer J</full_name>
  <full_name>Pickens, Boone </full_name>
  <full_name>Ellis, Alexander John</full_name>
</full_names>

You can also specify the XML file to process from the command line, rather than wiring it into the XQuery script, e.g. with the following script (script2.xq), which also suppressed the XML declaration and indents (pretty-prints) the output.

script2.xq:

declare option saxon:output "omit-xml-declaration=yes" ;
declare option saxon:output "indent=yes" ;

<full_names>{
  for $p in ./ClinicalDocument/recordTarget/patientRole/patient[name/@use="L"]
	  return <full_name>{concat(data($p/name/family), ", ", data($p/name/given[1]), " ",
	  data($p/name/given[2]))}</full_name>
}</full_names>

<full_names>
   <full_name>Earhart, Amelia M</full_name>
   <full_name>Snurd, Mortimer J</full_name>
   <full_name>Pickens, Boone </full_name>
   <full_name>Ellis, Alexander John</full_name>
</full_names>

There are many other options, but, as you can see, simple extractions need only correspondingly simple XQuery scripts.

How XQuery works

It really started out as a database tool, an attempt to replace SQL by XML. It's apparently never escaped that purpose or mindeset. Therefore, somewhat surprisingly...

This is all in comparison with SAX which is very fast (3), light (1) and free (2).

Useful XQuery links

Survey of XQuery tools

BaseX

One tutorial, http://www.learndb.com/databases/basex-tutorial-for-using-an-xml-native-database-management-system, called the BaseX software a DBMS.

In http://docs.basex.org/wiki/GUI, the XML file one is analyzing seems to be called "the database." I believe this is because a) xQuery was developed to exploit databases that spoke XML and b) to keep up with the paradigm, why not treat XML content as if a database?

In the BaseX GUI, the place to type XPath expressions is in the upper left-hand window whose tab appears to be "file." This appears to be called the Editor window.

XQuery, Saxon and BaseX Notes

Ken's notes

script.xq:

script.xq (enhanced):

script2.xq:

How XQuery works

Useful XQuery links

Survey of XQuery tools

BaseX