There are numerous ways to parse XML in Java but they are all based on one of
the two technologies:

  • DOM
  • SAX

I’m not going to explain what these two API’s do exactly, there are plenty of
articles on the subject, but in a nutshell, DOM gives you a tree view of your
XML document, which you can then navigate by moving from one node to the other,
while SAX is event-driven and will call your code whenever it encounters a tag.

Over the years, I have come to developa strong liking for SAX despite its
apparent limitations, and now, it’s reached a point where I haven’t needed to
resort to DOM for a long time, and here is why.

The thing I like most about SAX is that it allows you to ignore all the
portions of your XML document that you don’t care about, making it not only
trivial to only pick the information you are interested in, but also easier to
migrate your schema over time, should you decide to do so.

Consider the following XML document:

<person>
  <first-name value="Cedric"</first-name>
  <last-name value="Beust"</last-name>
</person>

Extracting the first and last names is straightforward:

 public void startElement(String uri, String localName, String qName,
Attributes attributes)
    throws SAXException
  {
    String name = attributes.getValue("value");
    if ("first-name".equals(qName)) {
      System.out.println("First name:" + name);
    }
    else if ("last-name".equals(qName)) {
      System.out.println("Last name:" + name);
    }
}

Note that the code above is completely ignoring the <person> tag and it focuses
exclusively on the content we are interested in.  If we have reached this
point in the code (which is defined in a ContentHandler), the parser has
probably already verified the validity and well-formedness of your
document.

Of course, this code won’t work if the same tags appear several times in the
document:

<project name="TestNG">
  <members>
    <person>
      <first-name value="Cedric"</first-name>
      <last-name value="Beust"</last-name>
    </person>
    <person>
      <first-name value="Alexandru"</first-name>
      <last-name value="Popescu"</last-name>
    </person>
  </members>
</project>

or, even more tricky, if these tags have different parents:

<project name="TestNG">
  <members>
    <vampire-slayer>
      <first-name value="Buffy"</first-name>
      <last-name value="Sommers"</last-name>
    </vampire-slayer>
    <vampire>
      <first-name value="Angel"</first-name>
      <last-name value="Angelus"</last-name>
    </vampire>
  </members>
</project>

A typical way to solve this is to keep track of the parent tag:

private VampireSlayer m_vampireSlayer = null;
private Vampire m_vampire = null;

 public void startElement(String uri, String localName, String qName, Attributes attributes)
    throws SAXException
  {
    String name = attributes.getValue("value");
    if ("vampire-slayer".equals(qName)) {
      m_vampireSlayer = new VampireSlayer();
    }
    else if ("first-name".equals(qName)) {
      if (null != m_vampireSlayer) {
        m_vampireSlayer.setFirstName(name);
      }
      else if (null != m_vampire) {
        m_vampire.setFirstName(name);
      }
    }
// …

Don’t forget to "pop out the context" when you exit the tag:

 public void endElement(String uri, String localName, String qName)
    throws SAXException
  {
    if("vampire".equals(qName)) {
      // store the vampire somewhere
      m_vampire = null;
    }
    eles if("vampire-slayer".equals(qName)) {
      // store the vampire slayer somewhere, then
      m_vampireSlayer = null;
    }

However, the problem with this approach is that the business logic attached
to a certain tag is now scattered in two different places, which makes the
code hard to maintain, so I have adopted the following rule:  whenever I
need to run code both at the start and at the end of a tag, I move the business
logic in a method that takes a boolean indicating if we are opening or closing
the tag:

 public void startElement(String uri, String localName, String qName,
Attributes attributes)
    throws SAXException
  {
    String name = attributes.getValue("value");
    if ("vampire-slayer".equals(qName)) {
      xmlVampireSlayer(true /* start */);
    }
// …

 public void endElement(String uri, String localName, String qName)
    throws SAXException
  {
    if("vampire-slayer".equals(qName)) {
      xmlVampireSlayer(false /* start */);
    }
// …

  /**
   * @param start If true, we are looking at a opening tag (e.g. <foo>),
   * otherwise, we are looking at a closing tag (</foo>)
   */
  private void xmlVampireSlayer(boolean start) {
    if (start) {
      m_vampireSlayer = new VampireSlayer();
    }
    else {
      // store the vampire slayer somewhere, then
      m_vampireSlayer = null;
    }
  }

And now we have the
best of both worlds: code that is not only easier to read but also quite robust
in the fact of schema changes.

Now, imagine a more complex situation where
your XML file can have tags nested six or seven levels deep.  One day, you
need to add a new tag.  With DOM, you would have to locate the code that is
walking this particular area of the tree, and even with typed tree-based
solutions such as XMLBeans, locating and modifying code is never easy.

With SAX, all you need to do is two things:

  • See if the name of this tag is unique within your file (if not, you will
    need to disambiguate it with the context approach shown above).
  • Implemt the method xmlTagName(boolean start) and gather its treatment
    inside.

How about you?  Do you prefer DOM over SAX?  Have you encountered
situations where DOM was a much better fit than SAX?