Streaming Parsers in XML

The DOM parser reads an XML document in its entirety into a tree data structure. For most practical applications, DOM works fine. However, it can be inefficient if the document is large and if your processing algorithm is simple enough that you can analyze nodes on the fly, without having to see all of the tree structure. In these cases, you should use a streaming parser.

In the following sections, we discuss the streaming parsers supplied by the Java library: the venerable SAX parser and the more modern StAX parser that was added to Java 6. The SAX parser uses event callbacks, and the StAX parser provides an iterator through the parsing events. The latter is usually a bit more convenient.

1. Using the SAX Parser

The SAX parser reports events as it parses the components of the XML input, but it does not store the document in any way—it is up to the event handlers to build a data structure. In fact, the DOM parser is built on top of the SAX parser. It builds the DOM tree as it receives the parser events.

Whenever you use a SAX parser, you need a handler that defines the event actions for the various parse events. The ContentHandter interface defines several callback methods that the parser executes as it parses the document. Here are the most important ones:

  • startEtement and endEtement are called each time encountered.
  • characters is called whenever character data are
  • startDocument and endDocument are called once each, the document.

For example, when parsing the fragment

<font>

<name>Hetvetica</name>

<size units=”pt”>36</size>

</font>

the parser makes the following callbacks:

  1. startElement, element name: font
  2. startElement, element name: name
  3. characters, content: Helvetica
  4. endElement, element name: name
  5. startElement, element name: size, attributes: units=”pt”
  6. characters, content: 36
  7. endElement, element name: size
  8. endElement, element name: font

Your handler needs to override these methods and have them carry out whatever action you want to carry out as you parse the file. The program at the end of this section prints all links <a href=”. . .”> in an HTML file. It simply overrides the startElement method of the handler to check for links with name a and an attribute with name href. This is potentially useful for implementing a “web crawler”—a program that reaches more and more web pages by following links.

The sample program is a good example for the use of SAX. We don’t care at all in which context the a elements occur, and there is no need to store a tree structure.

Here is how you get a SAX parser:

SAXParserFactory factory = SAXParserFactory.newInstance();

SAXParser parser = factory.newSAXParser();

You can now process a document:

parser.parse(source, handler);

Here, source can be a file, URL string, or input stream. The handter belongs to a subclass of DefauttHandter. The DefauttHandter class defines do-nothing methods for the four interfaces:

ContentHandter

DTDHandler

EntityResolver

ErrorHandler

The example program defines a handler that overrides the startElement method of the ContentHandter interface to watch out for a elements with an href attribute:

var handler = new DefaultHandler()

{

public void startElement(String namespaceURI, String lname, String qname,

Attributes attrs) throws SAXException

{

if (lname.equalsIgnoreCase(“a”) && attrs != null)

{

for (int i = 0; i < attrs.getLength(); i++)

{

String aname = attrs.getLocalName(i);

if (aname.equalsIgnoreCase(“href”))

System.out.println(attrs.getValue(i));

}

}

}

};

The startElement method has three parameters that describe the element name. The qname parameter reports the qualified name of the form prefix:localname. If namespace processing is turned on, then the namespaceURI and lname parameters provide the namespace and local (unqualified) name.

As with the DOM parser, namespace processing is turned off by default. To activate namespace processing, call the setNamespaceAware method of the factory class:

SAXParserFactory factory = SAXParserFactory.newInstance();

factory.setNamespaceAware(true);

SAXParser saxParser = factory.newSAXParser();

In this program, we cope with another common issue. An XHTML file starts with a tag that contains a DTD reference, and the parser will want to load it. Understandably, the W3C isn’t too happy to serve billions of copies of files such as www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. At one point, they refused altogether, but at the time of this writing, they serve the DTD at a glacial pace. If you don’t need to validate the document, just call

factory.setFeature(“http://apache.org/xml/features/nonvalidating/load-external-dtd”, false);

Listing 3.7 contains the code for the web crawler program. Later in this chapter, you will see another interesting use of SAX. An easy way of turning a non-XML data source into XML is to report the SAX events that an XML parser would report. See Section 3.9, “XSL Transformations,” on p. 216 for details.

2. Using the StAX Parser

The StAX parser is a “pull parser.” Instead of installing an event handler, you simply iterate through the events, using this basic loop:

InputStream in = url.openStream();

XMLInputFactory factory = XMLInputFactory.newInstance();

XMLStreamReader parser = factory.createXMLStreamReader(in);

while (parser.hasNext())

{

int event = parser.next();

Call parser methods to obtain event details

}

For example, when parsing the fragment

<font>

<name>Hetvetica</name>

<size units=”pt”>36</size>

</font>

the parser yields the following events:

  1. START_ELEMENT, element name: font
  2. CHARACTERS, content: white space
  3. START_ELEMENT, element name: name
  4. CHARACTERS, content: Helvetica
  5. END_ELEMENT, element name: name
  6. CHARACTERS, content: white space
  7. START_ELEMENT, element name: size
  8. CHARACTERS, content: 36
  9. END_ELEMENT, element name: size
  10. CHARACTERS, content: white space
  11. END_ELEMENT, element name: font

To analyze the attribute values, call the appropriate methods of the XMLStreamReader class. For example,

String units = parser.getAttributeValue(null, “units”);

gets the units attribute of the current element.

By default, namespace processing is enabled. You can deactivate it by modifying the factory:

XMLInputFactory factory = XMLInputFactory.newInstance();

factory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, false);

Listing 3.8 contains the code for the web crawler program implemented with the StAX parser. As you can see, the code is simpler than the equivalent SAX code because you don’t have to worry about event handling.

Source: Horstmann Cay S. (2019), Core Java. Volume II – Advanced Features, Pearson; 11th edition.

Leave a Reply

Your email address will not be published. Required fields are marked *