Parsing an XML Document in Java

To process an XML document, you need to parse it. A parser is a program that reads a file, confirms that the file has the correct format, breaks it up into the constituent elements, and lets a programmer access those elements. The Java library supplies two kinds of XML parsers:

Tree parsers, such as a Document Object Model (DOM) parser, that read an XML document into a tree structure.
Streaming parsers, such as a Simple API for XML (SAX) parser, that generate events as they read an XML document

The DOM parser interface is standardized by the World Wide Web Consortium (W3C). The org.w3c.dom package contains the definitions of interface types such as Document and Element. Different suppliers, such as the Apache Organization and IBM, have written DOM parsers whose classes implement these interfaces. The Java API for XML Processing (JAXP) library actually makes it possible to plug in any of these parsers. But the JDK also comes with a DOM parser that is derived from the Apache parser.

To read an XML document, you need a DocumentBuilder object that you get from a DocumentBuilderFactory like this:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

DocumentBuilder builder = factory.newDocumentBuilder();

You can now read a document from a file:

File f = . . .;

Document doc = builder.parse(f);

Alternatively, you can use a URL:

URL u = . . .;

Document doc = builder.parse(u.toString());

You can even specify an arbitrary input stream:

InputStream in = . . .;

Document doc = builder.parse(in);

A Document object is an in-memory representation of the tree structure of an XML document. It is composed of objects whose classes implement the Node interface and its various subinterfaces. Figure 3.1 shows the inheritance hierarchy of the subinterfaces.

Start analyzing the contents of a document by calling the getDocumentElement method. It returns the root element.

Element root = doc.getDocumentElement();

For example, if you are processing a document

<?xml version=”1.0″?>

…

then calling getDocumentElement returns the font element.

The getTagName method returns the tag name of an element. In the preceding example, root.getTagName() returns the string “font”.

To get an element’s children (which may be subelements, text, comments, or other nodes), use the getChildNodes method. That method returns a collection of type NodeList. That type was standardized before the standard Java collections, so it has a different access protocol. The item method gets the item with a given index, and the getLength method gives the total count of the items. You can enumerate all children like this:

NodeList children = root.getChildNodes();

for (int i = 0; i < children.getLength(); i++)

{

Node child = children.item(i);

…

}

Be careful when analyzing children. Suppose, for example, that you are processing the document

<name>Helvetica</name>

You would expect the font element to have two children, but the parser reports five:

The whitespace between and <name>
The name element
The whitespace between </name> and <size>
The size element
The whitespace between </size> and

Figure 3.2 shows the DOM tree.

If you expect only subelements, you can ignore the whitespace:

for (int i = 0; i < chitdren.getLength(); i++)

{

Node child = children.item(i);

if (child instanceof Element)

{

var childElement = (Element) child;

…

}

Now you look at only two elements, with tag names name and size.

As you will see in the next section, you can do even better if your document has a DTD. Then the parser knows which elements don’t have text nodes as children, and it can suppress the whitespace for you.

When analyzing the name and size elements, you want to retrieve the text strings that they contain. Those text strings are themselves contained in child nodes of type Text. You know that these Text nodes are the only children, so you can use the getFirstChild method without having to traverse another NodeList. Then, use the getData method to retrieve the string stored in a Text node:

for (int i = 0; i < children.getLength(); i++)

{

Node child = children.item(i);

if (child instanceof Element)

{

var childElement = (Element) child;

var textNode = (Text) childElement.getFirstChild();

String text = textNode.getData().trim();

if (childElement.getTagName().equals(“name”))

name = text;

else if (childElement.getTagName().equals(“size”))

size = Integer.parseInt(text);

}

You can also get the last child with the getLastChitd method, and the next sibling of a node with getNextSibting. Therefore, another way of traversing a node’s children is

for (Node chitdNode = element.getFirstChild();

chitdNode != null;

chitdNode = childNode.getNextSibling())

{

…

}

To enumerate the attributes of a node, call the getAttributes method. It returns a NamedNodeMap object that contains Node objects describing the attributes. You can traverse the nodes in a NamedNodeMap the same way as in a NodeList. Then, call the getNodeName and getNodeValue methods to get the attribute names and values.

NamedNodeMap attributes = element.getAttributes();

for (int i = 0; i < attributes.getLength(); i++)

{

Node attribute = attributes.item(i);

String name = attribute.getNodeName();

String value = attribute.getNodeValue();

…

}

Alternatively, if you know the name of an attribute, you can retrieve the corresponding value directly:

String unit = element.getAttribute(“unit”);

You have now seen how to analyze a DOM tree. The program in Listing 3.1 puts these techniques to work by converting an XML document to JSON format.

The tree display clearly shows how child elements are surrounded by text containing whitespace and comments. You can clearly see the newline and return characters as \n.

You don’t have to be familiar with JSON to understand how the program works with the DOM tree. Simply observe the following:

We use a DocumentBuilder to read a Document from a file.
For each element, we print the tag name, attributes, and elements.
For character data, we produce a string with the data. If the data comes from a comment, we add a “Comment: ” prefix.

Source: Horstmann Cay S. (2019), Core Java. Volume II – Advanced Features, Pearson; 11th edition.

Leave a Reply Cancel reply

Login