Validating XML Documents in Java

In the previous section, you saw how to traverse the tree structure of a DOM document. However, with that approach, you’ll have to do quite a bit of tedious programming and error checking. It’s not just having to deal with whitespace between elements; you will also need to check whether the document contains the nodes that you expect. For example, suppose you are reading this element:

<font>

<name>Helvetica</name>

<size>36</size>

</font>

You get the first child. Oops . . . it is a text node containing whitespace “\n “. You skip text nodes and find the first element node. Then, you need to check that its tag name is “name” and that it has one child node of type Text. You move on to the next nonwhitespace child and make the same check. What if the author of the document switched the order of the children or added another child element? It is tedious to code all this error checking—but reckless to skip the checks.

Fortunately, one of the major benefits of an XML parser is that it can auto­matically verify that a document has the correct structure. That makes parsing much simpler. For example, if you know that the font fragment has passed validation, you can simply get the two grandchildren, cast them as Text nodes, and get the text data, without any further checking.

To specify the document structure, you can supply a DTD or an XML Schema definition. A DTD or schema contains rules that explain how a document should be formed, by specifying the legal child elements and attributes for each element. For example, a DTD might contain a rule:

<!ELEMENT font (name,size)>

This rule expresses that a font element must always have two children, which are name and size elements. The XML Schema language expresses the same constraint as

<xsd:element name=”font”>

<xsd:sequence>

<xsd:element name=”name” type=”xsd:string”/>

<xsd:element name=”size” type=”xsd:int”/>

</xsd:sequence>

</xsd:element>

XML Schema can express more sophisticated validation conditions (such as the fact that the size element must contain an integer) than can DTDs. Unlike the DTD syntax, the XML Schema syntax itself uses XML, which is a benefit if you need to process schema files.

In the next section, we will discuss DTDs in detail, then briefly cover the basics of XML Schema support. Finally, we will present a complete application that demonstrates how validation simplifies XML programming.

1. Document Type Definitions

There are several methods for supplying a DTD. You can include a DTD in an XML document like this:

<?xmt version=”1.0″?>

<!DOCTYPE config [

<!ELEMENT config . . .>

more rules

]>

<config>

</config>

As you can see, the rules are included inside a DOCTYPE declaration, in a block delimited by [. . .]. The document type must match the name of the root element, such as config in our example.

Supplying a DTD inside an XML document is somewhat uncommon because DTDs can grow lengthy. It makes more sense to store the DTD externally. The SYSTEM declaration can be used for that purpose. Specify a URL that contains the DTD, for example:

<!DOCTYPE config SYSTEM “config.dtd”>

or

<!DOCTYPE config SYSTEM “http://myserver.com/config.dtd”>

The mechanism for identifying well-known DTDs has its origin in SGML. Here is an example:

<!DOCTYPE web-app

PUBLIC “-//Sun Microsystems, Inc.//DTD Web Application 2.2//EN”

“http://java.sun.com/j2ee/dtds/web-app_2_2.dtd”>

If an XML processor knows how to locate the DTD with the public identifier, it need not go to the URL.

Now that you have seen how the parser locates the DTD, let us consider the various kinds of rules.

The ELEMENT rule specifies what children an element can have. Use a regular expression, made up of the components shown in Table 3.1.

Here are several simple but typical examples. The following rule states that a menu element contains 0 or more item elements:

<!ELEMENT menu (item)*>

This set of rules states that a font is described by a name followed by a size, each of which contain text:

<!ELEMENT font (name,size)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT size (#PCDATA)>

The abbreviation PCDATA denotes parsed character data. It is “parsed” because the parser interprets the text string, looking for < characters that denote the start of a new tag, or & characters that denote the start of an entity.

An element specification can contain regular expressions that are nested and complex. For example, here is a rule that describes the makeup of a chapter in a book:

<!ELEMENT chapter (intro,(heading,(para|image|tabte|note)+)+)

Each chapter starts with an introduction, which is followed by one or more sections consisting of a heading and one or more paragraphs, images, tables, or notes.

However, in one common case you can’t define the rules to be as flexible as you might like. Whenever an element can contain text, there are only two valid cases. Either the element contains nothing but text, such as

<!ELEMENT name (#PCDATA)>

or the element contains any combination of text and tags in any order, such as

<!ELEMENT para (#PCDATA|em|strong|code)*>

It is not legal to specify any other types of rules that contain #PCDATA. For example, the following is illegal:

<!ELEMENT captionedImage (image,#PCDATA)>

You have to rewrite such a rule, either by introducing another caption element or by allowing any combination of image elements and text.

This restriction simplifies the job of the XML parser when parsing mixed content (a mixture of tags and text). Since you lose some control by allowing mixed content, it is best to design DTDs so that all elements contain either other elements or nothing but text.

You can also specify rules to describe the legal attributes of elements. The general syntax is

<!ATTLIST element attribute type default>

Table 3.2 shows the legal attribute types, and Table 3.3 shows the syntax for the defaults.

Here are two typical attribute specifications:

<!ATTLIST font style (plain|bold|italic|bold-itatic) “plain”>

<!ATTLIST size unit CDATA #IMPLIED>

The first specification describes the style attribute of a font element. There are four legal attribute values, and the default value is plain. The second specifica­tion expresses that the unit attribute of the size element can contain any character data sequence.

The handling of a CDATA attribute value is subtly different from the processing of #PCDATA that you have seen before, and quite unrelated to the <![CDATA[. . .]]> sections. The attribute value is first normalized—that is, the parser processes character and entity references (such as &#233; or &lt;) and replaces whitespace with spaces.

An NMTOKEN (or name token) is similar to CDATA, but most nonalphanumeric characters and internal whitespace are disallowed, and the parser removes leading and trailing whitespace. NMTOKENS is a whitespace-separated list of name tokens.

The ID construct is quite useful. An ID is a name token that must be unique in the document—the parser checks the uniqueness. You will see an application in the next sample program. An IDREF is a reference to an ID that exists in the
same document, which the parser also checks. IDREFS is a whitespace-separated list of ID references.

An ENTITY attribute value refers to an “unparsed external entity.” That is a holdover from SGML that is rarely used in practice. The annotated XML specification at www.xmt.com/axml/axml.htmt has an example.

A DTD can also define entities, or abbreviations that are replaced during parsing. You can find a good example for the use of entities in the user inter­face descriptions of the Firefox browser. Those descriptions are formatted in XML and contain entity definitions such as

<!ENTITY back.label “Back”>

Elsewhere, text can contain an entity reference, for example:

<menuitem label=”&back.label;”/>

The parser replaces the entity reference with the replacement string. To inter­nationalize the application, only the string in the entity definition needs to be changed. Other uses of entities are more complex and less common; look at the XML specification for details.

This concludes the introduction to DTDs. Now that you have seen how to use DTDs, you can configure your parser to take advantage of them.

First, tell the document builder factory to turn on validation:

factory.setValidating(true);

All builders produced by this factory validate their input against a DTD. The most useful benefit of validation is ignoring whitespace in element content. For example, consider the XML fragment

<font>

<name>Helvetica</name>

<size>36</size>

</font>

A nonvalidating parser reports the whitespace between the font, name, and size elements because it has no way of knowing if the children of font are

(name,size)

(#PCDATA,name,size)*

or perhaps

ANY

Once the DTD specifies that the children are (name,size), the parser knows that the whitespace between them is not text. Call

factory.setlgnoringElementContentWhitespace(true);

and the builder will stop reporting the whitespace in text nodes. That means you can now rely on the fact that a font node has two children. You no longer need to program a tedious loop:

for (int i = 0; i < children.getLength(); i++)

{

Node child = children.item(i);

if (child instanceof Element)

{

var childElement = (Element) child;

if (childElement.getTagName().equals(“name”)) . . .;

else if (childElement.getTagName().equals(“size”)) . . .;

}

}

Instead, you can simply access the first and second child:

var nameElement = (Element) children.item(0);

var sizeElement = (Element) children.item(1);

That is why DTDs are so useful. You don’t overload your program with rule­checking code—the parser has already done that work by the time you get the document.

When the parser reports an error, your application will want to do something about it—log it, show it to the user, or throw an exception to abandon the parsing. Therefore, you should install an error handler whenever you use validation. Supply an object that implements the ErrorHandler interface. That interface has three methods:

void warning(SAXParseException exception)

void error(SAXParseException exception)

void fatalError(SAXParseException exception)

Install the error handler with the setErrorHandler method of the DocumentBuilder class:

builder.setErrorHandler(handler);

 

2. XML Schema

XML Schema is quite a bit more complex than the DTD syntax, so we will only cover the basics. For more information, we recommend the tutorial at www.w3.org/TR/xmlschema-0.

To reference a schema file in a document, add attributes to the root element, for example:

<?xml version=”1.0″?>

<config xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”

xsi:noNamespaceSchemaLocation=”config.xsd”>

</config>

This declaration states that the schema file config.xsd should be used to validate the document. If your document uses namespaces, the syntax is a bit more complex—see the XML Schema tutorial for details. (The prefix xsi is a namespace alias; see Section 3.6, “Using Namespaces,” on p. 193 for more information.)

A schema defines a type for each element and attribute. A simple type is a string, perhaps with restrictions on its contents. Everything else is a complex type. An element with a simple type can have no attributes and no child ele­ments. Otherwise, it must have a complex type. Conversely, attributes always have a simple type.

Some simple types are built into XML Schema, including

xsd:string

xsd:int

xsd:bootean

You can define your own simple types. For example, here is an enumerated

type:

<xsd:simpteType name=”StyteType”>

<xsd:restriction base=”xsd:string”>

<xsd:enumeration value=”PLAIN” />

<xsd:enumeration value=”BOLD” />

<xsd:enumeration value=”ITALK” />

<xsd:enumeration vatue=”BOLD_ITALIC” />

</xsd:restriction>

</xsd:simpteType>

When you define an element, you specify its type:

<xsd:etement name=”name” type=”xsd:string”/>

<xsd:etement name=”size” type=”xsd:int”/>

<xsd:etement name=”styte” type=”StyteType”/>

The type constrains the element content. For example, the elements

<size>10</size>

<styte>PLAIN</styte>

will validate correctly, but the elements

<size>defautt</size>

<styte>SLANTED</styte>

will be rejected by the parser.

You can compose types into complex types, for example:

<xsd:comptexType name=”FontType”>

<xsd:sequence>

<xsd:etement ref=”name”/>

<xsd:etement ref=”size”/>

<xsd:etement ref=”styte”/>

</xsd:sequence>

</xsd:comptexType>

A FontType is a sequence of name, size, and style elements. In this type definition, we use the ref attribute and refer to definitions that are located elsewhere in the schema. You can also nest definitions, like this:

<xsd:complexType name=”FontType”>

<xsd:sequence>

<xsd:element name=”name” type=”xsd:string”/>

<xsd:element name=”size” type=”xsd:int”/>

<xsd:element name=”style”>

<xsd:simpleType>

<xsd:restriction base=”xsd:string”>

<xsd:enumeration value=”PLAIN” />

<xsd:enumeration value=”BOLD” />

<xsd:enumeration value=”ITALIC” />

<xsd:enumeration value=”BOLD_ITALIC” />

</xsd:restriction>

</xsd:simpleType>

</xsd:element>

</xsd:sequence>

</xsd:complexType>

Note the anonymous type definition of the style element.

The xsd:sequence construct is the equivalent of the concatenation notation in DTDs. The xsd:choice construct is the equivalent of the | operator. For example,

<xsd:complexType name=”contactinfo”>

<xsd:choice>

<xsd:element ref=”email”/>

<xsd:element ref=”phone”/>

</xsd:choice>

</xsd:complexType>

This is the equivalent of the DTD type email|phone.

To allow repeated elements, use the minoccurs and maxoccurs attributes. For example, the equivalent of the DTD type item* is

<xsd:element name=”item” type=”. . .” minoccurs=”0″ maxoccurs=”unbounded”>

To specify attributes, add xsd:attribute elements to complexType definitions:

<xsd:element name=”size”>

<xsd:complexType>

<xsd:attribute name=”unit” type=”xsd:string” use=”optional” default=”cm”/>

</xsd:complexType>

</xsd:element>

This is the equivalent of the DTD statement

<!ATTLIST size unit CDATA #IMPLIED “cm”>

Enclose element and type definitions of your schema inside an xsd: schema element:

<xsd:schema xmlns:xsd=”http://www.w3.org/2001/XMLSchema”>

</xsd:schema>

Parsing an XML file with a schema is similar to parsing a file with a DTD, but with two differences:

  1. You need to turn on support for namespaces, even if you don’t use them in your XML files.

factory.setNamespaceAware(true);

  1. You need to prepare the factory for handling schemas, with the following magic incantation:

final String JAXP_SCHEMA_LANGUAGE =

“http://java.sun.com/xmt/jaxp/properties/schemaLanguage”;

final String W3C_XML_SCHEMA = “http://www.w3.org/2001/XMLSchema”;

factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);

3. A Practical Example

In this section, we work through a practical example that shows the use of XML in a realistic setting.

Suppose an application needs configuration data that specifies arbitrary objects, not just text strings. We provide two mechanisms for instantiating the object: with a constructor, and with a factory method. Here is how to make a Color object using a constructor:

<construct class=”java.awt.Color”>

<int>55</int>

<int>200</int>

<int>100</int>

</construct>

Here is an example with a factory method:

factory class=”java.util.logging.Logger” method=”getLogger”>

<string>com.horstmann.corejava</string>

</factory>

If the factory method name is omitted, it defaults to getInstance.

As you can see, there are elements for describing strings and integers. We also support the boolean type, and other primitive types can be added in the same way.

Just to show off the syntax, there is a second mechanism for primitive types:

<value type=”int”>30</vatue>

A configuration is a sequence of entries. Each entry has an ID and an object:

<config>

<entry id=”background”>

<construct dass=”java.awt.Cotor”>

<value type=”int”>55</value>

<value type=”int”>200</value>

<value type=”int”>100</value>

</construct>

</entry>

</config>

The parser checks that IDs are unique.

The DTD, shown in Listing 3.4, is straightforward.

Listing 3.5 contains the equivalent schema. In the schema, we can provide additional checking: an int or boolean element can only contain integer or boolean content. Note the use of the xsd:group construct to define parts of complex types that are used repeatedly.

The program in Listing 3.2 shows how to parse a configuration file. A sample configuration is defined in Listing 3.3.

The program uses the schema instead of the DTD if you choose a file that contains the string -schema.

This example is a typical use of XML. The XML format is robust enough to express complex relationships. The XML parser adds value by taking over the routine job of validity checking and supplying defaults.

Source: Horstmann Cay S. (2019), Core Java. Volume II – Advanced Features, Pearson; 11th edition.

Leave a Reply

Your email address will not be published. Required fields are marked *