SQL and XML: XML and Metadata

One of the most powerful qualities of the relational model is its very rigid support for data types and data structure, implemented by the definitions of tables, columns, primary keys, foreign keys, and constraints. In addition, as shown in Chapter 16, the system catalog of a relational database contains metadata, or “data about the data” in the database. By querying the system catalog, you can discover the structure of the database, including the data types of its columns, the columns that comprise its tables, and intratable relationships.

In contrast, XML documents by themselves provide only very limited metadata. They impose a hierarchical element structure on their data, but the only real data about the structure is the names of the elements and attributes. An XML document can be well formed and still have quite an irregular structure. For example, there is nothing to prevent a well-formed XML document from having a named element that contains text data in one instance and subelements in another instance, or a named attribute that has an integer value for one element and a date value for another. Clearly, a document with this structure, while it may be well formed, does not represent data that is easily transformed to and from a relational database. When using XML for data processing documents, stronger support for data types and rigid structure is needed.

XML standards and products have addressed this need in multiple ways during the short history of XML technologies. These include:

  • Document Type Definition (DTD). A part of the original XML 1.0 specification, Document Type Definitions provided a way to specify and restrict the structure of a document. XML parsers can examine an XML document in the context of a DTD and determine whether it is a valid document (i.e., whether it conforms to the DTD restrictions).
  • XML-Data. Submitted to the W3C in 1998, XML-Data was an early attempt to address some of the deficiencies in the DTD scheme. It never received W3C endorsement, but many of its ideas have carried forward into the XML Schema specification. Microsoft adapted its own form of XML-Data, called XML-Data Reduced (XDR), and implemented it as part of its BizTalk integration server and Internet Explorer 5.0 browser. The energy around the XML-Data proposal shifted in late 1999 and 2000 to the XML Schema proposal.
  • XML Schema. A stand-alone specification that became a W3C recommendation in May 2001, XML Schema built on and extended the ideas in XML-Data. XML Schema provides much more rigorous data type support, and has the advantage that the schema definition (the document metadata) is itself expressed as an XML document, in much the same way that relational database metadata is provided via a standard relational table structure.
  • Industry group standards. Various industry groups have banded together to define XML standards for specific types of documents that are important for data exchange within their industry. For example, financial services firms are working on standards to describe financial instruments (stocks, bonds, etc.) and market data. Manufacturing firms are working on standards to describe purchase order documents, order confirmations, and the like. These standards for specific industry-oriented documents are usually built on generic standards, such as DTD and XML Schema.

The area of XML metadata and document type standards is evolving rapidly. The W3C consortium provides a frequently updated web site at http://www.w3.org, which provides access to the various XML-related standards and information about their status. You can find information about industry-specific standards at http://www.xml.org, a site organized and hosted by the Organization for the Advancement of Structured Information Systems (OASIS). The site contains a registry of XML-based standards, classified by industry.

1. Document Type Definitions (DTDs)

The earliest attempt to standardize XML metadata was contained in the Document Type Definition (DTD) capability of the original XML 1.0 specification. DTD’s are used to specify the form and structure of a particular type of document (such as a purchase order document or a transfer-of-funds document). Figure 25-6 shows a DTD that might be used for a simple purchase order document in Figure 25-5. This DTD demonstrates only a fraction of the full capabilities of DTDs, but it illustrates the key components of a typical DTD.

The !ELEMENT entries in the DTD define the element hierarchy that gives the document its basic form. DTDs provide for these different types of elements:

  • Text-only element. The element contains only a text string, which can represent a data value from a single column of database data.
  • Element-only element. The element’s contents are other elements (subelements); it is the parent in a local parent/child hierarchy of elements. This type of element can be used to represent a row of a table, with subelements representing the columns.
  • Mixed-content element. The element can contain a mixture of interspersed text contents and subelements. This type is not typically used for database contents, because this mix of subelements and data doesn’t naturally appear in the row/column structure of tables.
  • Empty-content element. The element has no content—neither subelements nor text content—but it may have attributes. This type of element can represent a row of a table when its attributes are used to represent individual column values.
  • Any-content element. The element has unrestricted content. The content may be empty, or may contain a mix of subelements and/or text. Like the mixed-content element, this type is typically not useful for XML documents used in database processing.

In the purchase order DTD of Figure 25-6, the top-level purchaseOrder element and the orderltem element have the element-only type. Their declarations list the subelements that they contain. The customerNumber and OrderDate elements are text-only elements, indicated by the #PCDATA definitions. The TERMS element is empty; it only has attributes. Both attributes have values that are character data (indicated by the CDATA type); one is required, and the other is optional, as indicated. Note that this DTD combines a data-as-elements style (for the customer information) and a data-as-attributes style (for the order terms) only for illustrative purposes. In practice, you would choose one style or the other of data representation and use it consistently, to simplify processing.

Document Type Definitions are critical to make XML actually useful in practice for representing structured documents for data exchange. They allow you to define the essential elements of a transactional document, such as a purchase order or an employee personnel action form or a request-for-quote form. With a DTD for such a document in place, it is straightforward to validate that a document that originates somewhere else within a company, or even outside a company, is a valid document of the specific type and can be processed. Any XML parser, whether based on the DOM API or the SAX API, is capable of validating an XML document against a supplied DTD. In addition, it’s possible to explicitly declare the DTD to which an XML document should conform within the document itself.

Document Type Definitions have some drawbacks, however. They lack the strong data typing typically found in relational databases. There is no way to specify that an element must contain an integer or a date, for example. DTDs also lack good support for user-defined (or corporate-defined) types or subdocument structures. For example, it’s possible that the orderItem element in Figure 25-6 will appear not only in a purchase order document, but also in a change order document, an order cancellation document, a backorder document, and an order acknowledgement document. It would be convenient to define the orderItem substructure once, give it a name, and then refer to it in these other document definitions, but DTDs don’t provide this capability.

DTDs are also somewhat restrictive in the types of content structures that they allow, although in practice, they are usually rich enough to support the kinds of transactional documents needed for hybrid database/XML applications. Finally, the expressions used by DTDs to define document structure are an extended form of Backus Naur Form (BNF). (An example of this is the asterisk that appears after the orderItem declaration within the purchaseOrder element list in Figure 25-6, which means, “This element may be repeated zero or more times.”)

While familiar to computer science students who deal with computer languages, this format is unfamiliar to people who approach XML from the document markup world of HTML. All of these deficiencies became visible soon after the adoption of XML 1.0, and work began to define a stronger metadata capability for XML documents. Eventually, these efforts resulted in the XML Schema specification, described in the next section.

2. XML Schema

XML Schema 1.0 became an official W3C recommendation in May 2001, and support for it is rapidly growing in commercial XML-related products. DTD’s are still widely supported for backward compatibility, but XML Schema offers some compelling advantages, and addresses most of the shortcomings of DTD. Figure 25-7 shows the document schema for the purchase order document in Figure 25-5, this time defined using an XML Schema. It’s useful to compare the XML Schema declaration in Figure 25-7 with the DTD declaration in Figure 25-6. Even this simple example shows the strong data type support in XML Schema; elements and attributes have data types that look very much like SQL data types. Also, the schema in Figure 25-7 is itself an XML document, so it is more readable for someone who is familiar with XML basics than the DTD in Figure 25-6.

3. Data Types in XML Schema

From a database point of view, XML Schema’s strong support for data types and data structures is one of its major advantages. XML Schema defines over 20 built-in data types, which correspond fairly closely to the defined SQL data types. Table 25-1 lists the most important XML Schema built-in data types for database processing.

Like the SQL2 and SQL3 standards, XML Schema supports user-defined data types that are derived from these built-in types or from other user-defined types. You can specify a derived data type as a restriction on another XML type. For example, here is a definition for a derived repNumType type that restricts legal employee numbers to a range from 101 to 199:

<simpleType name=”repNumType”>

<restriction base=”integer”>

<minInclusive value=”101″ />

<maxExclusive value=”200″ />

</restriction>

</simpleType>

With this data type defined, you can declare entities or attributes in a schema as having a data type of repNumType and the restriction is automatically implemented. XML Schema provides a rich set of data type characteristics (called facets) on which you can build restrictions. These include data length (for strings and binary data), inclusive and exclusive data ranges, number of digits and fractional digits (for numeric data), and explicit lists of permitted values. There is even a built-in pattern-matching capability, where data values can be restricted using a regular expression syntax like that used in the Perl scripting language.

XML Schema also gives you the ability to define complex data types, which are user-defined structures. For example, here is a definition for a complex custAddrType type that is comprised of familiar subelements:

You can also create a user-defined data type that is a list of data items with another type. For example, here is a definition for a complex repListType type, which is a list of salesperson employee numbers:

<simpleType name=repListType”>

<list itemType=”repNumType” />

</simpleType>

XML Schema also allows you to overload a user-defined data type, allowing it to take on one of several different underlying data types depending on the specific need. For example, in the preceding custAddrType definition, the postal code portion of the address is defined as an integer. This works for US-style five-digit zip codes (except that it doesn’t preserve the leading zero), but not for Canadian six-digit postal codes, which include letters and digits. A more international approach is to declare the U.S. and Canadian versions, and then a more universal postal code, which may be any of the types:

<simpleType name=”usZip5Type”>

<restriction base=”integer”>

<totalDigits value=5 />

</restriction>

</simpleType>

<simpleType name=”canPost6Type”>

<restriction base=”string”>

<length value=6 />

</restriction>

</simpleType>

<simpleType name=”intlPostType”>

<union memberTypes=”usZip5Type canPost6Type” />

</simpleType>

With user-defined data type definitions in place, you can more easily define larger, more complex structures. For example, here is part of the purchase order document in Figure 25-7, expanded to include a bill-to and ship-to address, and to permit a list of sales representatives:

<complexType name=”purchaseOrderType”>

… other element declarations go here …

<element name=”billAddr” type=”custAddrType” />

<element name=”shipAddr” type=”custAddrType” />

<element name=”repNums” type=”repListType” />

… other element declarations continue…

4. Elements and Attributes in XML Schema

Building on its support for a rich data type structure, XML Schema also provides a rich vocabulary for specifying the legal structure of a document type and the permitted elements and attributes that comprise it. XML Schema supports the same basic element types defined in the DTD model:

  • Simple content. The element contains only text content (although as explained in the preceding section, the text can be restricted to data of a specific type like a date or a numeric value). Content of this type is defined using a simpleContent element.
  • Element-only content. The element contains only subelements. Content of this type is defined using a complexType element.
  • Mixed content. The element contains a mix of subelements and its own text content. Unlike the DTD mixed-content model, XML Schema requires that the sequence of elements and text content be rigidly defined, and valid documents must conform to the defined sequence. Content of this type is defined using a mixed attribute on a complexType element.
  • Empty content. The element contains only attributes, and no text content of its own. XML Schema treats this as a special case of element-only content, with no declared elements.
  • Any content. The element contains any mix of content and subelements, in any order. Content of this type is defined using the XML Schema data type anyType as the data type of the element.

These basic element types can appear individually in the declarations of elements within a schema. You can also specify that an element may occur multiple times within a document, and optionally, specify a minimum and a maximum number of occurrences. XML Schema also supports SQL-style NULL values for elements, to indicate that element contents are unknown. The XML terminology is nil values, but the concept is the same. This capability simplifies mapping of data between XML document elements and database columns that can contain NULL values.

XML Schema lets you define a logical group of elements that are typically used together, and give the element group a name. Subsequent element declarations can then include the entire named group of elements as a unit. Grouped elements also provide additional flexibility for element structure. The group can specify a sequence of elements, which must all be present, in the specified order. Alternatively, it can specify a choice of elements, indicating that only one of a set of defined element types must appear.

XML Schema provides similar control over attributes. You can specify an individual attribute as optional or required. You can specify a default attribute value, to be used if an explicit value is not provided in the document instance, or a fixed attribute value, which forces the attribute to always have the specified value in an instance document. Attribute groups allow you to define and name a group of attributes that are typically used together. The entire group of attributes can be declared for an element in a schema simply by naming the attribute group.

Finally, XML Schema provides extensive support for XML namespaces, which are used to store and manage different XML vocabularies—that is, different collections of data type definitions and data structure declarations that are used for different purposes. In a large organization, it will be useful to define standardized XML representations for common basic business objects, such as an address, a product number, or a customer-id, and collect these in a common repository. Higher-level XML declarations for documents such as purchase orders, vacation time requests, payment authorization forms, and the like will also be useful, but should typically be collected together in groups based on shared usage.

XML namespaces support these capabilities by allowing you to collect together related XML definitions and declarations, store them in a file, and identify them by name. An XML schema for a new type of document can then draw its basic data definitions and structures from one or more namespaces by referencing the namespaces in the schema header. In fact, the standard XML vocabulary and many of the built-in data types are defined in a namespace maintained on the W3C organization web site. An Internet-style URL identifies the source file for an XML namespace.

If an XML Schema declaration incorporates definitions from more than one XML namespace, the potential for name conflict exists. The same name could easily have been chosen by the developers of two different namespaces to represent two quite different XML structures or data types. To remove the potential ambiguity, XML data types and structure definitions can be specified using qualified names, using a technique that closely parallels the use of qualified column names in SQL. Each namespace that is identified in a schema header can be assigned a prefix name, which is then used to qualify references to items within that namespace. For clarity, the prefix names have been omitted from the schema examples in this chapter. Here is a more typical schema header and excerpt from a schema body that uses prefix names and qualification to reference the main XML Schema namespace (maintained by W3C) and a corporate namespace:

<schema xmlns:xsd=”http://www.w3.org/2001/XMLSchema”

xmlns:corp=”http://www.mycompany.com/schemas/purchasing”>

<complexType name=”purchaseOrderType”>

… other element declarations go here…

<element name-=”orderDate” type=”xsd:date” />

<element name=”billAddr” type=”corp:custAddrType” />

<element name=”shipAddr” type=”corp:custAddrType” />

<element name=”repNums” type=”corp:repListType” nillable=”true” />

… other element declarations continue…

In this example, the corporate XML namespace is identified by the prefix corp and the main XML Schema namespace by the prefix xsd. All of the data type references are qualified by one of these prefixes, and as a result, they are unambiguous. Because qualified references can become quite cumbersome, it’s also possible to specify default namespaces that minimize the need for prefixes. The complete XML Schema naming system is quite a bit more sophisticated than the capabilities outlined here, but the capabilities are clearly directed toward supporting the creation of very complex document type specifications by large groups of people.

As with DTDs, the power of XML Schema is that it allows you to specify well-defined document types against which individual document instances can be validated. All of the popular XML parsers, whether they implement the SAX API or the DOM API, provide XML Schema-based validation. You can specify the schema to which an XML document claims conformance within the XML document itself, but you can also ask a parser to validate an arbitrary XML document against an arbitrary schema.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *