SQL and XML: Using XML with Databases

With the rapidly growing popularity of XML, database product vendors have moved quickly to offer XML support in their products. The form of XML support varies, but tends to fall into one or more of these categories:

  • XML output. An XML document can easily represent the data in one or more rows of query results. With this support, the DBMS generates an XML document as its response to a SQL query instead of the usual row/column query results.
  • XML input. An XML document can easily represent the data to be inserted as one or more new rows of a table. It can also represent data to update a row of a table, or the identification of a row to be deleted. With this support, the DBMS accepts an XML document as input instead of a SQL request.
  • XML data exchange. XML is a natural way to express data that is to be exchanged between different DBMS systems or among DBMS servers. Data from the source database is transformed into an XML document and shipped to the destination database, where it is transformed back into a database format. This same style of data exchange is useful for moving data between relational databases and non-DBMS applications, such as corporate Enterprise Resource Planning (ERP) or Enterprise Application Integration (EAI) systems.
  • XML storage. A relational database can easily accept an XML document (which is a string of text characters) as a piece of variable-length character string (VARCHAR) or character large object (CLOB) data. At the most basic level of XML support, an entire XML document becomes the content of one column in one row of the database. Slightly stronger XML support may be possible if the DBMS allows the column to be declared with an explicit “XML-data” data type.
  • XML data integration. A more sophisticated level of integrated XML storage is possible if the DBMS can parse an XML document, decompose it into its component elements, and store the individual elements in individual columns. Ordinary SQL can then be used to search those columns, providing search support for elements within the XML document. In response to a query, the DBMS can recompose the XML document from its stored component elements.

1. XML Output

One of the most straightforward combinations of XML and database technology is to use XML as a format for SQL query results. Query results have a structured tabular format that can easily be translated into an XML representation. Consider this simple query from the sample database:

If the DBMS is instructed to output the query results in XML format instead, here is the output that might result:

SELECT ORDER_NUM, MFR, PRODUCT, QTY, AMOUNT

  FROM ORDERS

 WHERE CUST = 2103;

<?xml version=”1.0″?>

<queryResults>

<row>

<order_num>112963</order_num>

<mfr>ACI</mfr>

<product>41004</product>

<qty>28</qty>

<amount>3276.00</amount>

</row>

<row>

<order_num>112983</order_num>

<mfr>ACI</mfr>

<product>41004</product>

<qty>3</qty>

<amount>702.00</amount>

</row>

<row>

<order_num>113027</order_num>

<mfr>ACI</mfr>

<product>41002</product>

<qty>54</qty>

<amount>4104.00</amount>

</row>

<row>

<order_num>112987</order_num>

<mfr>ACI</mfr>

<product>4100Y</product>

<qty>11</qty>

<amount>27500.00</amount>

</row>

</queryResults>

This is typical of the output you could actually receive from some of the popular DBMS products that currently support XML output. The query results are a well-formed, self-contained XML document. If you submit the results to an XML parser (parsers are described in the “Large Objects and Parsers” section later in this chapter), the parser will correctly interpret them as having:

  • One root element, queryResults
  • Four row subelements beneath the root
  • Five subelements beneath each row element, and in this case, all five subelements appearing for every row element, and in the same order

Having XML-formatted query output can be a significant advantage. The output can be sent directly to programs that accept XML documents as input, for further processing. The output can be sent across a network to another system, and because of its XML format, its elements are self-describing—every receiving system or application will interpret the query results in the same way—as four rows of five elements each. Because the output is in pure text format, it won’t be misinterpreted because of differences in binary data representations between sending and receiving systems. Finally, if the XML is transmitted over an HTTP link using the Simple Object Access Protocol (SOAP) protocol standards, the XML-formatted message can typically move through corporate firewalls and link an originating application in one company with a receiving application in a different company.

The XML-formatted output also has some disadvantages. One is the raw size of the data. There are about four times as many characters in the XML-formatted results as in the tabular format. If the XML form is being stored on disk, it requires four times the storage. If it’s being sent to another computer system over a network, it will take four times as long to transmit, or it will require a network with four times the bandwidth to preserve the same transmission time. These aren’t serious problems for the small amount of data in the example, but they can be very significant for results with thousands or tens of thousands of rows, multiplied by hundreds of applications in an enterprise data center.

This simple XML output format also loses some information about the data. The currency symbol that appeared in the tabular display has disappeared, so it’s impossible to determine, from the XML content itself, that the data has a currency type and what kind of currency it is. The XML Schema capability provides a way to gain back this information, as described later in the “XML Schema” section, but at the expense of still more increase in the size of the query results text.

2. XML Input

Just as XML can be used to represent a row of query results that is output from a database, XML can easily be used to represent a row of data to be inserted into a database. To process the XML data, the DBMS must analyze the XML document containing the data to be inserted and identify the individual data items (represented as either elements or attributes). The DBMS must then match (usually using column names) or translate (using a DBMS-specific scheme) the corresponding element or attribute names to columns in the target table that is to receive the new data. Conceptually, this simple INSERT statement:

 INSERT INTO OFFICES (OFFICE, CITY, REGION, SALES)

VALUES (23,’San Francisco’,’Western’,0.00)

can be easily translated into an equivalent hybrid SQL/XML statement like this one:

INSERT INTO OFFICES (OFFICE, CITY, REGION, SALES)

VALUES <?xml version=”1.0″?>

<row>

<office>23</office>

<city>San Francisco</city>

<region>Western</region>

<sales>0.00</sales>

</row>

Updates to the database can be similarly handled. This simple UPDATE statement:

UPDATE OFFICES

   SET TARGET = 200000.00, MGR = 108

 WHERE OFFICE = 23

can be translated into this equivalent hybrid SQL/XML statement:

UPDATE OFFICES

WHERE OFFICE = 23

<?xml version=”1.0″?>

<update_info>

<values>

<target>200000.00</target>

<mgr>108</mgr>

</values>

<where>office = 23</where>

</update_info>

and a DELETE statement requires only the specification of the WHERE clause, using the same conventions.

While several SQL DBMS brands have added the capability to process XML-based INSERT, UPDATE, and DELETE operations using this type of approach, the specific techniques for representing table and column names and data values in the XML text, and for mapping them to corresponding database structures, are DBMS-specific. There are no standards (yet) for the type of hybrid SQL/XML syntax in these examples.

Although representing input and update values as small XML documents is conceptually straightforward, it represents some significant DBMS processing issues. For example, the column list in a SQL INSERT statement appears to be redundant if the XML document containing the data values to be inserted also contains the column names as either element or attribute names. Why not simply drop the column list and let the XML documents specify which columns to insert? For interactive SQL, there is no problem in doing this, but the XML format is unlikely to be used for an interactive SQL session. For programmatic use of SQL, the problem is that the XML document and the data values that it contains will be supplied to the DBMS at runtime. If the column names (or even the table name) are also supplied only in the XML document, then the DBMS cannot know, until runtime, which tables and columns are affected. In this situation, the DBMS must use dynamic SQL to handle the processing, as described in Chapter 18, with all of its associated performance penalties.

Similar problems arise with the WHERE clause in an UPDATE or DELETE statement, and the SET clause of the UPDATE statement. To get the performance and efficiency of static SQL, the DBMS must know in advance (when the program is compiled) which search conditions will be used and which columns will be updated. One approach to this problem is to use the parameterized form of these statements. Here is the same UPDATE example, using this approach:

UPDATE OFFICES

   SET TARGET = ?, MGR = ?

 WHERE OFFICE = ?

<?xml version=”1.0″?>

<update_info>

<param>200000.00</param>

<param>108</param>

<param>23</param>

</update_info>

With this style, the XML text and the SQL text are actually quite separate. The SQL text is self-contained, and can be processed at compile-time. The XML text is self-contained, and the DBMS can match its parameter values to the needed statement parameters at runtime. This example follows the usual SQL style of specifying parameters by position, but the XML document loses a lot of its self-describing qualities as a result. Depending on the DBMS, it may be possible to use named elements within the XML document and match them to named statement parameters at runtime.

3. XML Data Exchange

A DBMS can support XML data exchange in a simple form merely by supporting XML output for query results and XML input for INSERT operations. However, this requires the user or programmer to carefully construct the format of the generated query results in the source database to match the expected format for the INSERT operations in the destination database. XML data exchange is more useful if the DBMS provides more explicit built-in support.

Several commercial DBMS products now offer the ability to perform a bulk export of a table (or in a more sophisticated operation, the results of a query) into an external file, formatted as an XML document. At the destination end, these products offer the same ability to do a bulk import from this same type of file into a DBMS table. With this scheme, the XML document file becomes a standard way of representing table contents for the exchange.

Note that once XML-based table import/export capabilities are offered, their use is not restricted to database-to-database exchanges. The source of the XML document in the data exchange file could well be an enterprise application, such as a Supply Chain Management (SCM) system. The destination similarly could be an enterprise application. In addition, many EAI systems now support XML document files. These systems provide further processing and integration capabilities, such as eliminating duplicated data and combining data from multiple input files.

4. XML Storage and Integration

XML input, output, and data exchange capabilities offer a very effective way to integrate existing relational databases with the emerging world of XML. With these approaches, XML is used in the external world to represent structured data, but the data within the database itself retains its row/column, tabular, binary structure. As XML documents proliferate, a natural next step is to consider storing XML documents themselves within a database.

4.1. Simple XML Storage with Large Objects

Any SQL-based DBMS that supports large objects automatically contains basic support for XML document storage and retrieval. The section titled “Large Object Support” in Chapter 24 described how several commercial databases store and retrieve large text documents through character large object (CLOB) or binary large object (BLOB) data types. Many commercial products support documents of up to 4 gigabytes for CLOB or BLOB data, which is adequate for the vast majority of XML documents.

To store XML documents using BLOBs or CLOBs, you would typically define a table that contains one BLOB or CLOB column to contain the document text, and some auxiliary columns (using standard data types) that contain attributes that identify the document. For example, if a table is to store purchase order documents, you might define auxiliary columns to hold the customer number, order date, and purchase order number using INTEGER, VARCHAR, or DATE data, in addition to the CLOB column for the XML document. You can search the table of purchase orders based on customer numbers, order dates, or P.O. numbers, and use the CLOB processing techniques described in Chapter 24 to retrieve or store the XML document.

An advantage of this approach is that it is relatively simple to implement. It also maintains a clean separation between the SQL operations (such as query processing) and the XML operations. A disadvantage is that the level of XML/DBMS integration is fairly weak. In the simplest implementations, a stored XML document is completely opaque to the DBMS; the DBMS knows nothing about its contents. You cannot search for a document based on one of its attributes or its element values, unless that particular attribute or element has been extracted from the XML document and is also represented as a separate column in the table. If you can anticipate in advance which types of searches are likely, this is not a large restriction.

Some object-relational databases provide a more advanced search capability for CLOBs by extending the SQL WHERE clause with full-text search capability. These products allow you to search CLOB columns as text, using the type of text search capabilities typically found in word processors. This provides an expanded, but typically still limited, capability for searching XML documents stored as CLOB columns. Using full-text search, you could, for example, locate every purchase order where the phrase “Type 4 Widgets” occurred. However, it will be difficult or impossible to search for only those XML documents where “Type 4 Widgets” applies in an order item description element. Because the search software doesn’t explicitly know about the structure of XML documents, it will probably also return rows where “Type 4 Widgets” occurred in a comments element or some other element.

4.2. Large Objects and Parsers

When exchanged between applications or stored in a file or in a DBMS CLOB column, XML documents are always in text form. This makes the contents very portable, but unwieldy for computer programs to handle. An XML parser is a piece of computer software that translates XML documents from their text form to a more program-friendly, internal representation. Any SQL-based DBMS that supports XML will have an XML parser as part of its software, for its own use in processing XML. If the DBMS brand supports CLOBs, it can provide further integration with XML by allowing an XML parser to operate directly on the CLOB column contents.

There are two popular types of XML parsers, which support two styles of XML processing:

  • Document Object Model (DOM). DOM parsers transform an XML document into a hierarchical tree structure within a computer’s main memory. A program can than make calls to the DOM API to navigate through the tree, moving up and down or sequentially through the element hierarchy. The DOM API makes the element structure of an XML document easily accessible to programmers and simplifies random access to portions of the document.
  • Simple XPI for XML (SAX). SAX parsers transform an XML document into a series of callbacks to a program, which inform the program of each part of the XML document as it is encountered. A program can be structured to take certain actions when the beginning of a document section is encountered, or when a particular attribute is encountered. The SAX API imposes a more sequential style of processing on a program using it. The API’s callback style matches well with an event-driven program structure.

Either type of XML parser will validate that an XML document is well formed, and can also validate an XML document against a schema, as described in the “XML Schema” section later in this chapter. A DOM parser is practical when the size of the stored XML document is fairly small; it will require double the memory space of the text XML document, because it generates a second, tree-structured representation of the entire document. For very large documents, a SAX parser makes it easy to process documents in small, discrete pieces. However, the fact that the entire document is not available at one time may require a program to make multiple passes through it, if the program needs to process various sections of the document out of sequential order.

4.3. XML Marshalling

Storing XML documents as large objects within a database is an excellent solution for some types of SQL/XML integration. If the XML documents are, for example, text-oriented business documents, or if they are text components of web pages, then there is really very little need for the DBMS to “understand” the internals of the XML documents themselves. Each document can probably be identified by one or more keywords or attributes, which can easily be extracted and stored as conventional columns for searching.

If the XML documents to be processed are really data processing records, however, the simple integration provided by large objects may be too primitive. You will probably want to process and access individual elements, and search based on their contents and attributes. The DBMS already provides these capabilities for its native row/column data. Why can’t the DBMS automatically decompose an incoming XML document, transforming its element contents and attribute values into a corresponding set of internal row/column data for processing? On the outbound side, we have already seen how this approach can work to transform row/column query results into an XML document. The same technique could be used to recompose an XML document if it were once again needed in its external text form.

The challenge of transforming XML documents, which are an excellent external data representation, to and from internal data representations more useful for programs is not unique to database systems. The same problems occur, for example, in Java processing of XML, where it is very desirable to transform an XML document to and from a set of Java class instances for internal processing. The process of decomposing an XML document into its component elements and attributes in some internal, binary representation is called unmarshalling in the XML literature. Conversely, the process of reassembling these individual element and attribute representations into a complete text XML document is called marshalling.

For very simple XML documents, the marshalling and unmarshalling process is straightforward, and commercial DBMS products are moving to support it. Consider once again the simple purchase order document in Figure 25-3. Its elements map directly, one to one, onto individual columns of the ORDERS table. In the simplest case, the names of the elements (or attributes) will be identical to the names of the corresponding columns. The DBMS can receive an inbound XML document like the one in the figure, automatically turn its elements (or attributes, depending on the style used) into column values, using the element names (or attribute names) to drive the process. Reconstituting the XML document from a row of the table is also no problem at all.

The DBMS must do slightly more work if the element names in the XML document don’t precisely match column names. In this case, some kind of mapping between element names (or attribute names) and column names must be specified. It’s relatively straightforward to put such a mapping into the DBMS system catalog.

Many useful real-world XML documents do not map neatly into single rows of a table. Figure 25-4 shows a simple extension of the purchase order XML document from Figure 25-3, which supports the typical real-world requirement that a purchase order may contain multiple line items. How should this XML document be unmarshalled into the sample database? One solution is to make each line item from the purchase order into a separate row of the ORDERS table. (Ignore for the moment that each row in the ORDERS table must contain a unique order number because the order number is the primary key.) This would result in some duplication of data, since the same order number, order date, customer number, and salesperson number will appear in several rows. It would also make marshalling the data to reconstitute the document more complex—the DBMS would have to know that all of the rows with the same order number should be marshaled into one purchase order XML document with multiple line items. Clearly, the marshalling/unmarshalling of even this simple document requires a more complex mapping.

The multiline purchase order merely scratches the surface of marshalling and unmarshalling XML documents. The more general situation is shown in Figure 25-5, where the DBMS must unmarshal an XML document into multiple rows of multiple, interrelated tables. To marshal the document, the DBMS must exercise the relationships between the tables to find the related rows and recompose the XML hierarchy. The underlying reason for this complexity is the mismatch between XML’s natural hierarchical structure and the flat, normalized, row/column structure of a relational database.

Marshalling and unmarshalling is both simplified and made more complex if a DBMS supports object-relational extensions, such as structured data types. The translation to and from XML can be simpler because individual columns of a table can now have their own hierarchical structure. A higher-level XML element (such as a billing address composed of street, city, state, country, and postal code elements) can be mapped into a corresponding column with an abstract ADDRESS data type, with its own internal hierarchy. However, the translation to and from XML now involves more decisions in the database design, trading off the marshalling/unmarshalling simplicity of structured data types against the flexibility of a flattened row/column approach.

Several commercial products are beginning to offer marshalling/unmarshalling capabilities, or have announced plans to provide this capability in future releases. The performance overhead of this translation can be very substantial, and it remains to be seen how popular these capabilities will be in practice. However, if an application is handling external data in XML form, the translation between XML and SQL data must occur at some point, and translation within the DBMS itself may be the most efficient approach.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *