Distributed Database Access with SQL

Over the last several years, research into fully distributed database access has slowly but surely found its way into commercial products. Today, many of the mainstream enterprise database products offer at least some level of transparent distributed database access. As noted earlier in the “Remote Data Transparency” section, the performance implications of distributed database access and updates can be very substantial. Two very similar-looking queries can create massively different amounts of network traffic and overhead. A single query, carried out in a brute force method or an optimized method, can create the same differences, depending on the quality of the optimization done by the DBMS.

Because of these challenges, all of the vendors have taken a step-by-step approach to delivering distributed database access. When IBM first announced its blueprint for distributed data management in its SQL products, it defined a four-stage approach. IBM’s four stages, shown in Table 23-1, provide an excellent framework for understanding distributed data management capabilities and their implications.

The IBM scheme provides a simple model for defining the distributed data access problem: a user of one computer system needs to access data stored on one or more other computer systems. The sophistication of the distributed access increases at each stage. Thus, the capabilities provided by a given DBMS can be described in terms of which stage it has reached. In addition, within each stage, a distinction can be made between read-only access (with the SELECT statement) and update access (with the INSERT, DELETE, and UPDATE statements). A DBMS product often provides read-only capability for a given stage before full update capability is provided.

1. Remote Requests

The first stage of distributed data access, as defined by IBM, is a remote request, shown in Figure 23-9. In this stage, the PC user may issue a SQL statement that queries or updates data in a single remote database. Each individual SQL statement operates as its own transaction, similar to the autocommit mode provided by many interactive SQL programs. The user can issue a sequence of SQL statements for various databases, but the DBMS doesn’t support multistatement transactions.

Remote requests are very useful when a PC user needs to query corporate data. Usually, the required data is located within a single database, such as a database of order-processing or manufacturing data. Using a remote request, the PC program can retrieve the remote data for processing by a PC spreadsheet, graphics program, or desktop publishing package.

The remote request capability is not powerful enough for most transaction­processing applications. For example, consider a PC-based order entry application that accesses a corporate database. To process a new order, the PC program must check inventory levels, add the order to the database, decrease the inventory totals, and adjust customer and sales totals, involving perhaps half a dozen different SQL statements. As explained in Chapter 11, database integrity can be corrupted if these statements do not execute as a single transaction. However, the remote request stage does not support multistatement transactions, so it cannot support this application.

2. Remote Transactions

The second stage of distributed data access, as defined by IBM, is a remote transaction (called a remote unit of work by IBM), shown in Figure 23-10. Remote transactions extend the remote request stage to include multistatement transaction support. The PC user can issue a series of SQL statements that query or update data in a remote database and then commit or roll back the entire series of statements as a single transaction. The DBMS guarantees that the entire transaction will succeed or fail as a unit, as it does for transactions on a local database. However, all of the SQL statements that make up the transaction must reference a single remote database.

Remote transactions open the door for distributed transaction-processing applications. For example, in an order-processing application, a PC-based order entry program can now perform a sequence of queries, updates, and inserts in the inventory database to process a new order. The program ends the statement sequence with a COMMIT or ROLLBACK for the transaction.

Remote transaction capability typically requires a DBMS (or at least transaction­processing logic) on the PC as well as the system where the database is located. The transaction logic of the DBMS must be extended across the network to ensure that the local and remote systems always have the same opinion about whether a transaction has been committed. However, the actual responsibility for maintaining database integrity remains with the remote DBMS.

Remote transaction capability is often the highest level of distributed database access provided by database gateways that link one vendor’s DBMS to other DBMS brands. For example, most of the independent enterprise database vendors (Sybase, Oracle, Informix) provide gateways from their UNIX-based DBMS systems to IBM’s mainframe DB2 implementation. Some gateway products go beyond the bounds of remote transactions, allowing a user to join, in a single query, tables from a local database with tables from a remote database managed by a different brand of DBMS. However, these gateways do not (and cannot, without support from the remote DBMS) provide the underlying transaction logic required to support the higher stages of distributed access as defined by IBM. The gateway can ensure the integrity of the local and remote databases individually, but it cannot guarantee that a transaction will not be committed in one and rolled back in the other.

3. Distributed Transactions

The third stage of distributed data access, as defined by IBM, is a distributed transaction (a distributed unit of work in IBM parlance), shown in Figure 23-11. At this stage, each individual SQL statement still queries or updates a single database on a single remote computer system. However, the sequence of SQL statements within a transaction may access two or more databases located on different systems. When the transaction is committed or rolled back, the DBMS guarantees that all parts of the transaction on all of the systems involved in the transaction will be committed or rolled back. The DBMS specifically guarantees that there will not be a partial transaction, where the transaction is committed on one system and rolled back on another.

Distributed transactions support the development of very sophisticated transaction­processing applications. For example, in the corporate network of Figure 23-1, a PC order-processing application can query the inventory databases on two or three different distribution center servers to check the inventory of a scarce product and then update the databases to commit inventory from multiple locations to a customer’s order. The DBMS ensures that other concurrent orders do not interfere with the remote access of the first transaction.

Distributed transactions are much more difficult to provide than the first two stages of distributed data access. It’s impossible to provide distributed transactions without the active cooperation of the individual DBMS systems involved in the transaction. For this reason, the DBMS brands that support distributed transactions almost always support them only for a homogeneous network of databases, all managed by the same DBMS brand (that is, an all-Oracle or all-Sybase network). A special transaction protocol, called the two-phase commit protocol, is used to implement distributed transactions and ensure that they provide the all-or-nothing requirement of a SQL transaction. The details of this protocol are described later in the section “The Two-Phase Commit Protocol.”

4. Distributed Requests

The final stage of distributed data access in the IBM model is a distributed request, shown in Figure 23-12. At this stage, a single SQL statement may reference tables from two or more databases located on different computer systems. The DBMS is responsible for automatically carrying out the statement across the network. A sequence of distributed request statements can be grouped together as a transaction. As in the previous distributed transaction stage, the DBMS must guarantee the integrity of the distributed transaction on all systems that are involved.

The distributed request stage doesn’t make any new demands on the DBMS transaction-processing logic, because the DBMS already had to support transactions across system boundaries at the previous distributed transaction stage. However, distributed requests pose major new challenges for the DBMS optimization logic. The optimizer must now consider network speed when it evaluates alternate methods for carrying out a SQL statement. If the local DBMS must repeatedly access part of a remote table (for example, when making a join), it may be faster to copy part of the table across the network in one large bulk transfer rather than repeatedly retrieving individual rows across the network.

The relative sizes of the tables on the local and remote system are also relevant optimization factors, as well as the selectivity of any search conditions in the SELECT clause. For some queries, the search conditions may select only one or a few rows on the local system and hundreds of rows on the remote system, so they should be applied locally first. For other queries involving the same tables, the relative selectivity may be reversed, and the remote search condition should be applied. For still other queries, the join condition itself may limit the rows that participate in both the local and remote systems, and it may be most efficient to apply it first. In each case, the cost of the query is not just the cost of the database access but also the cost of shipping the results of intermediate query execution steps back and forth across the network.

The optimizer must also decide which copy of the DBMS should handle statement execution. If most of the tables are on a remote system, it may be a good idea for the remote DBMS on that system to execute the statement. However, that may be a bad choice if the remote system is heavily loaded. Thus, the optimizer’s task is both more complex and much more important in a distributed request.

Ultimately, the goal of the distributed request stage is to make the entire distributed database look like one large database to the user. Ideally, the user would have full access to any table in the distributed database and could use SQL transactions without knowing anything about the physical location of the data. Unfortunately, this ideal scenario would quickly prove impractical in real networks. In a network of any size, the number of tables in the distributed database would quickly become very large, and users would find it impossible to find data of interest. The user-ids of every database in the organization would have to be coordinated to make sure that a given user-id uniquely identified a user in all databases. Database administration would also be very difficult.

In practice, therefore, distributed requests must be implemented selectively. Database administrators must decide which remote tables are to be made visible to local users and which will remain hidden. The cooperating DBMS copies must translate user-ids from one system to another, allowing each database to be administered autonomously while providing security for remote data access. Distributed requests that would consume too many DBMS or network resources must be detected and prohibited before they impact overall DBMS performance.

Because of their complexity, distributed requests are not fully supported by any commercial SQL-based DBMS today, and it will be some time before even a majority of their features are available. One major step toward distributed processing across database brands has been the standardization of a distributed transaction protocol. The XA protocol, originally developed to coordinate among multiple transaction monitors, is being actively applied to distributed database transactions as well. A Java version of the same capability, called Java Transaction Protocol (JTP), provides a distributed transaction interface for Java-based applications and application servers. Today, most commercial DBMS products designed to be used in a network environment support XA and JTA interfaces.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *