The Challenge of Distributed Data Management with SQL

When the foundations of relational database management and the SQL language were being laid in the 1970s, almost all commercial data processing happened on large, centralized computer systems. The company’s data was stored on mass storage attached to the central system. The business programs that processed transactions and generated reports ran on the central system and accessed the data. Much of the workload of the central system was batch processing. Online users accessed the central system through “dumb” computer terminals with no processing power of their own. The central system formatted information to be displayed for the online user and accepted data typed by the user for processing.

In this environment, the roles of a relational database system and its SQL language were clear and well contained. The DBMS had responsibility for accepting, storing, and retrieving data based on requests expressed in the SQL language. The business­processing logic resided outside the database and was the responsibility of the business programs developed and maintained by the information systems staff. The programs and the DBMS software executed on the same centralized system where the data was stored, so the performance of the system was not affected by external factors like network traffic or outside system failures.

Commercial data processing in a modern corporation has evolved a long way from the centralized environment of the 1970s. Figure 23-1 shows a portion of a computer network that you might find in a manufacturing company, a financial services firm, or a distribution company today. Data is stored on a variety of computer systems in the network:

  • Mainframes. The company’s core data processing applications, such as accounting and payroll, run on an IBM mainframe. The oldest applications, developed and maintained over the last 20 or 30 years, still store their data in hierarchical IMS databases. The company has a strategy to migrate these applications to DB2 over time, and all new mainframe applications development uses DB2 as its database manager.
  • Workstations and UNIX-based servers. The company’s engineering organization uses UNIX-based workstations and servers (from Sun Microsystems) for engineering design, testing, and support. Engineering test results and specifications are stored in an Oracle database. The company also uses Oracle databases running on UNIX-based servers from Hewlett-Packard located in its six distribution centers to manage inventory and to process orders.
  • LAN servers. All of the company’s departments have individual PC LANs to share printers and files. Some of the departments also have local databases to support their work. For example, the personnel department has purchased a human resources management system software package, and it uses SQL Server on Windows NT to store its data. In the financial planning department, the data processing staff has built a custom-written corporate planning application, which uses Informix Universal Server.
  • Desktop personal computers. All of the company’s office workers use personal computers. Many of the administrative assistants and some of the senior managers maintain personal databases using Excel spreadsheets, Microsoft Access, or one of the lightweight DBMS products, such as Oracle Light. In a few cases, the databases are shared with other users, using LAN versions of these products.
  • Mobile laptop PCs. The company recently purchased a sales force automation software package and equipped every salesperson with a laptop PC. The laptop runs sales presentations, sends and receives e-mail, and also holds a local lightweight database (SQL Anywhere from Sybase) with recent product pricing and availability data. The database also captures orders entered by the salesperson. At night, the laptop PC connects to the corporate network over a dialup connection, transmits its orders, and receives updated information for its local copy of the products database.
  • Handheld devices. The company’s management team has widely adopted handheld personal digital assistants (PDAs). In addition to the personal calendar and address-book functions, applications running on the PDA can use wireless network connections to check prices and enter customer orders. The wireless network can also be used to alert users, via their PDAs, of important database changes, such as price updates or product shortages.
  • Internet connections. The company has an Internet web site where customers, dealers, and distributors can find out the latest information about its products and services. At first, this was an information-only web site, but competitors have recently begun accepting customer orders directly via the Internet. One of the corporate IS department’s highest priorities is to respond to this competitive challenge by supporting e-commerce transactions on the company’s web site.

With data spread over many different systems, it’s easy to imagine requests that span more than one database and the possibility for conflicting data among the databases:

  • An engineer needs to combine lab test results (on an engineering workstation) with production forecasts (on the mainframe) to choose among three alternative technologies.
  • A financial planner needs to link financial forecasts (in an Informix database) to historical financial data (on the mainframe).
  • A product manager needs to know how much inventory of a particular product is in each distribution center (data stored on six UNIX servers) to plan product obsolescence.
  • Current pricing data needs to be downloaded daily from the mainframe to the distribution center servers, and also to all of the sales force’s laptop computers.
  • Orders need to be uploaded daily from the laptop systems and parceled out to the distribution centers; aggregate order data from the distribution centers must be uploaded to the mainframe so that the manufacturing plan can be adjusted.
  • Salespeople may accept customer orders and make shipment date estimates for popular products based on their local databases, without knowing that other salespeople have made similar commitments. Orders must be reconciled and prioritized, and revised shipment estimates provided to customers.
  • Engineering changes made in the workstation databases may affect product costs and pricing. These changes must be propagated through the mainframe systems and out to the web site, the distribution centers, and the sales force laptops.
  • Managers throughout the company want to query the various shared databases using the PCs on their desktops.

As these examples suggest, effective ways of distributing data, managing distributed data, and providing access to distributed data have become critical as data processing has moved to a distributed computing model. The leading DBMS vendors are committed to delivering distributed database management, and currently offer a variety of products that solve some of the distributed data problems illustrated by these examples. Distributed data management has also been the focus of extensive university and corporate research, and many technical articles have been published about the theory of distributed data management and the trade-offs involved. There is general agreement among the researchers about the ideal characteristics that should be provided by a scheme to manage distributed databases:

  • Location transparency. The user shouldn’t have to worry about where the data is physically located. The DBMS should present all data as if it were local and be responsible for maintaining that illusion.
  • Heterogeneous systems. The DBMS should support data stored on different systems, with different architectures and performance levels, including PCs, workstations, LAN servers, minicomputers, and mainframes.
  • Network transparency. Except for differences in performance, the DBMS should work the same way over different networks, from high-speed LANs to low-speed telephone links.
  • Distributed queries. The user should be able to join data from any of the tables in the (distributed) database, even if the tables are located on different physical systems.
  • Distributed updates. The user should be able to update data in any table for which the user has the necessary privileges, whether that table is on the local system or on a remote system.
  • Distributed transactions. The DBMS should support distributed transactions (using COMMIT and ROLLBACK) across system boundaries, maintaining the integrity of the (distributed) database even in the face of network failures and failures of individual systems.
  • Security. The DBMS must provide a security scheme adequate to protect the entire (distributed) database from unauthorized forms of access.
  • Universal access. The DBMS should provide universal, uniform access to all of the organization’s data.

No current distributed DBMS product even comes close to meeting this ideal, and it’s unlikely that any product ever will. In practice, formidable obstacles make it difficult to provide even simple forms of distributed database management. These obstacles include:

  • Performance. In a centralized database, the path from the DBMS to the data has an access speed of a few milliseconds and a data transfer rate of millions of characters per second. Even on a fast local area network, access speeds lengthen to hundredths or tenths of a second, and transfer rates can fall to 100,000 characters per second or less. On a modem link, data access may take seconds or minutes, and a few thousand characters per second may be the maximum effective throughput. This vast difference in speeds can dramatically slow the performance of remote data access.
  • Integrity. Distributed transactions require active cooperation by two or more independent copies of the DBMS software running on different computer systems if the transactions are to remain all-or-nothing propositions. Special two-phase commit transaction protocols must be used. These protocols generate a great deal of network traffic and lock parts of the databases that are participating in the distributed transaction for long periods of time.
  • Static SQL. A static embedded SQL statement is compiled and stored in the database as an application plan. When a query combines data from two or more databases, where should its application plan be stored? Must there be two or more cooperating plans? If there is a change in the structure of one database, how do the application plans in the other databases get notified? Using dynamic SQL to solve these problems in a networked database environment almost always leads to unacceptably slow application performance, due to network overhead and delays.
  • Optimization. When data is accessed across a network, the normal rules for SQL optimization don’t apply. For example, it may be more efficient to sequentially scan an entire local table than to use an index search on a remote table. The optimization software must know about the network(s) and their speeds. Generally speaking, optimization becomes both more critical and more difficult.
  • Data compatibility. Different computer systems support different data types, and even when two systems offer the same data type, they often use different formats. For example, a Windows PC and a Macintosh store 16-bit integers differently. IBM mainframes store EBCDIC character codes while UNIX-based servers and PCs use ASCII. A distributed DBMS must mask these differences.
  • System catalogs. As a DBMS carries out its tasks, it makes very frequent access to its system catalogs. Where should the catalog be kept in a distributed database? If it is centralized on one system, remote access to it will be slow, bogging down the DBMS. If it is distributed across many different systems, changes must be propagated around the network and synchronized.
  • Mixed-vendor environment. It’s highly unlikely that all the data in an organization will be managed by a single brand of DBMS, so distributed database access will cross DBMS brand boundaries. This requires active cooperation between DBMS products from highly competitive vendors—an unlikely prospect. As the DBMS vendors scramble to extend the capabilities of their products with new features, capabilities, and data types, the ability to sustain a cross-vendor standard is even less likely.
  • Distributed deadlocks. When transactions on two different systems each try to access locked data on the other system, a deadlock can occur in the distributed database, even though the deadlock is not visible on either of the two systems. The DBMS must provide global deadlock detection for a distributed database. Again, this requires coordination of processing across a network and will typically lead to unacceptably slow application performance.
  • Recovery. If one of the systems running a distributed DBMS fails, the operator of that system must be able to run its recovery procedures independent of the other systems in the network, and the recovered state of the database must be consistent with that of the other systems.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *