Distributing SQL Data: Practical Approaches

Because of the formidable obstacles to realizing the ideal distributed database, DBMS vendors have taken a step-by-step approach to databases and networking. They have focused on specific forms of network database access, data distribution, and distributed data management that are appropriate for particular application scenarios. For example,
a DBMS vendor may first provide tools to rapidly extract subset data from a master database and send it across a network for loading into a slave database. Later, the vendor may enhance the tool to track updates to the master database since the last extract, and to extract and transmit only the changes to the master database.

A subsequent version of the tool may automate the entire process, providing a graphical user interface for specifying the data to be extracted and scripts to automate the periodic extract process. Similarly, a DBMS may provide initial support for distributed queries by allowing a user on one system to query a database located on another system. In subsequent releases, the DBMS may allow the remote query as a subquery within a query that accesses local database tables. Still later, the DBMS may allow distributed queries that more freely intermix data from local and remote databases.

1. Remote Database Access

One of the simplest approaches to managing data stored in multiple locations is remote data access. With this capability, a user of one database is given the ability to reach out across a network and retrieve information from a different database. In its simplest form, this may involve carrying out a single query against the remote database, as shown in Figure 23-2. It may also involve performing an INSERT, UPDATE, or DELETE statement to modify the remote database contents. This type of requirement often arises when the local database is a satellite database (such as a database in a local sales office or distribution center) and the remote database is a central, corporate database.

In addition to the remote data access request, Figure 23-2 also shows a client/server request to the remote database from a (different) PC user. Note that, from the standpoint of the remote database, there is very little difference between processing the request from the PC client and processing the remote database access request. In both cases, a SQL request arrives across the network and the remote database determines that the user making the request has appropriate privileges and then carries it out. And, in both cases, the status of the SQL processing is reported back across the network.

The local database in Figure 23-2 must do some very different work than the process it normally uses to process local database requests, however. There are several complications for the local DBMS:

  • It must determine which remote database the user wants to access, and how it can be accessed on the network.
  • It must establish a connection to the remote database for carrying out remote requests.
  • It must determine how the local user authentication and privilege scheme maps to the remote database. That is, does it simply pass the user name/password supplied for local database access to the remote database, or is a different remote user name/password supplied, or should some kind of automatic mapping be performed?

In effect, the local DBMS becomes an agent for the user making the remote access request. It becomes a client in a client/server connection to the remote DBMS.

Several of the leading enterprise DBMS vendors offer the kind of remote database access capability shown in Figure 23-2. They differ in the specific way that remote access is presented to the user and to the database administrator. In some cases, they involve extensions to the SQL language accepted by the DBMS. In others, the extra mechanisms for establishing remote access are mostly external to the SQL language.

Sybase offers a simple entry-level remote database access capability. While connected to a local Sybase installation, the user can issue a CONNECT TO SQL statement, naming a remote server that is known to the local server. For example, if a remote server named CENTRALHOST contains a copy of the sample database, then this statement:

CONNECT TO CENTRALHOST

makes that remote server the current server for the session. The local server in effect enters a passthrough mode, sending all SQL statements to the remote server. The remote database can now be processed directly over the connection, with standard, unmodified queries and data manipulation statements:

Get the names and sales numbers of all salespeople who are already over quota.

 SELECT NAME, QUOTA, SALES

   FROM SALESREPS

  WHERE SALES > QUOTA

When the remote access is completed, a companion SQL statement:

DISCONNECT

ends the passthrough mode, and the local server once again becomes the current server. Except for the CONNECT/DISCONNECT statement pair, the mechanism for managing remote access is external to the SQL language. The database administrator tells the local database about the existence, locations, and names of remote servers through the spaddserver() and spdropserver() system stored procedures.

The current local user name and password are used by default for access to the remote server. Alternatively, the database administrator can specify a proxy user name/password that is used for remote server access, again through system stored procedures. Sybase offers other, more complex distributed database capabilities, but this basic capability has the advantage of maximum simplicity.

Oracle takes a somewhat different approach to remote database access, but one that is similar to the capabilities provided by other DBMS brands. It requires that Oracle’s SQL*Net networking software be installed along with the Oracle DBMS on both the local and the remote system. The database administrator is responsible for establishing one or more named database links from the local database to remote databases. Each database link specifies:

  • Network location of the target remote computer system
  • Communications protocol to use
  • Name of the Oracle database on the remote server
  • Remote database user name and password

All remote database access occurs via a database link and is governed by the privileges of the supplied user name in the remote system. The database link thus embodies the answers to the “which database,” “how to communicate,” and “what privileges” questions raised earlier in this section. The database administrator assigns the database link a name. Links can be private (created for use by a specific user of the local system) or public (available for use by multiple users of the local system).

To access a remote database over a database link, the local system user uses standard SQL statements. The name of the database link is appended to the remote table and view names, following an at sign (@). For example, assume you are on a local computer system that is connected to a copy of the sample database on a remote system over a database link called CENTRALHOST. This SQL statement retrieves information from the remote SALESREPS table:

Get the names and sales numbers of all salespeople who are already over quota.

 SELECT NAME, QUOTA, SALES

   FROM SALESREPS@CENTRALHOST

  WHERE SALES > QUOTA

Oracle supports nearly all of the query capabilities that are available for the local database against remote databases. The only restriction is that every remote database entity (table, view, and so on) must be suffixed with the database link name. Here is a two-table join, executed on the remote Oracle database:

Get the names and office cities of all salespeople who are already over quota.

SELECT NAME, CITY, QUOTA, SALES

  FROM SALESREPS@CENTRALHOST, OFFICES@CENTRALHOST

 WHERE SALES > QUOTA

   AND REP_OFFICE = OFFICE

Oracle also supports data definition and database update operations carried out in the remote database. Here is an example:

Create a new remote table of high-credit-limit customer info in the remote database and populate it with data from the CUSTOMERS table.

CREATE TABLE HIGHCUST@CENTRALHOST
   (CUST_NUM INTEGER NOT NULL,

     COMPANY VARCHAR(20) NOT NULL,

    REP_NAME VARCHAR(15))

 

INSERT INTO HIGHCUST@CENTRALHOST

     SELECT CUST_NUM, COMPANY, NAME

      FROM CUSTOMERS@CENTRALHOST, SALESREPS@CENTRALHOST

     WHERE CREDIT_LIMIT > 50000.00

       AND CUST_REP = EMPL_NUM

Informix Universal Server provides capabilities that are similar to those offered by Oracle, but uses a different mechanism for identifying remote databases and a different SQL syntax extension. The Informix architecture differentiates between a remote database server and a remote database that is managed by the remote server, since it tends to provide rich support for multiple, named databases per server. Suppose an Informix copy of the sample database is called SAMPLE and it resides on a remote database server called CENTRALHOST. Then this query is equivalent to the previous Oracle and Sybase examples:

Get the names and sales numbers of all salespeople who are already over quota.

SELECT NAME, QUOTA, SALES

FROM SAMPLE@CENTRALHOST:SALEREPS

WHERE SALES > QUOTA

The database name appears at the beginning of the table name (as an additional qualifier before the colon). If the database is remote, then the server name appears following the at sign (@) after the database name.

2. Remote Data Transparency

With any of the remote database-naming conventions that extend the usual SQL table and view names, the additional qualifiers can quickly become annoying or confusing. For example, if two tables in the remote database have columns with the same names, any query involving both tables must use qualified column names—and the table name qualifiers now have the remote database qualification as well. Here’s a qualified Informix column name for the NAME column in the remote SALESREPS table owned by the user JOE in a remote database named SAMPLE on the remote Informix server CENTRALHOST:

SAMPLE@CENTRALHOST.JOE.SALESREPS.NAME

A single column reference has grown to half a line of SQL text! For this reason, table aliases are frequently used in SQL statements involving remote database access.

Synonyms and aliases (described in Chapter 16) are also very useful for providing more transparent access to remote databases. Here’s an Informix synonym definition that could be established by a user or a database administrator:

CREATE SYNONYM REMOTE_REPS FOR SAMPLE@CENTRALHOST.JOE.SALESREPS

The equivalent Oracle synonym definition is:

CREATE SYNONYM REMOTE_REPS FOR JOE.SALESREPS@CENTRALHOST

With this synonym in place, the preceding qualified column name becomes simply:

REMOTE_REPS.NAME

Any query referencing the REMOTE_REPS table and its columns is actually a remote database query, but that fact is transparent to the user. In practice, most database installations with frequently accessed remote tables will have a set of synonyms defined for them. Most of the DBMS brands support both public synonyms (available to all users) and private synonyms that are created for a specific user or group of users. With this structure, synonyms can become an additional part of the remote access security mechanism, limited to only those users with a real need for remote access.

Several DBMS brands take the synonym capability for transparent database access one step further and permit views in the local database that are defined in terms of remote database tables. Here is an Oracle view definition that creates a view called EAST_REPS in the local database. The view is a subset of information from the remote sample database:

Create a local view defined in terms of two remote tables.

CREATE VIEW EAST_REPS AS

 SELECT EMPL_NUM, NAME, AGE, CITY

   FROM SALESREPS@CENTRALHOST, OFFICES@CENTRALHOST

  WHERE REP_OFFICE = OFFICE

    AND REP_OFFICE BETWEEN 11 AND 19

After this view has been defined, a user can pose queries in terms of the EAST_REPS view, without worrying about database links or remote table names. The view not only provides transparent remote access, but also hides from the user the remote join operation between the OFFICES and SALESREPS tables.

Transparent access to remote data, provided by views and synonyms, is usually considered a very desirable characteristic. It does have one drawback, however. Because the remote aspect of the database access is now hidden, the network overhead created by the access is also hidden. Therefore, the possibility of a user or programmer inadvertently creating a great deal of network traffic through very large queries is increased. The database administrator must make this trade-off when deciding whether to permit remote transparent synonyms and views.

Transparent remote access also inevitably raises one additional question: since the remote tables now appear as if they are local, can the user pose queries that involve both remote and local tables? That is, can a join cross the database server boundaries and relate information from the remote database and the local database? Even more serious questions are posed when the SQL transaction scheme is considered. If a database permits transparent access to a remote database, then is a user allowed to update a row in the local database and insert a row in the remote database, and then decide to roll back the transaction? Since the remote resources have been made to appear as if they are local, it seems that the answer should be: “Of course—the local and remote databases together should appear as if they were just one local, integrated database.”

In fact, supporting such distributed queries and transactions adds a major new level of complexity (and potentially huge network data transmission overhead) to the remote access. Because of this, although several commercial DBMS systems support distributed queries and transactions, they are not heavily used in practice. These capabilities, and their overhead implications, are more fully discussed later, in the “Distributed Database Access” section. The next section discusses a practical alternative—duplicating data, or database replication—that is much more frequently used in practice.

3. Table Extracts

Remote database access is very convenient for small remote queries and occasional remote database access. If an application requires heavy and frequent access to a remote

database, however, the communications overhead of remote database access can become large. Once remote access grows beyond a certain point, it is often more efficient to maintain a local copy of the remote data in the local database. Many of the DBMS vendors provide tools to simplify the process of data extraction and distribution. In its simplest form, the process extracts the contents of a table in a master database, sends it across a network to another system, and loads it into a corresponding replica table in a slave database, as shown in Figure 23-3. In practice, the extract is performed periodically and during off-peak times of database activity.

This approach is very appropriate when the data in the replicated table changes slowly, or when changes to the table naturally occur in a batch. For example, suppose some tables of the sample database, located on a remote central computer system, are to be replicated in a local database. The contents of the OFFICES table hardly ever change. It would be an excellent candidate for replication onto distribution center or sales force automation databases. Once the initial (local) replica tables are set up and populated, they might need to be updated only once per month, or when a new sales office is opened.

The PRODUCTS table is also a good candidate for replication. Product price changes occur more frequently than office changes, but in most companies, they happen in batches, perhaps once a week or once a day. With this natural processing cycle, it would be very effective to extract a table of product price data just after each batch of updates, and send it to the distribution center databases and the sales force automation central database. The price data in these databases does not need to be tightly linked to the mainframe database to insure that it is fresh. A weekly or daily extract/update cycle will make the data just as current, with substantially less processing workload.

It’s possible to implement this type of replicated-table strategy without any support from the DBMS. You could write an application program that uses SQL on the mainframe to extract the product pricing data into a file. A file transfer program could transmit the file to the distribution centers, where another application program could read its contents and generate the appropriate DROP TABLE, CREATE TABLE, and INSERT statements to populate the replicated table.

The first step toward automating this strategy was the development of high-speed data extract and data loading programs. These utility programs, offered by the DBMS vendors, typically use proprietary, lower-level database access techniques to extract the data and load the data much more rapidly than is possible through SQL SELECT and INSERT statements. More recently, software companies have targeted this area as an opportunity for stand-alone software packages, independent of the DBMS vendors.

This category of software, called Enterprise Application Integration (EAI) software, focuses on linking disparate computer systems, software packages, database systems, and file formats. Linking different DBMS systems is a small part of the total solution offered by these systems, which are extensively customized to meet an individual company’s needs when they are installed. The EAI systems typically offer a graphical user interface for specifying the data extraction, an array of tools for reformatting data between the source and destination systems, a messaging capability for transmitting the data, perhaps a store-and-forward capability for staging extracted data before and after transmission, and utilities for managing and monitoring the overall process.

4. Table Replication

Several DBMS vendors have moved beyond their extract and load utility programs to offer support for table extraction within the DBMS itself. Oracle, for example, offers a snapshot facility to automatically create a local copy of a remote table. In its simplest form, the local table is a read-only replica of the remote master table, which is automatically refreshed by the Oracle DBMS periodically. Here is an Oracle SQL statement to create a local copy of product pricing data, assuming that the remote master database includes a PRODUCTS table like the one in the sample database:

Create a local replica of pricing information from the remote PRODUCTS table.

 CREATE SNAPSHOT PRODPRICE

AS SELECT MFR_ID, PRODUCT_ID, PRICE

FROM PRODUCTS@REMOTE_LINK

This statement effectively creates a local Oracle table named PRODPRICE. It contains three columns, specified by the SELECT statement against the remote (master) database. The at sign and name REMOTE_LINK in the statement tell Oracle that the

PRODUCTS table from which the data is to be replicated is a remote table, accessible via the Oracle database link named REMOTE_LINK. The Oracle database administrator sets up these remote database links as part of the distributed Oracle capabilities that are required to use the snapshot feature. Finally, the CREATE SNAPSHOT statement will actually cause the local PRODPRICE snapshot table to be populated with data from the remote PRODUCTS table.

With this type of read-only snapshot, users are not allowed to change the snapshot table with INSERT, UPDATE, or DELETE statements. All database updates occur in the master (remote) table and are propagated to the replicated (snapshot) table by Oracle. The database administrator can manually refresh the snapshot table as desired. The CREATE SNAPSHOT statement also includes rather comprehensive facilities for specifying automatic refreshes. Here are some examples:

Create a local replica of pricing information from the remote PRODUCTS table. Refresh the data once per week, with a complete reload of the data.

CREATE SNAPSHOT PRODPRICE

REFRESH COMPLETE START WITH SYSDATE NEXT SYSDATE+7

AS SELECT MFR_ID, PRODUCT_ID, PRICE

FROM PRODUCTS@REMOTE_LINK

Create a local replica of pricing information from the remote PRODUCTStable. Refresh the data once per day, sending only changes from the master table.

CREATE SNAPSHOT PRODPRICE

REFRESH FAST START WITH SYSDATE NEXT SYSDATE+1

AS SELECT MFR_ID, PRODUCT_ID, PRICE

FROM PRODUCTS@REMOTE_LINK

In the latter example, the snapshot is refreshed by transmitting only changes from the remote PRODUCTS table. Oracle implements this capability by maintaining a log of changes (a snapshot log) on the remote system and updating the log every time an update to the PRODUCTS table would affect the snapshot replica. When the time for a refresh arrives, information from the snapshot log is used.

For applications like this one, where product price changes probably affect only a small percentage of the overall table, this strategy is effective. The additional overhead of maintaining the log for the master table is more than offset by the reduced network traffic of transmitting only changed data. In other applications, where a large percentage of the rows in the master table will be modified between refreshes, it may be more efficient to simply do a complete refresh and eliminate the overhead of maintaining the snapshot log.

By default, Oracle identifies rows (to determine whether they are changed) based on their primary key. If the primary key is not part of the replicated data, this can cause confusion about which rows have been updated; in this case, Oracle uses an internal row-id number to identify the modified rows for refreshes to the snapshot.

The SELECT statement that defines the snapshot table offers a very general capability for data extraction. It can include a SELECT clause to extract only selected rows of the master table:

Create a local replica of pricing information for high-priced products from the remote PRODUCTS table. Refresh the data once per day, sending only changes from the master table.

CREATE SNAPSHOT PRODPRICE

REFRESH FAST START WITH SYSDATE NEXT SYSDATE+1

AS SELECT MFR_ID, PRODUCT_ID, PRICE

FROM PRODUCTS@REMOTE_LINK

WHERE PRICE > 1000.00

Note that this makes maintaining the snapshot log more complex. Oracle does not need to add to the log all updates to the PRODUCTS table; only those that modify rows that meet the search criterion. The snapshot can also be created as a joined table, extracting its data from two or more master tables in the remote database:

Create a local replica of salesperson data, refreshed weekly.

CREATE SNAPSHOT SALESTEAM

REFRESH FAST START WITH SYSDATE NEXT SYSDATE+7

AS SELECT NAME, QUOTA, SALES, CITY

FROM SALESREPS@REMOTE, OFFICES@REMOTE

WHERE REP_OFFICE = OFFICE

Adding to the complexity, the snapshot can be defined by a grouped query:

Create a local summary of customer order volume, refreshed daily.

CREATE SNAPSHOT CUSTORD

REFRESH FAST START WITH SYSDATE NEXT SYSDATE+1

AS SELECT COMPANY, SUM(AMOUNT)

FROM CUSTOMERS@REMOTE, ORDERS@REMOTE

WHERE CUST = CUST_NUM

Of course, with each level of additional complexity, the overhead of managing the snapshot and the replication process increases. Regardless of how simple or complex the definition of the snapshot, however, the overall principles remain the same. Instead of having queries against the replicated data travel across the network to the remote database, the remote data is brought down into the snapshot. The refreshes to the snapshot still generate network traffic, but the day-to-day queries against the snapshot data are carried out locally and do not generate network traffic. For situations where the query workload is much higher than the overhead of maintaining the snapshot, this can be an effective way to improve overall database performance.

5. Updateable Replicas

In the simplest implementations, a table and its replicas have a strict master/slave relationship, as shown in Figure 23-3. The central/master copy contains the real data. It is always up to date, and all updates to the table must occur on this copy of the table. The other slave copies are populated by periodic updates, managed by the DBMS. Between updates, they may become slightly out of date, but if the database is configured in this way, then it is an acceptable price to pay for the advantage of having a local copy of the data. Updates to the slave copies are not permitted. If attempted, the DBMS returns an error condition.

By default, the Oracle CREATE SNAPSHOT statement creates this type of slave replica of a table. The master/slave relationship is implicit in the Microsoft SQL Server structure for replication. The SQL server architecture defines the master as the publisher of the data and the slaves as subscribers to the data. In the default configuration, there is a single (updateable) publisher, and there may be multiple (read-only) subscribers. The SQL Server architecture carries this analogy one step further, supporting both the notion of push- updates (the publisher actively sends the update data to the subscribers) and pull updates (where the subscribers have primary responsibility for getting updates from the publisher).

There are some applications for which table replication is an excellent technique, but where the master/slave relationship does not apply. For example, applications that demand high availability use replicated tables to maintain identical copies of data on two different computer systems. If one system fails, the other contains current data and can carry on processing. An Internet application may demand very high database access rates, and achieve this scalability by replicating a table many times on different computer systems and then spreading out the workload across the systems. A sales force automation application will probably contain one central CUSTOMER table and hundreds of replicas on laptop systems, and individual salespeople should be able to enter new customers or change customer contact information on the laptop replicas. In these configurations (and others), the most efficient use of the computer resources is achieved if all of the replicas can accept updates to the table, as shown in Figure 23-4.

A replicated table where multiple copies can accept updates creates a new set of data integrity issues. What happens if the same row of the table is updated in one or more replicas? When the DBMS tries to synchronize the replicas, which of the two updates should apply, or should neither apply, or both? What happens if a row is deleted from one copy of the table, but it is updated in another copy of the table?

In DBMS systems that support updateable replicas, these issues are addressed by creating a set of conflict resolution rules that are applied by the replication system. For example, when replication is set up between a central CUSTOMER table and laptop versions of the table, the replication rule may say that changes to the central customer database always win over changes entered on a laptop system. Alternatively, the replication rule might say that the most recent update always wins. In addition to the built-in rules provided by the DBMS itself, the replication definition may include the capability to pass conflicts to a user-written procedure (such as a stored procedure within the database) for selection of the winner and loser replicas.

6. Replication Trade-Offs

Practical replication strategies always involve a trade-off between the desire to keep data as current as possible and the desire to keep network traffic down to a practical level and provide adequate performance. These trade-offs usually involve not just technical considerations, but business practices and policies as well. For example, consider an order-processing application using the sample database, and assume that order processing is distributed across five different call centers that are geographically distributed around the world. Each call center has its own computer system and database. Incoming orders are checked against the PRODUCTS table to be certain that enough inventory is on hand to fill the order. The PRODUCTS table keeps track of product-on-hand quantities for all of the company’s warehouses, worldwide.

Suppose the company’s policy is that the order-processing clerk must be able to absolutely guarantee a customer that products can be shipped within 24 hours of the time an order is accepted. In this case, the PRODUCTS table must contain absolutely up-to-the-minute data, reflecting the inventory impact of orders taken just seconds earlier. There are only two possible designs for the database in this case. There could be a single, central copy of the PRODUCTS table, shared by all users at all five order-processing sites. Alternatively, there could be a fully mirrored copy of the PRODUCTS table at each of the five sites. The fully mirrored solution is almost certainly impractical because the frequent updates to the PRODUCTS table as each order is taken will cause excessive network traffic to keep the five copies of the table in perfect synchronization.

But suppose the company believes it can still maintain adequate customer satisfaction with a policy that is slightly less strict—for example, it promises to notify any customer within 24 hours if the order cannot be filled immediately and give the customer an opportunity to cancel the order. In this case, a replicated PRODUCTS table becomes an excellent solution. Once a day, updates to the PRODUCTS table can be downloaded to the replicated copy at each of the five sites. During the day, orders are verified against the local copy of the PRODUCTS table, but only the local PRODUCTS table is updated. This prevents the company from taking an order for which there was not adequate stock on hand at the beginning of the day, but it does not prevent orders taken at two or three different sites from exceeding the available stock. The next night, when data communications costs are lower than they are during the day, the orders from each site are transmitted to a central site, which processes them against a central copy of the PRODUCTS table. Orders that cannot be filled from inventory are flagged, and a report of them is generated. When processing is complete, the updated PRODUCTS table, along with the problem orders report, is transmitted back to each of the five sites to prepare for the next day’s processing.

Which is the correct architecture for supporting the operation of this global business? As the example shows, it is not so much a database architecture question as a business policy question. The interdependence of computer systems architectures and business operations is one of the reasons why decisions about replication and data distribution inevitably make certain types of business operations easier and others harder.

7. Typical Replication Architectures

In many cases, it’s possible to structure an application that involves replicated data so that conflicts between replica updates are avoided or greatly minimized. The DBMS conflict resolution rules are then applied as a last resort, when a conflict arises despite the design of the application. The next few sections describe some typical replicated table scenarios and the application structure that is often used in each scenario to minimize replication conflicts.

7.1. Horizontal Table Subsets

One efficient way to replicate parts of a table across a network is to divide the table horizontally, placing different rows of the table on different systems. Figure 23-5 shows a simple example where a horizontal table split is useful. In this application, a company operates three distribution centers, each with its own computer system and DBMS to manage an inventory database and order processing. A central database is also maintained for production-planning purposes.

To support this environment, the PRODUCTS table is split horizontally into three parts and expanded to include a LOCATION column that tells where the inventory is located. The central copy of the table contains all of the rows. The rows of the table that describe inventory in each distribution center are replicated in the local database managed by that center’s DBMS.

In this case, most updates to the PRODUCTS table take place at the distribution center itself, as it processes orders. Because distribution center replicas are mutually exclusive (that is, a row from the PRODUCTS table appears in only one distribution center replica), update conflicts are avoided. The replicas in the distribution center can periodically transmit updates to the central database to keep it up to date.

7.2. Vertical Table Subsets

Another efficient way to manage table replication is to divide the table vertically, replicating different columns of the table on different systems. Figure 23-6 shows a simple example of a vertical table split. The SALESREPS table has been expanded to include new columns of personnel information (phone number, marital status, and so on), and its information is needed in two databases—one in the order-processing department and the other in the personnel department. Most of the activity in each department focuses on one or two columns of the table, but many queries and reports use both personnel-related and order-related columns.

To accommodate this application, the SALESREPS table is replicated on both systems, but conceptually, it is split vertically into two parts. The columns of the table that store personnel data (name, AGE, HIRE_DATE, PHONE, MARRIED) are owned by the personnel system. It wins any conflicts related to updates on these columns. The other columns (EMPL_NUM, QUOTA, SALES, REP_OFFICE) are owned by the order-processing system. It wins update conflicts related to these columns. Because the entire table is replicated on both systems, either system can be used to generate reports and handle ad hoc inquiries, and all of these can be processed locally. Only updates involve the replication mechanism, generate network traffic, and potentially require conflict resolution.

7.3. Mirrored Tables

When table replication is used to achieve high availability (that is, resistance to computer or database failure), the entire table is typically mirrored, as shown in Figure 23-7. The easiest way to implement this configuration is if one system is the active system and another is a hot standby. In this scheme, all database access normally flows to the active system (System A), which replicates any updates to the standby system (System B). Only in the event of system failure does the database access switch over to the standby system, but it has fresh data because of the replicated table. The disadvantage of this scheme is that it wastes the standby computer system under normal operation. The system must be paid for and maintained, but it doesn’t add any data processing capacity.

For this reason, high-availability systems are often designed to also provide load balancing, as shown in Figure 23-8. In this configuration, some front-end software intercepts DBMS access requests and evenly distributes them between the two (or more) computer systems. Under normal operation, both (all) systems contribute data processing power; none is wasted. Furthermore, it’s conceptually easy to grow the data processing power, simply by adding more computer systems with a copy of the replicated table.

This type of mirrored table approach can be highly effective if the ratio of database queries to database updates is very high (for example, 95 percent read access/5 percent update access). If the percentage of updates is higher, the potential for conflicts and the replication overhead can diminish the effectiveness and scalability of the overall configuration. Efficiency also decreases with each increase in the number of replicated systems, since the replication overhead rises.

One common way to get more efficiency out of a mirrored table configuration like the one in Figure 23-8 is to divide updates to the table based on some rule. For example, if the mirrored table is a customer table, the primary key may be the customer name. The front-end load-balancing software can then be written so that updates for customer names starting with A through M are routed to the one system and updates for customer names starting with N through Z are routed to the other system. This eliminates the possibility of update conflicts. Because the table remains fully replicated under this scenario, read access requests can still be distributed randomly between the two systems to balance the workload. This type of approach can be quite effective in achieving scalable database performance with replicated tables. It can be fairly easily extended from a two-way scheme to an N-way scheme, where updates are split among three or more database servers.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *