SQL Data Integrity: Referential Integrity

Chapter 4 discussed primary keys, foreign keys, and the parent/child relationships that they create between tables. Figure 11-1 shows the SALESREPS and OFFICES tables and illustrates once again how foreign keys and primary keys work. The OFFICE column is the primary key for the OFFICES table, and it uniquely identifies each row. The REP_OFFICE column, in the SALESREPS table, is a foreign key for the OFFICES table. It identifies the office where each salesperson is assigned.

The REP_OFFICE and OFFICE columns create a parent/child relationship between the OFFICES and SALESREPS rows. Each OFFICES (parent) row has zero or more SALESREPS (child) rows with matching office numbers. Similarly, each SALESREPS (child) row has exactly one OFFICES (parent) row with a matching office number.

Suppose you tried to insert a new row into the SALESREPS table that contained an invalid office number, as in this example:

 INSERT INTO SALESREPS (EMPL_NUM, NAME, REP_OFFICE, AGE,

HIRE_DATE, SALES)

VALUES (115, ‘George Smith’, 31, 37,’01-APR-90′, 0.00)

On the surface, there’s nothing wrong with this INSERT statement. In fact, many SQL implementations will successfully add the row. The database will show that George Smith works in office number 31, even though no office number 31 is listed in the OFFICES table. The newly inserted row clearly breaks the parent/child relationship between the OFFICES and SALESREPS tables. In fact, the office number in the INSERT statement is probably an error—the user may have intended office number 11, 21, or 13.

It seems clear that every legal value in the REP_OFFICE column should be forced to match some value that appears in the OFFICE column. This rule is known as a referential integrity constraint. It ensures the integrity of the parent/child relationships created by foreign keys and primary keys.

Referential integrity has been a key part of the relational model since it was first proposed by Codd. However, referential integrity constraints were not included in IBM’s prototype System/R DBMS, nor in early releases of DB2 or SQL/DS. IBM added referential integrity support to DB2 in 1989, and referential integrity was added to the SQL1 standard after its initial release. Most DBMS vendors today support referential integrity constraints.

1. Referential Integrity Problems

Four types of database updates can corrupt the referential integrity of the parent/child relationships in a database. Using the OFFICES and SALESREPS tables in Figure 11-1 as illustrations, these four update situations are the following:

  • Inserting a new child row. When an INSERT statement adds a new row to the child (SALESREPS) table, its foreign key (REP_OFFICE) value must match one of the primary key (OFFICE) values in the parent table (OFFICES). If the foreign key value does not match any primary key, inserting the row will corrupt the database, because there will be a child without a parent (an orphan). Note that inserting a row in the parent table never poses a problem; it simply becomes a parent without any children.
  • Updating the foreign key in a child row. This is a different form of the previous problem. If the foreign key (REP_OFFICE) is modified by an UPDATE statement, the new value must match a primary key (OFFICE) value in the parent (OFFICES) table. Otherwise, the updated row will be an orphan.
  • Deleting a parent row. If a row of the parent table (OFFICES) that has one or more children (in the SALESREPS table) is deleted, the child rows will become orphans. The foreign key (REP_OFFICE) values in these rows will no longer match any primary key (OFFICE) value in the parent table. Note that deleting a row from the child table never poses a problem; the parent of this row simply has one less child after the deletion.
  • Updating the primary key in a parent row. This is a different form of the previous problem. If the primary key (OFFICE) of a row in the parent table (OFFICES) is modified, all of the current children of that row become orphans because their foreign keys no longer match a primary key value.

The referential integrity features of the ANSI/ISO SQL standard handle each of these four situations. The first problem (INSERT into the child table) is handled by checking the values of the foreign key columns before the INSERT statement is permitted. If they don’t match a primary key value, the INSERT statement is rejected with an error message. In Figure 11-1, this means that before a new salesperson can be added to the SALESREPS table, the office to which the salesperson is assigned must already be in the OFFICES table. As you can see, this restriction makes sense in the sample database.

The second problem (UPDATE of the child table) is similarly handled by checking the updated foreign key value. If there is no matching primary key value, the UPDATE statement is rejected with an error message. In Figure 11-1, this means that before a salesperson can be reassigned to a different office, that office must already be in the OFFICES table. Again, this restriction makes sense in the sample database.

The third problem (DELETE of a parent row) is more complex. For example, suppose you closed the Los Angeles office and wanted to delete the corresponding row from the OFFICES table in Figure 11-1. What should happen to the two child rows in the SALESREPS table that represent the salespeople assigned to the Los Angeles office? Depending on the situation, you might want to:

  • Prevent the office from being deleted until the salespeople are reassigned.
  • Automatically delete the two salespeople from the SALESREPS table as well.
  • Set the REP_OFFICE column for the two salespeople to NULL, indicating that their office assignment is unknown.
  • Set the REP_OFFICE column for the two salespeople to some default value, such as the office number for the headquarters office in New York, indicating that the salespeople are automatically reassigned to that office.

The fourth problem (UPDATE of the primary key in the parent table) has similar complexity. For example, suppose for some reason you wanted to change the number of the Los Angeles office from 21 to 23. As with the previous example, the question is
what should happen to the two child rows in the SALESREPS table that represent salespeople from the Los Angeles office. Again, there are four logical possibilities:

  • Prevent the office number from being changed until the salespeople are reassigned. In this case, you should first add a new row to the OFFICES table with the new office number for Los Angeles, then update the SALESREPS table, and finally delete the old OFFICES row for Los Angeles.
  • Automatically update the office number for the two salespeople in the SALESREPS table, so that their rows are still linked to the Los Angeles row in the OFFICES table, via its new office number.
  • Set the REP_OFFICE column for the two salespeople to NULL, indicating that their office assignment is unknown.
  • Set the REP_OFFICE column for the two salespeople to some default value, such as the office number for the headquarters office in New York, indicating that the salespeople are automatically reassigned to that office.

Although some of these alternatives may seem more logical than others in this particular example, it’s relatively easy to come up with examples where any one of the four possibilities is the right thing to do, if you want the database to accurately model the real-world situation. The SQL1 standard provided only the first possibility for the preceding examples—it prohibited the modification of a primary key value that was in use and prohibited the deletion of a row containing such a primary key. DB2, however, permitted other options through its concept of delete rules. The SQL2 standard has expanded these delete rules into delete and update rules that cover both deleting of parent rows and updating of primary keys.

2. Delete and Update Rules *

For each parent/child relationship created by a foreign key in a database, the SQL2 standard allows you to specify an associated delete rule and an associated update rule. The delete rule tells the DBMS what to do when a user tries to delete a row of the parent table. These four delete rules can be specified:

  • RESTRICT delete rule. The RESTRICT delete rule prevents you from deleting a row from the parent table if the row has any children. A DELETE statement that attempts to delete such a parent row is rejected with an error message. Deletions from the parent table are thus restricted to rows without any children. Applied to Figure 11-1, this rule can be summarized as “You can’t delete an office if any salespeople are assigned to it.”
  • CASCADE delete rule. The CASCADE delete rule tells the DBMS that when a parent row is deleted, all of its child rows should also automatically be deleted from the child table. For Figure 11-1, this rule can be summarized as “Deleting an office automatically deletes all the salespeople assigned to that office.”
  • SET NULL delete rule. The SET NULL delete rule tells the DBMS that when a parent row is deleted, the foreign key values in all of its child rows should automatically be set to NULL. Deletions from the parent table thus cause a “set to NULL” update on selected columns of the child table. For the tables in Figure 11-1, this rule can be summarized as “If an office is deleted, indicate that the current office assignment of its salespeople is unknown.”
  • SET DEFAULT delete rule. The SET DEFAULT delete rule tells the DBMS that when a parent row is deleted, the foreign key values in all of its child rows should automatically be set to the default value for that particular column. Deletions from the parent table thus cause a “set to DEFAULT” update on selected columns of the child table. For the tables in Figure 11-1, this rule can be summarized as “If an office is deleted, indicate that the current office assignment of its salespeople is the default office specified in the definition of the SALESREPS ”

There are some slight differences between the SQL2 and DB2 implementations of the delete rules. The current DB2 implementation does not support the SET DEFAULT rule; it is specified only by the SQL2 standard. The SQL2 standard actually calls the previously described RESTRICT rule, NO ACTION. The SQL2 naming is somewhat confusing. It means “If you try to delete a parent row that still has children, the DBMS will take no action on the row.” The DBMS will, however, generate an error code. Intuitively, the DB2 name for the rule, restrict, seems a better description of the situation—the DBMS will restrict the DELETE operation from taking place and generate an error code.

Recent releases of DB2 support both a RESTRICT and a NO ACTION delete rule. The difference between them is the timing of the enforcement of the rule. The RESTRICT rule is enforced before any other constraints; the NO ACTION rule is enforced after other referential constraints. Under almost all circumstances, the two rules operate identically.

Just as the delete rule tells the DBMS what to do when a user tries to delete a row of the parent table, the update rule tells the DBMS what to do when a user tries to update the value of one of the primary key columns in the parent table. Again, there are four possibilities, paralleling those available for delete rules:

  • RESTRICT update rule. The RESTRICT update rule prevents you from updating the primary key of a row in the parent table if that row has any children. An UPDATE statement that attempts to modify the primary key of such a parent row is rejected with an error message. Changes to primary keys in the parent table are thus restricted to rows without any children. Applied to Figure 11-1, this rule can be summarized as “You can’t change an office number if salespeople are assigned to the office.”
  • CASCADE update rule. The CASCADE update rule tells the DBMS that when a primary key value is changed in a parent row, the corresponding foreign key value in all of its child rows should also automatically be changed in the child table, to match the new primary key. For Figure 11-1, this rule can be summarized as “Changing an office number automatically changes the office number for all the salespeople assigned to that office.”
  • SET NULL update rule. The SET NULL update rule tells the DBMS that when a primary key value in a parent row is updated, the foreign key values in all of its child rows should automatically be set to NULL. Primary key changes in the parent table thus cause a “set to NULL” update on selected columns of the child table. For the tables in Figure 11-1, this rule can be summarized as “If an office number is changed, indicate that the current office assignment of its salespeople is unknown.”
  • SET DEFAULT update rule. The SET DEFAULT update rule tells the DBMS that when a primary key value in a parent row is updated, the foreign key values in all of its child rows should automatically be set to the default value for that particular column. Primary key changes in the parent table thus cause a “set to DEFAULT” update on selected columns of the child table. For the tables in Figure 11-1, this rule can be summarized as “If an office number is changed, automatically change the office assignment of its salespeople to the default office specified in the definition of the SALESREPS ”

The same differences between DB2 and the SQL2 standard described for the delete rules apply to the update rules. The SET DEFAULT update rule is present only in the standard, not in the current DB2 implementation. The RESTRICT update rule is a DB2 naming convention; the SQL2 standard again calls this update rule NO ACTION.

You can specify two different rules as the delete rule and the update rule for a parent/child relationship, although in most cases, the two rules will be the same. If you do not specify a rule, the RESTRICT rule is the default, because it has the least potential for accidental destruction or modification of data. Each of the rules is appropriate in different situations. Usually, the real-world behavior modeled by the database will indicate which rule is appropriate. In the sample database, the ORDERS table contains three foreign key/primary key relationships, as shown in Figure 11-2. These three relationships link each order to:

  • The product that was ordered
  • The customer who placed the order
  • The salesperson who took the order

For each of these relationships, different rules seem appropriate:

  • The relationship between an order and the product that is ordered should probably use the RESTRICT rule for delete and update. It shouldn’t be possible to delete product information from the database if there are still current orders for that product, or to change the product number.
  • The relationship between an order and the customer who placed it should probably use the CASCADE rule for delete and update. You probably will delete a customer row from the database only if the customer is inactive or ends the customer’s relationship with the company. In this case, when you delete the customer, any current orders for that customer should also be deleted. Similarly, changes in a customer number should automatically propagate to orders for that customer.
  • The relationship between an order and the salesperson who took it should probably use the SET NULL rule. If the salesperson leaves the company, any orders taken by that salesperson become the responsibility of an unknown salesperson until they are reassigned. Alternatively, the SET DEFAULT rule could be used to automatically assign these orders to the sales vice president. This relationship should probably use the CASCADE update rule, so that employee number changes automatically propagate to the ORDERS table.

3. Cascaded Deletes and Updates *

The RESTRICT rule for deletes and updates is a single-level rule—it affects only the parent table in a relationship. The CASCADE rule, on the other hand, can be a multilevel rule, as shown in Figure 11-3.

Assume for this discussion that the OFFICES/SALESREPS and SALESREPS/ ORDERS relationships shown in the figure both have CASCADE rules. What happens when you delete Los Angeles from the OFFICES table? The CASCADE rule for the OFFICES/SALESREPS relationship tells the DBMS to automatically delete all of the SALESREPS rows that refer to the Los Angeles office (office number 21) as well. But deleting the SALESREPS row for Sue Smith brings into play the CASCADE rule for the SALESREPS/ORDERS relationship. This rule tells the dBms to automatically delete all of the ORDERS rows that refer to Sue (employee number 102). Deleting an office thus causes cascaded deletion of salespeople, which causes cascaded deletion of orders.

As the example shows, CASCADE delete rules must be specified with care because they can cause widespread automatic deletion of data if they’re used incorrectly. Cascaded update rules can cause similar multilevel updates if the foreign key in the child table is also its primary key. In practice, this is not very common, so cascaded updates typically have less far-reaching effects than cascaded deletes.

The SET NULL and SET DEFAULT update and delete rules are both two-level rules; their impact stops with the child table. Figure 11-4 shows the OFFICES, SALESREPS, and ORDERS tables again, with a SET NULL delete rule for the OFFICES/SALESREPS relationship. This time, when the Los Angeles office is deleted, the SET NULL delete rule tells the DBMS to set the REP_OFFICE column to NULL in the SALESREPS rows that refer to office number 21. The rows remain in the SALESREPS table, however, and the impact of the delete operation extends only to the child table.

4. Referential Cycles *

In the sample database, the SALESREPS table contains the REP_OFFICE column, a foreign key for the OFFICES table. The OFFICES table contains the MGR column, a foreign key for the SALESREPS table. As shown in Figure 11-5, these two relationships form a referential cycle. Any given row of the SALESREPS table refers to a row of the OFFICES table, which refers to a row of the SALESREPS table, and so on. This cycle includes only two tables, but it’s also possible to construct cycles of three or more tables.

Regardless of the number of tables that they involve, referential cycles pose special problems for referential integrity constraints. For example, suppose that NULL values were not allowed in the primary or foreign keys of the two tables in Figure 11-5. (This is not, in fact, the way the sample database is actually defined, for reasons that will become obvious in a moment.) Consider this database update request and the INSERT statements that attempt to implement it:

You have just hired a new salesperson, Ben Adams (employee number 115), who is the manager of a new sales office in Detroit (office number 14).

INSERT INTO SALESREPS (EMPL_NUM, NAME, REP_OFFICE,

HIRE_DATE, SALES)

VALUES (115,’Ben Adams’, 14, ’01-APR-90′, 0.00)

INSERT INTO OFFICES (OFFICE, CITY, REGION, MGR, TARGET, SALES)

VALUES (14,’Detroit’, ‘Eastern’, 115, 0.00, 0.00)

Unfortunately, the first INSERT statement (for Ben Adams) will fail. Why? Because the new row refers to office number 14, which is not yet in the database! Of course, reversing the order of the INSERT statements doesn’t help:

INSERT INTO OFFICES (OFFICE, CITY, REGION, MGR, TARGET, SALES)

VALUES (14,’Detroit’, ‘Eastern’, 115, 0.00, 0.00)

INSERT INTO SALESREPS (EMPL_NUM, NAME, REP_OFFICE,

HIRE_DATE, SALES)

VALUES (115,’Ben Adams’, 14, ’01-APR-90′, 0.00)

The first INSERT statement (for Detroit this time) will still fail, because the new row refers to employee number 115 as the office manager, and Ben Adams is not yet in the database! To prevent this insertion deadlock, at least one of the foreign keys in a referential cycle must permit NULL values. In the actual definition of the sample database, the MGR column does not permit NULLs, but the REP_OFFICE does. The two-row insertion can then be accomplished with two INSERTs and an UPDATE, as shown here:

INSERT INTO SALESREPS (EMPL_NUM, NAME, REP_OFFICE,

HIRE_DATE, SALES)

VALUES (115,’Ben Adams’, NULL, ’01-APR-90′, 0.00)

INSERT INTO OFFICES (OFFICE, CITY, REGION, MGR, TARGET, SALES)

VALUES (14,’Detroit’, ‘Eastern’, 115, 0.00, 0.00)

UPDATE SALESREPS

   SET REP_OFFICE = 14

 WHERE EMPL_NUM = 115

As the example shows, there are times when it would be convenient if the referential integrity constraint were not checked until after a series of interrelated updates are performed. Unfortunately, this type of complex deferred checking is not provided by most current SQL implementations. Some deferred checking capabilities are specified by the SQL2 standard, as described later in the “Deferred Constraint Checking” section.

Referential cycles also restrict the delete and update rules that can be specified for the relationships that form the cycle. Consider the three tables in the referential cycle shown in Figure 11-6. The PETS table shows three pets and the boys they like, the GIRLS table shows three girls and the pets they like, and the BOYS table shows four boys and the girls they like, forming a referential cycle. All three of the relationships in the cycle specify the RESTRICT delete rule. Note that George’s row is the only row you can delete from the three tables. Every other row is the parent in some relationship and is therefore protected from deletion by the RESTRICT rule. Because of this anomaly, you should not specify the RESTRICT rule for all of the relationships in a referential cycle.

The CASCADE rule presents a similar problem, as shown in Figure 11-7. This figure contains exactly the same data as in Figure 11-6, but all three delete rules have been changed to CASCADE. Suppose you try to delete Bob from the BOYS table. The delete rules force the DBMS to delete Rover (who likes Bob) from the PETS table, which forces you to delete Betty (who likes Rover) from the GIRLS table, which forces you to delete Sam (who likes Betty), and so on, until all of the rows in all three tables have been deleted. For these small tables this might be practical, but for a production database with thousands of rows, it would quickly become impossible to keep track of the cascaded deletions and retain the integrity of the database. For this reason, DB2 enforces a rule that prevents referential cycles of two or more tables where all of the delete rules are CASCADE. At least one relationship in the cycle must have a RESTRICT or SET NULL delete rule to break the cycle of cascaded deletions.

5. Foreign Keys and null Values *

Unlike primary keys, foreign keys in a relational database are allowed to contain NULL values. In the sample database, the foreign key REP_OFFICE in the SALESREPS table permits NULL values. In fact, this column does contain a NULL value in Tom Snyder’s row, because Tom has not yet been assigned to an office. But the NULL value poses an interesting question about the referential integrity constraint created by the primary key/foreign key relationship. Does the NULL value match one of the primary key values, or doesn’t it? The answer is “maybe”—it depends on the real value of the missing or unknown data.

The ANSI/ISO SQL1 standard automatically assumes that a foreign key that contains a NULL value satisfies the referential integrity constraint. In other words, it gives the row the benefit of the doubt and allows it to be part of the child table, even though its foreign key value doesn’t match any row in the parent table. Interestingly, the referential integrity constraint is assumed to be satisfied if any part of the foreign key has a NULL value. This can produce unexpected and unintuitive behavior for compound foreign keys, such as the one that links the ORDERS table to the PRODUCTS table.

Suppose for a moment that the ORDERS table in the sample database permitted NULL values for the PRODUCT column, and that the PRODUCTS/ORDERS relationship had a SET NULL delete rule. (This is not the actual structure of the sample database, for the reasons illustrated by this example.) An order for a product with a manufacturer ID (MFR) of ABC and a NULL product ID (PRODUCT) can be successfully inserted into the ORDERS table because of the NULL value in the PRODUCT column. DB2 and the ANSI/ISO standard assume that the row meets the referential integrity constraint for ORDERS and PRODUCTS, even though no product in the PRODUCTS table has a manufacturer ID of ABC.

The SET NULL delete rule can produce a similar effect. Deleting a row from the PRODUCTS table will cause the foreign key value in all of its child rows in the ORDERS table to be set to NULL. Actually, only those columns of the foreign key that accept NULL values are set to NULL. If there were a single row in the PRODUCTS table for manufacturer DEF, deleting that row would cause its child rows in the ORDERS table to have their PRODUCT column set to NULL, but their MFR column would continue to have the value DEF. As a result, the rows would have an MFR value that did not match any row in the PRODUCTS table.

To avoid creating this situation, you should be very careful with NULL values in compound foreign keys. An application that enters or updates data in the table that contains the foreign key should usually enforce an “all NULLs or no NULLs” rule on the columns of the foreign key. Foreign keys that are partially NULL and partially non-NULL can easily create problems.

The SQL2 standard addresses this problem by giving the database administrator more control over the handling of NULL values in foreign keys for integrity constraints. The integrity constraint in the CREATE TABLE statement provides two options:

  • MATCH FULL option. The MATCH FULL option requires that foreign keys in a child table fully match a primary key in the parent table. With this option, no
    part of the foreign key can contain a NULL value, so the issue of NULL value handling in delete and update rules does not arise.
  • MATCH PARTIAL option. The MATCH PARTIAL option allows NULL values in parts of a foreign key, so long as the non-NULL values match the corresponding parts of some primary key in the parent table. With this option, NULL value handling in delete and update rules proceeds as previously described.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *