Multitable Queries in SQL: The Structure of a Join

For simple joins, it’s fairly easy to write the correct SELECT statement based on an English-language request or to look at a SELECT statement and figure out what it does. When many tables are joined or when the search conditions become complex, however, it becomes very difficult just to look at a SELECT statement and figure out what it means. For this reason, it’s important to define more carefully and just a bit more formally what a join is, what query results are produced by a given SELECT statement, and just a little bit of the theory of relational database operation that underlies joins.

1. Table Multiplication

A join is a special case of a more general combination of data from two tables, known as the Cartesian product (or just the product) of two tables. The product of two tables is another table (the product table), which consists of all possible pairs of rows from the two tables. The columns of the product table are all the columns of the first table, followed by all the columns of the second table. Figure 7-11 shows two small sample tables and their product.

If you specify a two-table query without a WHERE clause, SQL produces the product of the two tables as the query result. For example, this query:

Show all possible combinations of salespeople and cities.

SELECT NAME, CITY

FROM SALESREPS, OFFICES

would produce the product of the SALESREPS and OFFICES tables, showing all possible salesperson/city pairs. There would be 50 rows of query results (5 offices * 10 salespeople = 50 combinations). Notice that the SELECT statement is exactly the same one you would use to join the two tables, without the WHERE clause that compares the matching columns, as follows:

Show all salespeople and the cities where they work.

SELECT NAME, CITY

FROM SALESREPS, OFFICES

WHERE REP_OFFICE = OFFICE

These two queries point out an important relationship between joins and products:

A join between two tables is just the product of the two tables with some of the rows removed. The removed rows are precisely those that do not meet the matching column condition for the join.

Products are important because they are part of the formal definition of how SQL processes a multitable query, described in the next section.

2. Rules for Multitable Query Processing

The steps following the code below restate the rules for SQL query processing originally introduced in Figure 6-14 and expands them to include multitable queries. The rules define the meaning of any multitable SELECT statement by specifying a procedure that always generates the correct set of query results. To see how the procedure works, consider this query:

List the company name and all orders for customer number 2103.

To generate the query results for a SELECT statement:

  1. If the statement is a UNION of SELECT statements, apply Steps 2 through 5 to each of the statements to generate their individual query results.
  2. Form the product of the tables named in the FROM If the FROM clause names a single table, the product is that table.
  3. If there is a WHERE clause, apply its search condition to each row of the product table, retaining those rows for which the search condition is TRUE (and discarding those for which it is FALSE or NULL).
  4. For each remaining row, calculate the value of each item in the select list to produce a single row of query results. For each column reference, use the value of the column in the current row.
  5. If SELECT DISTINCT is specified, eliminate any duplicate rows of query results that were produced.
  6. If the statement is a UNION of SELECT statements, merge the query results for the individual statements into a single table of query results. Eliminate duplicate rows unless UNION ALL is specified.
  7. If there is an ORDER BY clause, sort the query results as specified.

The rows generated by this procedure comprise the query results.

Following the previous steps:

  1. The FROM clause generates all possible combinations of rows from the CUSTOMERS table (21 rows) and the ORDERS table (30 rows), producing a product table of 630 rows.
  2. The WHERE clause selects only those rows of the product table where the customer numbers match (CUST_NUM = CUST) and the customer number is the one specified (CUST_NUM = 2103). Only four rows are selected; the other 626 rows are eliminated.
  3. The SELECT clause extracts the three requested columns (COMPANY, ORDER_ NUM, and ORD_AMOUNT) from each remaining row of the product table to generate four rows of detailed query results.
  4. The ORDER BY clause sorts the four rows on the ORDER_NUM column to generate the final query results.

Obviously no SQL-based DBMS would actually carry out the query this way, but the purpose of the previous definition is not to describe how the query is carried out by a DBMS. Instead, it constitutes a definition of how to figure out exactly what a particular multitable query “means”—that is, the set of query results that it should produce.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *