SQL Server Queries: Join Operator

The previous sections of this chapter demonstrated the use of the SELECT statement to query rows from one table of a database. If the Transact-SQL language supported only such simple SELECT statements, the attachment of two or more tables to retrieve data would not be possible. Consequently, all data of a database would have to be stored in one table. Although the storage of all the data of a database inside one table is possible, it has one main disadvantage— the stored data are highly redundant.

Transact-SQL provides the join operator, which retrieves data from more than one table. This operator is probably the most important operator for relational database systems, because it allows data to be spread over many tables and thus achieves a vital property of database systems—nonredundant data.

NOTE The UNION operator also attaches two or more tables. However, the UNION operator always attaches two or more SELECT statements, while the join operator “joins” two or more tables using just one SELECT. Further, the UNION operator attaches rows of tables, while, as you will see later in this section, the join operator “joins” columns of tables.

The join operator is applied to base tables and views. This chapter discusses joins between base tables, while Chapter 11 discusses joins concerning views.

There are several different forms of the join operator. This section discusses the following fundamental types:

  • Natural join
  • Cartesian product (cross join)
  • Outer join
  • Theta join, self-join, and semi-join

Before explaining different join forms, this section describes the different syntax forms of the join operator.

1. Two Syntax Forms to Implement Joins

To join tables, you can use two different forms:

  • Explicit join syntax (ANSI SQL:1992 join syntax)
  • Implicit join syntax (old-style join syntax)

The ANSI SQL:1992 join syntax was introduced in the SQL92 standard and defines join operations explicitly—that is, using the corresponding name for each type of join operation.

The keywords concerning the explicit definition of join are

  • CROSS JOIN
  • [INNER] JOIN
  • LEFT [OUTER] JOIN
  • RIGHT [OUTER] JOIN
  • FULL [OUTER] JOIN

CROSS JOIN specifies the Cartesian product of two tables. INNER JOIN defines the natural join of two tables, while LEFT OUTER JOIN and RIGHT OUTER JOIN characterize the join operations of the same names, respectively. Finally, FULL OUTER JOIN specifies the union of the right and left outer joins. (All these different join operations are explained in the following sections.)

The implicit join syntax is “old-style” syntax, where each join operation is defined implicitly via the WHERE clause, using the so-called join columns (see the second statement in Example 6.57).

NOTE Use of the explicit join syntax is recommended. This syntax enhances the readability of queries. For this reason, all examples in this chapter concerning the join operation are solved using the explicit syntax forms. In a few introductory examples, you will see the old-style syntax, too.

2. Natural Join

Natural join is best explained through the use of an example, so check out Example 6.57.

NOTE The phrases “natural join” and “equi-join” are often used as synonyms, but there is a slight difference between them. The equi-join operation always has one or more pairs of columns that have identical values in every row. The operation that eliminates such columns from the equi-join is called a natural join.

Example 6.57

Get full details of each employee; that is, besides the employee’s number, first and last names, and corresponding department number, also get the name of his or her department and its location, with duplicate columns displayed.

The following is the explicit join syntax:

USE sample;

SELECT employee.*, department.*

FROM employee INNER JOIN department

ON employee.dept_no = department.dept_no;

The SELECT list in Example 6.57 includes all columns of the employee and department tables. The FROM clause in the SELECT statement specifies the tables that are joined as well as the explicit name of the join form (INNER JOIN). The ON clause is also part of the FROM clause; it specifies the join columns from both tables. The condition employee.dept_no = department.dept_no specifies a join condition, and both columns are said to be join columns. The equivalent solution with the old-style, implicit join syntax is as follows:

USE sample;

SELECT employee.*, department.*

FROM employee, department

WHERE employee.dept_no = department.dept_no;

This syntax has two significant differences from the explicit join syntax: the FROM clause of the query contains the list of tables that are joined, and the corresponding join condition is specified in the WHERE clause using join columns.

The result is

Example 6.57 can be used to show how a join operation works. Note that this is just an illustration of how you can think about the join process; the Database Engine actually has several strategies from which it chooses to implement the join operator. Imagine each row of the employee table combined with each row of the department table. The result of this combination is a table with 7 columns (4 from the table employee and 3 from the table department) and 21 rows (see Table 6-1).

In the second step, all rows from Table 6-1 that do not satisfy the join condition employee. dept_no = department.dept_no are removed. These rows are prefixed in Table 6-1 with the * sign. The rest of the rows represent the result of Example 6.57.

The semantics of the corresponding join columns must be identical. This means both columns must have the same logical meaning. It is not required that the corresponding join columns have the same name (or even an identical type), although this will often be the case.

NOTE It is not possible for a database system to check the logical meaning of a column. (For instance, project number and employee number have nothing in common, although both columns are defined as integers.) Therefore, database systems can check only the data type and the length of string data types. The Database Engine requires that the corresponding join columns have compatible data types, such as INT and SMALLINT.

The sample database contains three pairs of columns in which each column of the pair has the same logical meaning (and they have the same names as well). The employee and department tables can be joined using the columns employee.dept_no and department .dept_no. The join columns of the employee and works_on tables are the columns employee .emp_no and works_on.emp_no. Finally, the project and works_on tables can be joined using the join columns project.project_no and works_on.project_no.

The names of columns in a SELECT statement can be qualified. “Qualifying” a column name means that, to avoid any possible ambiguity about which table the column belongs to, the column name is preceded by its table name (or the alias of the table), separated by a period: table_name.column_name.

In most SELECT statements a column name does not need any qualification, although the use of qualified names is generally recommended for readability. If column names within a SELECT statement are ambiguous (like the columns employee.dept_no and department .dept_no in Example 6.57), the qualified names for the columns must be used.

In a SELECT statement with a join, the WHERE clause can include other conditions in addition to the join condition, as shown in Example 6.58.

Example 6.58

Get full details of all employees who work on the project Gemini.

Explicit join syntax:

USE sample;

SELECT emp_no, project.project_no, job, enter_date, project_name, budget

FROM works_on JOIN project

ON project.project_no = works_on.project_no

WHERE project_name = ‘Gemini’;

Old-style join syntax:

USE sample;

SELECT emp_no, project.project_no, job, enter_date, project_name, budget

FROM works_on, project

WHERE project.project_no = works_on.project_no

AND project_name = ‘Gemini’;

The result is

From this point forward, all examples will be implemented using the explicit join syntax only. Example 6.59 shows another use of the inner join.

Example 6.59

Get the department number for all employees who entered their projects on October 15, 2017:

USE sample;

SELECT dept_no

FROM employee JOIN works_on

ON employee.emp_no = works_on.emp_no

WHERE enter_date = ‘10.15.2017’;

The result is

3. Joining More Than Two Tables

Theoretically, there is no upper limit on the number of tables that can be joined using a SELECT statement. (One join condition always combines two tables!) However, the Database Engine has an implementation restriction: the maximum number of tables that can be joined in a SELECT statement is 64.

Example 6.60 joins three tables of the sample database.

Example 6.60

Get the first and last names of all analysts whose department is located in Seattle:

USE sample;

SELECT emp_fname, emp_lname

FROM works_on JOIN employee ON works_on.emp_no=employee.emp_no

   JOIN department ON employee.dept_no=department.dept_no

AND location = ‘Seattle’

AND job = ‘analyst’;

The result is

The result in Example 6.60 can be obtained only if you join at least three tables: works_on, employee, and department. These tables can be joined using two pairs of join columns:

(works_on.emp_no, employee.emp_no)

(employee.dept_no, department.dept_no)

Example 6.61 uses all four tables from the sample database to obtain the result set.

Example 6.61

Get the names of projects (with redundant duplicates eliminated) being worked on by employees in the Accounting department:

USE sample;

SELECT DISTINCT project_name

FROM project JOIN works_on

ON project.project_no = works_on.project_no

JOIN employee ON works_on.emp_no = employee.emp_no

JOIN department ON employee.dept_no = department.dept_no

WHERE dept_name = ‘Accounting’;

The result is

Notice that when joining three tables, you use two join conditions (linking two tables each) to achieve a natural join. When you join four tables, you use three such join conditions. In general, if you join n tables, you need n – 1 join conditions to avoid a Cartesian product. Of course, using more than n – 1 join conditions, as well as other conditions, is certainly permissible to further reduce the result set.

4. Cartesian Product

The previous section illustrated a possible method of producing a natural join. In the first step of this process, each row of the employee table is combined with each row of the department table. This intermediate result was made by the operation called Cartesian product. Example 6.62 shows the Cartesian product of the tables employee and department.

Example 6.62

USE sample;

SELECT employee.*, department.*

FROM employee CROSS JOIN department;

The result of Example 6.62 is shown in Table 6-1. A Cartesian product combines each row of the first table with each row of the second table. In general, the Cartesian product of two tables such that the first table has n rows and the second table has m rows will produce a result with n times m rows (or n*m). Thus, the result set in Example 6.62 contains 7*3 = 21 rows.

In practice, the use of a Cartesian product is highly unusual. Sometimes users generate the Cartesian product of two tables when they forget to include the join condition in the WHERE clause of the old-style join syntax. In this case, the output does not correspond to the expected result because it contains too many rows. (The existence of many and unexpected rows in the result is a hint that a Cartesian product of two tables, rather than the intended natural join, has been produced.)

5. Outer Join

In the previous examples of natural join, the result set included only rows from one table that have corresponding rows in the other table. Sometimes it is necessary to retrieve, in addition to the matching rows, the unmatched rows from one or both of the tables. Such an operation is called an outer join.

Examples 6.63 and 6.64 show the difference between a natural join and the corresponding outer join. (All examples in this section use the employee_enh table.)

Example 6.63

Get full details of all employees, including the location of their department, who live and work in the same city:

USE sample;

SELECT employee_enh.*, department.location

FROM employee_enh JOIN department

ON domicile = location;

The result is

Example 6.63 uses a natural join to display the result set of rows. If you would like to know all other existing living places of employees, you have to use the (left) outer join. This is called a left outer join because all rows from the table on the left side of the operator are returned, whether or not they have a matching row in the table on the right. In other words, if there are no matching rows in the table on the right side, the outer join will still return a row from the table on the left side, with NULL in each column of the other table (see Example 6.64). The Database Engine uses the operator LEFT OUTER JOIN to specify the left outer join.

A right outer join is similar, but it returns all rows of the table on the right of the symbol. The Database Engine uses the operator RIGHT OUTER JOIN to specify the right outer join.

Example 6.64

Get full details of all employees, including the location of their department, for all cities that are either the living place only or both the living and working place of employees:

USE sample;

SELECT employee_enh.*, department.location

FROM employee_enh LEFT OUTER JOIN department

ON domicile = location;

As you can see, when there is no matching row in the table on the right side (department, in this case), the left outer join still returns the rows from the table on the left side (employee_ enh), and the columns of the other table are populated by NULL values. Example 6.65 shows the use of the right outer join operation.

Example 6.65

Get full details of all departments, as well as all living places of their employees, for all cities that are either the location of a department or the living and working place of an employee:

USE sample;

SELECT employee_enh.domicile, department.*

FROM employee_enh RIGHT OUTER JOIN department

ON domicile =location;

The result is

In addition to the left and right outer joins, there is also the full outer join, which is defined as the union of the left and right outer joins. In other words, all rows from both tables are represented in the result set. If there is no corresponding row in one of the tables, its columns are returned with NULL values. This operation is specified using the FULL OUTER JOIN operator.

Every outer join operation can be simulated using the UNION operator plus the NOT EXISTS function. Example 6.66 is equivalent to the example with the left outer join (Example 6.64).

Example 6.66

Get full details of all employees, including the location of their department, for all cities that are either the living place only or both the living and working place of employees:

USE sample;

SELECT employee_enh.*, department.location

FROM employee_enh JOIN department

ON domicile = location

UNION

SELECT employee_enh.*, ‘NULL’

FROM employee_enh

WHERE NOT EXISTS

(SELECT *

FROM department

WHERE location = domicile);

The first SELECT statement in the union specifies the natural join of the tables employee_enh and department with the join columns domicile and location. This SELECT statement retrieves all cities that are at the same time the living places and working places of each employee. The second SELECT statement in the union retrieves, additionally, all rows from the employee_enh table that do not match the condition in the natural join.

6. Further Forms of Join Operations

The preceding sections discussed the most important join forms. This section shows you three other forms:

  • Theta join
  • Self-join
  • Semi-join

The following subsections describe these forms.

6.1. Theta Join

Join columns need not be compared using the equality sign. A join operation using a general join condition is called a theta join. Example 6.67, which uses the employee_enh table, shows the theta join operation.

Example 6.67

Get all the combinations of employee information and department information where the domicile of an employee alphabetically precedes any location of departments.

USE sample;

SELECT emp_fname, emp_lname, domicile, location

FROM employee_enh JOIN department

ON domicile < location;

The result is

In Example 6.67, the corresponding values of columns domicile and location are compared. In every resulting row, the value of the domicile column is ordered alphabetically before the corresponding value of the location column.

6.2. Self-Join, or Joining a Table with Itself

In addition to joining two or more different tables, a natural join operation can also be applied to a single table. In this case, the table is joined with itself, whereby a single column of the table is compared with itself. The comparison of a column with itself means that the table name appears twice in the FROM clause of a SELECT statement. Therefore, you need to be able to reference the name of the same table twice. This can be accomplished using at least one alias name. The same is true for the column names in the join condition of a SELECT statement. In order to distinguish both column names, you use the qualified names. Example 6.68 joins the department table with itself.

Example 6.68

Get full details of all departments located at the same location as at least one other department:

USE sample;

SELECT t1.dept_no, t1.dept_name, tl.location

FROM department t1 JOIN department t2

ON t1.location = t2.location

WHERE t1.dept_no <> t2.dept_no;

The result is

The FROM clause in Example 6.68 contains two aliases for the department table: t1 and t2. The first condition in the WHERE clause specifies the join columns, while the second condition eliminates unnecessary duplicates by making certain that each department is compared with different departments.

6.3. Semi-Join

The semi-join is similar to the natural join, but the result of the semi-join is only the set of all rows from one table where one or more matches are found in the second table. Example 6.69 shows the semi-join operation.

Example 6.69

USE sample;

SELECT emp_no, emp_lname, e.dept_no

FROM employee e JOIN department d

ON e.dept_no = d.dept_no

WHERE location = ‘Dallas’;

The result is

As can be seen from Example 6.69, the SELECT list of the semi-join contains only columns from the employee table. This is exactly what characterizes the semi-join operation. This operation is usually used in distributed query processing to minimize data transfer. The Database Engine uses the semi-join operation to implement the feature called star join (see Chapter 26).

Source: Petkovic Dusan (2020), Microsoft SQL Server 2019: A Beginner’s Guide, Seventh Edition-McGraw-Hill Education.

Leave a Reply

Your email address will not be published. Required fields are marked *