Summary Queries in SQL: Grouped Queries (GROUP BY Clause)

The summary queries described thus far are like the totals at the bottom of a report. They condense all of the detailed data in the report into a single, summary row of data. Just as subtotals are useful in printed reports, it’s often convenient to summarize query results at a “subtotal” level. The GROUP BY clause of the SELECT statement provides this capability.

The function of the GROUP BY clause is most easily understood by example. Consider these two queries:

What is the average order size?

SELECT AVG(AMOUNT)

FROM ORDERS

AVG(AMOUNT)

————

\$8,256.37

What is the average order size for each salesperson?

SELECT REP, AVG(AMOUNT)

FROM ORDERS GROUP BY REP

REP   AVG(AMOUNT)

—————-

101   \$8,876.00
102   \$5,694.00

103   \$1,350.00
105   \$7,865.40
106   \$16,479.00
107   \$11,477.33
108   \$8,376.14
109   \$3,552.50
110   \$11,566.00

The first query is a simple summary query like the previous examples in this chapter. The second query produces several summary rows—one row for each group, summarizing the orders taken by a single salesperson. Figure 8-3 shows how the second query works. Conceptually, SQL carries out the query as follows:

1. SQL divides the orders into groups of orders, with one group for each salesperson. Within each group, all of the orders have the same value in the REP
2. For each group, SQL computes the average value of the AMOUNT column for all of the rows in the group and generates a single, summary row of query results. The row contains the value of the REP column for the group and the calculated average order size.

A query that includes the GROUP BY clause is called a grouped query because it groups the data from its source tables and produces a single summary row for each row group. The columns named in the GROUP BY clause are called the grouping columns of the query, because they determine how the rows are divided into groups. Here are some additional examples of grouped queries:

What is the range of assigned quotas in each office?

SELECT REP_OFFICE, MIN(QUOTA), MAX(QUOTA)

FROM SALESREPS

GROUP BY REP_OFFICE

REP_OFFICE MIN(QUOTA) MAX(QUOTA)

———————————-

NULL        NULL          NULL
11     \$275,000.00   \$300,000.00
12     \$200,000.00   \$300,000.00
13     \$350,000.00   \$350,000.00
21     \$350,000.00   \$350,000.00
22     \$300,000.00   \$300,000.00

How many salespeople are assigned to each office?

SELECT REP_OFFICE, COUNT(*)

FROM SALESREPS

GROUP BY REP_OFFICE

REP_OFFICE   COUNT(*)
———– ———
NULL      1
11        2
12        3
13        1
21        2
22        1

How many different customers are served by each salesperson?

SELECT COUNT(DISTINCT CUST_NUM), ‘customers for salesrep’, CUST_REP

FROM CUSTOMERS

GROUP BY CUST_REP

There is an intimate link between the SQL column functions and the GROUP BY clause. Remember that the column functions take a column of data values and produce a single result. When the GROUP BY clause is present, it tells SQL to divide the detailed query results into groups and to apply the column function separately to each group, producing a single result for each group. The following steps show the rules for SQL query processing, expanded once again for grouped queries.

To generate the query results for a SELECT statement:

1. If the statement is a UNION of SELECT statements, apply Steps 2 through 7 to each of the statements to generate their individual query results.
2. Form the product of the tables named in the FROM If the FROM clause names a single table, the product is that table.
3. If there is a WHERE clause, apply its search condition to each row of the product table, retaining those rows for which the search condition is TRUE (and discarding those for which it is FALSE or NULL).
4. If there is a GROUP BY clause, arrange the remaining rows of the product table into row groups, so that the rows in each group have identical values in all of the grouping columns.
5. If there is a HAVING clause, apply its search condition to each row group, retaining those groups for which the search condition is TRUE (and discarding those for which it is FALSE or NULL).
6. For each remaining row (or row group), calculate the value of each item in the select list to produce a single row of query results. For a simple column reference, use the value of the column in the current row (or row group). For a column function, use the current row group as its argument if GROUP BY is specified; otherwise, use the entire set of rows.
7. If SELECT DISTINCT is specified, eliminate any duplicate rows of query results that were produced.
8. If the statement is a UNION of SELECT statements, merge the query results for the individual statements into a single table of query results. Eliminate duplicate rows unless UNION ALL is specified.
9. If there is an ORDER BY clause, sort the query results as specified.

The rows generated by this procedure comprise the query results.

1. Multiple Grouping Columns

SQL can group query results based on the contents of two or more columns. For example, suppose you want to group the orders by salesperson and by customer. This query groups the data based on both criteria:

Calculate the total orders for each customer of each salesperson.

SELECT REP, CUST, SUM(AMOUNT)

FROM ORDERS

GROUP BY REP, CUST

Even with multiple grouping columns, SQL provides only a single level of grouping. The query produces a separate summary row for each salesperson/customer pair. It’s impossible to create groups and subgroups with two levels of subtotals in SQL. The best you can do is sort the data so that the rows of query results appear in the appropriate order. In many SQL implementations, the GROUP BY clause will automatically have the side effect of sorting the data, but you can override this sort with an ORDER BY clause, as shown next:

Calculate the total orders for each customer of each salesperson, sorted by customer, and within each customer by salesperson.

SELECT CUST, REP, SUM(AMOUNT)
FROM ORDERS
GROUP BY CUST, REP
ORDER BY CUST, REP

Note that it’s also impossible to get both detailed and summary query results from a single query. To get detailed query results with subtotals or to get multilevel subtotals, you must write an application program using programmatic SQL and compute the subtotals within the program logic. The original developers of SQL Server addressed this limitation of standard SQL by adding an optional COMPUTE clause to the end of the SELECT statement. The COMPUTE clause calculates subtotals and sub-subtotals, as shown in this example:

Calculate the total orders for each customer of each salesperson, sorted by salesperson, and within each salesperson by customer.

SELECT REP, CUST, AMOUNT

FROM ORDERS

ORDER BY REP, CUST

COMPUTE SUM(AMOUNT) BY REP, CUST

COMPUTE SUM(AMOUNT), AVG(AMOUNT) BY REP

The query produces one row of detailed query results for each row of the ORDERS table, sorted by CUST within REP. In addition, it computes the sum of the orders for each customer/salesperson pair (a low-level subtotal) and computes the sum of the orders and average order size for each salesperson (a high-level subtotal). The query results thus contain a mixture of detail rows and summary rows, which include both subtotals and sub-subtotals.

The COMPUTE clause is very nonstandard, and in fact, it is unique to the Transact- SQL dialect used by SQL Server. Furthermore, it violates the basic principles of relational queries because the results of the SELECT statement are not a table, but a strange combination of different types of rows. Nonetheless, as the example shows, it can be very useful.

2. Restrictions on Grouped Queries

Grouped queries are subject to some rather strict limitations. The grouping columns must be actual columns of the tables named in the FROM clause of the query. You cannot group the rows based on the value of a calculated expression.

There are also restrictions on the items that can appear in the select list of a grouped query. All of the items in the select list must have a single value for each group of rows. Basically, this means that a select item in a grouped query can be:

• A constant
• A column function, which produces a single value summarizing the rows in the group
• A grouping column, which by definition has the same value in every row of the group
• An expression involving combinations of these

In practice, a grouped query will always include both a grouping column and a column function in its select list. If no column function appears, the query can be expressed more simply using SELECT DISTINCT, without GROUP BY. Conversely, if you don’t include a grouping column in the query results, you won’t be able to tell which row of query results came from which group!

Another limitation of grouped queries is that SQL ignores information about primary keys and foreign keys when analyzing the validity of a grouped query. Consider this query:

Calculate the total orders for each salesperson.

SELECT EMPL_NUM, NAME, SUM(AMOUNT)

FROM ORDERS, SALESREPS

WHERE REP = EMPL_NUM

GROUP BY EMPL_NUM

Error: “NAME” not a GROUP BY expression

Given the nature of the data, the query makes perfectly good sense because grouping on the salesperson’s employee number is in effect the same as grouping on the salesperson’s name. More precisely, EMPL_NUM, the grouping column, is the primary key of the SALESREPS table, so the NAME column must be single-valued for each group. Nonetheless, SQL reports an error because the NAME column is not explicitly specified as a grouping column. To correct the problem, you simply include the NAME column as a second (redundant) grouping column:

Calculate the total orders for each salesperson.

SELECT EMPL_NUM, NAME, SUM(AMOUNT)

FROM ORDERS, SALESREPS

WHERE REP = EMPL_NUM

GROUP BY EMPL_NUM, NAME

EMPL_NUM NAME            SUM(AMOUNT)
——— ————– ————

101 Dan Roberts     \$26,628.00
102 Sue Smith       \$22,776.00
103 Paul Cruz       \$2,700.00
106 Sam Clark       \$32,958.00

107 Nancy Angelli   \$34,432.00
108 Larry Fitch     \$58,633.00
109 Mary Jones      \$7,105.00
110 Tom Snyder      \$23,132.00

Of course, if the salesperson’s employee number is not needed in the query results, you can eliminate it entirely from the select list, giving:

Calculate the total orders for each salesperson.

SELECT NAME, SUM(AMOUNT)

FROM ORDERS, SALESREPS

WHERE REP = EMPL_NUM

GROUP BY NAME

NAME           SUM(AMOUNT)
————– ————
Dan Roberts    \$26,628.00
Larry Fitch    \$58,633.00
Mary Jones     \$7,105.00
Nancy Angelli  \$34,432.00
Paul Cruz      \$2,700.00
Sam Clark      \$32,958.00
Sue Smith      \$22,776.00
Tom Snyder     \$23,132.00

3. Null Values in Grouping Columns

A NULL value poses a special problem when it occurs in a grouping column. If the value of the column is unknown, into which group should the row be placed? In the WHERE clause, when two different NULL values are compared, the result is NULL (not TRUE), that is, the two NULL values are not considered to be equal. Applying the same convention to the GROUP BY clause would force SQL to place each row with a NULL grouping column into a separate group by itself.

In practice, this rule proves too unwieldy. Instead, the ANSI/ISO SQL standard considers two NULL values to be equal for purposes of the GROUP BY clause. If two rows have NULLs in the same grouping columns and identical values in all of their non-NULL grouping columns, they are grouped together into the same row group. The
small sample table in Figure 8-4 illustrates the ANSI/ISO handling of NULL values by the GROUP BY clause, as shown in this query:

SELECT HAIR, EYES, COUNT(*)

FROM PEOPLE

GROUP BY HAIR, EYES

Although this behavior of NULLs in grouping columns is clearly specified in the ANSI/ISO standard, it is not implemented in all SQL dialects. It’s a good idea to build a small test table and check the behavior of your DBMS brand before counting on a specific behavior.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.