Summary Queries in SQL: Column Functions

SQL lets you summarize data from the database through a set of column functions. A SQL column function takes an entire column of data as its argument and produces a single data item that summarizes the column. For example, theAVG() column function takes a column of data and computes its average. Here is a query that uses the AVG() column function to compute the average value of two columns from the SALESREPS table:

What are the average quota and average sales of our salespeople?

SELECT AVG(QUOTA), AVG(SALES)

FROM SALESREPS

AVG(QUOTA)     AVG(SALES)

————————

$300,000.00   $289,353.20

Figure 8-1 graphically shows how the query results are produced. The first column function in the query takes values in the QUOTA column and computes their average; the second one averages the values in the SALES column. The query produces a single row of query results summarizing the data in the SALESREPS table.

SQL offers six different column functions, as shown in Figure 8-2. The column functions offer different kinds of summary data:

  • SUM() computes the total of a column.
  • AVG() computes the average value in a column.
  • MIN() finds the smallest value in a column.
  • MAX() finds the largest value in a column.
  • COUNT() counts the number of values in a column.
  • COUNT(*) counts rows of query results.

The argument to a column function can be a simple column name, as in the previous example, or it can be a SQL expression, as shown here:

What is the average quota performance of our salespeople?

SELECT AVG(100 * (SALES/QUOTA))

FROM SALESREPS

AVG(100*(SALES/QUOTA))

————————

102.60

To process this query, SQL constructs a temporary column containing the value of the expression (10 0 * (SALES/QUOTA)) for each row of the SALESREPS table and then computes the averages of the temporary column.

1. Computing a Column Total (sum)

The SUM() column function computes the sum of a column of data values. The data in the column must have a numeric type (integer, decimal, floating point, or money). The result of the SUM() function has the same basic data type as the data in the column, but the result may have a higher precision. For example, if you apply the SUM () function to a column of 16-bit integers, it may produce a 32-bit integer as its result.

Here are some examples that use the SUM() column function:

What are the total quotas and sales for all salespeople?

SELECT SUM(QUOTA), SUM(SALES)

FROM SALESREPS

SUM(QUOTA)         SUM(SALES)

——————————

$2,700,000.00    $2,893,532.00

What is the total of the orders taken by Bill Adams?

SELECT SUM(AMOUNT)

FROM ORDERS, SALESREPS

WHERE NAME = ‘Bill Adams

AND REP = EMPL_NUM

SUM(AMOUNT)

———–

$39,327.00

2. Computing a Column Average (avg)

The AVG() column function computes the average of a column of data values. As with the SUM() function, the data in the column must have a numeric type. Because the AVG() function adds the values in the column and then divides by the number of values, its result may have a different data type than that of the values in the column. For example, if you apply the AVG () function to a column of integers, the result will be either a decimal or a floating point number, depending on the brand of DBMS you are using.

Here are some examples of the AVG() column function:

Calculate the average price of products from manufacturer ACI.

SELECT AVG(PRICE)

FROM PRODUCTS

WHERE MFR_ID = ‘ACI’

AVG(PRICE)

———-

$804.29

Calculate the average size of an order placed by Acme Mfg. (customer number 2103).

SELECT AVG(AMOUNT)

FROM ORDERS WHERE CUST = 2103

AVG(AMOUNT)

———

$8,895.50

3. Finding Extreme Values (min and max)

The MIN () and MAX () column functions find the smallest and largest values in a column, respectively. The data in the column can contain numeric, string, or date/time information.

The result of the MIN() or MAX() function has exactly the same data type as the data in the column.

Here are some examples that show the use of these column functions:

What are the smallest and largest assigned quotas?

SELECT MIN(QUOTA), MAX(QUOTA)

FROM SALESREPS

MIN(QUOTA)     MAX(QUOTA)

————————

$200,000.00   $350,000.00

What is the earliest order date in the database?

SELECT MIN(ORDER_DATE)

FROM ORDERS

MIN(ORDER_DATE)

————-

04-JAN-89

What is the best sales performance of any salesperson?

SELECT MAX(100 * (SALES/QUOTA))

FROM SALESREPS

MAX(100*(SALES/QUOTA))

———————-

135.44

When the MIN() and MAX() column functions are applied to numeric data, SQL compares the numbers in algebraic order (large negative numbers are less than small negative numbers, which are less than zero, which is less than all positive numbers). Dates are compared sequentially. (Earlier dates are smaller than later ones.) Durations are compared based on their length. (Shorter durations are smaller than longer ones.)

When using MIN() and MAX() with string data, the comparison of two strings depends on the character set being used. On a personal computer or minicomputer, both of which use the ASCII character set, digits come before the letters in the sorting sequence, and all of the uppercase characters come before all of the lowercase characters. On IBM mainframes, which use the EBCDIC character set, the lowercase characters precede the uppercase characters, and digits come after the letters. Here is a comparison of the ASCII and EBCDIC collating sequences of a list of strings, from smallest to largest:

The difference in the collating sequences means that a query with an ORDER BY clause can produce different results on two different systems.

International characters pose additional problems (for example, accented characters in French, German, Spanish, or Italian, or the Cyrillic alphabet letters used in Greek or Russian, or the Kanji symbols used in Japanese). Some brands of DBMS use special international sorting algorithms to sort these characters into their correct position for each language. Others simply sort them according to the numeric value of the code assigned to the character. To address these issues, the SQL2 standard includes elaborate support for national character sets, user-defined character sets, and alternate collating sequences. Unfortunately, support for these SQL2 features varies widely among popular DBMS products. If your application involves international text, you will want to experiment with your particular DBMS to find out how it handles these characters.

4. Counting Data Values (count)

The COUNT() column function counts the number of data values in a column. The data in the column can be of any type. The COUNT() function always returns an integer, regardless of the data type of the column. Here are some examples of queries that use the COUNT() column function:

How many customers are there?

SELECT COUNT(CUST_NUM)

FROM CUSTOMERS

COUNT(CUST NUM)

—————

21

How many salespeople are over quota?

SELECT COUNT(NAME)

FROM SALESREPS

WHERE SALES > QUOTA

COUNT(NAME)

———–

7

How many orders for more than $25,000 are on the books?

SELECT COUNT(AMOUNT)

FROM ORDERS

WHERE AMOUNT > 25000.00

COUNT(AMOUNT)

————–

   4

Note that the COUNT() function ignores the values of the data items in the column; it simply counts how many data items there are. As a result, it doesn’t really matter which column you specify as the argument of the COUNT() function. The last example could just as well have been written this way:

 SELECT COUNT(ORDER_NUM)

FROM ORDERS

WHERE AMOUNT > 25000.00

COUNT(ORDER NUM)

—————

4

In fact, it’s awkward to think of the query as “counting how many order amounts” or “counting how many order numbers”; it’s much easier to think about “counting how many orders.” For this reason, SQL supports a special COUNT(*) column function, which counts rows rather than data values. Here is the same query, rewritten once again to use the COUNT(*) function:

 SELECT COUNT(*)

FROM ORDERS

WHERE AMOUNT > 25000.00

COUNT(*)

——–

  4

If you think of the COUNT(*) function as a “rowcount” function, it makes the query easier to read. In practice, the COUNT(*) function is almost always used instead of the COUNT() function to count rows.

5. Column Functions in the Select List

Simple queries with a column function in their select list are fairly easy to understand. However, when the select list includes several column functions, or when the argument to a column function is a complex expression, the query can be harder to read and understand. The following steps show the rules for SQL query processing expanded once more to describe how column functions are handled. As before, the rules are intended to provide a precise definition of what a query means, not a description of how the DBMS actually goes about producing the query results.

To generate the query results for a SELECT statement:

  1. If the statement is a UNION of SELECT statements, apply Steps 2 through 5 to each of the statements to generate their individual query results.
  2. Form the product of the tables named in the FROM If the FROM clause names a single table, the product is that table.
  3. If there is a WHERE clause, apply its search condition to each row of the product table, retaining those rows for which the search condition is TRUE (and discarding those for which it is FALSE or NULL).
  4. For each remaining row, calculate the value of each item in the select list to produce a single row of query results. For a simple column reference, use the value of the column in the current row. For a column function, use the entire set of rows as its argument.
  5. If SELECT DISTINCT is specified, eliminate any duplicate rows of query results that were produced.
  6. If the statement is a UNION of SELECT statements, merge the query results for the individual statements into a single table of query results. Eliminate duplicate rows unless UNION ALL is specified.
  7. If there is an ORDER BY clause, sort the query results as specified.

The rows generated by this procedure comprise the query results.

One of the best ways to think about summary queries and column functions is to imagine the query processing broken down into two steps. First, you should imagine how the query would work without the column functions, producing many rows of detailed query results. Then you should imagine SQL applying the column functions to the detailed query results, producing a single summary row. For example, consider the following complex query:

Find the average order amount, total order amount, average order amount as a percentage of the customer’s credit limit, and average order amount as a percentage of the salesperson’s quota.

SELECT AVG(AMOUNT), SUM(AMOUNT), (100 * AVG(AMOUNT/CREDIT_LIMIT)),

(100 * AVG(AMOUNT/QUOTA))

FROM ORDERS, CUSTOMERS, SALESREPS

WHERE CUST = CUST_NUM

AND REP = EMPL_NUM

AVG(AMOUNT) SUM(AMOUNT) (100*AVG(AMOUNT/CREDIT_LIMIT)) (100*AVG(AMOUNT/QUOTA))

—————————————————————————

$8,256.37   $247,691.00                        24.45                     2.51

Without the column functions, it would look like this:

SELECT AMOUNT, AMOUNT, AMOUNT/CREDIT_LIMIT,AMOUNT/QUOTA

FROM ORDERS, CUSTOMERS, SALESREPS

WHERE CUST = CUST_NUM AND

AND REP = EMPL_NUM

and would produce one row of detailed query results for each order. The column functions use the columns of this detailed query results table to generate a single-row table of summary query results.

A column function can appear in the select list anywhere that a column name can appear. It can, for example, be part of an expression that adds or subtracts the values of two column functions. However, the argument of a column function cannot contain another column function, because the resulting expression doesn’t make sense. This rule is sometimes summarized as “it’s illegal to nest column functions.”

It’s also illegal to mix column functions and ordinary column names in a select list, again because the resulting query doesn’t make sense. For example, consider this query:

 SELECT NAME, SUM(SALES)

FROM SALESREPS

The first select item asks SQL to generate a ten-row table of detailed query results—one row for each salesperson. The second select item asks SQL to generate a one-row column of summary query results containing the total of the SALES column. The two SELECT items contradict one another, producing an error. For this reason, either all column references in the select list must appear within the argument of a column function (producing a summary query), or the select list must not contain any column functions (producing a detailed query). Actually, the rule is slightly more complex when grouped queries and subqueries are considered. The necessary refinements are described later in the “Group Search Conditions” section.

6. Null Values and Column Functions

The SUM(), AVG(), MIN(), MAX(), and COUNT() column functions each take a column of data values as their argument and produce a single data value as a result. What happens if one or more of the data values in the column is a NULL value? The ANSI/ISO SQL standard specifies that NULL values in the column are ignored by the column functions.

This query shows how the COUNT() column function ignores any NULL values in a column:

SELECT COUNT(*), COUNT(SALES), COUNT(QUOTA)

FROM SALESREPS

COUNT(*) COUNT(SALES) COUNT(QUOTA)

———————————-

10             10              9

The SALESREPS table contains ten rows, so COUNT(*) returns a count of ten. The SALES column contains ten non-NULL values, so the function COUNT(SALES) also returns a count of ten. The QUOTA column is NULL for the newest salesperson. The COUNT(QUOTA) function ignores this NULL value and returns a count of nine. Because of these anomalies, the COUNT (*) function is almost always used instead of the COUNT () function, unless you specifically want to exclude NULL values in a particular column from the total.

Ignoring NULL values has little impact on the MIN() and MAX() column functions. However, it can cause subtle problems for the SUM() and AVG() column functions, as illustrated by this query:

SELECT SUM(SALES), SUM(QUOTA), (SUM(SALES) – SUM(QUOTA)), SUM(SALES-QUOTA)

FROM SALESREPS

SUM(SALES)       SUM(QUOTA) (SUM(SALES)-SUM(QUOTA)) SUM(SALES-QUOTA)

——————————————————————–

$2,893,532.00   $2,700,000.00        $193,532.00      $117,547.00

You would expect the two expressions:

 (SUM(SALES) – SUM(QUOTA)) and SUM(SALES-QUOTA)

in the select list to produce identical results, but the example shows that they do not. The salesperson with a NULL value in the QUOTA column is again the reason. The expression:

SUM(SALES)

totals the sales for all ten salespeople, while the expression:

SUM(QUOTA)

totals only the nine non-NULL quota values. The expression:

SUM(SALES) – SUM(QUOTA)

computes the difference of these two amounts. However, the column function:

SUM(SALES-QUOTA)

has a non-NULL argument value for only nine of the ten salespeople. In the row with a NULL quota value, the subtraction produces a NULL, which is ignored by the SUM() function. Thus, the sales for the salesperson without a quota, which are included in the previous calculation, are excluded from this calculation.

Which is the “correct” answer? Both are! The first expression calculates exactly what it says: “the sum of SALES, less the sum of QUOTA.” The second expression also calculates exactly what it says: “the sum of (SALES – QUOTA).” When NULL values occur, however, the two calculations are not quite the same.

The ANSI/ISO standard specifies these precise rules for handling NULL values in column functions:

  • If any of the data values in a column are NULL, they are ignored for the purpose of computing the column function’s value.
  • If every data item in the column is NULL, then the SUM(), AVG(), MIN(), and MAX() column functions return a NULL value; the COUNT() function returns a value of zero.
  • If no data items are in the column (that is, the column is empty), then the SUM(), AVG(), MIN(), and MAX() column functions return a NULL value; the COUNT() function returns a value of zero.
  • The COUNT(*) counts rows and does not depend on the presence or absence of NULL values in a column. If there are no rows, it returns a value of zero.

Although the standard is very clear in this area, commercial SQL products may produce results different from the standard, especially if all of the data values in a column are NULL or when a column function is applied to an empty table. Before assuming the behavior specified by the standard, you should test your particular DBMS.

7. Duplicate Row Elimination (distinct)

Recall from Chapter 6 that you can specify the DISTINCT keyword at the beginning of the select list to eliminate duplicate rows of query results. You can also ask SQL to eliminate duplicate values from a column before applying a column function to it. To eliminate duplicate values, the keyword DISTINCT is included before the column function argument, immediately after the opening parenthesis.

Here are two queries that illustrate duplicate row elimination for column functions:

How many different titles are held by salespeople?

SELECT COUNT(DISTINCT TITLE)

FROM SALESREPS

COUNT(DISTINCT TITLE)

——————

    3

How many sales offices have salespeople who are over quota?

SELECT COUNT(DISTINCT REP_OFFICE)

FROM SALESREPS

WHERE SALES > QUOTA

COUNT(DISTINCT REP OFFICE)

————————–

4

The SQL1 standard specified that when the DISTINCT keyword is used, the argument to the column function must be a simple column name; it cannot be an expression. The standard allows the DISTINCT keyword for the SUM () and AVG () column functions. The standard does not permit use of the DISTINCT keyword with the MIN() and MAX() column functions because it has no impact on their results, but many SQL implementations allow it anyway. The standard also requires the DISTINCT keyword for the COUNT() column function, but many SQL implementations permit the use of the COUNT () function without it. DISTINCT cannot be specified for the COUNT ( * ) function, because it doesn’t deal with a column of data values at all—it simply counts rows. The SQL2 standard relaxed these restrictions, allowing DISTINCT to be applied for any of the column functions and permitting expressions as arguments for any of the functions as well.

In addition, the DISTINCT keyword can be specified only once in a query. If it appears in the argument of one column function, it can’t appear in any others. If it is specified before the select list, it can’t appear in any column functions. The only exception is that DISTINCT may be specified a second time inside a subquery (contained within the query). Subqueries are described in Chapter 9.

Source: Liang Y. Daniel (2013), Introduction to programming with SQL, Pearson; 3rd edition.

Leave a Reply

Your email address will not be published. Required fields are marked *