Business Intelligence and Transact-SQL: Standard and Nonstandard Analytic Functions

The Database Engine contains the following standard and nonstandard analytic functions:

  • TOP
  • OFFSET/FETCH
  • NTILE
  • PIVOT and UNPIVOT
  • STRING_AGG
  • APPROX_COUNT_DISTINCT

The following sections describe these analytic functions and operators.

1. TOP Clause

The TOP clause specifies the first n rows of the query result that are to be retrieved. This clause should always be used with the ORDER BY clause, because the result of such a query is always well defined and can be used in table expressions. (A table expression specifies a sample of a grouped result table.) A query with TOP but without the ORDER BY clause is nondeterministic, meaning that multiple executions of the query with the same data may not always display the same result set.

Example 24.19 shows the use of this clause.

Example 24.19

Retrieve the four projects with the highest budgets:

USE sample;

SELECT TOP (4) dept_name, budget

FROM project_dept

ORDER BY budget DESC;

The result is

As you can see from Example 24.19, the TOP clause is part of the SELECT list and is written in front of all column names in the list.

NOTE You should write the input value of TOP inside parentheses, because the system supports any self-contained expression as input.

The TOP clause is a nonstandard ANSI SQL implementation used to display the ranking of the top n rows from a table. A query equivalent to Example 24.19 that uses the window construct and the standardized RANK function is shown in Example 24.20.

Example 24.20

Retrieve the four projects with the highest budgets:

USE sample;

SELECT dept_name, budget

FROM (SELECT dept_name, budget,

RANK() OVER (ORDER BY budget DESC) AS rank_budget

FROM project_dept) part_dept

WHERE rank_budget <= 4;

The TOP clause can also be used with the additional PERCENT option. In that case, the first n percent of rows are retrieved from the result set. The additional option WITH TIES specifies that additional rows will be retrieved from the query result if they have the same value in the ORDER BY column(s) as the last row that belongs to the displayed set. Example 24.21 shows the use of the PERCENT and WITH TIES options.

Example 24.21

Retrieve the top 25 percent of rows with the smallest number of employees:

USE sample;

SELECT TOP (25) PERCENT WITH TIES emp_cnt, budget

FROM project_dept

ORDER BY emp_cnt ASC;

The result is

The result of Example 24.21 contains five rows, because there are three projects with five employees.

You can also use the TOP clause with UPDATE, DELETE, and INSERT statements. Example 24.22 shows the use of this clause with the UPDATE statement.

Example 24.22

Find the three projects with the highest budget amounts and reduce them by 10 percent:

USE sample;

UPDATE TOP (3) project_dept

SET budget = budget * 0.9

WHERE budget in (SELECT TOP (3) budget

FROM project_dept

ORDER BY budget desc);

Example 24.23 shows the use of the TOP clause with the DELETE statement.

Example 24.23

Delete the four projects with the smallest budget amounts:

USE sample;

DELETE TOP (4)

FROM project_dept

WHERE budget IN

(SELECT TOP (4) budget FROM project_dept

ORDER BY budget ASC);

In Example 24.23, the TOP clause is used first in the subquery, to find the four projects with the smallest budget amounts, and then in the DELETE statement, to delete these projects.

2. OFFSET/FETCH

Chapter 6 showed how OFFSET/FETCH can be used for server-side paging. This application of OFFSET/FETCH is only one of many. Generally, OFFSET/FETCH allows you to filter several rows according to the given order. Additionally, you can specify how many rows of the result set should be skipped and how many of them should be returned. For this reason, OFFSET/ FETCH is similar to the TOP clause. However, there are certain differences:

  • OFFSET/FETCH is a standardized way to filter data, while the TOP clause is an extension of Transact-SQL. For this reason, it is possible that OFFSET/FETCH will replace the TOP clause in the future.
  • OFFSET/FETCH is more flexible than TOP insofar as it allows skipping of rows using the OFFSET clause. (The Database Engine doesn’t allow you to use the FETCH clause without OFFSET. In other words, even when no rows are skipped, you have to set OFFSET to 0.)
  • The TOP clause is more flexible than OFFSET/FETCH insofar as it can be used in DML statements INSERT, UPDATE, and DELETE (see Examples 24.22 and 24.23).

Examples 24.24 and 24.25 show how you can use OFFSET/FETCH with the ROW_ NUMBER ranking function. (Before you execute the following examples, repopulate the project_dept table. First delete all the rows, and then execute the INSERT statements from Example 24.2.)

Example 24.24

USE sample;

SELECT date_month, budget, ROW_NUMBER()

OVER (ORDER BY date_month DESC, budget DESC) as row_no

FROM project_dept

ORDER BY date_month DESC, budget DESC

OFFSET 5 ROWS FETCH NEXT 4 ROWS ONLY;

The result is

Example 24.24 displays the rows of the project_dept table in relation to the date_month and budget columns. (The first five rows of the result set are skipped and the next four are displayed.) Additionally, the row number of these rows is returned. The row number of the first row in the result set starts with 6 because row numbers are assigned to the result set before the filtering. (OFFSET/FETCH is part of the ORDER BY clause and therefore is executed after the SELECT list, which includes the ROW_NUMBER function. In other words, the values of ROW_NUMBER are determined before OFFSET/FETCH is applied.)

If you want to get the row numbers starting with 1, you need to modify the SELECT statement. Example 24.25 shows the necessary modification.

Example 24.25

USE sample;

SELECT *, ROW_NUMBER()

OVER (ORDER BY date_month DESC, budget DESC) as row_no

FROM (SELECT date_month, budget FROM project_dept

ORDER BY date_month DESC, budget DESC

OFFSET 5 ROWS FETCH NEXT 4 ROWS ONLY) c;

The result of Example 24.25 is identical to the result of Example 24.24 except that row numbers start with 1. The reason is that in Example 24.25 the query with OFFSET/FETCH is written as a table expression inside the outer query with the ROW_NUMBER function in the SELECT list. That way, the values of ROW_NUMBER are determined before OFFSET/FETCH is executed.

3. NTILE Function

The NTILE function belongs to the ranking functions. It distributes the rows in a partition into a specified number of groups. For each row, the NTILE function returns the number of the group to which the row belongs. For this reason, this function is usually used to arrange rows into groups. Example 24.26 shows the use of the NTILE function.

NOTE The NTILE function breaks down the data based only on the count of values.

4. Pivoting Data

Pivoting data is a method that is used to transform data from a state of rows to a state of columns. Additionally, some values from the source table can be aggregated before the target table is created.

There are two operators for pivoting data:

  • PIVOT
  • UNPIVOT

The following subsections describe these operators in detail.

4.1. PIVOT Operator

PIVOT is a nonstandard relational operator that is supported by Transact-SQL. You can use it to manipulate a table-valued expression into another table. PIVOT transforms such an expression by turning the unique values from one column in the expression into multiple columns in the output, and it performs aggregations on any remaining column values that are desired in the final output.

To demonstrate how the PIVOT operator works, let’s use a table called project_dept_pivot, which is derived from the project_dept table specified at the beginning of this chapter.

The new table contains the budget column from the source table and two additional columns: month and year. The year column of the project_dept_pivot table contains the years 2017 and 2018, which appear in the date_month column of the project_dept table. Also, the month columns of the project_dept_pivot table (January, february, and july) contain the summaries of budgets corresponding to these months in the project_dept table.

Example 24.27 creates the project_dept_pivot table.

Example 24.27

USE sample;

SELECT budget, month(date_month) as month, year(date_month) as year

INTO project_dept_pivot

FROM project_dept;

The content of the new table is given in Table 24-2.

Suppose that you get a task to return a row for each year, a column for each month, and the budget value for each year and month intersection. Table 24-3 shows the desired result.

Example 24.28 demonstrates how you can solve this problem using the standard SQL language.

Example 24.28

USE sample;

SELECT year,

SUM(CASE WHEN month = 1 THEN budget END ) AS January,

SUM(CASE WHEN month = 2 THEN budget END ) AS February,

SUM(CASE WHEN month = 7 THEN budget END ) AS July

FROM project_dept_pivot

GROUP BY year;

The process of pivoting data can be divided into three steps:

  1. Group the data. Generate one row in the result set for each distinct “on rows” element. In Example 24.28, the “on rows” element is the year column and it appears in the GROUP BY clause of the SELECT statement.
  2. Manipulate the data. Spread the values that will be aggregated to the columns of the target table. In Example 24.28, the columns of the target table are all distinct values of the month To implement this step, you have to apply a CASE expression for each of the different values of the month column: 1 (January), 2 (February), and 7 (July).
  1. Aggregate the data. Aggregate the data values in each column of the target table. Example 24.28 uses the SUM function for this step.

Example 24.29 solves the same problem as Example 24.28 using the PIVOT operator.

Example 24.29

USE sample;

SELECT year, [1] as January, [2] as February, [7] July FROM

(SELECT budget, year, month from project_dept_pivot) p2

PIVOT (SUM(budget) FOR month IN ([1],[2],[7])) AS P;

The SELECT statement in Example 24.29 contains an inner query, which is embedded in the FROM clause of the outer query. The PIVOT clause is part of the inner query. It starts with the specification of the aggregation function: SUM (of budgets). The second part specifies the pivot column (month) and the values from that column to be used as column headings— the first, second, and seventh months of the year. The value for a particular column in a row is calculated using the specified aggregate function over the rows that match the column heading.

The most important advantage of using the PIVOT operator in relation to the standard solution is its simplicity in the case in which the target table has many columns. In this case, the standard solution is verbose because you have to write one CASE expression for each column in the target table.

4.2. UNPIVOT Operator

The UNPIVOT operator performs the reverse operation of PIVOT, by rotating columns into rows. Example 24.30 shows the use of this operator.

Example 24.30

USE sample;

CREATE TABLE project_dept_pvt (year int, January float, February float, July float);

INSERT INTO project_dept_pvt VALUES (2017, 80000, 110000, 75000);

INSERT INTO project_dept_pvt VALUES (2018, 50000, 80000, 30000);

–UNPIVOT the table

SELECT year, month, budget

FROM

(SELECT year, January, February, July

FROM project_dept_pvt) p

UNPIVOT (budget FOR month IN (January, February, July)

)AS unpvt;

The result is

Example 24.30 uses the project_dept_pvt table to demonstrate the UNPIVOT relational operator. UNPIVOT’s first input is the column name (budget), which holds the normalized values. After that, the FOR option is used to determine the target column name (month). Finally, as part of the IN option, the selected values of the target column name are specified.

NOTE UNPIVOT is not the exact reverse of PIVOT, because any NULL values in the table being transformed cannot be used as column values in the output.

5. STRING_AGG Function

The STRING_AGG function concatenates rows of alphanumerical values into a single string and places a separator between the values. The separator is not added at the end of the string. The syntax of the STRING_AGG function is as follows:

string_agg ( input_string, separator )   [ order_clause ]

input_string is of any type that can be converted to the VARCHAR or NVARCHAR data types. separator is an expression of the NVARCHAR (VARCHAR) data type that is used as a separator for concatenated strings. It can be a literal or a variable. order_clause is optional and specifies the sort order of the concatenated strings using the WITHIN GROUP clause in the following way:

WITHIN GROUP (ORDER BY expression [ ASC | DESC] )

The following two examples show the use of the STRING_AGG function.

Example 24.31 uses the STRING_AGG function to generate lists of e-mail addresses of persons from the person.person table of the AdventureWorks database with “Michael” as the first name, and grouped according to the last name. In other words, e-mail addresses of all Michaels with the same last name are aggregated, separated by a semicolon (;), and displayed in one row.

Example 24.31

USE AdventureWorks;

SELECT p.LastName, STRING_AGG(e.EmailAddress,’;’) Email_list

FROM person.person p JOIN person.EmailAddress e

ON p.BusinessEntityID = e.BusinessEntityID

WHERE p.FirstName = ‘Michael’

GROUP BY LastName;

The first six rows of the result set are as follows:

Example 24.32 is the same as Example 24.31 but with the addition of the WITHIN GROUP clause.

Example 24.32

USE AdventureWorks;

SELECT p.LastName, STRING_AGG(e.EmailAddress,’;’)

WITHIN GROUP (ORDER BY EmailAddress) Email_list

FROM person.person p JOIN person.EmailAddress e

ON p.BusinessEntityID = e.BusinessEntityID

WHERE p.FirstName = ‘Michael’

GROUP BY LastName;

The only difference between Examples 24.31 and 24.32 is that each row with concatenated e-mail addresses in Example 24.32 will be additionally sorted in ascending sort order, which is the default sort order.

6. APPROX_COUNT_DISTINCT Function

As you already know, COUNT DISTINCT can be used to get the unique number of non-null values from a table. Using this function can be very time consuming, especially for larger tables with millions of rows. Starting with SQL Server 2019, Microsoft allows you to avoid this performance bottleneck by using the APPROX_COUNT_DISTINCT function. The function evaluates an expression for each row in a group, and returns the approximate number of unique non-null values in the group. This function is designed to provide aggregations across large data sets where minimizing the response time is important.

The APPROX_COUNT_DISTINCT function is one of the features of the topic called intelligent query processing. For this reason, this function and its performance advantages in relation to COUNT DISTINCT will be described in Chapter 28.

Source: Petkovic Dusan (2020), Microsoft SQL Server 2019: A Beginner’s Guide, Seventh Edition-McGraw-Hill Education.

Leave a Reply

Your email address will not be published. Required fields are marked *