SQL Server R Services

SQL Server R Services is an add-on component of SQL Server Machine Learning Services and is used for executing R scripts on SQL Server. The compiled R scripts run isolated from Database Engine processes but can access relational data using either stored procedures or SQL scripts containing R statements. That way, the data stored persistently in SQL Server can be manipulated using R functions.

R Services includes a base distribution of R, overlaid with enterprise R packages from Microsoft so that you can load and process a large amount of data in parallel and aggregate it into a single consolidated output. The main application areas of Microsoft’s R functions and algorithms are data visualization, predictive analytics, and statistical modeling.

1. R Language: An Introduction

R was created in the 1990s as an open source alternative to the proprietary language called S. Since then, the R language has been used by scientists, statisticians, and (more recently) data scientists as a convenient environment for exploratory data analysis. One of the most important properties of the R language is that it enables you to create and package entire scripts as libraries. That way, you can get more consistent and reliable results than when working with systems that require a lot of manual interaction with a graphical user interface. (Note that R packages can include datasets, too.)

R is best used to manipulate moderately sized datasets, to do statistical analysis, and to produce data-centric output that can be visualized using different forms of GUIs. Its popularity is based on its expressiveness and the huge collection of third-party libraries created for it.

On the other hand, R is not considered as a general-purpose programming language, because it does not have all the features typically offered in programming languages, such as control structures. R is usually compared to specialized statistical systems such as SPSS and SAS.

2. Getting Started with R in SQL Server

To run R scripts in the SQL Server environment, you have to use the sp_execute_external_ script system stored procedure. This procedure allows you to execute any provided script at an external location. The syntax of this procedure is as follows:

sp_execute_external_script

@language = N’language’,

@script = N’script’

[ , @input_data_1 = N’SelectStatement’ ]

[ , @input_data_1_name = N’input_data_1_name’ ]

[ , @output_data_1_name = N’output_data_1_name’ ]

[ , @parallel = 0 | 1 ]

[ , @params = N’@parameter_name data_type [ OUT | OUTPUT ] [ ,…n ]’ ]

[ , @parameter1 = ‘value1’ [ OUT | OUTPUT ] [ ,…n ] ]

Before we examine the most important parameters of the sp_execute_external_script system stored procedure, let’s take a look at how a programming language can be integrated into SQL Server. Generally, when writing a script using a programming language other than Transact-SQL, three issues must be addressed:

  • Specification of the particular programming language
  • Implementation of the external language script as a string
  • Implementation of the internal (Transact-SQL) language script

To address these issues, the sp_execute_external_script system stored procedure supports three parameters, @language, @script, and @input_data_1, respectively. In SQL Server 2019, the @language parameter can have the following values: R (for the R language) and Python. The @script parameter defines the external language script as a string. The @input_data_1 parameter specifies the input dataset, which is generated using a Transact-SQL query. (The description of all other parameters can be found in Microsoft Docs.)

NOTE To be able to use the sp_execute_external_script system stored procedure, you need to enable it first by running the following batch:

EXEC sp_configure ‘external scripts enabled’, 1

GO

RECONFIGURE WITH OVERRIDE

GO

Example 32.1 is a very simple demonstration of how you can use the sp_execute_external script system stored procedure to display values of the project table.

NOTE In contrast to SQL, R is a case-sensitive language.

Example 32.1

USE sample;

GO

execute sp_execute_external_script

@language = N’R’

, @script = N’OutputDataSet <- InputDataSet;’

, @input_data_1 = N’SELECT project_no, budget FROM project;

WITH RESULT SETS ((Name CHAR(20), Balanced_budget INT));

Example 32.1 uses the project table of the sample database to generate input data. The value R in the @language parameter of the sp_execute_external_script system stored procedure specifies to use the R language for the implementation of the script. As you already know, the @input_data_1 parameter specifies the input dataset. Therefore,

@input_data_1 = N’SELECT project_no, budget FROM project;’

specifies that the input data will be selected using the SELECT statement specified.

Two important parameters are related to the @script variable: @input_data_1_name and @output_data_1_name. The first parameter specifies the name of the variable used to represent the input data; that is, the result of the SELECT statement defined with the @input_data_1 parameter. The default value is InputDataSet. As you can see from the query in Example 32.1, the input data contains all values of the columns project_no and budget of the project table.

The second parameter, @output_data_1_name, specifies the name of the variable in the external script that contains the data to be returned to the database system after the execution of the script. The default value is OutputDataSet. Thus, the expression

N’OutputDataSet <- InputDataSet;’

in Example 32.1 passes the input data contained in InputDataSet to R and then back to the result set without any modifications.

The last line of the code in Example 32.1 uses WITH RESULT SETS. By default, a result set returned by an R script is displayed as a table with unnamed columns (as shown in upcoming Example 32.2). In other words, names used within a script are local to the scripting environment and are not reflected in the corresponding result set. The WITH RESULT SETS clause allows you to name result set columns as you wish. Therefore,

WITH RESULT SETS ((Name CHAR(20), Balanced_budget INT))

assigns the names Name and Balanced_budget to the output of the columns project_no and budget, respectively.

The result of Example 32.1 is

The script in Example 32.1 does not perform any manipulation of the input data. (I use this example just to demonstrate how to run the sp_execute_external_script system stored procedure and to explain the most important parameters of it.) The most important advantage of R Services is the capability to use data stored persistently in SQL Server, manipulate it using R functions, and return it to SQL Server. Example 32.2 shows this capability in action.

Example 32.2

USE sample;

DECLARE @rscript NVARCHAR(MAX);

SET @rscript = N’OutputDataSet <- SqlData;

OutputDataSet[,2] <- round(SqlData$budget/7, 2);’;

EXEC sp_execute_external_script

@language = N’R’,

@script = @rscript,

@input_data_1 = N’SELECT project_no, budget FROM project’,

@input_data_1_name = N’SqlData’;

Example 32.1 used the default name InputDataSet for the input dataset. In Example 32.2,

@input_data_1_name = N’SqlData’;

assigns another name, SqlData, to the input dataset. (For the output data, we use the default name.) The following line of code uses the R function called round to round the obtained result:

OutputDataSet[,2] <- round(SqlData$budget/7, 2);’;

Generally, to modify values of one of the selected columns of the corresponding SELECT statement, you have to reference the input dataset (SqlData) first, add a dollar sign ($), and name that column (in our case, budget). Each value of the selected column is then divided by 7 and rounded, up to two decimal digits. (The term OutputDataSet[,2] specifies that the obtained value will be assigned to the second column in the SELECT list.)

The result is

NOTE Because Example 32.2 does not use the WITH RESULT SETS statement, the columns in the result set will not have explicit names.

3. R Data Frames

A key concept when working with R is the data frame, because it allows you to pass data from a database system, modify the data using R, and send it back to the database system to be stored persistently. The structure of a data frame is similar to a relational table, consisting of rows and columns. Because you are already familiar with the relational model, you should not have any problems understanding and using data frames.

NOTE All subsequent examples use Microsoft’s sample database AdventureWorks.

Example 32.3 shows the use of data frames in the R language.

Example 32.3

USE AdventureWorks;

GO

DECLARE @rscript NVARCHAR(MAX);

SET @rscript = N’

purchase <- InputDataSet

cl <- levels(purchase$Units)

print(cl)’;

DECLARE @select NVARCHAR(MAX);

SET @select = N’

SELECT h.subtotal AS Total, v.UnitMeasureCode AS Units

FROM Purchasing.PurchaseOrderHeader h

INNER JOIN Purchasing.PurchaseOrderDetail d

ON h.PurchaseOrderID = d.PurchaseOrderID

INNER JOIN Purchasing.ProductVendor v

ON d.ProductID = v.ProductID’;

EXEC sp_execute_external_script

@language = N’R’,

@script = @rscript,

@input_data_1 = @select;

I will explain only the most important parts of the code in Example 32.3, starting with the following three lines of code:

purchase <- InputDataSet

c1 <- levels(purchase$Units)

print(c1)

The first line assigns InputDataSet to the purchase variable. That way, the input data is passed to the purchase data frame, which is a two-dimensional dataset. Once you have the data, you can start working with it. To understand which data is passed to the variable, let’s look at the query assigned to the @select variable. The SQL statement refers to the AdventureWorks database to display the total purchase per unit measure code. To extract data, we join three tables: PurchaseOrderHeader, PurchaseOrderDetail, and ProductVendor. (All these tables belong to the Purchasing schema of the AdventureWorks database.) The first column in the SELECT list has the alias Total and specifies aggregated purchase amounts, while the second column is called Units and specifies the corresponding unit measure codes.

Moving on the second line, generally, the levels function of the R language provides access to the distinct values of a variable. In Example 32.3, this function retrieves a list of distinct values from the UnitMeasureCode column of the ProductVendor table. As you already know, when referencing a column, you must first specify the dataset (in this case, purchase), followed by a dollar sign ($), followed by the column name or its alias (Units). The second line also assigns the output of the levels function to the c1 variable, which gives us what we need to retrieve the result set. The last statement, print(c1), displays the retrieved values.

To display the output, you have to click the column Messages in the lower part of the right pane. The output of Example 32.3 is

[1] “CAN” “CS ” “CTN” “DZ ” “EA ” “GAL” “PAK”

Although the SELECT statement in Example 32.3 returns aggregated purchase amounts for each unit measure code, the script in that example uses the values of only one “dimension” of the data frame—the values of the UnitMeasureCode column.

Example 32.4 goes a step further and uses the same data frame to get all data from the query in Example 32.3. In other words, Example 32.3 returns “row” values as well as “column” values of the data frame and displays them in the table form.

Example 32.4

USE AdventureWorks;

GO

DECLARE @rscript NVARCHAR(MAX);

SET @rscript = N’

purchase <- InputDataSet

c1 <- levels(purchase$Units)

c2 <- round(tapply(purchase$total, purchase$Units, sum))

purchase <- data.frame(c1, c2)

print(c2)’;

DECLARE @select NVARCHAR(MAX);

SET @select = N’

SELECT h.subtotal AS total, v.UnitMeasureCode AS Units

FROM Purchasing.PurchaseOrderHeader h

INNER JOIN Purchasing.PurchaseOrderDetail d

ON h.PurchaseOrderID = d.PurchaseOrderID

INNER JOIN Purchasing.ProductVendor v

ON d.ProductID = v.ProductID’;

EXEC sp_execute_external_script

@language = N’R’,

@script = @rscript,

@input_data_1 = @select;

The part of the program in Example 32.4 in relation to the cl variable is identical to the corresponding part in Example 32.3. For this reason, I will discuss only the following two lines of code:

c2 <- round(tapply(purchase$total, purchase$Units, sum))

purchase <- data.frame(c1, c2)

In the first line of code, the tapply function is used. Generally, this function uses an operator to apply it on array elements, broken down by a given variable. In our case, the operator is sum, and it is specified as the third argument of the function. The array elements mentioned are the distinct values of the UnitMeasureCode column, and are specified in the second argument of the function. The first argument of the tapply function defines the subset (subtotal) to which the operator will be applied. Therefore, in Example 32.4, the aggregated sums of values in the subtotal column are calculated for each different value of the UnitMeasureCode column.

The result is

CAN         CS     CTN      DZ      EA      GAL     PAK

9609759   584963 4408405 958018 267092933 7526088 206097

Source: Petkovic Dusan (2020), Microsoft SQL Server 2019: A Beginner’s Guide, Seventh Edition-McGraw-Hill Education.

Leave a Reply

Your email address will not be published. Required fields are marked *