Python is a programming language that can be used for many purposes, including machine learning. The most important advantages of the language are
- Simple syntax
- High readability
- A large collection of libraries
Python aims to be simple in the design of its syntax, encapsulated in the slogan “There should be one obvious way to do it.” Because of its simplicity, Python is known as a beginner’s level programming language.
If you ask Python programmers what they like most about Python, they often cite its high readability. One reason for the high readability of Python code is its complete set of code style guidelines. Because of its simplicity, Python has attracted many developers to create new libraries for machine learning. Also, because of the existence of these libraries, Python is becoming very popular among machine learning experts.
NOTE To be able to use the sp_execute_external_script system stored procedure, you need to enable it first by running the following batch:
EXEC sp_configure ‘external scripts enabled’, 1
GO
RECONFIGURE WITH OVERRIDE
GO
1. Getting Started with Python
If you are new to the Python programming language, it is important to know that Python code blocks are specified by their indentation. In other words, indentation of lines of code is a requirement and not a matter of style as in many other programming languages. Example 33.1 is a very simple example that shows how you can use the sp_execute_external_script system stored procedure to display values of the project table using Python. (For the syntax and description of this system procedure, see the section “Getting Started with R in SQL Server” toward the beginning of Chapter 32.)
Example 33.1
USE sample;
DECLARE @pscript NVARCHAR(MAX);
SET @pscript = N’ df = InputDataSet OutputDataSet = df’;
DECLARE @select NVARCHAR(MAX);
SET @select = N’
SELECT project_no, budget
FROM project’;
EXEC sp_execute_external_script
@language = N’Python’,
@script = @pscript,
@input_data_1 = @select;
GO
As you can see from Example 33.1, the sp_execute_external_script system stored procedure has, among others, the following three parameters: @language, @script, and @input_data_1. The @language parameter specifies the name of the language integrated in SQL Server. The value “Python” specifies that Python is used for implementation. The @script parameter defines the external language script as a string. You can specify it using a literal or a variable. The @input_data_1 parameter defines the input data used by the external script in the form of a Transact-SQL query. In other words, the query generates a result set that is used as input data. The data type of this parameter is NVARCHAR(max), meaning that the query is passed as a string.
Two important parameters are related to the @script variable: @input_data_1_name and @output_data_1_name. The first parameter specifies the name of the variable used to represent the input dataset (i.e., the dataset defined by @input_data_1). You can omit the use of @input_data_1_name, in which case the default value InputDataSet is used. The second parameter, @output_data_1_name, specifies the name of the variable used to represent the output dataset (i.e., the result of the script). If you do not assign a value to @output_data_1_ name, the default name OutputDataSet is used.
As you can see from the SELECT statement in Example 33.1, the input data contains all values of the columns project_no and budget of the project table. These values are assigned to the variable called df. After that, the input data is passed to the output data (OutputDataSet = df) without any modification. (As you can see from the following result, neither column has an explicit name. Example 33.2 shows how you can assign particular names to the columns in a result set.)
The result is
I present the script in Example 33.1 only to demonstrate how to run the sp_execute_ external_script system stored procedure and to explain the meaning of some parameters of the procedure. In other words, Example 33.1 simply assigns query results to the Python script, which returns the same results as that of the SELECT statement. This is something you can do without using a Python script, but being able to assign your query results to the Python script means you can then use the analytical power built into Python to apply it to that data.
The most important advantage of the integration of Python in SQL Server is the capability to use data stored persistently in the database system, manipulate the data using Python functions, and return it to the Database Engine. Example 33.2 shows this.
Example 33.2
USE sample;
DECLARE @pscript NVARCHAR(MAX);
SET @pscript = N’
df1 = InputDataSet
OutputDataSet = round(df1/7, 2)’;
DECLARE @select NVARCHAR(MAX);
SET @select = N’
SELECT budget AS Balanced_budget
FROM project’;
EXEC sp_execute_external_script
@language = N’Python’,
@script = @pscript,
@input_data_1 = @select
WITH RESULT SETS ((Balanced_budget FLOAT));
GO
First, note the following two lines of code from Example 33.2:
df1 = InputDataSet
OutputDataSet = round(df1/7, 2)’;
The SELECT statement assigned to @input_data_1 using the @select variable generates the input dataset. The dataset contains values of the budget column of the project table. The second line applies a Python function to the df1 variable, to which the input data is assigned. Precisely, each input value is then divided by 7 and rounded using the round function. After that, the modified values of the budget column are assigned to the result set, which is sent to the database system.
The last line of the code in Example 33.2 uses the WITH RESULT SETS clause to name a column of the result set. (By default, a result set returned by a Python script is displayed as a table with unnamed columns.)
The result is
2. Python Data Frames
A data frame is used to store input data sent from the Database Engine, pass the data to Python, modify it using Python functions, and send it back to the database system. The structure of a data frame is “two-dimensional,” meaning that each frame is made up of rows and columns.
In case of Python, input data is passed to a script and converted to a DataFrame object. Also, the data returned by a Python script is passed to the output variable as a DataFrame object. All DataFrame objects belong to the class with the same name, which is a part of the pandas library. (This library provides data structures designed to allow you to work with “relational” data—data provided in the table form.)
Example 33.3 shows the use of data frames.
Example 33.3
USE AdventureWorks;
DECLARE @pscript NVARCHAR(MAX);
SET @pscript = N’
df1 = InputDataSet
OutputDataSet = df1.groupby( “Units”, as_index=False).max()’;
DECLARE @select NVARCHAR(MAX);
SET @select = N’
SELECT v.UnitMeasureCode AS Units
FROM Purchasing.PurchaseOrderHeader h
INNER JOIN Purchasing.PurchaseOrderDetail d
ON h.PurchaseOrderID = d.PurchaseOrderID
INNER JOIN Purchasing.ProductVendor v
ON d.ProductID=v.ProductID’;
EXEC sp_execute_external_script
@language = N’Python’,
@script = @pscript,
@input_data_1 = @select;
The input data of the script in Example 33.3 is generated from the tables of the AdventureWorks sample database. The SELECT statement displays the total purchase per unit measure code. To retrieve data, we join three tables: PurchaseOrderHeader, PurchaseOrderDetail, and ProductVendor. The single column in the SELECT list has the alias Units. The results of this query are all distinct unit measure codes.
The line of code following the assignment of the input data to the df1 variable needs some explanation:
OutputDataSet = df1.groupby(“Units”, as_index=False).max()’;
The groupby function is applied to the values stored in the df1 variable. This function allows you to group records using distinct values. The first argument of the function specifies the object whose values will be grouped. In this case, the grouping is done on values of the UnitMeasureCode column of the ProductVendor table. (The alias of this column is Units.)
To get one value from each group, the max function is applied to the result. (While all values in each group are identical, the function returns just one of them.)
The second argument of the groupby function is as_index. When using this function, the as_index parameter can be set to either True or False, depending on whether you want the grouping column to be the index of the output or not, respectively. When as_index=False is used, the key(s) you use in the groupby function is generated by the system and added as an additional column of the output data. (Taking a look at the following result, you can see that, besides the values of the UnitMeasureCode column, there is an additional column with index
NOTE The use of as_index=True is discussed shortly in relation to Example 33.5.
The result is
As you already know, data frames provide data in table form. Example 33.3 uses only column data of the data frame for calculation. Example 33.4 uses the same input data as Example 33.3 but displays data in table form, with both row values and column values of the data frame.
Example 33.4
USE AdventureWorks;
DECLARE @pscript NVARCHAR(MAX);
SET @pscript = N’
df1 = InputDataSet
df2 = df1.groupby(“Units”, as_index=False).sum()
OutputDataSet = df2′;
DECLARE @select NVARCHAR(MAX);
SET @select = N’
SELECT CAST(h.subtotal AS FLOAT) AS Total, v.UnitMeasureCode AS Units
FROM Purchasing.PurchaseOrderHeader h
INNER JOIN Purchasing.PurchaseOrderDetail d
ON h.PurchaseOrderID = d.PurchaseOrderID
INNER JOIN Purchasing.ProductVendor v
ON d.ProductID=v.ProductID’;
EXEC sp_execute_external_script
@language = N’Python’,
@script = @pscript,
@input_data_1 = @select
WITH RESULT SETS((UnitCodes NVARCHAR(50), TotalSales MONEY));
First, the SELECT list of the query, assigned to the @select variable, contains two columns with the aliases Total and Units, respectively. In the following code,
df1 = InputDataSet
df2 = df1.groupby(“Units”, as_index=False).sum()
OutputDataSet = df2
the first line passes the input dataset to the df1 data frame. In the second line, the groupby function is applied to the values of the unit measure codes and, for each distinct unit, the sum of all subtotals (the values of the subtotal column of the PurchaseOrderHeader table) is calculated. Finally, the content of the df2 data frame is passed to the output data.
The result is
One more feature of the query in Example 33.4 requires further explanation. Note that the SELECT list of the query uses the CAST operator to select values from the subtotal column of the PurchaseOrderHeader table and to convert them into the values with the data type FLOAT. This is necessary, because if you do not use casting, the system displays an error with a message similar to the following: “Unsupported input data type in column ‘total. Supported types: bit, tinyint, smallint, int, bigint, uniqueidentifier, real, float, char, varchar, nchar, nvarchar, varbinary.”
The reason for this error message is that the set of standard data types in Python is significantly smaller than the corresponding set for Transact-SQL. For this reason, the Python system cannot convert each Transact-SQL data type into an appropriate Python type. One example of such a data type is MONEY, which is the data type of the subtotal column. In such cases, you have to use the CAST operator and to explicitly convert the column’s data type into a data type that is supported by Python.
Source: Petkovic Dusan (2020), Microsoft SQL Server 2019: A Beginner’s Guide, Seventh Edition-McGraw-Hill Education.