SQL Server Machine Learning Services: Predictive Modeling with Python in SQL Server

As briefly introduced in Chapter 32, predictive analytics comprises a variety of statistical techniques from machine learning that analyze current and historical data to make predictions about future events. In the business sector, for example, predictive models analyze patterns found in historical and transactional data to identify possible risks and/or opportunities. To do this, a predictive model examines relationships among several parameters. After the analysis, the predictive model presents the various scenarios for evaluation.

1. Solving Linear Regression Problems Using Python

Linear regression is one of the simplest techniques used in predictive modeling. For this reason, this section presents an example of using linear regression to explain predictive modeling with Python.

NOTE See the Chapter 32 section “Solving Linear Regression Problems with R” for an explanation of the two forms of regression analysis, linear and nonlinear.

For the following example, assume that for each given value, the corresponding value is measured. Measurements in real life, such as this one, usually do not have perfect linear relationships between the values of the independent variable and the values of the dependent variable. For this reason, the goal of linear regression is to find the straight line that best fits the given values (called the best-fit line). In other words, linear regression identifies the equation that produces the smallest difference between all the observed values and their fitted values.

The table created in Example 33.6 will be used to show how linear regression can be calculated using Python.

(6,2));

Example 33.6

USE sample;

CREATE TABLE Measures (x_value INT, y_value DEC (6,2)); 

INSERT INTO Measures VALUES (1, 33.5);

INSERT INTO Measures VALUES (2, 35.9);

INSERT INTO Measures VALUES (3, 37.9);

INSERT INTO Measures VALUES (4, 39.8);

INSERT INTO Measures VALUES (5, 41.6);

INSERT INTO Measures VALUES (6, 45.4);

INSERT INTO Measures VALUES (7, 44.6);

INSERT INTO Measures VALUES (8, 47.4);

INSERT INTO Measures VALUES (9, 48.2);

INSERT INTO Measures VALUES (10, 50.3);

The ten measured values are stored in the Measures table created at the beginning of Example 33.6. Our task is to find linear regression—the formula for the corresponding straight line to which all the given points are at a minimum distance. Example 33.7 calculates the line and displays its formula, together with some other parameters.

Example 33.7

USE sample;

EXEC sp_execute_external_script

@language = N ‘Python’ , @script = N’

from revoscalepy import rx_lin_mod, rx_predict

linearmodel=rx_lin_mod(formula = “Y_Value~X_Value”,data=InputDataSet);

print(linearmodel.summary())’,

@input_data_1 = N’SELECT x_value AS X_Value,

CAST (y_value AS FLOAT) AS Y_Value FROM Measures’

Let’s take a look at the following three lines of code in Example 33.7:

from revoscalepy import rx_lin_mod, rx_predict

linearmodel=rx_lin_mod(formula=”Y_Value~X_Value”,data=InputDataSet);

print(linearmodel.summary())

The first line imports the rx_lin_mod and rx_predict functions from the revoscalepy module. (The revoscalepy module is a collection of Python functions that you can use for statistics-related tasks, such as linear models, regression, and classification.) The second line incorporates the rx_lin_mod function, which is used to fit linear models on small or large data sets. The two most important parameters of the function are formula and data. The former specifies which statistical model is used. The formula

“Y_Value ~ X_Value”

defines that the alphanumerical string on the right side of the ~ operator is the name of the independent variable, while the string on the left side is the name of the dependent variable. The data argument specifies the input data. The last line of the code prints the summary for the generated model.

After execution of this script, the result looks similar to the following (several lines of the result have been omitted):

Linear regression Results for: Y_Value ~ X_Value

Dependent variable(s): [‘Y_Value’]

Total independent variables: 2

Number of valid observations: 10

Number of missing observations: 0

(Intercept)   (Intercept) 32.360000

 X_Value        X_Value    1.836364

Residual standard error: 0.8482 on 8.0 degrees of freedom

Multiple R-squared:0.9797

As you already know, when applying regression analysis, the program calculates the straight line that is generated so that all the data points are at a minimum distance from the line. Generally, the formula for the straight line is y = a + bx, where a is the y-intercept and b is the slope of the particular straight line. Therefore, by applying the algorithm for linear regression, we derive the values of the coefficients a and b. As you can see from the result of Example 33.7, the y-intercept of the straight line is 32.36 and the corresponding slope is 1.83.

Multiple R-squared (the last line of output) evaluates the scatter of the data points around the fitted regression line. It is also called the coefficient of determination. For the same data set, higher R-squared (aka R2) values represent smaller differences between the observed data and the fitted values. R-squared is always between 0 percent and 100 percent. (The larger the value of multiple R-squared, the better the regression model fits your observations.) Therefore, our straight line is a very good choice, because the corresponding R2 value is almost 98 percent.

Example 33.8 uses the measurements from the Measures table to show how data is plotted.

Example 33.8

USE sample;

EXEC sp_execute_external_script

@language = N’Python’

, @script = N’

#Importing Packages

import matplotlib

matplotlib.use(“PDF”)

from revoscalepy import rx_lin_mod, rx_predict

import matplotlib.pyplot as plt

import pandas as pd

linearmodel = rx_lin_mod(formula = “Y_Value ~ X_Value”, data =

InputDataSet);

df = InputDataSet

plt.scatter(df.X_Value,df.Y_Value)

plt.xlabel(“Values of Independent Variable “)

plt.ylabel(“Values of Dependent Variable”)

#plt.title(“Graphical Output of Example 33.8”)

plt.plot()

plt.savefig(“C:\\temp\\Figure33_2.png”) ‘,

@input_data_1 = N’SELECT x_value AS X_Value,

CAST (y_value AS FLOAT) AS Y_Value FROM dbo.Measures’

Three lines of code in Example 33.8 merit further explanation:

plt.scatter(df.X_Value,df.Y_Value)

plt.plot()

plt.savefig(“C:\\temp\\Figure33_2.png”) ‘,

The scatter function of the matplotlib.pyplot framework generates a scatter plot. This function has several parameters, but only the first two have to be specified. These two parameters specify the X and Y coordinates of data points, respectively. The plot function in the second line creates the corresponding scatter plot, shown in Figure 33-2.

Finally, the savefig function saves the figure. Note that the use function of the matplotlib library in Example 33.8 specifies the .pdf format as the format of the output file. This value can be modified afterwards, using the savefig function. As you can see in the example, the format of the scatter plot is modified and stored as a .png file.

Source: Petkovic Dusan (2020), Microsoft SQL Server 2019: A Beginner’s Guide, Seventh Edition-McGraw-Hill Education.

Leave a Reply

Your email address will not be published. Required fields are marked *