Predictive analytics comprises a variety of statistical techniques from machine learning that analyze current and historical data to make predictions about future events. In the business sector, for example, predictive models analyze patterns found in historical and transactional data to identify possible risks and/or opportunities. To do this, a predictive model examines relationships among several parameters. After the analysis, the predictive model presents the various scenarios for evaluation.
1. Solving Linear Regression Problems with R
Regression analysis is a statistical methodology that is most often used for numeric prediction. Generally, regression analysis can be used to model the relationship between one or more independent variables and a dependent variable. (An independent variable is also called a predictor variable, while a dependent variable is also called a response variable.) Therefore, the values of the independent variables are known and the goal of the regression analysis is to predict the dependent variable.
There are two forms of regression analysis:
- Linear regression analysis A linear approach to modeling the relationship between a dependent variable and one or more independent variables. The simplest form of regression is straight-line regression analysis. In this case, the dependent variable is modeled as a linear function of independent variable(s).
- Nonlinear regression analysis Existing data is modeled by a function that is a nonlinear combination of the model parameters and depends on one or more independent variables. In other words, the output of nonlinear regression cannot be represented with a straight line in the coordinate system. (Generally, problems based on nonlinear regression are usually converted to a linear approach.)
Linear regression is probably the simplest technique used in prediction analytics. I will show an R script that solves a linear regression problem.
As you already know, besides functions, R packages include datasets, too. A list of datasets in an R package can be displayed by calling the data function:
One of the datasets available in the ggplot2 package is named mpg and contains fuel economy data for several models of car. While the ggplot2 package has already been loaded, you can use the head command to view the first few rows. Example 32.7 shows the corresponding R script.
@language = N’R’ ,
@script = N’
cl <- head(mpg)
The result is
Suppose you want to perform a linear regression to determine the effect how a measurement of engine size (the displ columns) influences gas mileage in the city (the cty column). In this case, the displ column is the independent variable, and the cty column is the dependent variable. The R script in Example 32.8 calculates the effect.
NOTE When executing Example 32.8, you might get an error similar to this:
Msg 39004, Level 16, State 20, Line 448 A ‘R’ script error occurred during execution of ‘sp_execute_external_script’ with HRESULT.
This error is related to the SQL Server Launchpad service. This service is used to start Advanced Analytics Extensions processes, which are necessary for integration of the R system and Python with the Database Engine. The program in Example 32.8 writes the file in the C:\temp directory, which is by default not accessible by the Launchpad service. You can either grant access to this directory, or use another directory; for instance, C:\RScripts. To grant access to a directory, right-click that directory, select Grant Access To, and click Add Everyone.
DECLARE @rscript NVARCHAR(MAX);
SET @rscript = N’
# Import R packages
# Specify report file
file <- “C:\\Temp\\Figure32_6.tif”
tiff(filename=file, width=1000, height=600)
# Generate plot
fit <- lm(cty ~ displ, mpg);
chart <- abline(fit) ;
@language = N’R’,
@script = @rscript ;
I will explain only the part of Example 32.8 that is added to the code in relation to Example 32.5. The following three lines of code define how to perform a linear regression and generate a regression line expressing the trend in the dataset:
fit <- lm(cty ~displ, mpg);
chart <- abline(fit) ;
The plot function generates graphs according to the type of the parameters specified. This function has many different forms. The form of the plot function given here is one of the simplest: the parameters specify the X and Y coordinates of the graphic, respectively. All points shown in the upcoming Figure 32-6 are generated after the plot function is executed.
The lm function is a generic function that can be used to display a large group of data mining techniques. The first parameter of this function specifies the particular technique. (The value lm means “linear model” and therefore specifies the linear regression model). The second parameter of the lm function specifies the independent variable (displ) on the right side and the dependent variable (cty) on the left side of the ~ operator. The last parameter, mpg, defines the dataset used to build the corresponding data mining model.
The abline function is used to add (vertical, horizontal, or regression) lines to a graph. It takes as the input argument the result of the lm function, produces the corresponding regression line according to the model, and assigns the result to the chart variable.
The last two lines of code print the graphics and close the specified plot explicitly:
Figure 32-6 shows the points generated using the plot function, as well as the regression line generated by the abline function.
Source: Petkovic Dusan (2020), Microsoft SQL Server 2019: A Beginner’s Guide, Seventh Edition-McGraw-Hill Education.