Linear Regression
Updated: June 13, 2017
Creates a linear regression model
Category: Machine Learning / Initialize Model / Regression
This article describes how to use the Linear Regression module in Azure Machine Learning to create a linear regression model for use in an experiment.
Regression is a machine learning used to predict a numeric outcome. Linear regression attempts to establish a linear relationship between one or more independent variables and an outcome, or dependent variable.
After you have configured the model, you must train the model using a labeled dataset and the Train Model module. If you use the online gradient descent method, you can also train the model using Tune Model Hyperparameters to automatically optimize the model parameters .
The trained model can then be used to make predictions. Alternatively, the untrained model can be passed to Cross-Validate Model for cross-validation against a labeled data set.
There are many different types of regression. In the most basic sense, regression means predicting a numeric target. However, for years statisticians have been developing increasingly advanced methods for regression. Linear regression is still a good choice when you want a very simple model for a basic predictive task. Linear regression also tends to work well on high-dimensional, sparse data sets lacking complexity.
In Azure Machine Learning Studio, you can use linear regression to solve these regression problems:
The classic regression problem involves a single independent variable and a dependent variable. This is called simple regression.
Multiple linear regression involves two or more independent variables that contribute to a single dependent variable. The Linear Regression module can solve such problems, in which multiple inputs are used to predict a single numeric outcome, also called multivariate linear regression.
The task of predicting multiple dependent variables within a single model is called multi-label regression. For example, in multi-label logistic regression, a sample can be assigned to multiple different labels. This type of regression is not supported in Azure Machine Learning; instead, you must create a separate learner for each output that you wish to predict.
The Linear Regression module in Azure Machine Learning can use two methods to fit the linear model:
Online gradient descent
Gradient descent is a method that minimizes the amount of error at each step of the model training process. There are many variations on gradient descent and its optimization for various learning problems has been extensively studied.
If you choose this option for Solution method, you can set a variety of parameters to control the step size, learning rate, and so forth. This option also supports use of an integrated parameter sweep.
Ordinary least squares.
Least squares linear regression is one of the most commonly used techniques in predictive analytics. This method assumes that there is a fairly strong linear relationship between the inputs and the dependent variable. Ordinary least squares refers to the loss function, which computes error as the sum of the square of distance from the actual value to the predicted line, and fits the model by minimizing the squared error.
Least squares is also the method that is used in the Analysis Toolpak for Microsoft Excel.
The options available in this module change depending on the method you select for fitting the regression line:
For small datasets, it is best to select Ordinary Least Squares. This should give very similar results to Excel.
Gradient descent is a better loss function for models that are more complex, or that have too little training data given the number of variables.
Create a regression model using Ordinary Least Squares
Add the Linear Regression Model module to your experiment. You can find the module in Azure Machine Learning Studio in the Machine Learning category. Expand the Initialize Model category, expand Regression, and then drag the Linear Regression Model module to your experiment
In the Properties pane, in the Solution method dropdown list, select Ordinary Least Squares. This option specifies the computation method used to find the regression line.
In L2 regularization weight, type the value to use as the weight for L2 regularization. We recommend that you use a non-zero value to avoid overfitting.
To learn more about how regularization affects model fitting, see this article: L1 and L2 Regularization for Machine Learning
Select the option, Include intercept term, if you want to view the term for the intercept.
Deselect this option if you don't need to review the regression formula.
For Random number seed, you can optionally type a value to seed the random number generator used by the model.
Using a seed value is useful if you want to maintain the same results across different runs of the same experiment.
Deselect the option, Allow unknown categorical levels, if you want missing values to raise an error.
If this option is selected, an additional level will be created for each categorical column, and any levels in the test dataset not available in the training dataset are mapped to this additional level.
Run the experiment.
After the model has been trained, you can view the model's parameters by right-clicking the trainer output and selecting Visualize.
You also can connect the trained model to the Score Model module to make predictions, or pass the untrained model to Cross-Validate Model for cross-validation against a labeled data set.
Create a regression model using Online Gradient Descent
Add the Linear Regression Model module to your experiment. You can find the module in Azure Machine Learning Studio in the Machine Learning category. Expand the Initialize Model category, expand Regression, and then drag the Linear Regression Model module to your experiment
In the Properties pane, in the Solution method dropdown list, choose Online Gradient Descent as the computation method used to find the regression line.
For Create trainer mode, indicate whether you want to train the model with a predefined set of parameters, or if you want to optimize the model by using a parameter sweep.
Single Parameter
If you know how you want to configure the linear regression network, you can provide a specific set of values as arguments.
You must then train the model by using a tagged dataset and the Train Model module.
Parameter Range
If you want the algorithm to find the best parameters, set Create trainer mode option to Parameter Range, specifying multiple values
You should then train the model using Tune Model Hyperparameters and a tagged dataset. The algorithm will determine the optimal parameters for you. You can save the model trained using those parameters, or you can make a note of the parameter settings to use when configuring a learner.
If you configure the model with specific values using the Single Parameter option and then switch to the Parameter Range option, the model will be trained using the minimum value in the range for each parameter.
Conversely, if you configure specific settings when you create the model but select the Parameter Range option and use a Tune Model Hyperparameters, the model will be trained using the default values for the learner as the range of values to sweep over.
For Learning rate, specify the initial learning rate for the stochastic gradient descent optimizer.
For Number of training epochs, type a value that indicates how many times the algorithm should iterate through examples.
For datasets with a small number of examples, this number should be large to reach convergence.
If you have already normalized the numeric data used to train the model, deselect the option, Normalize features. Otherwise the module will by default normalize all numeric inputs to a range between 0 and 1.
Note that when you pass new data to the model for scoring, you should use the same normalization method.
In L2 regularization weight, type the value to use as the weight for L2 regularization. We recommend that you use a non-zero value to avoid overfitting.
To learn more about how regularization affects model fitting, see this article: L1 and L2 Regularization for Machine Learning
Select the option, Normalize features, to indicate that instances should be normalized.
Select the option, Average final hypothesis, to average the final hypothesis.
In regression models, hypothesis testing means using some statistic to evaluate the probability of the null hypothesis, which states that there is no linear correlation between a dependent and independent variable.
In many regression problems, you must test a hypothesis involving more than one variable. This option, which is selected by default, tests a combination of the parameters where two or more parameters are involved.
Select the option, Decrease learning rate, if you want the learning rate to decrease as iterations progress.
For Random number seed, you can optionally type a value to seed the random number generator used by the model.
Using a seed value is useful if you want to maintain the same results across different runs of the same experiment.
Deselect the option, Allow unknown categorical levels, if you want missing values to raise an error.
When this option is selected, an additional level will be created for each categorical column, and any levels in the test dataset not available in the training dataset are mapped to this additional level.
Run the experiment.
After the model has been trained, you can connect it to the Score Model module to make predictions.
Alternatively, the untrained model can be passed to Cross-Validate Model for cross-validation against a labeled data set.
For examples of regression models, see these sample experiments in the Model Gallery:
The Compare Regressors sample contrasts several different kinds of regression models.
The Cross Validation for Regression sample demonstrates linear regression using ordinary least squares.
The Twitter sentiment analysis sample uses several different regression models to generate predicted ratings.
Many tools support creation of linear regression, ranging from the simple to complex. For example, you can easily perform linear regression in Excel, using the Solver Toolpak, or you can code your own regression algorithm, using R, Python, or C#.
However, because linear regression is a well-established technique that is supported by many different tools, there are many different interpretations and implementations. Not all types of models are supported equally by all tools. There are also some differences in nomenclature you should be aware of.
Regression methods are often categorized by the number of response variables. For example, multiple linear regression means a model that has multiple variables to predict.
In Matlab, multivariate regression refers to a model that has multiple response variables.
In Azure Machine Learning, regression models support a single response variable.
In the R language, the features provided for linear regression depend on the package you are using. For example, the glm package will give you the ability to create a logistic regression model with multiple independent variables. In general, Azure Machine Learning Studio provides the same functionality as the R glm package.
We recommend that you use this module, Linear Regression, for typical regression problems. In contrast, if you are using multiple variables to predict a class value, we recommend the Two-Class Logistic Regression or Multiclass Logistic Regression modules. If you want to use other linear regression packages that are available for the R language, we recommend that you use the Execute R Script module and call the lm or glm packages, which are included in the runtime environment of Azure Machine Learning Studio. |
| Name | Range | Type | Default | Description |
|---|---|---|---|---|
| Normalize features | any | Boolean | true | Indicate whether instances should be normalized |
| Average final hypothesis | any | Boolean | true | Indicate whether the final hypothesis should be averaged |
| Learning rate | >=double.Epsilon | Float | 0.1 | Specify the initial learning rate for the stochastic gradient descent optimizer |
| Number of training epochs | >=0 | Integer | 10 | Specify how many times the algorithm should iterate through examples. For datasets with a small number of examples, this number should be large to reach convergence. |
| Decrease learning rate | Any | Boolean | true | Indicate whether the learning rate should decrease as iterations progress |
| L2 regularization weight | >=0.0 | Float | 0.001 | Specify the weight for L2 regularization. Use a non-zero value to avoid overfitting. |
| Random number seed | any | Integer | Specify a value to seed the random number generator used by the model. Leave blank for default. | |
| Allow unknown categorical levels | any | Boolean | true | Indicate whether an additional level should be created for each categorical column. Any levels in the test dataset not available in the training dataset are mapped to this additional level. |
| Include intercept term | Any | Boolean | True | Indicate whether an additional term should be added for the intercept |
| Name | Type | Description |
|---|---|---|
| Untrained model | ILearner interface | An untrained regression model |