Principal Component Analysis
Updated: April 11, 2016
Computes a set of features with reduced dimensionality for more efficient learning
Category: Data Transformation / Sample and Split
You can use the Principal Component Analysis module to analyze your data and create a reduced feature set that captures all the information contained in the larger dataset.
The module also creates a transformation that you can apply to new data, to achieve a similar reduction in dimensionality and compression of features, without requiring additional training.
Principal Component Analysis is based on the fact that many types of vector-space data are compressible, and that compression can be most efficiently achieved by sampling. As such, Principal Component Analysis (PCA) is a popular technique in machine learning for dimensionality reduction. Added benefits are that PCA can improve data visualization and optimize resource utilization by the learning algorithm.
The Principal Component Analysis module transforms a set of feature columns in the provided dataset into a projection of the feature space that has lower dimensionality. By definition, the transformed data matrices capture the variance in the original data while reducing the effect of noise and minimizing the risk of overfitting. The algorithm uses randomization techniques to identify a feature subspace that captures most of the information in the complete feature matrix.
For general information about principal component analysis (PCA) see this Wikipedia article. For information about the PCA approaches used in this module, see these articles:
Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. Halko, Martinsson, and Tropp, 2010.
Combining Structured and Unstructured Randomness in Large Scale PCACombining Structured and Unstructured Randomness in Large Scale PCA. Karampatziakis and Mineiro, 2013.
Add the dataset you want to analyze, and connect it to the input of the module.
If it is not already clear which columns are features and which are labels, we recommend that you use the Edit Metadata module to mark the columns in advance.
Use the Column Selector to choose the columns you want to evaluate.
In Number of dimensions to reduce to, type a number to specify how many columns to include in the output dataset. Each column represents a dimension capturing some part of the information in the input columns.
Tip The algorithm used for this module is optimized when the number of reduced dimensions is much smaller than the original dimensions.
For example, if the source dataset has eight columns and you type 3, three new columns will be returned that capture the information of the eight selected columns.
In that case, the columns would be named Col1, Col2, and Col3, and should be considered an approximation of the contents of columns 1-8, rather than being directly derived from particular source columns.
Select the Normalize dense dataset to zero mean option if the dataset is dense (contains few missing values) and you want to normalize the values in the columns to a mean of zero before further processing.
For sparse datasets, the parameter is overridden if selected.
Run the experiment.
The module outputs a reduced set of columns that you can use in creating a model. You can save the output as a new dataset or use it in your experiment.
Optionally, you can save the analysis process as a saved transform, to apply to another dataset using Apply Transformation. Note that the dataset you apply the transformation to must have the same schema as the original dataset.
For examples of how Principal Component Analysis is used in machine learning, see these experiments in the Model Gallery:
The Clustering: Find Similar Companies sample uses Principal Component Analysis to reduce the number of values from text mining to a manageable number of features.
Although in this sample PCA is applied using a custom R script, it illustrates how PCA is typically used.
There are two stages to computation of the lower-dimensional components. The first is to construct a low-dimensional subspace that captures the action of the matrix. The second is to restrict the matrix to the subspace and then compute a standard factorization of the reduced matrix.
Name | Type | Description |
|---|---|---|
Dataset | Dataset whose dimensions are to be reduced |
Name | Type | Range | Optional | Description | Default |
|---|---|---|---|---|---|
Selected columns | ColumnSelection | Required | Selected columns to apply PCA to | ||
Number of dimensions to reduce to | Integer | >=1 | Required | The number of desired dimensions in the reduced dataset | |
Normalize dense dataset to zero mean | Boolean | Required | true | Indicate whether the input columns will be mean normalized for dense datasets (for sparse data parameter is ignored) |
Name | Type | Description |
|---|---|---|
Results dataset | Dataset with reduced dimensions | |
PCA Transformation | Transformation which when applied to dataset will give new dataset with reduced dimensions |
For a list of all exceptions, see Machine Learning REST API Error Codes.
Exception | Description |
|---|---|
Exception occurs if one or more specified columns of data set couldn't be found. | |
Exception occurs if one or more of inputs are null or empty. | |
Exception occurs if parameter is less than or equal to specific value. |