Data Transformation - Manipulation
Updated: October 5, 2017
This article describes the modules in Azure Machine Learning Studio that are provided for basic data manipulation. While Studio supports other tasks that are very specific to machine learning, such as normalization or feature selection, the modules in this group are intended for more general tasks.
You can now use Azure Machine Learning Workbench to perform more sophisticated data cleanup and preparations tasks, using "learn by example" functions. See this blog from the Machine Learning team for examples. |
The modules in this group are intended to support core data management tasks that might need to be performed in Studio, such as:
Combining two datasets, either by using joins, or by merging columns or rows
Creating new categories to use in grouping data
Modifying column headings, changing column data types, or flagging columns as features or labels
Checking for missing values and replacing them with appropriate values
Related Tasks
To perform sampling or divide a dataset into trainign and testing sets, see Sample and Split.
To scale numbers, normalize data, or put numerical values into bins, use the modules in Scale and Reduce.
To perform calculations on numeric data fields or generate commonly used statistics, use the tools in Statistical Functions.
For examples of how to work with complex data in machine learning experiments, see these samples in the Model Gallery:
The Data Processing and Analysis sample demonstrates key tools and processes.
The Breast cancer detection sample illustrates how to partition datasets and apply special processing to each partition.
For additional information about the process of preparing data for predictive analytics, see these resources:
The Data Transformation/Manipulation category includes the following modules:
| Module | Description |
|---|---|
| Add Columns | Adds a set of columns from one dataset to another |
| Add Rows | Appends a set of rows from an input dataset to the end of another dataset |
| Apply SQL Transformation | Runs a SQLite query on input datasets to transform the data |
| Clean Missing Data | Specifies how to handle the values missing from a dataset This module replaces Missing Values Scrubber (deprecated), which has been deprecated. |
| Convert to Indicator Values | Converts categorical values in columns to indicator values |
| Edit Metadata | Edits metadata associated with columns in a dataset |
| Group Categorical Values | Groups data from multiple categories into a new category |
| Join Data | Joins two datasets |
| Remove Duplicate Rows | Removes the duplicate rows from a dataset |
| Select Columns in Dataset | Selects columns to include or exclude from a dataset in an operation |
| Select Columns Transform | Creates a transformation that selects the same subset of column as in the given dataset |
| SMOTE | Increases the number of low incidence examples in a dataset using synthetic minority oversampling |
Data Transformation
Module Categories and Descriptions
A-Z Module List