Data Transformation - Manipulation

 

Updated: October 5, 2017

This article describes the modules in Azure Machine Learning Studio that are provided for basic data manipulation. While Studio supports other tasks that are very specific to machine learning, such as normalization or feature selection, the modules in this group are intended for more general tasks.

System_CAPS_ICON_tip.jpg Tip

You can now use Azure Machine Learning Workbench to perform more sophisticated data cleanup and preparations tasks, using "learn by example" functions. See this blog from the Machine Learning team for examples.

The modules in this group are intended to support core data management tasks that might need to be performed in Studio, such as:

  • Combining two datasets, either by using joins, or by merging columns or rows

  • Creating new categories to use in grouping data

  • Modifying column headings, changing column data types, or flagging columns as features or labels

  • Checking for missing values and replacing them with appropriate values

Related Tasks

  • To perform sampling or divide a dataset into trainign and testing sets, see Sample and Split.

  • To scale numbers, normalize data, or put numerical values into bins, use the modules in Scale and Reduce.

  • To perform calculations on numeric data fields or generate commonly used statistics, use the tools in Statistical Functions.

For examples of how to work with complex data in machine learning experiments, see these samples in the Model Gallery:

For additional information about the process of preparing data for predictive analytics, see these resources:

The Data Transformation/Manipulation category includes the following modules:

ModuleDescription
Add ColumnsAdds a set of columns from one dataset to another
Add RowsAppends a set of rows from an input dataset to the end of another dataset
Apply SQL TransformationRuns a SQLite query on input datasets to transform the data
Clean Missing DataSpecifies how to handle the values missing from a dataset

This module replaces Missing Values Scrubber (deprecated), which has been deprecated.
Convert to Indicator ValuesConverts categorical values in columns to indicator values
Edit MetadataEdits metadata associated with columns in a dataset
Group Categorical ValuesGroups data from multiple categories into a new category
Join DataJoins two datasets
Remove Duplicate RowsRemoves the duplicate rows from a dataset
Select Columns in DatasetSelects columns to include or exclude from a dataset in an operation
Select Columns TransformCreates a transformation that selects the same subset of column as in the given dataset
SMOTEIncreases the number of low incidence examples in a dataset using synthetic minority oversampling

Data Transformation
Module Categories and Descriptions
A-Z Module List

Show: