Data Transformation / Manipulation

 

Updated: April 11, 2016

The modules in this section provide tools that are used to prepare data for machine learning.

Azure Machine Learning Studio supports most core data management tasks that are required for preparing data for use in a machine learning model, including:

  • Merging datasets

  • Grouping and summarizing data

  • Converting values to another type

  • Checking for missing values and replacing them with appropriate values

  • Flagging columns as features (for example, labels)

Most of the modules in this section are designed to work with discrete or categorical data. If you need to scale numbers, normalize data, or put numerical values into bins, use the tools in the Data Transformation / Scale and Reduce section.

If you need to perform calculations on numeric data fields or generate commonly used statistics, see the modules in the Statistical Functions section.

For examples of how to work with complex data in machine learning experiments, see these samples in the Model Gallery:

For additional information about the process of preparing data for predictive analytics, see these resources:

The Data Transformation/Manipulation category includes the following modules:

Module

Description

Add Columns

Adds a set of columns from one dataset to another

Add Rows

Appends a set of rows from an input dataset to the end of another dataset

Apply SQL Transformation

Runs a SQLite query on input datasets to transform the data

Clean Missing Data

Specifies how to handle the values missing from a dataset

This module replaces Missing Values Scrubber (deprecated), which has been deprecated.

Convert to Indicator Values

Converts categorical values in columns to indicator values

Edit Metadata

Edits metadata associated with columns in a dataset

Group Categorical Values

Groups data from multiple categories into a new category

Join Data

Joins two datasets

Remove Duplicate Rows

Removes the duplicate rows from a dataset

Select Columns in Dataset

Selects columns to include or exclude from a dataset in an operation

Select Columns Transform

Creates a transformation that selects the same subset of column as in the given dataset

SMOTE

Increases the number of low incidence examples in a dataset using synthetic minority oversampling

Show: