Data Transformation / Manipulation
Updated: April 11, 2016
The modules in this section provide tools that are used to prepare data for machine learning.
Azure Machine Learning Studio supports most core data management tasks that are required for preparing data for use in a machine learning model, including:
Merging datasets
Grouping and summarizing data
Converting values to another type
Checking for missing values and replacing them with appropriate values
Flagging columns as features (for example, labels)
Most of the modules in this section are designed to work with discrete or categorical data. If you need to scale numbers, normalize data, or put numerical values into bins, use the tools in the Data Transformation / Scale and Reduce section.
If you need to perform calculations on numeric data fields or generate commonly used statistics, see the modules in the Statistical Functions section.
For examples of how to work with complex data in machine learning experiments, see these samples in the Model Gallery:
The Data Processing and Analysis sample demonstrates key tools and processes.
The Breast cancer detection sample illustrates how to partition datasets and apply special processing to each partition.
For additional information about the process of preparing data for predictive analytics, see these resources:
The Data Transformation/Manipulation category includes the following modules:
Module | Description |
|---|---|
Adds a set of columns from one dataset to another | |
Appends a set of rows from an input dataset to the end of another dataset | |
Runs a SQLite query on input datasets to transform the data | |
Specifies how to handle the values missing from a dataset This module replaces Missing Values Scrubber (deprecated), which has been deprecated. | |
Converts categorical values in columns to indicator values | |
Edits metadata associated with columns in a dataset | |
Groups data from multiple categories into a new category | |
Joins two datasets | |
Removes the duplicate rows from a dataset | |
Selects columns to include or exclude from a dataset in an operation | |
Creates a transformation that selects the same subset of column as in the given dataset | |
Increases the number of low incidence examples in a dataset using synthetic minority oversampling |