Data Transformation - Sample and Split

 

Updated: February 13, 2017

Partitioning and sampling data are important tasks in machine learning. For example, it is a common practice to divide data into training and testing sets, so that you can evaluate a model on a holdout data set. Sampling is also increasingly important in the era of big data, to ensure that there is a fair distribution of classes in your training data, and that you are not processing more data than is needed.

The modules are highly customizable, to meet the needs of machine learning scenarios such as these:

  • Filter my training data based on some attribute in the data
  • Perform stratified sampling to divide the class variable equally among n number of groups
  • Divide source data into a training and testing data set with a custom ratio
  • Apply regular expressions to the data to filter out invalid values

Although the names sound similar, the two modules are designed to provide complementary functionality, and you might find yourself combining them in an experiment to get the right mix and right amount of data.

  • Sampling: Use Partition and Sample. This module provides multiple, customizable sampling methods, including several options for stratified sampling.

  • Divide data into two groups: Use Split Data module. It produces exactly two splits of the data. You can specify the condition on which the data is split, and the proportion of the data to put into each subset. Split Data always saves the subset of data that doesn’t meet the conditions.

  • Return only a subset of the data: Use Partition and Sample module. It gives you the specified subset on the primary output. The remaining data is available on a secondary output.

  • Get only the top 2000 rows of a datset. Use Partition and Sample and select the Head option. This is particularly handy when you are testing a new experiment and want to run short trials of a workflow.

Examples of operations with the Split Data module

Suppose you imported a very large dataset from a CSV file that contains customer demographics. You want to create different models for customers in different countries, so you decide to split the data by using the value of the column Country/Region, .

  1. Add the Split Data module, and specify an expression on the Country/Region.
  2. The remainder of the data is available on the secondary output. Add another instance of Split Data module.
  3. Repeat, specifying a different country in the expression each time.

The Split Data module supports both regular expressions, for text data, and relative expressions, for numeric data.

The Split Data module also provides sophisticated features for dividing the specialized datasets used for creating recommendation models, and for generating predictions.

Examples of operations with the Partition and Sample module

The Partition and Sample module can generate multiple partitions of the data, not just two. At the same time, it can perform various sampling operations.

For example, say you need to get just 5 percent of your data, but you want to ensure that the distribution of the target attribute is the same as in the source data.

  1. Add the Partition and Sample module.
  2. choose the Sampling mode, and specify 5%.
  3. Select the stratified sampling option, and pick the column that contains the target attribute.

Whenever you don't need to keep all the data, use Partition and Sample. The remaining data is still present in the workspace but doesn't need to be processed further as part of the experiment.

We also recommend these modules for feature analysis and dimensionality reduction:

  • SMOTE: Increases the number of rare cases in a sample, or rebalances the cases for some target value

  • Principal Component Analysis. Performs dimensionality reduction by finding the combination of features that best represents the data space.

  • Learning with Counts. Creates compact features based on an analysis of features and counts.

The Sample and Split category includes the following modules:

ModuleDescription
Partition and SampleCreates multiple partitions of a dataset based on sampling
Split DataPartitions the rows of a dataset into two distinct sets

Data Transformation
A-Z Module List

Show: