Split Data using Relative Expression

 

Updated: June 2, 2017

This topic describes how to use the Relative Expression Split option in the Split Data module of Azure Machine Learning. Dividing datasets used for training and testing, either randomly or by some criteria, is an important task in many machine learning workflows.

A relative expression split lets you divide a dataset by choosing a single numeric column in your data, and then creating an expression that acts as a filter on the column. The relative expression must include the column name, the value, and an operator such as greater than and less than, equal and not equals.

For general information about data partitioning for machine learning experiments, see Split Data and Partition and Split.

  1. Add the Split Data module to your experiment, and connect it as input to the dataset you want to split.

  2. For Splitting mode, select relative expression split .

  3. In the Relational expression text box, type an expression that performs a numeric comparison operation.

    The expression divides the dataset into two sets of rows: rows with values that meet the condition, and all remaining rows. The expression can be applied only to the specified column.

  4. Run the experiment, or right-click the module and select Run selected.

Notes

The following restrictions apply to relative expressions on a dataset:

  • Relative expressions can be applied to any columns that contain a numeric data type. Date/time data types are also supported.

  • Relative expressions can reference a maximum of one column name.

  • In relative expressions, use the ampersand character (&) for the AND operation and use the pipe character (|) for the OR operation.

  • The following operators are allowed for relative expressions: <, >, <=, >=, ==, !=

  • You cannot group operations by using "(" and ")".

The following examples demonstrate how to divide a dataset using Relative Expression mode:

  • A common scenario is to divide a dataset by years. The following expression selects all rows where the values in the column Year are greater than 2010.

    \"Year" > 2010  
    
    

    Note that the date expression must account for all date parts that are included in the data column, and that the format of dates in the data column are expected to be consistent. For example, in a date column using the format mmddyyyy, the expression must be something like this:

    \"Date" > 1/1/2010
    
    
  • The following expression demonstrates how you can use the column index to select all rows in the first column of the dataset that contain values less than or equal to 30, but not equal to 20.

    (\0)<=30 & !=20  
    
    
  • Suppose you want to split a table of log data, to group queries that run too long. You could use the following relative expression to put in one dataset the queries that ran over 1 minute, and then use another instance of Split Data on the right-hand output to extract another set of queries with response times under one minute but more than 30 seconds.

    \"Elapsed" >00:01:00   
    
    
    \"Elapsed" <:00:01:00 & >00:00:30  
    
    
  • The following relative expression specifies that the dataset should be divided by using the date values in the column dt1. Rows with a date greater than 10-08-2015 are added to the first (left) output dataset. Rows with a date of 10-08-2015 or earlier are added to the second (right) output dataset.

    \"dt1" > 10-08-2015  
    
    

Split Data using Regular Expression
Split Data using Split Rows
Split Data using Recommender Split

Sample and Split
Partition and Sample
A-Z Module List

Show: