Select Columns in Dataset

 

Updated: August 22, 2017

Selects columns to include or exclude from a dataset in an operation

Category: Data Transformation / Manipulation

You can use the Select Columns in Dataset module to choose a subset of columns to use in downstream operations. This can be useful if you need to limit the columns available for a downstream operation, or if you want to reduce the size of the dataset by removing unneeded columns. Some common selection scenarios are described in the Examples section.

Using Select Columns in Dataset does not actually remove the columns from the source dataset; it only creates a view or subset of columns.

In general, the columns in the dataset are output in the same order as in the original data, even if you specify them in a different order.

This module has no parameters. You use the column selector to choose the columns to include or exclude.

Choose by name

There are multiple methods for choosing columns by name:

  • Filter and search

    Click the BY NAME option and then filter columns by type or search for columns by name.

    If you have connected a dataset that is already populated, a list of available columns should appear. If no columns appear, you might need to run upstream modules to view the column list.

    To filter the list, type in the search box. For example, if you type the letter "w" in the search box, the list is filtered to show the column names that contain the letter w.

    Select columns and click the right arrow button to move the selected columns to the list in the right-hand pane. Press Shift + Click to select a continuous range of column names. Press Ctrl + Click to add individual columns to the selection.

    Click the checkmark button to save and close.

  • Use names in combination with other rules

    Click the WITH RULES option and then select columns one-by-one.

  • Type or paste a comma-separated list of column names

    If your dataset is very wide, it might be easier to use indexes or generated lists of names, rather than selecting columns individually. Assuming you have prepared the list in advance:

    1. Click the WITH RULES option.
    2. Select No columns, select Include, and then click inside the text box with the red exclamation mark.
    3. Paste in or type a comma-separated list of previously validated column names. You will not be able to save the module if any column has an invalid names, so be sure to check the names beforehand.

    You can also use this method to specify a list of column indices. See the Examples for some tips on how to find column indices.

Choose by type

If you use the WITH RULES option, you can apply multiple conditions on the column selections. For example, you might need to get only feature columns of a numeric data type.

The Begin With option determines your starting point and is very important for understanding the results.

  • If you select the ALL COLUMNS option, all columns are added to the list, and then you must use the Exclude option to remove columns that meet certain conditions. For example, you might start with all columns and then remove columns by name, or by type.

  • If you select the NO COLUMNS option, the list of columns starts out empty, and you specify conditions that define the columns you want to add to the list. If you apply multiple rules, each condition is additive.

    For example, say you start with no columns, and then add a rule to get all numeric columns. In the Automobile price dataset, that results in 16 columns. Then, you click the + sign to add a new condition, and select Include all features. The resulting dataset includes all the numeric columns, plus all the feature columns, including some string feature columns.

Choose by column index

The column index refers to the position of the column within the original dataset. Columns are numbered sequentially starting at 1. To get a range of columns, use a hyphen. Open-ended specifications such as 1- are not allowed.

For example, the following examples all return multiple non-contiguous columns:

  • 8,1-4,6
  • 1,3-8
  • 1,3-6,4 -- No error, but returns the duplicated column only once

For additional tips on working with column indices, see the Examples section.

Change order of columns

The option Allow duplicates and preserve column order in selection starts with an empty list, and adds columns that you specify by name or by index. Unlike other options, which always return columns in their "natural order", this option outputs the columns in the order that you name or list them.

For example, in a dataset with the columns Col1, Col2, Col3, and Col4, you could reverse the order of the columns and leave one column out by specifying a list as follows:

  • Col4, Col3, Col1
  • 4,3,1

Duplicate column names and indexes are not allowed, and might result in an error.

For examples of how to use Select Columns in Dataset, see these sample experiments in the Model Gallery:

Common scenarios for column selection

The following examples describe some typical ways that users apply Select Columns in Dataset in machine learning, and the easiest way to select the columns:

  • I want to remove text columns from the dataset so I can apply a math operation to all numeric columns.

    Many operations require that only numeric columns be present in the dataset. You can temporarily remove columns that would cause an error, by excluding text and excluding categorical columns (numbers that represent discrete categories).

    1. Click Launch column selector.

    2. For Begin With, select All columns.

    3. Select the Exclude option, select column type, and then select String.

    4. Click the plus sign (+) to add a new condition.

    5. Select the Exclude option, select column type, and then select Categorical.

  • I need to apply feature selection to only the categorical feature columns.

    If you need to separate columns of a similar type, you can apply multiple conditions. For example, features can be either categorical or numeric, but some feature selection modules do not allow non-numeric fields, so you first have to get features, and then add a condition to get just the numeric features.

    1. Click Launch column selector.

    2. For Begin With, select No columns.

    3. Select the Include option, and select all features.

    4. Click the plus sign (+) to add a new condition.

    5. Select the Include option, select column type, and then select Categorical.

  • I need to apply a different normalization operation to different numeric columns.

    Before applying mathematical operations, you might need to separate integers from floating point numbers, and so forth. To do this use the data types and apply multiple conditions.

    1. Click Launch column selector.

    2. For Begin With, select No columns.

    3. Select the Include option, select column type, and then select Numeric.

    4. Click the plus sign (+) to add a new condition.

    5. Select the Include option, select column type, and then select the numeric type that is incompatible with the downstream operation.

  • There are too many columns to choose using the selector.

    Often, after importing a dataset, you find that it has a lot of columns that aren't needed for modeling. However, you want to preserve them for output later, or for identifying cases. You can do this by splitting the dataset into two parts (metadata, and columns used for modeling) and later recombine columns as needed, by using Add Columns.

    1. Click Launch column selector.

    2. For Begin With, select No columns.

    3. Select the Include option, select column type, and then select Feature.

    4. Click the plus sign (+) to add a new condition.

    5. Select the Include option, select column type, and then select Label.

    6. Repeat these steps, but start with all columns, and then exclude feature and label columns to create a dataset of only the metadata.

  • I don't know the index values for the columns I need.

    If there are just a few columns in your dataset, you can use the Visualize option to see the first 100 rows and then figure out which column is index 1, 2, and so forth.

    • The indexes in Azure Machine Learning start at 1, so the first column is always 1.

    • To get the index of the last column, look at the two lists of columns in the Column Selector: AVAILABLE COLUMNS and SELECTED COLUMNS. The gray bar beneath the column list displays the count of columns in each list. Thus, if 24 columns are available and two columns are selected, there are a total of 26 columns, and the index of the final column is 26.

    Another option for extracting the schema of your dataset is to use the Execute R Script module to get the column names with index numbers.

    1. Connect your dataset to the Execute R Script module.

    2. In the module, type a script like the following to output the column names. The line starting with myindex generates a sequence that represents the indexes in order.

      dataset1 <- maml.mapInputPort(1) # class: data.frame
      mycolnames <-names(dataset1);
      myindex <- seq(from = 1, to = length(mycolnames), by=1);
      outdata <- as.data.frame(cbind(myindex, mycolnames));
      maml.mapOutputPort("outdata"); 
      
      

    Results on Automobile price dataset

    myindexmycolnames
    1symboling
    2normalized-losses
    3make

If you are familiar with relational databases, this module creates a projection of the data; hence the original name, Project Columns). In database terms, a projection is a function, such as a Transact-SQL or LINQ statement, that takes a data in tabular format as input and produces a related output.

In relational algebra, a projection is a unary operation, which is written as a set of attribute names. The result of a projection is the set of those attributes, with other attributes discarded.

NameTypeDescription
DatasetData TableInput dataset
NameRangeTypeDefaultDescription
Select columnsanyColumnSelectionSelect columns to keep in the projected dataset.
NameTypeDescription
Results datasetData TableOutput dataset

For a list of all exceptions, see Module Error Codes.

ExceptionDescription
Error 0001An exception occurs if one or more specified columns of the dataset couldn't be found.
Error 0003An exception occurs if one or more input datasets are null or empty.

Manipulation
A-Z Module List

Show: