Convert to Dataset

 

Updated: March 2, 2017

Converts data input to the internal Dataset format used by Microsoft Azure Machine Learning

Category: Data Format Conversions

You can use the Convert to Datasetmodule to convert any data that you might need for an experiment to the internal format used by Studio.

Conversion is not strictly required, because Azure Machine Learning implicitly converts data to its native dataset format when any operation is performed on the data. However, saving data to the dataset format is recommended if you have performed some kind of normalization or cleaning on a set of data, and you want to ensure that the changes are used in further experiments.

System_CAPS_ICON_note.jpg Note

Convert to Dataset changes only the format of the data, and it does not save a new copy of the data in the workspace. To save the dataset, double-click the output port, select Save as dataset, and type a new name.

We recommend that you use the Edit Metadata module to prepare the dataset before using Convert to Dataset. You can add or change column names, adjust data types, and so forth.

  1. Add the Convert to Dataset module to your experiment. You can find this module in the Data Format Conversions group in the experiment items list in Azure Machine Learning Studio.

  2. Connect it to any module that outputs a dataset.

    As long as the data is tabular, you can convert it to a dataset. This includes data loaded using Import Data, data created by using Enter Data Manually, data generated by code in custom modules, datasets transformed by using Apply Transformation, or datasets that were generated or modified by using Apply SQL Transformation.

  3. In the Action dropdown list, indicate if you want to do any cleanup on the data before saving the dataset:

    • None

      Use the data as is.

    • SetMissingValue

      Specify a placeholder that will be inserted in the dataset wherever there is a missing value. The default placeholder is the question mark character (?), but you can use the Custom missing value option to type a different value.

    • ReplaceValues

      Use this option to specify a single exact value to be replaced with any other exact value.

      For example, assuming your data contains the string obs used as a placeholder for missing values, you could specify a custom replacement operation using these options:

      • Set Replace to Custom

      • For Custom value, type the value you want to find. In this case, you would type obs.

      • For New value, type the new value to replace the original string with. In this case, you might type ?

      Note that the ReplaceValues operation applies only to exact matches. For example, these strings would not be affected: obs., obsolete.

    • SparseOutput

      Indicates that the dataset is sparse. By creating a sparse data vector, you can ensure that missing values do not affect a sparse data distribution.

      After choosing this option, you must indicate how missing values and zero values should be handled. To remove any value other than zero, click the Remove option and type a single value to remove. You can remove missing values, or set a custom value to delete from the vector. Only exact matches will be removed. For example, if you type x in the Remove value text box, the row xx would not be affected.

      By default, the option Remove zeroes is set to True, meaning that all zero values are removed when the sparse column is created.

  4. Run the experiment, or right-click the Convert to Dataset module and select Run selected.

    You can save the resulting dataset with a new name by right-clicking the output of Convert to Dataset and selecting Save as Dataset.

You can see examples of how the Convert to Dataset module is used by exploring these sample experiments in the Model Gallery:

  • The CRM sample reads from a shared dataset and saves a copy of the dataset in the local workspace.

  • The Flight Delay example saves a dataset that has been cleaned by replacing missing values so that you can use it for future experiments.

  • Any module that takes a dataset as input can also take data in the CSV, TSV, or ARFF formats. Before any module code is executed, preprocessing of the inputs is performed, which is equivalent to running the Convert to Dataset module on the input.

  • You cannot convert from the SVMLight format to dataset.

  • When specifying a custom replace operation, the search and replace operation applies to complete values; partial matches are not allowed. For example, you can replace a 3 with a -1 or with 33, but you cannot replace a 3 in a two-digit number such as 35.

  • For custom replace operations, the replacement will silently fail if you use as a replacement any character that does not conform to the current data type of the column.

  • If you need to save data that uses numerical data that is sparse and has missing values, internally, Studio supports sparse arrays by using a SparseVector, which is a class in the Math.NET numeric library. Prepare your data that uses zeros and has missing values, and then use Convert to Dataset with the arguments SparseOutput and Remove Zeros = TRUE.

NameTypeDescription
DatasetData TableInput dataset
NameRangeTypeDefaultDescription
ActionListAction MethodNoneAction to apply to input dataset
NameTypeDescription
Results datasetData TableOutput dataset

Data Format Conversions
A-Z Module List

Show: