Unpack Zipped Datasets

 

Updated: September 20, 2017

Unpacks datasets from a zip package in user storage

Category: Data Input and Output

This article describes how to use the Unpack Zipped Datasets module in Azure Machine Learning Studio to get compressed data files and unzip them for use in an experiment.

The module takes as input a dataset in your workspace that was uploaded in a compressed format, and decompresses the dataset and adds the data to your workspace. This feature is useful because you can reduce data transfer times when working with very large datasets by saving and uploading your data files in a compressed format.

This section describes the process of preparing your data for zipping and unzipping in Azure Machine Learning Studio. Generally, zipping files is a good option when your dataset is so large that you want to use compression for the upload, to minimize upload time and associated costs.

Step 1. Prepare files

Before uploading your file, check that the file can be unzipped in Azure Machine Learning:

  • Ensure that the data in the file uses UTF-8 encoding.

    If the file is small enough, you can open it in Notepad and then save the file in the desired encoding. Many other text editors offer similar functionality. For CSV files, you can use Excel's Save As or Export commands to specify a file format and encoding.

  • Verify that the data files use a supported format, such as CSV, TSV, ARFF, or SVMLight.

  • Compress the data by adding the data file to a .ZIP or .GZ format archive file.

  • If any of the files or the compressed folder itself has been encrypted or password-protected, you must unlock or decrypt the file before you upload it. The module cannot detect encrypted data types and does not support dialog boxes for password entry from arbitrary clients.

Step 2. Upload dataset to your workspace

Next, you upload the zipped dataset to your experiment workspace.

  1. Click NEW, select DATASET, and select FROM LOCAL FILE.

  2. Locate the zipped file to upload. When you select the file, the type should automatically be set to Zip file (.zip).

Step 3. Add zipped dataset to experiment

After the dataset has uploaded completely, you add it to your experiment in zipped format.

  1. In the left-hand navigation pane of Azure Machine Learning Studio, select Saved Datasets, and then expand My Datasets.

  2. Locate the zipped dataset that you just uploaded, and drag it to the experiment canvas.

Step 4. Unpack dataset

The final step is to unpack the dataset.

  1. Connect the zipped dataset to the input of the Unpack Zipped Datasets module.

  2. In Dataset to Unpack, type the name of a single dataset to unpack.

    • If you saved a worksheet with the name Sheet1 as an Excel CSV file named Test.csv, the name of the dataset would be Test.csv, not Sheet1.

    • The name that you type in the Dataset to Unpack text box must be exactly the same as the name of the original file before it was compressed, including the file name extension. For example, if you want to unpack a dataset based on the text file Users.txt, type Users.txt, not Users.

    • If you put multiple files into one compressed folder, you must unpack one dataset at a time.

    System_CAPS_ICON_tip.jpg Tip

    If you leave the property blank, the module will get the file name from the zipped file, assuming the compressed archive file contains only one source file. If the compressed archive contains multiple files, you will get a run-time error.

  3. For Dataset file format, specify the original format of the dataset -- that is, the format before it was zipped.

    You can upload and unzip datasets that were created using any of these formats: CSV, ARFF, TSV, SvmLight.

    If this property is left empty, the module will identify the dataset using the source file name.

  4. Select the option, File has header row, if the original dataset had a header row. Otherwise the

    This option applies only to .CSV and .TSV files.

    System_CAPS_ICON_note.jpg Note

    If you change the format of the file, this option will be reset.

  5. If the file is compressed, use the Compression file format option to specify the algorithm that was used to compress or expand the file.

    Currently the .ZIP and GZ (or Gzip) formats are supported.

  6. Run the experiment.

Results

  • To verify that the data was imported correctly, right-click the Unpacked Zipped Datasets module, and select Visualize .

  • To change the name of the dataset, right-click the Unpacked Zipped Datasets module, and select Save as Dataset. At this point you can type a different name.

    This option is handy if you are unpacking multiple datasets from a single ZIP file.

To demonstrate how this module works, we created a sample .ZIP file containing four different CSV files. All files were saved from Excel.

File nameDescription
names-uni.csvUnicode file with column headings
names-utf.csvUTF-8 file with column headings
nonames-uni.csvUnicode file with no column headings
nonames-utf8.csvUTF-8 file with no column headings

The entire zipped file was uploaded, and then the Unpack Zipped Datasets module was run four times to extract each of the four files, using these settings:

  1. Dataset to unpack = names-uni.csv, File has header row = TRUE
  2. Dataset to unpack = names-utf8.csv, File has header row = TRUE
  3. Dataset to unpack = nonames-uni.csv, File has header row = FALSE
  4. Dataset to unpack = nonames-utf8.csv, File has header row = FALSE

The results were as expected:

File nameUpload result
names-uni.csvError 0049: Error while parsing the file. File is not Unicode (UTF-8) encoded
names-utf8.csvSuccess. Uses original column names from source file.
nonames-uni.csvError 0049: Error while parsing the file. File is not Unicode (UTF-8) encoded
nonames-utf8.csvSuccess. Column names Col1, col2, ...coln are automatically added to the dataset.
System_CAPS_ICON_note.jpg Note

If you use the option, File has header row = TRUE, and the source file actually does not have a column heading, the first row of data is used as the column heading.

You cannot use this module to unpack zipped R packages into your workspace. R packages must be uploaded and consumed as zipped files.

For more information about how to work with zipped R packages, see Execute R Script.

System_CAPS_ICON_note.jpg Note


Confused about the difference between UTF-8 and Unicode? See this Wikipedia article: What is UTF-8

NameRangeTypeDefaultDescription
Compression file formatZip

Gzip
compression ruleZipCompression algorithm used to compress or expand the file.
Dataset to UnpackAnyStringnoneName of dataset to register with Azure ML Studio. If the name of a dataset is not specified, the name is obtained from the file name in the zipped file.
Dataset file formatCSV

TSV

ARFF

SVMLIGHT
File formatCSVFile format of the dataset in the zipped file
File has header rowTRUE/FALSEBooleanFalseSet to True only if the CSV/TSV file has a header row
NameTypeDescription
DatasetZipZipped file containing datasets
NameTypeDescription
Results datasetData TableOutput dataset

Data Input and Output

Show: