Import from Web URL via HTTP
Published: June 1, 2016
Updated: September 20, 2017
This article describes how to use the Import Data module in Azure Machine Learning Studio, to read data from a public Web page for use in a machine learning experiment.
in general, the following restrictions apply to data published on a web page:
- Data must be in one of the supported formats: CSV, TSV, ARFF, or SvmLight.
- Typically, no authentication is required, but data must be publicly available.
Use the Data Import Wizard
The module features a new wizard to help you choose a storage option, select from among existing subscriptions and accounts, and quickly configure all options.
Add the Import Data module to your experiment. You can find the module under Data Input and Output.
Click Launch Import Data Wizard and follow the prompts.
When configuration is complete, to actually copy the data into your experiment, right-click the module, and select Run Selected.
If you need to edit an existing data connection, the wizard loads all previous configuration details so that you don't have to start again from scratch
Manually set properties in the Import Data module
The following steps describe how to manually configure the import source.
Add the Import Data module to your experiment. You can find this module in the Data Input and Output group in the experiment items list in Azure Machine Learning Studio.
For Data source, select Web URL via HTTP.
For URL, type or paste the full URL of the page that contains the data you want to load. The URL should include the site URL and the full path, with file name and extension, to the page that contains the data to load.
For example, the following page contains the iris data set from the Univerosty of California, Irvine machine learning repository. in CSV format:
http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.dataFor Data format, select one of the supported data formats from the list.
If you are not sure of the format, be sure to review the page in advance. The supported data formats are CSV, TSV, ARFF, and SvmLight.
If the data is in CSV or TSV format, you can specify that it includes a header row. The header row is used to assign column names.
Select the Use cached results options if you don't expect the data to change much, or if you want to avoid reloading the data each time you run the experiment.
When this is selected, if there are no other changes to module parameters, the experiment will load the data the first time the module is run, and thereafter use a cached version of the dataset.
If you want to re-load the dataset on each iteration of the experiment dataset, deselect the Use cached results option. Results will also be re-loaded if there are any changes to the parameters of Import Data.
Run the experiment.
Results
When complete, click the output dataset and select Visualize to see if the data was imported successfully.
See these examples in the Cortana Intelligence Gallery of machine learning experiments that get data from public web sites:
The Letter Recognition sample gets a training dataset from the public machine learning repository hosted by UC Irvine.
The Download UCI Dataset sample reads a dataset in the CSV format.
This section contains advanced configuration options and answers to commonly asked questions.
How can I filter data as it is being read from the source?
The Import Data module does not support filtering as data is being read.
However, you can filter data after reading it into Azure Machine Learning Studio:
Use a custom R script to get only the data you want.
Use the Split Data module with a relative expression or a regular expression to isolate the data you want, and then save it as a dataset.
Note If you find that you have loaded more data than you need, you can overwrite the cached dataset by reading a new dataset, and saving it with the same name as the older, larger data.
How can I avoid re-loading the same data unnecessarily?
If your source data changes, you can refresh the dataset and add new data by re-running Import Data. However, if you don't want to re-read from the source each time you run the experiment, select the Use cached results option to TRUE. When this option is set to TRUE, the module will check whether the experiment has run previously using the same source and same input options, and if a previous run is found, the data in the cache is used, instead of re-loading the data from the source.
Why does Import Data add an extra row at the end of my dataset when it finds a trailing new line?
If the Import Data module encounters a row of data that is followed by an empty line or a trailing new line character, an extra row is added at the end of the table. This new row contains missing values.
The reason for interpreting a trailing new line as a new row is that Import Data cannot determine the difference between an actual empty line and an empty line that is created by the user pressing ENTER at the end of a file.
Because some machine learning algorithms support missing data and will actually treat this line as a case (which in turn would affect the results), you should use Clean Missing Data to check for missing values (particularly rows that are completely empty), and remove them as needed.
Before you check for empty rows, you might also want to divide the dataset by using Split Data. This separates rows with partial missing values, which represent actual missing values in the source data. Use the Select head N rows option to read the first part of the dataset into a separate container from the last line.
Why are some characters in my source file not displayed correctly?
Azure Machine Learning supports the UTF-8 encoding. If your source file used another type of encoding, the characters might not be imported correctly.
Why was the format of my data changed?
If the source data is in ARFF or SVMLight format, it will be converted to a tabular (dataset) format and therefore cannot be directly used for models that require ARFF or SVMLight formats.
For example, a case with a label and feature-value pairs would be converted to a series of columns containing values as follows:
0 1:-3.639 2:0.418 3:-0.67 4:1.779
Col1 Col2 Col3 Col4 Labels -3.639 0.418 0.67 1.779 0
| Name | Range | Type | Default | Description |
|---|---|---|---|---|
| Data source | List | Data Source Or Sink | Azure Blob Storage | Data source can be HTTP, FTP, anonymous HTTPS or FTPS, a file in Azure BLOB storage, an Azure table, an Azure SQL Database, an on-premises SQL Server database, a Hive table, or an OData endpoint. |
| URL | any | String | none | URL for HTTP |
| Data format | CSV TSV ARFF SvmLight | Data Format | CSV | File type of HTTP source |
| CSV or TSV has header row | TRUE/FALSE | Boolean | false | Indicates if CSV or TSV file has a header row |
| Use cached results | TRUE/FALSE | Boolean | FALSE | Module executes only if valid cache does not exist. Otherwise, cached data from previous execution is used. |
| Name | Type | Description |
|---|---|---|
| Results dataset | Data Table | Dataset with downloaded data |
| Exception | Description |
|---|---|
| Error 0027 | An exception occurs when two objects have to be the same size, but they are not. |
| Error 0003 | An exception occurs if one or more of inputs are null or empty. |
| Error 0029 | An exception occurs when an invalid URI is passed. |
| Error 0030 | an exception occurs in when it is not possible to download a file. |
| Error 0002 | An exception occurs if one or more parameters could not be parsed or converted from the specified type to the type required by the target method. |
| Error 0048 | An exception occurs when it is not possible to open a file. |
| Error 0046 | An exception occurs when it is not possible to create a directory on specified path. |
| Error 0049 | An exception occurs when it is not possible to parse a file. |
Import Data
Export Data
Import from Hive Query
Import from Azure SQL Database
Import from Azure Table
Import from Azure Blob Storage
Import from Data Feed Providers
Import from On-Premises SQL Server Database