Import from Hive Query
Published: June 1, 2016
Updated: September 20, 2017
This article describes how to use the Import Data module in Azure Machine Learning Studio, to get data from Hadoop clusters and HDInsight distributed storage.
Importing data from Hive is particularly useful for loading large datasets, or if you want to pre-process the data using a MapReduce job before loading the data into a machine learning experiment.
Use the wizard
The module features a new wizard to help you choose a storage option, select from among existing subscriptions and accounts, and quickly configure all options.
Add the Import Data module to your experiment. You can find the module under Data Input and Output.
Click Launch Import Data Wizard and follow the prompts.
When configuration is complete, to actually copy the data into your experiment, right-click the module, and select Run Selected.
If you need to edit an existing data connection, the wizard loads all previous configuration details so that you don't have to start again from scratch
Manually set import properties
The following steps describe how to manually configure the import source.
Add the Import Data module to your experiment. You can find this module in the Data Input and Output group in the experiment items list in Azure Machine Learning Studio.
For Data source, select Hive Query.
In the Hive database query text box, specify the data you want to read by using HiveQL.
HiveQL is a SQL-like query language that can also be used to aggregate data and perform data filtering before you add the data to Machine Learning Studio.
However, the Hive query must return the data in a tabular format. For example:
SELECT <column list> FROM table where <expression>;Click the HCatalog server URI text box, and then type the fully qualified name of your cluster.
For example, if you created a cluster with the name mycluster001, use this format:
https://mycluster001.azurehdinsight.netClick the Hadoop user account name text box, and paste in the Hadoop user account that you used when you provisioned the cluster.
Click the Hadoop user account password text box, and type the credentials that you used when you provisioned the cluster.
For more information about cluster naming and authentication for Hadoop, see Provision Hadoop clusters in HDInsight.
For Location of output data, select the option that indicates where the data is stored. If the data is in the Hadoop distributed file system (HDFS), it must be accessible via the same account and password that you just entered. If the data is in Azure, you need to provide the location and credentials of the storage account.
HDFS
Type or paste the HDFS server URI. Be sure to use the HDInsight cluster name without the HTTPS:// prefix.
Azure
For Azure storage account name, type the name of the Azure account. For example, if the full URL of the storage account is
http://myshared.blob.core.windows.net, you would typemyshared.For Azure storage key, copy and paste the key that is provided for accessing the storage account.
If you don’t know the access key, see the section, “View, copy and regenerate storage access keys” in this article: About Azure Storage Accounts.
For Azure container name, specify the default container for the cluster.
If you create your cluster by accepting all defaults, a container with the same name as the cluster is created at the same time the cluster was created. That container is the default container for the cluster.
If you choose the CUSTOM CREATE option when creating a cluster, you are given two options for selecting the default container. The first option is to select an existing container. When you do so, that container becomes the default storage container for the cluster. The second option is Create default container. When you use this option, the default container has the same name as the cluster.
Select the Use cached results options if you don't expect the data to change much, or if you want to avoid reloading the data each time you run the experiment.
When this is selected, if there are no other changes to module parameters, the experiment will load the data the first time the module is run, and thereafter use a cached version of the dataset.
If you want to re-load the dataset on each iteration of the experiment dataset, deselect the Use cached results option. Results will also be re-loaded if there are any changes to the parameters of Import Data.
Run the experiment.
Results
When complete, click the output dataset and select Visualize to see if the data was imported successfully.
If you get errors, check your data for missing values, additional empty columns, or incompatible data types.
For examples of how to configure an HDInsight cluster and use Hive queries in machine learning experiments, see these resources:
This article provides a detailed walkthrough of how to create a cluster, upload data, and call the data from Studio using Hive: Advanced Analytics Process and Technology in Action: Using HDInsight Hadoop clusters.
This blog by MVP Vesa Tikkanen describes some issues and workarounds when reading very large files (distributed queries) from an HD cluster on Linux: Reading Linux HDInsight Hive from Azure ML
Although Hive is superior to Azure ML for many kinds of data clean-up and pre-processing, after import, you mightfind these tools useful for preparing the data for modeling:
Use the Edit Metadata and other modules to change column names, specify which columns contain labels and features, and specify the column data type. For examples, see Dataset Processing.
Post-process text data using Python, to remove punctuation, flag parts of speech, and much more. For examples, see Text Classification.
Combine multiple tables from different sources into a single table of training data. For examples, see Predictive maintenance.
This section contains some advanced configuration options and answers to commonly asked questions.
How can I avoid out of memory problems when using Hive to pre-process data?
When using Hive queries to extract records from big data sources, sometimes the default configuration of the Hadoop cluster is too limited to support running the MapReduce job. For example, in these Release Notes for HDInsight, the default settings are defined as a four-node cluster.
If the requirements of the MapReduce job exceed available capacity, the Hive queries might return an Out of Memory error message, which causes the Import Data operation to fail. If this happens, you can change the default memory allocation for Hive queries in the Import Data module, as shown here:

In this example, the commands
set mapreduce.map.memory.mbandset mapreduce.reduce.memory.mbare used to increase the amount of memory, to use the maximum allowed in the cluster.How do I call Python scripts from a Hive query?
You can use the Import Data module to run Hive queries that call Python UDFs to process records.
For more information, see Use Python with Hive and Pig in HDInsight.
How can I avoid re-loading the same data unnecessarily?
If your source data changes, you can refresh the dataset and add new data by re-running Import Data. However, if you don't want to re-read from the source each time you run the experiment, select the Use cached results option to TRUE. When this option is set to TRUE, the module will check whether the experiment has run previously using the same source and same input options, and if a previous run is found, the data in the cache is used, instead of re-loading the data from the source.
How can I filter data as it is being read from the source?
The Import Data module does not support filtering as data is being read. To filter data before reading it into Azure Machine Learning Studio, use a Hive query or a MapReduce job to aggregate and transform the data.
There are also multiple options for filtering data after it has been loaded into Azure Machine Learning Studio:
Use a custom R script to get only the data you want.
Use the Split Data module with a relative expression or a regular expression to isolate the data you want, and then save it as a dataset.
Note If you find that you have loaded more data than you need, you can overwrite the cached dataset by reading a new dataset, and saving it with the same name as the older, larger data.
| Name | Range | Type | Default | Description |
|---|---|---|---|---|
| Data source | List | Data Source Or Sink | Azure Blob Storage | Data source can be HTTP, FTP, anonymous HTTPS or FTPS, a file in Azure BLOB storage, an Azure table, an Azure SQL Database, an on-premises SQL Server database, a Hive table, or an OData endpoint. |
| Hive database query | any | StreamReader | HQL query | |
| HCatalog server URI | any | String | Templeton endpoint | |
| Hadoop user account name | any | String | Hadoop HDFS/HDInsight username | |
| Hadoop user account password | any | SecureString | Hadoop HDFS/HDInsight password | |
| Location of output data | any | DataLocation | HDFS | Specify HDFS or Azure for outputDir |
| HDFS server URI | any | String | HDFS rest endpoint | |
| Azure storage account name | any | String | Azure storage account name | |
| Azure storage key | any | SecureString | Azure storage key | |
| Azure container name | any | String | Azure container name | |
| Data content type | List (subset) | Url Contents | OData | Data format type |
| Source URL | any | String | URL for Power Query data source | |
| Use cached results | TRUE/FALSE | Boolean | FALSE | description |
| Name | Type | Description |
|---|---|---|
| Results dataset | Data Table | Dataset with downloaded data |
| Exception | Description |
|---|---|
| Error 0027 | An exception occurs when two objects have to be the same size, but they are not. |
| Error 0003 | An exception occurs if one or more of inputs are null or empty. |
| Error 0029 | An exception occurs when an invalid URI is passed. |
| Error 0030 | An exception occurs in when it is not possible to download a file. |
| Error 0002 | An exception occurs if one or more parameters could not be parsed or converted from the specified type to the type required by the target method. |
| Error 0009 | An exception occurs if the Azure storage account name or the container name is specified incorrectly. |
| Error 0048 | An exception occurs when it is not possible to open a file. |
| Error 0015 | An exception occurs if the database connection has failed. |
| Error 0046 | An exception occurs when it is not possible to create a directory on specified path. |
| Error 0049 | An exception occurs when it is not possible to parse a file. |
Import Data
Export Data
Import from Web URL via HTTP
Import from Azure SQL Database
Import from Azure Table
Import from Azure Blob Storage
Import from Data Feed Providers
Import from On-Premises SQL Server Database