Import from Hive Query

 

Published: June 1, 2016

Updated: September 20, 2017

This article describes how to use the Import Data module in Azure Machine Learning Studio, to get data from Hadoop clusters and HDInsight distributed storage.

Importing data from Hive is particularly useful for loading large datasets, or if you want to pre-process the data using a MapReduce job before loading the data into a machine learning experiment.

Use the wizard

The module features a new wizard to help you choose a storage option, select from among existing subscriptions and accounts, and quickly configure all options.

  1. Add the Import Data module to your experiment. You can find the module under Data Input and Output.

  2. Click Launch Import Data Wizard and follow the prompts.

  3. When configuration is complete, to actually copy the data into your experiment, right-click the module, and select Run Selected.

If you need to edit an existing data connection, the wizard loads all previous configuration details so that you don't have to start again from scratch

Manually set import properties

The following steps describe how to manually configure the import source.

  1. Add the Import Data module to your experiment. You can find this module in the Data Input and Output group in the experiment items list in Azure Machine Learning Studio.

  2. For Data source, select Hive Query.

  3. In the Hive database query text box, specify the data you want to read by using HiveQL.

    HiveQL is a SQL-like query language that can also be used to aggregate data and perform data filtering before you add the data to Machine Learning Studio.

    However, the Hive query must return the data in a tabular format. For example:

    SELECT <column list> FROM table where <expression>;

  4. Click the HCatalog server URI text box, and then type the fully qualified name of your cluster.

    For example, if you created a cluster with the name mycluster001, use this format:

    https://mycluster001.azurehdinsight.net

  5. Click the Hadoop user account name text box, and paste in the Hadoop user account that you used when you provisioned the cluster.

  6. Click the Hadoop user account password text box, and type the credentials that you used when you provisioned the cluster.

    For more information about cluster naming and authentication for Hadoop, see Provision Hadoop clusters in HDInsight.

  7. For Location of output data, select the option that indicates where the data is stored. If the data is in the Hadoop distributed file system (HDFS), it must be accessible via the same account and password that you just entered. If the data is in Azure, you need to provide the location and credentials of the storage account.

    • HDFS

      Type or paste the HDFS server URI. Be sure to use the HDInsight cluster name without the HTTPS:// prefix.

    • Azure

      • For Azure storage account name, type the name of the Azure account. For example, if the full URL of the storage account is http://myshared.blob.core.windows.net, you would type myshared.

      • For Azure storage key, copy and paste the key that is provided for accessing the storage account.

        If you don’t know the access key, see the section, “View, copy and regenerate storage access keys” in this article: About Azure Storage Accounts.

      • For Azure container name, specify the default container for the cluster.

        If you create your cluster by accepting all defaults, a container with the same name as the cluster is created at the same time the cluster was created. That container is the default container for the cluster.

        If you choose the CUSTOM CREATE option when creating a cluster, you are given two options for selecting the default container. The first option is to select an existing container. When you do so, that container becomes the default storage container for the cluster. The second option is Create default container. When you use this option, the default container has the same name as the cluster.

  8. Select the Use cached results options if you don't expect the data to change much, or if you want to avoid reloading the data each time you run the experiment.

    When this is selected, if there are no other changes to module parameters, the experiment will load the data the first time the module is run, and thereafter use a cached version of the dataset.

    If you want to re-load the dataset on each iteration of the experiment dataset, deselect the Use cached results option. Results will also be re-loaded if there are any changes to the parameters of Import Data.

  9. Run the experiment.

Results

When complete, click the output dataset and select Visualize to see if the data was imported successfully.

If you get errors, check your data for missing values, additional empty columns, or incompatible data types.

For examples of how to configure an HDInsight cluster and use Hive queries in machine learning experiments, see these resources:

Although Hive is superior to Azure ML for many kinds of data clean-up and pre-processing, after import, you mightfind these tools useful for preparing the data for modeling:

  • Use the Edit Metadata and other modules to change column names, specify which columns contain labels and features, and specify the column data type. For examples, see Dataset Processing.

  • Post-process text data using Python, to remove punctuation, flag parts of speech, and much more. For examples, see Text Classification.

  • Combine multiple tables from different sources into a single table of training data. For examples, see Predictive maintenance.

This section contains some advanced configuration options and answers to commonly asked questions.

  • How can I avoid out of memory problems when using Hive to pre-process data?

    When using Hive queries to extract records from big data sources, sometimes the default configuration of the Hadoop cluster is too limited to support running the MapReduce job. For example, in these Release Notes for HDInsight, the default settings are defined as a four-node cluster.

    If the requirements of the MapReduce job exceed available capacity, the Hive queries might return an Out of Memory error message, which causes the Import Data operation to fail. If this happens, you can change the default memory allocation for Hive queries in the Import Data module, as shown here:

    Increase memory to maximum allowed on cluster

    In this example, the commands set mapreduce.map.memory.mb and set mapreduce.reduce.memory.mb are used to increase the amount of memory, to use the maximum allowed in the cluster.

  • How do I call Python scripts from a Hive query?

    You can use the Import Data module to run Hive queries that call Python UDFs to process records.

    For more information, see Use Python with Hive and Pig in HDInsight.

  • How can I avoid re-loading the same data unnecessarily?

    If your source data changes, you can refresh the dataset and add new data by re-running Import Data. However, if you don't want to re-read from the source each time you run the experiment, select the Use cached results option to TRUE. When this option is set to TRUE, the module will check whether the experiment has run previously using the same source and same input options, and if a previous run is found, the data in the cache is used, instead of re-loading the data from the source.

  • How can I filter data as it is being read from the source?

    The Import Data module does not support filtering as data is being read. To filter data before reading it into Azure Machine Learning Studio, use a Hive query or a MapReduce job to aggregate and transform the data.

    There are also multiple options for filtering data after it has been loaded into Azure Machine Learning Studio:

    • Use a custom R script to get only the data you want.

    • Use the Split Data module with a relative expression or a regular expression to isolate the data you want, and then save it as a dataset.

    System_CAPS_ICON_note.jpg Note

    If you find that you have loaded more data than you need, you can overwrite the cached dataset by reading a new dataset, and saving it with the same name as the older, larger data.

NameRangeTypeDefaultDescription
Data sourceListData Source Or SinkAzure Blob StorageData source can be HTTP, FTP, anonymous HTTPS or FTPS, a file in Azure BLOB storage, an Azure table, an Azure SQL Database, an on-premises SQL Server database, a Hive table, or an OData endpoint.
Hive database queryanyStreamReaderHQL query
HCatalog server URIanyStringTempleton endpoint
Hadoop user account nameanyStringHadoop HDFS/HDInsight username
Hadoop user account passwordanySecureStringHadoop HDFS/HDInsight password
Location of output dataanyDataLocationHDFSSpecify HDFS or Azure for outputDir
HDFS server URIanyStringHDFS rest endpoint
Azure storage account nameanyStringAzure storage account name
Azure storage keyanySecureStringAzure storage key
Azure container nameanyStringAzure container name
Data content typeList (subset)Url ContentsODataData format type
Source URLanyStringURL for Power Query data source
Use cached resultsTRUE/FALSEBooleanFALSEdescription
NameTypeDescription
Results datasetData TableDataset with downloaded data
ExceptionDescription
Error 0027An exception occurs when two objects have to be the same size, but they are not.
Error 0003An exception occurs if one or more of inputs are null or empty.
Error 0029An exception occurs when an invalid URI is passed.
Error 0030An exception occurs in when it is not possible to download a file.
Error 0002An exception occurs if one or more parameters could not be parsed or converted from the specified type to the type required by the target method.
Error 0009An exception occurs if the Azure storage account name or the container name is specified incorrectly.
Error 0048An exception occurs when it is not possible to open a file.
Error 0015An exception occurs if the database connection has failed.
Error 0046An exception occurs when it is not possible to create a directory on specified path.
Error 0049An exception occurs when it is not possible to parse a file.

Import Data
Export Data
Import from Web URL via HTTP
Import from Azure SQL Database
Import from Azure Table
Import from Azure Blob Storage
Import from Data Feed Providers
Import from On-Premises SQL Server Database

Show: