Microsoft Azure - Enhance Data Exploration with Azure Search

Bruno Terkaly The fields of data exploration, real-time analytics and machine learning are being applied in many creative ways. Companies are building interesting architectures around a variety of open source software packages. Azure Search is one such piece of a larger architecture. Azure Search is a powerful new search experience for your Web sites and applications—even if you’re not a search expert.

Azure Search is a fully managed, cloud-based service that uses a simple REST API. It includes type-ahead suggestions, suggested results based on near matches, multi-faceted navigation and the ability to adjust capacity based on need. Azure Search provides full-text search with weighting, ranking and your own search behaviors based on a schema defined by field-attribute combinations. Data is immediately indexed, which minimizes search delays.

The need for effective search is growing along with the massive size of data stores. On Facebook alone, users spend hundreds of billions of minutes searching every month. To efficiently search Wikipedia, you’d have to index 17 million entries. Twitter boasts more than 600 million users, generating= more than 50,000 tweets per day. Performing full-text search at this scale requires some creative engineering. To index and curate all this information is not for the faint of heart.

Many companies for which search is an integral component of their business are starting to work with Azure G-Series Virtual Machines with 32 cores, 448GB RAM and 6.5TB of solid-state drive (SSD). Some engineers are writing custom assembly and C code to optimize cache coherency, for both data and instruction caches. They use the caches to reduce the amount of time the CPU sits there waiting for a memory request to be fulfilled, and to reduce the overall amount of data that needs to be transferred. One challenge within these huge, multi-core machines is there are many threads competing for the memory bus. The performance boost of leveraging the L2 and L3 cache is orders of magnitude more significant. All this is important because getting lots of data quickly into your full-text engine is critical.

Full-Text Search Done Correctly

Azure Search offers many advantages. It reduces the complexity of setting up and managing your own search index. It’s a fully managed service that helps you avoid the hassle of dealing with index corruption, service availability, scaling and service updates. One of the big advantages is that Azure Search supports rich, fine-tuned ranking models. This lets you tie search results to business goals. It also offers multi-language search, spelling-mistake correction and type-ahead suggestions. If search results are weak, Azure Search can suggest queries based on near matches. You can follow a quick tutorial on provisioning a new Azure Search instance at bit.ly/1wYb8L8.

The tutorial guides you through getting started and the steps you need to perform at the portal. You’ll provision Azure Search in the Azure Management Portal (portal.azure.com), which requires two things: the URL provided by the portal itself and the API-KEY. The URL represents the endpoint in the cloud running your Azure Search service to which your client app will talk. The API-KEY will need to be protected carefully, because it provides all access to your service. After all, you can’t allow any client to access your Azure Search service without being authenticated with this key.

There are limits and constraints of which you should be aware in terms of the number of indexes, maximum fields per index, maximum document counts and so on. One important limitation is there are no quotas or maximum limits associated with queries. Queries-per-second (QPS) are variable, depending on available bandwidth and competition for system resources.

With the free Azure Search service, the Azure compute and storage resources backing your shared service are shared by multiple subscribers, so QPS for your solution will vary depending on how many workloads are running at the same time. For dedicated (STANDARD SKU) services, resources are all dedicated to the customer and not shared.

Once you have your URL and API-KEY, you’re ready to use the service. The easiest way to do that is Fiddler because it lets you compose your own HTTP requests. (Get a free copy of Fiddler at bit.ly/1jKA1UJ.) Azure Search uses simple HTTP, so it’s trivial to insert and query data with Fiddler (see Figure 1). Later in this article, you’ll see how to use Node.js to talk to Azure Search.

Figure 1 Execute an Insert into Azure Search Using Fiddler

As you can see, there are four things with which to concern yourself:

HTTP Verb
URL
Request Header
Request Footer

The HTTP verb (PUT, POST, GET or DELETE) maps to different operations, whether the schema definition operation, data insertion and so on. For example, PUT maps to a schema definition and POST maps to data insertion. The URL is available from the Azure Portal and may change depending on your query parameters. The API-KEY is sent in the request header. The request body is always a JSON representation of a schema or data being inserted.

Azure Search can play a key role in a larger architecture. Figure 2 demonstrates an architecture that leverages Azure Search. Begin with the essential authentication layer, where you have some options. For example, you can use the Azure Active Directory Graph API using OAuth2 with Node.js. Speaking of Node.js, you can use it as a proxy to the URL endpoint of Azure Search, providing some structure and control to your service.

Figure 2 Azure Search in Larger Architectures

One of the great features of Azure Search is you can index and search almost any structured data—except photos, images and videos. Figure 2 illustrates some potential data sources for Azure Search. Because relational databases aren’t well suited to performing full-text searches, many startups are taking their relational data and exporting key data to Azure Search. This is also beneficial because it offloads the burden of full-text search in relational databases.

This doesn’t mean that Azure Search is an end all or be all. You still need to mine data using map/reduce technologies like Hadoop or HDInsight to apply machine learning algorithms such as clustering, where text documents are grouped into topically related documents.

Imagine analyzing a tweet and giving it a score for how likely it belongs in some other category. For example, you may have a category called Rants for emotionally charged criticism or another called Raves for positive opinions. Linear classifier algorithms are often used for such insight. Azure Search isn’t capable of this. But imagine the way you index documents in Azure Search is based on the way you categorize and analyze data.

Node.js

Now I’ll turn my attention to writing a Node.js front end, which I’ll use as a proxy layer in front of Azure Search.

Step 1 requires you to complete the getting started tutorial noted earlier in this article so that Azure Search is provisioned at the portal. Recall that you’ll need the URL and the API-KEY.
Step 2 requires you to download and install the Node.js runtime locally on your development computer. You can find this at nodejs.org/download. My install ended up in the folder c:\program files\nodejs. It’s recommended you get a basic “hello world” running in Node.js before proceeding further.
Step 3 requires you to confirm the Node Package Manager (NPM) is properly installed and configured. NPM lets you install Node.js applications (JavaScript) available on the NPM registry.
Step 4 involves installing the elasticsearch package, which simplifies writing code to communicate with Azure Search.

Once you’ve completed these steps, you’re ready to return to the command prompt, navigate to whatever directory you like and start writing code. If you encounter some errors with NPM, you may need to validate some environment variables:

C:\node>set nodejs=C:\Program Files\nodejs\
C:\node>set node_path=C:\Program Files\nodejs\node_modules\*
C:\node>set npm=C:\Program Files\nodejs\

Build Out the Node.js Solution

Now you’re ready to develop some Node.js code to execute on your local system to illustrate communicating with Azure Search. Node.js makes it easy to insert and query data in Azure Search. Assume you have an Azure Search URL of terkaly.search.windows.net. You’d get this from the Azure Management Portal. You’ll also need your API-KEY, which for this example is B7D12B8CA3D018EC09C754F95CA552D2.

There’s more than one way to develop Node.js applications on your local computer. If you love the debugger in Visual Studio, then you’ll want to use the Node.js Tools for Visual Studio plug-in (nodejstools.codeplex.com). If you like the command line, check out Nodejs.org. Once you install Node.js, it’s important to integrate the NPM. This lets you install Node.js applications available on the NPM registry. The core package used here is called request.

The code in Figure 3 is straightforward. It does the same thing as described in the tutorial at bit.ly/1Ilh6vB, the only difference being you’ve implemented this code and Node.js using the request package. The code covers some of the more general use cases, such as creating an index, inserting data and, of course, performing queries. There are a number of callbacks here that define a schema, insert data and query data.

Figure 3 Node.js Code That Shows How to Create an Index, Insert Data and Query Data

var request = require('request');
//////////////////////////////////////////////////
// OPTIONS FOR HTTP PUT
// Purpose:    Used to create an index called hotels
//////////////////////////////////////////////////
var optionsPUT = {
  url: 'https://terkaly.search.windows.net/indexes/hotels?api-version=2014-07-31-Preview',
  method: 'PUT',
  json: true,
  headers: {
    'api-key': 'B7D12B8CA3D018EC09C754F95CA552D2',
    'Content-Type': 'application/json'
  },
  body: {
    "name": "hotels",
    "fields": [
      { "name": "hotelId", "type": "Edm.String", "key": true, "searchable": false },
      { "name": "baseRate", "type": "Edm.Double" },
      { "name": "description", "type": "Edm.String", "filterable": false, 
        "sortable": false,
        "facetable": false, "suggestions": true },
      { "name": "hotelName", "type": "Edm.String", "suggestions": true },
      { "name": "category", "type": "Edm.String" },
      { "name": "tags", "type": "Collection(Edm.String)" },
      { "name": "parkingIncluded", "type": "Edm.Boolean" },
      { "name": "smokingAllowed", "type": "Edm.Boolean" },
      { "name": "lastRenovationDate", "type": "Edm.DateTimeOffset" },
      { "name": "rating", "type": "Edm.Int32" },
      { "name": "location", "type": "Edm.GeographyPoint" }
    ]
  }
};
//////////////////////////////////////////////////
// OPTIONS FOR HTTP POST
// Purpose: Used to insert data  
//////////////////////////////////////////////////
var optionsPOST = {
  url: 'https://terkaly.search.windows.net/indexes/hotels/docs/
    index?api-version=2014-07-31-Preview',
  method: 'POST',
  json: true,
  headers: {
    'api-key': 'B7D12B8CA3D018EC09C754F95CA552D2',
    'Content-Type': 'application/json'
  },
  body: {
    "value": [
    {
      "@search.action": "upload",
      "hotelId": "1",
      "baseRate": 199.0,
      "description": "Best hotel in town",
      "hotelName": "Fancy Stay",
      "category": "Luxury",
      "tags": ["pool", "view", "wifi", "concierge"],
      "parkingIncluded": false,
      "smokingAllowed": false,
      "lastRenovationDate": "2010-06-27T00:00:00Z",
      "rating": 5,
      "location": { "type": "Point", "coordinates": [-122.131577, 47.678581] }
    },
    {
      "@search.action": "upload",
      "hotelId": "2",
      "baseRate": 79.99,
      "description": "Cheapest hotel in town",
      "hotelName": "Roach Motel",
      "category": "Budget",
      "tags": ["motel", "budget"],
      "parkingIncluded": true,
      "smokingAllowed": true,
      "lastRenovationDate": "1982-04-28T00:00:00Z",
      "rating": 1,
      "location": { "type": "Point", "coordinates": [-122.131577, 49.678581] }
    },
    {
      "@search.action": "upload",
      "hotelId": "3",
      "baseRate": 279.99,
      "description": "Surprisingly expensive",
      "hotelName": "Dew Drop Inn",
      "category": "Bed and Breakfast",
      "tags": ["charming", "quaint"],
      "parkingIncluded": true,
      "smokingAllowed": false,
      "lastRenovationDate": null,
      "rating": 4,
      "location": { "type": "Point", "coordinates": [-122.33207, 47.60621] }
    },
    {
      "@search.action": "upload",
      "hotelId": "4",
      "baseRate": 220.00,
      "description": "This could be the one",
      "hotelName": "A Hotel for Everyone",
      "category": "Basic hotel",
      "tags": ["pool", "wifi"],
      "parkingIncluded": true,
      "smokingAllowed": false,
      "lastRenovationDate": null,
      "rating": 4,
      "location": { "type": "Point", "coordinates": [-122.12151, 47.67399] }
    }
    ]
  }
};
//////////////////////////////////////////////////
// OPTIONS FOR HTTP GET
// Purpose:    Used to do a perform a query
//////////////////////////////////////////////////
var optionsGET = {
  url: 'https://terkaly.search.windows.net/indexes/hotels/
    docs?search=motel&facet=category&facet=rating,
    values:1|2|3|4|5&api-version=2014-07-31-Preview',
  method: 'GET',
  json: true,
  headers: {
    'api-key': 'B7D12B8CA3D018EC09C754F95CA552D2',
    'Content-Type': 'application/json'
  },
  body: {
  }
};
request(optionsPUT, callbackPUT);
//////////////////////////////////////////////////
// Purpose:    Used to create an index
// Http Verb:  PUT
// End Result: Defines an index using the fields
// that make up the index definition.
//////////////////////////////////////////////////
function callbackPUT(error, response, body) {
  if (!error) {
    try {
      if (response.statusCode === 204) {
          console.log('***success in callbackPUT***');
          request(optionsPOST, callbackPOST);
      }
    } catch (error2) {
      console.log('***Error encountered***');
      console.log(error2);
    }
  } else {
    console.log('error');
    console.log(error);
  }
}
//////////////////////////////////////////////////
// Purpose:    Used to insert data
// End Result: Inserts a document
//////////////////////////////////////////////////
function callbackPOST(error, response, body) {
  if (!error) {
    try {
      var result = response.request.response.statusCode;
      if (result === 200) {
          console.log('***success in callbackPOST***');
          console.log("The statusCode = " + result);
        // Perform a query
        request(optionsGET, callbackGET);
      }
    } catch (error2) {
      console.log('***Error encountered***');
      console.log(error2);
    }
  } else {
    console.log('error');
    console.log(error);
  }
}
//////////////////////////////////////////////////////////////
// Purpose:    Used to retrieve information
// Http Verb:  GET
// End Result: Query searches on the term "motel" and retrieves
// facet categories for ratings.
//////////////////////////////////////////////////////////////
function callbackGET(error, response, body) {
  if (!error) {
    try {
      var result = response.request.response.statusCode;
      if (result === 200) {
          result = body.value[0];
          console.log('description = ' + result.description);
          console.log('hotel name = ' + result.hotelName);
          console.log('hotel rate = ' + result.baseRate);
      }
      console.log('***success***');
    } catch (error2) {
      console.log('***Error encountered***');
      console.log(error2);
    }
  } else {
    console.log('error');
    console.log(error);
  }
}

The callback chain is straightforward, as well. It starts with a simple GET, then moves to a PUT, POST and a second GET (with the query). It demonstrates the core operations you’d use with Azure Search. First, create a schema for the documents you’ll add later. Use a PUT HTTP verb to define a schema. Next, use a POST to insert data. Finally, use a GET to query the data.

Wrapping Up

The goal in this article was to expose some of the exciting things happening out there in the world of startups, specifically in the area of social networking analytics. You can use Azure Search as one piece in a larger solution where you need a sophisticated and powerful search experience to integrate with your Web site and applications. It lets you use fine-tuned ranking models to tie search results to business goals, as well as reliable throughput and storage.

Bruno Terkaly is a principal software engineer at Microsoft with the objective of enabling development of industry-leading applications and services across devices. He’s responsible for driving the top cloud and mobile opportunities across the United States and beyond from a technology-enablement perspective. He helps partners bring their applications to market by providing architectural guidance and deep technical engagement during the ISV’s evaluation, development and deployment. He also works closely with the cloud and mobile engineering groups, providing feedback and influencing the roadmap.

Thanks to the following Microsoft technical experts for reviewing this article: Liam Cavanagh, Simon Gurevich, Govind Kanshi, Raj Krishnan, Venugopal Latchupatula, Eugene Shvets

Microsoft Azure - Enhance Data Exploration with Azure Search

Full-Text Search Done Correctly

Node.js

Build Out the Node.js Solution

Wrapping Up

Additional resources