Article
09/22/2015

April 2015

Volume 30 Number 4

Azure Insider - Event Hubs for Analytics and Visualization

Bruno Terkaly There are countless examples of companies that are building solutions based on analytics and visualization. There are many scenarios requiring large-scale data ingestion and analytics. Nobody can deny the huge growth in the social networking space, where tweets from Twitter, posts from Facebook, and Web content from blogs are analyzed in real time and used to provide company’s brand awareness. This is the first part of a multi-part series around real-time analytics and visualization, focusing on hyper-scale data flows.

This level of data analysis is big business for companies such as Networked Insights, which helps brands make faster, smarter and more audience-centric decisions. Its marketing solution analyzes and organizes real-time consumer data from the social Web to produce strategic, actionable insights that inform better audience segmentation, content strategy, media investment and brand health. What differentiates its approach is the company’s ability to simplify the use of social media data by proactively classifying it across 15,000 consumer-interest dimensions in real time, leveraging computational linguistics, machine learning and other techniques. Some of those consumer interest dimensions include the use of 46 emotional classifiers that go deeper than the roots of positive and negative sentiment and into the nuances of hate, love, desire and fear so marketers understand not only how consumers feel about a product, but what to do next.

For this level of analysis, the first thing you need is large-scale data ingestion. Using the social media example again, consider that Twitter has hundreds of millions of active users and hundreds of millions of tweets per day. Facebook has an active user base of 890 million. Building the capability to ingest this amount of data in real time is a daunting task.

Once the data is ingested, there’s still more work to do. You need to have the data parsed and placed in a permanent data store. In many scenarios, the incoming data flows through a Complex Event Processing (CEP) system, such as Azure Stream Analytics or Storm. That type of system does stream analytics on the incoming data by running standing queries. This might generate alerts or transform the data to a different format. You may then opt to have the data (or part of it) persisted to a permanent store or have it discarded.

That permanent data store is often a JSON-based document database or even a relational database like SQL Server. Prior to being placed into permanent storage, data is often aggregated, coalesced or tagged with additional attributes. In more sophisticated scenarios, the data might be processed by machine learning algorithms to facilitate making predictions.

Ultimately, the data must be visualized so users can get a deeper understanding of context and meaning. Visualization often takes place on a Web-based dashboard or in a mobile application. This typically introduces the need for a middle tier or Web service, which can expose data from permanent storage and make it available to these dashboards and mobile applications.

This article, and subsequent installments, will take a more straightforward example. Imagine you have thousands of devices in the field measuring rainfall in cities across the United States. The goal is to visualize rainfall data from a mobile device. Figure 1 demonstrates some of the technologies that might bring this solution to life.

Figure 1 Overall Architecture of the Rainfall Analysis System

There are several components to this architecture. First is the event producer, which could be the Raspberry Pi devices with attached rainfall sensors. These devices can use a lightweight protocol such as Advanced Message Queuing Protocol (AMQP) to send high volumes of rainfall data into Microsoft Azure Event Hubs. Once Azure has ingested that data, the next tier in the architecture involves reading the events from Azure Event Hubs and persisting the events to a permanent data store, such as SQL Server or DocumentDB (or some other JSON-based store). It might also mean aggregating events or messages. Another layer in the architecture typically involves a Web tier, which makes permanently stored data available to Web-based clients, such as a mobile device. The mobile application or even a Web dashboard then provides the data visualization. This article will focus on the first component.

Event Producer

Because I’m sure most of you typically don’t own thousands of Internet of Things (IoT) devices, I’ve made some simplifications to the example relating to the event producer in Figure 1. In this article, I’ll simulate the Raspberry Pi devices with Linux-based virtual machines (VMs) running in Azure.

I need to have a C program running on a Linux machine, specifically Ubuntu 14.02. I also need to use AMQP, which was developed in the financial industry to overcome some of the inefficiencies found in HTTP. You can still use HTTP as your transport protocol, but AMQP is much more efficient in terms of latency and throughput. This is a protocol built upon TCP and has excellent performance on Linux.

The whole point in simulating the Raspberry Pi on a Linux VM is you can copy most, if not all, of the code to the Raspberry Pi, which supports many Linux distributions. So instead of reading data from an attached rainfall sensor, the event producer will read from a text file that contains 12 months of rainfall data for a few hundred cities in the United States.

The core technology for high-scale data ingestion in Azure is called Event Hubs. Azure Event Hubs provide hyper-scalable stream ingestion, which lets tens of thousands of Raspberry Pi devices (event producers) send continuous streams of data without interruption. You can scale Azure Event Hubs by defining throughput units (Tus), whereby each throughput unit can handle write operations of 1,000 events per second or 1MB per second (2MB/s for read operations). Azure Event Hubs can handle up to 1 million producers exceeding 1GB/s aggregate throughput.

Sending Events

Sending events to Azure Event Hubs is simple. An entity that sends an event, as you can see in Figure 1, is called an event publisher. You can use HTTPS or AMQP for transfer. I’ll use Shared Access Signatures (SAS) to provide authentication. An SAS is a time-stamped unique identity providing Send and Receive access to event publishers.

To send the data in C#, create an instance of EventData, which you can send via the Send method. For higher throughput, you can use the SendBatch method. In C, AMQP provides a number of methods to provide a similar capability.

Partition Keys

One of the core concepts in Azure Event Hubs is partitions. Partition Keys are values used to map the incoming event data into specific partitions. Partitions are simply an ordered system in which events are held at Event Hubs. As new events are sent, they’re added at the end of the partition. You can think of each partition as a separate commit log in the relational database world and work in a similar fashion.

By default, each Azure Event Hub contains eight partitions. You can go beyond 32 partitions, but that requires a few extra steps. You’d only need to do that based on the degree of downstream parallelism required for consuming applications. When publishing events, you can target specific partition keys, but that doesn’t scale well and introduces coupling into the architecture.

A better approach is to let the internal hashing function use a round robin event partition mapping capability. If you specify a PartitionKey, the hashing function will assign it to a partition (same PartitionKey always assigns to same partition). The round robin assignment happens only when no partition key is specified.

Get Started

A great place to start is to go through the tutorial at bit.ly/1F2gp9H. You can choose your programming language: C, Java or C#. The code for this article is based on C. Remember, I’m simulating the Raspberry Pi devices by reading from a text file that contains rainfall data. That said, it should be easy to port all this code to a Raspberry Pi device.

Perhaps the best place to start is to provision Azure Event Hubs at the portal. Click on App Services | Service Bus | Event Hub | Quick Create. Part of the provisioning process involves obtaining a shared access signature, which is the security mechanism used by your C program to let it write events into Azure Event Hubs.

Once you provision in Azure Event Hubs, you’ll also get a URL you’ll use to point to your specific instance of Azure Event Hubs inside the Azure datacenter. Because the SAS key contains special characters, you’ll need to encode the URL, as directed at bit.ly/1z82c9j.

Provision an Azure VM and Install AMQP

Because this exercise will simulate the IoT scenario by provisioning a Linux-based VM in Azure, it makes sense to do most of your work on a full-featured VM in the cloud. The development and testing environment on a Raspberry Pi is limited. Once you get your code up and running in the VM, you can simply repeat this process and copy the binaries to the Raspberry Pi device. You can find additional guidance on provisioning the Linux-based VM on Azure at bit.ly/1o6mrST. The code base I have used is based on an Ubuntu Linux image.

Now that you’ve provisioned Azure Event Hubs and a VM hosting Linux, you’re ready to install the binaries for the AMQP. As I mentioned earlier, AMQP is a high-performance, lightweight messaging library that supports a wide range of messaging applications such as brokers, client libraries, routers, bridges, proxies and so on. To install AMQP, you’ll need to remote into the Ubuntu Linux machine and install the AMQP Messenger Library from bit.ly/1BudbhA.

Because I do most of my work from a Windows machine, I use PuTTY to remote into the Ubuntu image running in Azure. You can download Putty at putty.org. To use PuTTY, you’ll need to get your VM URL from the Azure Management Portal. Developers using OS X can simply use SSH to remote in from the Mac terminal.

You might find it easier to install AMQP on CentOS instead of Ubuntu as directed at bit.ly/1F2k47z. Running Linux-based workloads in Azure is a popular tactic. If you’re already accustomed to Windows development, you should get familiar with programming in the Linux world. Notice in Figure 2 that I’m pointing to my Ubuntu VM (vmeventsender.cloudapp.net).

Figure 2 Use PuTTY to Remote in to the Ubuntu Linux VM

Send.c

Once you’ve installed AMQP on your Ubuntu VM, you can take advantage of the example provided by the AMQP installation process. On my particular deployment, here’s where you’ll find the send.c file: /home/azureuser/dev/qpid-proton-0.8/examples/messenger/c/send.c. (The code download for this article includes my edited version of send.c.)

You’ll essentially replace the default example installed by AMQP with my revised version. You should be able to run it as is, except for the URL pointing to Azure Event Hubs and the encoded shared access signature. The modifications I made to send.c include reading the rainfall data from weatherdata.csv, also included in the code download for this article. The code in Figure 3 is fairly clear. The main entry method starts by opening the text file, reading one line at a time and parsing the rainfall data into 12 separate pieces—one for each month.

Figure 3 Source Code in C to Send Message to Azure Event Hubs

int main(int argc, char** argv)
{
  printf("Press Ctrl-C to stop the sender process\n");
  FILE * fp;
  char  line[512];
  size_t len = 0;
  size_t read = 0;
  int i = 0;
  int curr_field = 0;
  int trg_col = 0;
  pn_messenger_t *messenger = pn_messenger(NULL);
  pn_messenger_set_outgoing_window(messenger, 1);
  pn_messenger_start(messenger);
  fp = fopen("weatherdata.csv", "r");
  if (fp == NULL)
    exit(EXIT_FAILURE);
  while (fgets(line, 512, fp)!=NULL)
  {
    for (i = 0; line[i] != '\0'; i++)
    {
      if (line[i] == ',')
      {
        fields[curr_field][trg_col] = '\0';
        trg_col = 0;
        curr_field += 1;
      }
      else
      {
        fields[curr_field][trg_col] = line[i];
        trg_col += 1;
      }
    }
    trg_col = 0;
    curr_field = 0;
    for (i = 1; i < 13; i++)
    {
      sendMessage(messenger, i, fields[0], fields[i]);
      printf("%s -> %s\n", fields[0], fields[i]);
    }
    printf("\n");
  }
  fclose(fp);
  // Release messenger resources
  pn_messenger_stop(messenger);
  pn_messenger_free(messenger);
  return 0;
}
int sendMessage(pn_messenger_t * messenger, int month, char *f1, char *f2)
{
  char * address = (char *) "amqps://SendRule:
    [secret key]@temperatureeventhub-ns.servicebus.windows.net/temperatureeventhub";
  int n = sprintf (msgbuffer, "%s,%d,%s", f1, month, f2);
  pn_message_t * message;
  pn_data_t * body;
  message = pn_message();
  pn_message_set_address(message, address);
  pn_message_set_content_type(message, (char*) "application/octect-stream");
  pn_message_set_inferred(message, true);
  body = pn_message_body(message);
  pn_data_put_binary(body, pn_bytes(strlen(msgbuffer), msgbuffer));
  pn_messenger_put(messenger, message);
  check(messenger);
  pn_messenger_send(messenger, 1);
  check(messenger);
  pn_message_free(message);
}

Figure 4 gives you a partial view of the text file with the rainfall data. The C code will connect to the Azure Event Hubs instance you provisioned earlier, using the URL and the encoded shared access signature. Once connected, the rainfall data will load into the message data structure and be sent to Azure Event Hubs using pn_messenger_send. You can get a complete description of these methods at bit.ly/1DzYuud.

Figure 4 Partial View of Text File with Rainfall Data

There are only two steps remaining at this point. The first is to actually compile the code, which simply involves changing a directory and issuing the make install command. The final step is to actually run the application you just created (see Figure 5):

// Part 1 – Compiling the code
cd /home/azureuser/dev/qpid-proton-0.8/build/examples/messenger/c
make install
// Part 2 – Running the code
cd /home/azureuser/dev/qpid-proton-0.8/build/examples/messenger/c
./send

Figure 5 Output from the Running Code

Wrapping Up

In the next installment, I’ll show what it takes to consume the events from Azure Event Hubs and Azure Stream Analytics, and store the data in a SQL database and a JSON-based data store known as Azure DocumentDB. Subsequent articles will delve into exposing this data to mobile applications. In the final article of this series, I’ll build a mobile application that provides a visualization of the rainfall data.

Bruno Terkaly is a principal software engineer at Microsoft with the objective of enabling development of industry-leading applications and services across devices. He’s responsible for driving the top cloud and mobile opportunities across the United States and beyond from a technology-enablement perspective. He helps partners bring their applications to market by providing architectural guidance and deep technical engagement during the ISV’s evaluation, development and deployment. Terkaly also works closely with the cloud and mobile engineering groups, providing feedback and influencing the roadmap.

Thanks to the following Microsoft technical experts for reviewing this article: James Birdsall, Pradeep Chellappan, Juan Perez, Dan Rosanova