Export (0) Print
Expand All

Big Compute Solutions in Azure: Overview

Updated: June 20, 2014

This topic is an overview of Big Compute solutions in Azure for the technical decision-maker or IT manager.

Azure offers organizations scalable, on-demand compute capabilities and services, enabling them to solve compute-intensive problems and make decisions with the resources, scale, and schedule they require. Organizations can leverage existing investments in on-premises compute clusters by creating hybrid clusters in Azure, or build cloud-only cluster solutions by using Azure platform and infrastructure services. Azure also offers an environment to develop special-purpose Big Compute solutions, and organizations can use vendor-created solutions or build their own.

For related information about the benefits of Big Compute solutions in Azure, see Big Compute on the Azure web site.

In this topic:

Big Compute refers to a wide range of tools and approaches to run large-scale applications for business, science, and engineering by using a large amount of CPU and memory resources in a coordinated way. For example, typical Big Compute applications perform complex modeling, simulation, and analysis, sometimes for a period of many hours or days, and they might run on a cluster of on-premises computers, on compute resources in the cloud, or on a combination of the two.

The essence of Big Compute is distributing application logic to run on many computers or virtual machines at the same time, in parallel, to solve problems faster than a single computer can. A Big Compute solution provides the necessary compute resources, infrastructure, management and scheduling tools, and workflow to run Big Compute applications efficiently.

A common question about Big Compute is how it differs from Big Data. The dividing lines are not always clear, and some applications may have characteristics of both. Both involve running large-scale computations, usually on clusters of computers that run on-premises, in the cloud, or both.

Big Compute typically involves applications that rely on CPU power and memory, such as engineering simulations, financial risk modeling, and digital rendering. The clusters that power a Big Compute solution might include computers with specialized muliticore processors to perform raw computation, and specialized, high speed networking hardware to connect the computers.

In contrast, Big Data solves data analysis problems that involve an amount of data that cannot be managed by a single computer or database management system, such as large volumes of web logs or other business intelligence data. Big Data tends to rely more on disk capacity and I/O performance than on CPU power, and a Big Data solution often includes specialized tools such as Apache Hadoop to manage the cluster and partition the data. (For information about Azure HDInsight, which is a Big Data solution powered by Apache Hadoop, see Big Data.)

Unlike web applications and many line-of-business applications, Big Compute applications have a defined beginning and end, and they can run on a schedule or on demand. Big Compute applications include traditional high performance computing (HPC) applications such as fluid dynamics simulations as well as a wide range of commercially available and custom applications in fields including financial services, engineering design, oil and gas exploration, life sciences, and digital content creation. Other applications include media transcoding, image analysis, file processing, and software testing.

Most Big Compute applications are batch or large-scale processes that fall into two main categories: intrinsically parallel (sometimes called “embarrassingly parallel”, because the problems they solve lend themselves to running in parallel on multiple computers) and tightly coupled. The following table summarizes the characteristics of these application types.

noteNote
When it runs in a Big Compute solution, an instance of an application is typically called a job, and each job may get divided into one or more tasks. And the clustered compute resources for the application are usually called compute nodes.

 

Application type Characteristics Examples

Intrinsically parallel

Intrinsically parallel

  • Individual compute resources run application logic independently, on different sets of input data

  • Additional compute resources enable the application to scale readily and decrease computation time

  • Application consists of separate executable files, or is divided into a group of services invoked by a client (a service-oriented architecture, or SOA, application)

  • Financial risk modeling

  • Image rendering and image processing

  • Media encoding and transcoding

  • Monte Carlo simulations

  • Software testing

Tightly coupled

Tightly coupled

  • The application requires compute resources to interact or to exchange intermediate results

  • Compute nodes typically communicate by using the Message Passing Interface (MPI), a common communications protocol for parallel computing

  • The application may be sensitive to network latency and bandwidth

  • Performance can be improved on computers that support high-speed networking technologies such as InfiniBand and Windows remote direct memory access (RDMA)

  • Many legacy HPC applications

  • Oil and gas reservoir modeling

  • Engineering design and analysis, such as computational fluid dynamics

  • Physical simulations such as car crashes and nuclear reactions

  • Weather forecasting

While many applications that are designed to run in on-premises clusters can be readily migrated to a cloud environment such as Azure, or to a hybrid environment, there may be some limitations or considerations. These include the following:

  • Availability of cloud resources   Depending on the type of cloud compute resources selected for the solution, you may not be able to rely on continuous machine availability for the duration of a job's execution. Failures and state handling should take this into account.

  • Data access   Data access techniques commonly available within an enterprise network cluster, such as SMB or NFS, may require special configuration in the cloud.

  • Data movement   For applications that process large amounts of data, strategies are needed to move the data to appropriate cloud storage and compute resources. Also consider legal, regulatory, or policy limitations for storing or accessing that data.

  • Licensing   Check with the vendor of any commercial application for licensing or other restrictions for running in the cloud. Not all vendors offer pay-as-you-go licensing.

A Big Compute solution often includes a cluster manager and a job scheduler to help manage clustered compute resources and run the applications. These functions may be accomplished by separate tools, or an integrated tool.

  • Cluster manager   Provisions, releases, and administers compute resources (or compute nodes). Depending on the tool, a cluster manager might automate installation of operating system images and applications on compute nodes, scale compute resources according to demands, and monitor the performance of the nodes.

  • Job scheduler   Specifies the resources (such as processors or memory) an application instance needs, and the conditions when it will run. A job scheduler typically maintains a queue of jobs and allocates resources to them based on an assigned priority or other characteristics.

Clustering and job scheduling tools are commercially available for Windows-based and Linux-based clusters, or they can be developed independently. For example, Microsoft HPC Pack is Microsoft’s compute cluster solution for Windows Server and Windows-based computers. You install HPC Pack on a computer called the head node that acts as the cluster manager and job scheduler. You also install HPC Pack on networked computers to create compute resources for the cluster, and on workstations used to submit jobs to the cluster.

Cluster management and job scheduling tools help accomplish a Big Compute workflow. The following is a list of the typical steps in the workflow. Depending on the solution, some steps may be omitted or may need manual intervention or additional, custom tools.

  1. Provision   Prepare the compute environment with the necessary infrastructure, compute resources, and storage to run applications

  2. Stage   Move input data and applications into the compute environment

  3. Schedule   Configure jobs and tasks and allocate compute resources to them when they are available

  4. Monitor   Provide status information about job and task progress; handle errors or exceptions

  5. Finish   Return results and post-process output data if needed

This section introduces the compute, data, and related services in Azure that commonly support Big Compute solutions and workflows. For more information, tutorials, and in-depth guidance on each of these services, see the documentation at the Azure.microsoft.com web site. See Solution scenarios later in this topic for examples of how these services can be combined.

noteNote
Because of the breadth of services and the pace of service additions to the Azure platform, you might be able to take advantage of additional capabilities in Azure that are useful for your scenario. If you have questions, contact a Azure partner.

Compute services in Azure provide compute resources to a Big Compute solution. The following Azure compute services are used most frequently.

 

Cloud Services   

Cloud Service

  • Can run Big Compute applications in worker role instances, which are virtual machines running a version of Windows Server and are managed entirely by the cloud service.

  • Enable scalable, reliable applications with low administrative overhead, running in a platform as a service (PaaS) model

  • May require additional tools to integrate with existing on-premises cluster solutions

  • Continuously monitor virtual machine health, and move a virtual machine to a new host in case of failures

Azure Virtual Machines   

Virtual Machine

  • Provide compute infrastructure as a service (IaaS) using Microsoft Hyper-V technology

  • Enable you to flexibly provision and manage persistent virtual machines from standard Windows Server or Linux images, or images and data disks you supply

  • Extend on-premises compute cluster tools and applications readily to the cloud

For more background on the different capabilities of Cloud Services and Azure Virtual Machines, see the “Execution Models / Compute” section of Introducing Azure.

Whether you use an Azure cloud service or Azure virtual machines in your solution, your application runs on virtual machine-based compute instances in Azure using Windows Server Hyper-V technology. These instances can be scaled up to thousands of cores, depending on your scenario, and then scaled down when you need fewer resources. Azure provides a range of instance sizes with different configurations of CPU cores, memory, disk capacity, and other characteristics to meet the requirements of different applications. For more information about the sizes and options for the virtual machine-based compute resources you can use in your solution, see Virtual Machine and Cloud Service Sizes for Azure.

Compute intensive instances   The recently introduced A8 and A9 compute intensive instances in Azure are designed for compute intensive workloads generally, and parallel MPI applications in particular, because they have high speed, multicore CPUs and large amounts of memory. Additionally, when deployed in configurations such as an HPC Pack cluster to run MPI applications, the compute intensive instances use a second network adapter with remote direct memory access (RDMA) capability to connect to a low-latency and high-throughput application network in Azure. This network is used exclusively for MPI process communication. For some workloads, these capabilities enable MPI application performance in the cloud that is comparable to performance in on-premises clusters with dedicated application networks. For more information, see About the A8 and A9 Compute Intensive Instances.

A Big Compute solution often operates on a set of input data, and generates data for its results. This section gives a brief description of the data services that are most likely to be used in Big Compute solutions. For more information, see Data Management.

 

Azure Storage    

Windows Azure Storage

Blobs

  • Store large amounts of unstructured text or binary data

  • Can be accessed from anywhere in the world via HTTP or HTTPS, using a variety of tools and APIs

  • Can be mounted by Azure compute instances as virtual disk drives

  • Are well suited to store and share technical data sets

Tables

  • Provide fast access to typed data but do not require complex SQL queries

  • Use a NoSQL approach called a key-value store. The basic data item stored in a table is called an entity.

Queues

  • Store messages for workflow and communication

  • Provide asynchronous communication: messages placed on the queue are stored until a recipient retrieves them

SQL Database   

SQL Database

  • Provides the key features of a relational database management system: atomic transactions, concurrent data access by multiple users with data integrity, and ANSI SQL queries

  • Uses a PaaS model to manage the hardware infrastructure and automatically keep the database and operating system software up to date

  • Provides a federation option that can distribute data across multiple servers, for large amounts of data or to spread access requests to improve performance

Azure File (Preview)

  • Exposes network files shares in Azure using the standard SMB protocol

  • Enables sharing of settings, data, and diagnostic files for distributed applications

  • Accessible through standard file system APIs and a REST interface

Depending on your scenario, additional Azure services might be useful in a Big Compute solution. Examples include:

 

Azure Virtual Network   

Virtual Network

  • Creates a logically isolated section in Azure for your compute resources

  • Can connect Azure resources to your on-premises datacenter or a single client machine using IPSec.

  • Enable Big Compute applications to access on-premises data, Active Directory services, and license servers

Service Bus   

Service Bus

  • Allows applications to communicate or exchange data, no matter where they are located: on Azure, on another cloud platform, or in a data center.

  • Provides a message queuing service (different from queues in Azure Storage): both one-to-one queues and a publish-and-subscribe mechanism, for one-to-many communications

  • Also allows direct communication through a relay service, which provides a secure way to interact through a NAT or firewall

Media Services   

Media Services

  • Enables developers to build solutions for ingesting, processing, managing, and delivering media content

  • Supports Microsoft and non-Microsoft components for ingestion, encoding, and other parts of a media workflow

For more information, see Scenario 3. Big Compute services later in this topic.

This section outlines common scenarios to create or run a Big Compute solution in Azure.

An organization that already has an on-premises compute cluster (made up computers that run either a Windows or a Linux operating system) may find that it sometimes needs extra compute power. Instead of purchasing additional hardware that may be idle most of the time, the organization can use Azure for on-demand compute power. The same situation may be true of organizations that are outgrowing their current premises but have no more room to expand. Instead of moving to another data center or adding equipment to one that is already too crowded, they can expand to Azure.

For example, you can use HPC Pack 2012 R2 to create an on-premises cluster, and then add extra compute resources in the form of Azure worker role instances running in a cloud service. Other on-premises cluster management tools may have comparable capabilities to expand compute capacity by “bursting” to the cloud. See the following figure.

Hybrid cluster

This solution leverages an existing investment in an on-premises cluster, but allows the on-premises cluster to be scaled for typical, not peak, workloads. If you need to access an on-premises license server or data store, you can set up an Azure virtual network to connect the on-premises cluster and the cloud.

noteNote
If you want to reduce the on-premises footprint, you can create a hybrid cluster that has just an on-premises HPC Pack head node, and you then add all compute resources on demand in Azure. For a tutorial that steps through this scenario, see Set up a hybrid compute cluster with Microsoft HPC Pack.

To a cluster user, submitting a compute job that runs partly on-premises and partly in the cloud looks just as it always does—there’s no difference from submitting one that runs entirely on-premises. For the cluster administrator, however, some extra work is required to make this possible. For example, with a cluster based on HPC Pack, the administrator still uses the familiar cluster management tools, but must also set up an Azure account and configure the Windows Azure nodes. Moving applications and input data to the appropriate on-premises and cloud stores may require special planning. The administrator also specifies which jobs are allowed to use which compute resources, whether those resources are compute nodes in the on-premises cluster or compute nodes in the cloud.

For more information and step-by-step instructions for this scenario, see Burst to Azure with Microsoft HPC Pack

Another option is to use Azure infrastructure and compute services to deploy a compute cluster entirely in Azure – for example, on Azure virtual machines that are connected with an Azure virtual network. Because Azure virtual machines can run a Windows Server or Linux operating system, you can bring a variety on-premises cluster solutions to the Azure public cloud.

noteNote
Check with the vendor of your on-premises cluster solution and applications for additional requirements and best practices for running in a public cloud providing infrastructure as a service (IaaS).

For example, if you want to create a Windows Server based cluster in Azure, you can create a head node by installing HPC Pack 2012 R2 on an Azure virtual machine running Windows Server, and then add compute resources in one of two ways:

  1. Add worker role instances running in an Azure cloud service as compute resources. This is the same way that you can “burst” to Azure from an on-premises cluster.

  2. Add compute nodes as Azure virtual machines. This is similar to the way that you add on-premises compute nodes to a Windows HPC cluster.

For more information about this scenario, see Microsoft HPC Pack in Azure VMs.

The following figure illustrates the first of these two options with HPC Pack.

Cloud-based cluster

A cluster user can submit a job securely to the cloud cluster through standard HPC Pack job submission tools. Just as with the hybrid cluster example, a user who submits a Windows HPC job sees no difference between running computations on premises or in the cloud. The job scheduler transparently spreads the application logic across the Azure nodes.

Putting an entire compute cluster in the cloud can have clear benefits.

  • An organization can run HPC jobs without buying and managing its own cluster hardware or supporting infrastructure. The administrative burden of the cluster is reduced compared with an on-premises or hybrid cluster solution. Because you only pay for the resources you use, Azure is a good choice for organizations whose use of an on-premises cluster is too infrequent to justify the expense and management overhead.

  • Compared with a hybrid deployment, running an application entirely in the cloud can make data access easier. Rather than dividing the data between the cloud and on premise installations, or making one part of the application access data remotely, all of the application data can be stored in the cloud.

  • Some software vendors design applications to run in cloud-based clusters. For example, by deploying the MATLAB Distributed Computation Server, from MathWorks, on an HPC Pack cluster in Azure virtual machines, you can run parallel MATLAB jobs entirely with cloud-based compute resources. For information, see MATLAB Distributed Computing Server with HPC Cluster in Microsoft Azure.

Depending on your computation needs, you might be able to use an existing Big Compute service in Azure, hosted by Microsoft or another service vendor, to simplify management of both the infrastructure and the application for your solution. These services typically host specific applications and serve the compute needs of customers in selected industries. Some services complement on-premises applications, enabling a hybrid solution. Others are dedicated cloud services.

Hosters and solutions vendors in a range of industries offer cloud-enabled Big Compute services. One example is Towers Watson, a global professional services organization, which offers its insurance risk and financial modeling application MoSes as an on-premises cluster application, a hybrid solution, or an Azure-based service..

In the Azure platform Microsoft offers some special-purpose Big Compute services. One example is Azure Media Services. Media Services provides programmatic access to an end-to-end workflow that uploads large volumes of media content to Azure storage, encodes that content with any of a variety of media encodings and formats, and flexibly delivers or streams the content to devices or networks. To perform media encoding, which is usually the most compute-intensive part of a customer’s scenario, Azure Media Services automatically allocates the necessary compute resources and schedules the compute-intensive work, removing much of the burden of building and maintaining a dedicated compute cluster. For more information, see Media Services.

For organizations with compute-intensive workloads that are not suited to standard compute clusters or commercially available Big Compute services, the Azure platform offers developers and partners a full set of compute capabilities, services, architecture choices, and development tools. Azure can support custom Big Compute workflows that scale to thousands of compute cores and involve specialized data workflows and job and task scheduling patterns.

For example, technical organizations that build applications using Python can take advantage of Python development tools that integrate with Azure services, and run their programs on Azure compute services. For information, see An Introduction to Using Python with Microsoft Azure and Building Scalable Services in Microsoft Azure and Python.

Although developing custom solutions is beyond the scope of this topic, see the following resources to get started:

Show:
© 2014 Microsoft