Windows Azure Insider - Meter and Autoscale Multi-Tenant Applications in Windows Azure
In our previous column (msdn.microsoft.com/magazine/dn201743), we introduced concepts relating to creating multi-tenant apps, covering two of the four pillars that should be considered when building this type of system: identity and security, and data isolation and segregation. This month, we concentrate on two other important areas that are deeply interwined: metering and autoscaling. Metering allows companies to gather information about the different components that are being shared among all the tenants; autoscaling guarantees that the end-user experience isn’t affected during periods of high traffic, and that servers are deprovisioned when resource demand is lower.
Gathering information on resource usage is common when troubleshooting applications, especially during the development and testing processes. By doing this, thresholds and hardware requirements that will guarantee optimal performance of the solution can be set, and minimum hardware requirements can be recommended. In Windows, this task is accomplished by using performance counters that help determine system bottlenecks and error conditions.
Metering becomes particularly important when running multi-tenant solutions in the cloud, and not only during the development stages. Supporting multiple users sharing common resources presents specific challenges, such as how to enforce quotas for tenants, identify any users who might be consuming excessive resources, or decide if the pricing tiers need to be redefined. Keep in mind that metering multi-tenant solutions is not only about determining or validating the usage bill from your cloud provider—in this case Windows Azure—but also about optimizing resources in your deployment that guarantee the level of service tenants expect, typically expressed in a service-level agreement (SLA).
In this article, we’ll concentrate on metering and autoscaling the compute portion of a multi-tenant solution, which is the solution type that’s most affected by variations in the number of users accessing the application. Now that Windows Azure supports multiple cloud deployment models (Cloud Services, Virtual Machines and Web Sites), it’s important to understand the different logging and diagnostics options that each one offers, and how they can be autoscaled based on this information. In case you need to better understand the basic differences, benefits and limitations of these deployment models, you’ll find a good guide at bit.ly/Z7YwX0.
Collecting Data from Windows Azure Cloud Services
Cloud Services (based on the Platform as a Service concept), collects data via the Windows Azure Diagnostics (WAD) infrastructure, which is built on the Event Tracing for Windows (ETW) framework. Because Cloud Services is based on stateless virtual machines (VMs), WAD allows you to save data locally and, based on a schedule, transfer it to a central repository in Windows Azure storage using blobs and tables. Once the diagnostics data has been collected from the multiple instances in the role, it can be analyzed and used for multiple purposes. Figure 1 shows how this process works.
Figure 1 Windows Azure Diagnostics for Cloud Services
To enable diagnostics for Cloud Services, the corresponding module should be imported into the role deployment (via the ServiceDefinition.csdef file), and then enabled via the WAD configuration file (diagnostics.wadcfg). Another approach is to programmatically configure diagnostics inside the OnStart method for the role, but using the configuration file is preferred because it’s loaded first and errors related to startup tasks can be caught and logged. Also, changes to the configuration don’t require the code to be rebuilt. Figure 2 shows the most basic version of the service definition and diagnostics configuration files.
<?xml version="1.0" encoding="utf-8"?> <ServiceDefinition name="MyHostedService" xmlns= "http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition" schemaVersion="2012-10.1.8"> <WebRole name="WebRole1"> <!--<Sites> ... </Sites> --> <!-- <Endpoints> ... </Endpoints> --> <Imports> <Import moduleName="Diagnostics" /> </Imports> </WebRole> </ServiceDefinition> <?xml version="1.0" encoding="utf-8" ?> <DiagnosticMonitorConfiguration xmlns= "http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration" configurationChangePollInterval="PT1M" overallQuotaInMB="4096"> <Directories bufferQuotaInMB="0" scheduledTransferPeriod="PT30M"> <IISLogs container="wad-iis" directoryQuotaInMB="0" /> </Directories> </DiagnosticMonitorConfiguration>
The configurationChangePollInterval attribute defines how often an instance checks for configuration changes, while the scheduledTransferPeriod specifies the interval for the local files to be transferred to Windows Azure Storage (in the example shown in Figure 2, to the “wad-iis” blob container). Consider that one minute (PT1M) is the default and minimum value for the scheduled transfer of files parameter, but that it might be overkill for most scenarios. The overallQuotaInMB attribute defines the total amount of file system storage allocated for logging buffers. The bufferQuotaInMB attribute for each data source can either be left at the default of zero—which means it’s less than the overallQuotaInMB property—or it can be explicitly set. The OverallQuotaInMB must be less than the sum of all the bufferQuotainMB properties.
Even though different data sources can be used to collect diagnostics information for cloud services, using them to determine which specific tenants are consuming most of the compute resources isn’t easy. The closest metric that can be used for this purpose is provided by the IIS World Wide Web Consortium (W3C) logs, assuming traffic from the different users and tenants is tracked via URL parameters or specific virtual directories. In order to activate it, you can add the IISLogs XML node to the diagnostics configuration file (also included in Figure 2), but be warned that these IIS logs can get huge quickly. Keep in mind that diagnostics information is stored in a Windows Azure storage account, and configuration changes can be made on deployed and running services.
To learn more about other types of data sources via the configuration file—including Windows event logs and performance counters, among others—you can review the Windows Azure documentation at bit.ly/GTXAvo. Also, starting with version 2.0 of the Windows Azure SDK, the process of configuring diagnostics in Visual Studio has been greatly improved. The diagnostics section now offers a Custom plan that can be modified to include one or more data sources to be logged and transferred to the specified Windows Azure storage account (Figure 3).
Figure 3 New Diagnostics Configuration Options in Windows Azure SDK 2.0
By clicking on the Edit button, you can define specific data to be collected, including Windows performance counters, event logs and log directories (Figure 4). In this example, diagnostics information for percentage of processor time being used, available megabytes of memory, and number of requests per second will be collected and transferred to the Windows storage account every 30 minutes (the transfer period setting). This new interface simplifies the process of configuring diagnostics for cloud services and obtaining metering data for later use.
Figure 4 Configuring Performance Counters in Visual Studio with the Windows Azure SDK 2.0
In addition to the Cloud Services monitoring options provided by the Windows Azure platform, you might want to take a look at the Cloud Ninja Metering Block released by the Windows Azure Incubation team, which encompasses many of these features in an easy-to-use library. It’s available at cnmb.codeplex.com.
Collecting Data from Windows Azure Virtual Machines
Virtual Machines are stateful instances running on the Windows Azure platform and can be deployed individually or connected to Cloud Services via virtual networks. Because these instances run full versions of Windows and Linux, gathering diagnostics information is similar to the process for on-premises machines, using performance counters and persisting to local storage. The process of extracting this information varies, but it’s usually accomplished by installing local agents that transfer this information to external services.
Collecting Data from Windows Azure Web Sites
Now let’s turn our attention to Windows Azure Web Sites. Gathering diagnostics information from Web Sites is a simple process that can be enabled directly in the management portal. For the purpose of monitoring multi-tenant applications, Web Server Logging (the W3C extended log file format) should be activated, and log files downloaded via FTP. Here are the steps to follow:
- Access manage.windowsazure.com.
- Select Web Sites, and then the specific site that needs to be configured.
- Click on Configure and scroll down to the “site diagnostics” section. Turn on Web Server Logging.
- You can download the logs from /LogFiles/http/RawLogs. Log Parser 2.2, available from the Microsoft Download Center (bit.ly/119mefJ), can be used to parse and query IIS logs.
As with Windows Azure Cloud Services, information from the log files can be used to determine the usage of resources by different tenants, by tracking either URL parameters or individual virtual directories.
Metering as a Service
In addition to the diagnostics options natively provided by Windows Azure, a few companies offer metering services for Windows Azure. For example, Dell Inc. has created a product called Foglight that delivers real-time data on the health of applications and ties it back to the UX. It also includes a notification service that alerts developers of critical problems. Today, Foglight supports Cloud Services and Windows Azure SQL Database, based on the WAD infrastructure, as shown in Figure 5.
Figure 5 The Dell Foglight Monitoring Portal
Once the metering and performance counter data has been collected, it can be used to determine the level of provisioning that’s needed to meet the performance requirements of the application. Autoscaling in Windows Azure refers to the act of adding or subtracting instances from a specific deployment (scaling out), with the idea of keeping solutions up and running for the lowest possible cost. Even though it’s possible to scale up (increase the resources for a single machine), this usually implies application downtime, which is never desirable. There are basically three ways to autoscale a Windows Azure deployment.
Use an Autoscaling Block One approach to autoscaling a Windows Azure deployment, which specifically applies to Windows Azure Cloud Services, is to add an autoscaling application block to the solution. There are a couple of ready-to-be-used libraries for this purpose. One library is part of the Enterprise Integration Pack for Windows Azure, and it uses a collection of user-defined rules, setting limits for the minimum and maximum number of role instances in the deployment based on counters or measurements collected by WAD. This approach has been extensively documented by the Microsoft patterns & practices team, and can be found at bit.ly/18cr5mD. Figure 6 shows a basic multi-tenant architecture with an autoscaling block added to the solution.
Figure 6 Using an Autoscaling Application Block Approach for Cloud Services
Use an External Service There are some scaling-out services available for Windows Azure deployments that act as external autoscaling application blocks. Microsoft recently acquired MetricsHub (metricshub.com), which provides a free monitoring and autoscaling service for Windows Azure subscribers. The logic for scaling out is based on sustained averages, leading indicators, trailing data and specific schedules. You can add the service directly from the management portal in the Add-Ons section (Windows Azure Store). MetricsHub supports both Windows Azure Cloud Services and Windows Azure Virtual Machines, based on an architecture that extracts information from WAD and receives information from agents installed on single stateful instances (see Figure 7).
Figure 7 MetricsHub Architecture
Once the service has been set up, the MetricsHub portal offers different thresholds for maintaining a healthy cloud environment, based on parameters such as target CPU range and number of messages in a queue. It also provides a cost forecast before and after applying the autoscaling options, truly automating the provisioning process in the smartest way possible, balancing cost with performance (see Figure 8).
Figure 8 The MetricsHub Architecture Autoscaling Portal
Use Automated Windows PowerShell Scripts The third method is based on Windows PowerShell scripts that are manually created and directly executed against the Windows Azure Management API. This approach provides a high level of control and flexibility, because these scripts can be used inside custom applications or continuous integration frameworks. Moreover, Windows PowerShell cmdlets for Windows Azure support the three deployment models, including the automation of the provisioning process for Windows Azure Web Sites. For example, changing the number of instances for a specific deployment is as easy as executing the following command:
This article concludes our two-part series on building multi-tenant solutions in Windows Azure. In addition to identity and data isolation in the first article, we introduced you to the process of configuring and extracting performance information from each of the Windows Azure deployment models—Cloud Services, Virtual Machines and Web Sites. At the same time, we analyzed three different ways of autoscaling deployments via internal and external components and services. By taking advantage of the cloud economic model—which is based on usage cost and pools of shared resources—more companies are releasing solutions that can be efficiently adapted to their needs.
Bruno Terkaly is a developer evangelist for Microsoft. His depth of knowledge comes from years of experience in the field, writing code using a multitude of platforms, languages, frameworks, SDKs, libraries and APIs. He spends time writing code, blogging and giving live presentations on building cloud-based applications, specifically using the Windows Azure platform. You can read his blog at blogs.msdn.com/b/brunoterkaly.
Ricardo Villalobos is a seasoned software architect with more than 15 years of experience designing and creating applications for companies in the supply chain management industry. Holding different technical certifications, as well as a master’s degree in business administration from the University of Dallas, he works as a cloud architect in the Windows Azure CSV incubation group for Microsoft. You can read his blog at blog.ricardovillalobos.com.
Terkaly and Villalobos jointly present at large industry conferences. They encourage readers of Windows Azure Insider to contact them for availability. Terkaly can be reached at firstname.lastname@example.org and Villalobos can be reached at Ricardo.Villalobos@microsoft.com.
Thanks to the following technical expert for reviewing this article: Trent Swanson (Full Scale 180)
Trent Swanson is a software architect and principal working with cloud and big data technologies at Full Scale 180. I have been working with Windows Azure since the very beginning, helping clients around the world build, deploy, and manage their cloud solutions on Windows Azure. Whether it’s moving an existing application to the cloud or building new ones, I enjoy the entire lifecycle of delivering scalable, reliable, and manageable cloud solutions on Windows Azure. When not working on interesting and challenging projects, I also enjoy spending time in the gym, mixed martial arts, supporting local small businesses, and being active in my church.