Mass-Hosting High-Availability Architectures
by Shaun Hirschman, Mathew Baldwin, Tito Leverette, and Michael Johnson
Summary: Scalability and high availability are essential but competing properties for hosting infrastructures. Whether you are an open-source engineer, commercial-solution consumer, or Microsoft IT engineer, no "silver-bullet" solution appears to exist. This article will concentrate on the required components needed to build an ultimate environment that is scalable, reliable, secure, and easy to maintain—all while providing high availability (HA). (9 printed pages)
History of the Hosting Industry and Current Problems
Considerations when Planning Mass-Hosting High-Availability Architectures
Scaling Microsoft IIS for Mass-Hosting
Trending: Isolation vs. Scale
Shared Application-Pool Scenario
Planning Your Application Pools for High Availability
Scalability and high availability are essential but competing properties for hosting infrastructures. Whether you are an open-source engineer, commercial-solution consumer, or Microsoft IT engineer, no "silver-bullet" solution appears to exist.
In exploring different aspects of a hosting infrastructure, we might find existing technologies that we can bring together to create the "silver-bullet" solution. This process will also help expose the gaps that we must bridge. The "silver-bullet" platform must deliver an infrastructure that can allow for density in number of customer accounts, as well as being scalable and highly redundant.
This article will concentrate on the required components needed to build an ultimate environment that is scalable, reliable, secure, and easy to maintain—all while providing high availability (HA). Solitary application providers, all the way to high-density shared hosters, would benefit from such a solution. But, does it exist? If not, what are the challenges that block us today, and how close can we get to it?
To understand fully the complexities of a Web hosting platform, we must first understand how the hosting market started and how it evolved into its current state. Before traditional hosting began to take shape, simple static Web sites were popular and relatively new to the public at large. The infrastructure built to support this movement was as equally basic and focused more on bringing as many customers as possible versus providing higher-end services, such as interactive applications and high availability.
Let's start by describing the classic architecture that Microsoft-based hosters have employed in the past: a single, stand-alone Web server with FTP services, running IIS with content hosted locally on the server. Each stand-alone system has a certain cost to it: hardware, software licenses, network, power, rack space, and others. Some hosters have even taken this to the extreme by hosting all services on a single server for x number of customers. For example, they have servers with Microsoft IIS, SQL, third-party mail-server software, and local storage for hosted content.
These boxes are easy to image and quick to deploy, especially if you are selling budget packages to customers who just want to host a few pages and nothing more complex.
As this type of platform grew, numerous issues began to rear their ugly heads: backup and restore requirements, consumption of costly data-center space, power per server, high-load customers, and general management. In addition to platform issues, there was the increasing demand from consumers and the advances in new Web technology. Technologies such as PHP, Cold Fusion, ASP, JSP, and ASP.NET emerged, providing more complex platforms that were driven by the consumer's desire for richer functionality and the desire for better business. This, in turn, seeded new applications requiring data stores like SQL and other related technologies. As new applications were created, the business potential around these applications became more evident and in demand.
Hosters, in an attempt to maximize the bottom line and simplify management, continued to operate all services required to support their customers on a stand-alone server. This created an increased load on a single server, encumbering it and reducing the number of Web sites that it could serve. Consumer demand rose more rapidly than hardware technology could support. Hosters now had to scale outwardly, by separating and isolating services across multiple servers instead of scaling vertically with faster hardware.
The hosting industry continued to get saturated with competitive companies in response to the high demand for Web-based services such as dynamic Web sites and e-commerce shopping carts. This blooming market forced hosters to differentiate their services from others by creating service-level agreements (SLAs). Not only are hosters required to provide services for low cost, but they must always be available. This is substantiated by consumers and businesses' growing dependency upon the Web and their demands for more interactive, complex applications. Producing highly available services usually translates into more servers to support redundancy, as well as new performance, security, and management requirements. How can hosters scale and support these services with the given software and hardware architecture?
In light of these industry changes, software technology companies providing the foundation for these services realized that their current operating systems were not capable of meeting the needs of hosters. A high level of system administrator contact was still required to keep operations running as smoothly as possible. Independent service vendors and industry engineers became conscious of the gaps in service related software/hardware and initiated their own focus on developing technologies to fill the gaps while profiting from them.
It is at this point that hosters started to look at building platforms that are more scalable and can achieve a higher per-box density. They must also be easier to manage. A good first step in building this type of architecture is to look at the various services that the hoster must support and manage.
When the hoster sits down and begins to think about the technologies they can put together and how to build a platform to support many Web sites and services, a number of key requirements come to mind. These requirements range from user or applications density to the type of features and services that should be supported and their impact on the underlying system—ASP.NET, PHP, SQL and more—and finally the cost per hosted site or application. The primary goal when designing a hosted platform is to optimize scalability, availability, density, and price parity while adhering to as granular a level of security as possible with the isolation of customers from each other.
We separate these services into broad segments, with the principle pieces of the architecture being: IIS cluster(s), SQL cluster(s), support infrastructure services, such as Active Directory Services, System Center Operations Manager, centralized storage on either a storage area network or network attached storage, SSL offload clusters, FTP clusters, and other similar clusters.
Separating these services into multiple segments allows the hoster to scale the architecture in different ways. A hoster could scale a SQL cluster differently from a Web cluster, which brings to the table a different set of architectural issues.
Another factor that informs the architecture is support for legacy requirements, the best example of which is the Microsoft FrontPage Server Extensions (FPSE). These are still in use by thousands of customers and are necessary if the mass-hosting platform hopes to attract them. These extensions are commonly used by mom-and-pop shops to author simple sites. They are still in use for development tools like Microsoft Visual Studio and Microsoft Visual Web Developer for their HTTP-uploading functionality, despite the discouragement from Microsoft. Legacy components such as FPSE cannot just be removed by large hosts without a loss of customer base.
Now, let's dive into a few of these clusters within the architecture. The biggest piece is the Web cluster, with the second being the SQL cluster. It's important to remember that one key differentiator between what hosters do, versus other enterprises such as internal IT departments, is that they will attempt to put as many sites or databases on a cluster as possible. Because of this, certain enterprise solutions does not work for them. Additionally, the hosters do not always control what types of applications are placed on a server, and so they cannot define capacity in the same way that typical enterprises can.
Figure 1. Application load-distribution model
Because there are multiple Web front-ends, the hosting architect has to consider many options for distributing the configuration across all Web servers. This configuration is dependent upon the type of load-distribution model that is chosen. There are several models for distributing load across multiple Web front ends. We will discuss two that are common to hosting scenarios. These are application load distribution and aggregate load distribution.
Application Load Distribution
Application load distribution describes a model where load is distributed across multiple Web front end nodes based upon the server's function. This model is typically request-based and leverages the application layer routing capabilities that many network load balancers now support. This model allows hosters to split the Web farm up based on server workloads. Looking at a typical implementation of this model you will see that servers separate from those that serve dynamic content such as ASP.NET or PHP are designated for static content (see Figure 1).
Even more granularity can be added to this configuration by further splitting the dynamic servers by their specific function. This means creating smaller sub farms for each type of application. All ASP .NET traffic would be routed to an ASP.NET sub-farm and all PHP content would be routed to another sub-farm. Because dynamic-content servers normally require more resources, this design allows the hoster to use a different class of hardware for those sites than would required be for static content.
Most hosters have to consider cost when designing their platform; therefore, the application load-distribution model might not always be feasible, simply because it increases the number of servers involved. Application load distribution also increases the complexity of managing the servers and forces a heavy reliance on networking equipment.
Figure 2. One-to-one application-pool scenarios (Click on the picture for a larger image)
Aggregate Load Distribution
To keep cost at a minimum and also to make managing the platform less complex, a hoster might choose to implement an aggregate load-distribution model. We refer to an aggregate load-distribution model as one where all servers share the exact same configuration. Each server in the Web farm is also capable of servicing both static and dynamic content. On systems running IIS, application-pool design is vital to maximizing scale in this type of model.
Within a mass-hosting company that focuses on the Microsoft platform, there is a constant push for a higher density on each server in the farm. With a complex, HA architecture the hoster increases their density model; however, performance bottlenecks still occur, in particular with application pools.
In IIS, application-pool design is vital to achieving maximum density. Failing to take the time to plan application-pool design properly might lead to unexpected performance and stability issues. Application pools are truly about isolation and there is direct correlation between the level of isolation you choose and the number of applications you can place on a server. When designing application pools, you must weigh your need for security versus your desired stability. Hosters must choose between two application-pool scenarios, a one-to-one scenario or a one-to-many shared application-pool scenario. Notice that in Figure 2, one-to-one application-pool scenarios tend to trend more toward isolation, but away from scale. The shared application-pool scenario, on the other hand, trends toward higher levels of scale, while trending away from isolation. Ideally, the hoster would choose a solution that allows them to maximize scale without sacrificing isolation and security.
A one-to-one isolation scenario is defined by having an application pool assigned to a single application or in a shared Web-hosting scenario to a single Web site. It allows a hoster to achieve a high level of isolation, because each Web site or application runs within a single process and does not share resources with others on the server. This is an optimal solution for a hoster or independent software vendor who must assure customers that others on the same server will not have access to their important data. However, this scenario is limited in mass-hosting scenarios. Although it provides you with a desired level of isolation and security, because of memory requirements, it misses the mark in providing hosters the scale they desire. As each application pool runs, it consumes memory, and eventually a bottleneck is reached.
Introducing dynamic code into the platform adds a whole new level of complexity. For example, ASP.NET applications increase the amount of memory needed for the application pool. This becomes a problem for the hoster, because it limits the number of dynamic Web sites that can run on a server. They begin to see that they can only scale into the hundreds of sites, instead of thousands, which is the benchmark for most improvements in hardware technology. Specifically the introduction of 64-bit architecture has afforded many hosters the luxury of adding immense amounts of memory to their servers. While this enables them to move beyond potential bottlenecks, it might also uncover other issues.
A shared application-pool scenario describes a situation where more than one application or Web site resides in the same application pool. There are two different shared application-pool configurations. The first is one-to-N where the hosting provider dedicates a single application pool to a predefined number of Web sites. The second is a one-to-all configuration where the host places all Web sites in a single application pool. A shared application-pool scenario allows for better scale, because it does not subject the system to the memory constraints imposed by a one-to-one application-pool design.
There is some concern that Web sites and applications in a shared application pool might potentially have access to one another's data. Certain mitigations must be taken to ensure that these applications and Web sites are secured. For example, each Web site must have a unique anonymous user, and access-control list should be applied to the Web files. Also, ASP.Net applications should be configured with code access security and set to a medium trust level. These types of steps will help assure that applications and Web sites on the server are secure even in a shared application pool.
Because all applications run under the same process, shared application-pools lack the isolation many hosting providers require. This can be a concern in HA scenarios, because a problem application affects all other Web sites and applications in the same pool. For instance, an application could cause the application pool to recycle or—worse—shut down completely. It is also more difficult for system administrators to identify a problem when there are multiple sites and applications in a single pool. Although there are tools available that allow hosts to isolate problems within an application pool, this task is more easily accomplished when using a one-to-one application pool to Web site model.
Hosters are faced with striking a balance between achieving high levels of isolation while maximizing scale. Because of this, many are forced to use a hybrid of the aforementioned application-pool scenarios. A typical scenario would include multiple application pools, each serving a specific purpose. For instance, Web sites that only contain static content might all be placed in a single shared application pool. This is possible because static content is not associated with the security and performance concerns that come with dynamic content. All other application pools would be dedicated to sites that contain dynamic content. This allows the hoster to allot greater resources to these sites. This configuration is more prevalent in shared hosting environments where a single platform must service customers who have both static and dynamic content.
Another fundamental requirement for mass-hosting, HA environments, is maintaining state and configuration across all Web front ends in the farm. Although there are other configurations that must exist on each front end, the most important is that of the Web server. Some Web servers have support for a centralized configuration store. For those that do not, some software solution must be implemented to replicate the configuration across all servers.
One of the central principles to a highly scalable, mass-hosting platform that architects face is determining the location of the customer content that will be hosted. In most cases, you are dealing with content that either lives within SQL or on disk. Because the front end of the architecture is clusters configured with thousands of sites, it is not practical to localize the content to onboard, directly attached disks. The below sections break-down the different storage architectures that are common among hosting companies.
Direct Attached Storage (DAS)
DAS is one of the most common and classic storage methods used by Web hosting companies. These are stand-alone Web-servers where content is stored locally, The primary benefit being that if a single stand-alone server goes down, the entire customer-base is not offline. One of the major downfalls is that the customers are susceptible to any type of hardware failures, not just a disk subsystem failure. Additionally, this type of configuration introduces issues such as density limits, migration problems, and lack of high-availability for the Web sites and load-distribution.
Network Attached Storage (NAS)
Most hosts that have gone with a highly scalable platform have opted to use NAS as their centralized remote storage for all of their customers. In a highly available Microsoft Windows environment, the NAS is accessed via Common Internet File System (CIFS), usually across a dedicated storage network. Customer Web content is stored centrally on the NAS with the path to the NAS translated into a logical path using Distributed File System (DFS). The combination of NAS and DFS allows a Windows-based hoster to spread customers across multiple NAS storage subsystems, limiting the impact of a global issue to those customers.
Optimizing CIFS, TCP/IP, and the NAS plays heavily into how scalable the NAS is with the number of simultaneous customer sites for which it can be serving content. With poor optimization on the NAS, hosts can introduce issues that can affect the entire customer base. However, hosts also mitigate this by using multiple NAS subsystems for different segments of customers and dedicating storage networks and interfaces to this type of traffic.
Many NAS subsystems have mirroring and snapshot capabilities. These technologies are leveraged by the hoster to provide a solid process for disaster and recovery—especially considering the storage system could be holding content for tens of thousands of customers.
One issue with a pure NAS storage subsystem is that, because technologies such as SQL do not support remote storage of their databases across CIFS, this limits the hoster in the types of services they can offer.
Storage Area Network (SAN)
Many enterprise storage systems are capable of operating as both a SAN and a NAS with the key difference being the method used to connect to them. In the case of a SAN, the principle methods are Fiber Channel (FC) and iSCSI.
Hosting companies are capable of building highly available, scalable SQL and mail systems, such as Microsoft Exchange, by leveraging the capabilities of a SAN as the centralized storage for these types of clusters. Additionally, the more advanced the SAN, the more options the hosting company has to perform tasks such as snapshot and recovery management within SQL or Exchange.
One of the major detractors to an enterprise storage system is the cost the hoster incurs. Hosts are careful to ensure that the product they are about to deploy will have a high enough return on investment (ROI) to warrant the cost of a SAN.
SQL is a major service that many hosts offer to a majority of their customers. However, it is also one of the key areas that many hosts do not implement as a cluster. There are a variety of reasons for this, with the chief reasons being cost and licensing. However, the hosts that do go with a highly available SQL cluster have to design their architecture so that they select the right type of clustering methodology that will support numerous databases.
Unlike other enterprises in which the SQL cluster consists of a relatively small number of databases, hosting companies will deploy hundreds, if not thousands, of databases to a single database cluster. This cluster must be resilient in both performance and uptime. Because hosting companies have no control over how their customers write their applications, some unique issues are presented when designing a cluster for mass-hosting. In a hosting environment, each customer has 1 – n databases assigned to them. These databases can be stored on a single cluster or distributed across multiple clusters. The most common cluster a hoster would build for mass SQL hosting is the standard active-passive SQL cluster. However, as hosts move into software-as-a-service (SaaS) hosting, the requirements on the SQL cluster transform from node redundancy to redundancy of data. This introduces more issues, because these same systems will still host numerous databases.
There really is no good, cost-effective way of building a scalable, highly available, and dense SQL-cluster platform. Each cluster topology has its disadvantages combined with the fact that hosts have no control over customer application design. The ideal SQL cluster would allow hosts to do load distribution along with database mirroring without having to deal with the problems of transaction collisions, while maintaining a very dense number of databases across the cluster(s).
Services providers must continually rise to the challenge to meet the growing customer demand for new sets of services that offer greater value to small- and medium-size businesses, developers, consumers, and designers. They must provide a high degree of service to maintain customer loyalty and achieve their business goals. Services providers seek ways to deliver the "silver-bullet" Web platform that helps reduce total cost of ownership (TCO) and deliver customer satisfaction effectively and efficiently.
When referring to a Web platform solution comprising Web serving, data storage, and database storage, a bottleneck inevitably occurs within one of these components. A solution is only as strong as its weakest link. Fundamentally, every component of a solution must scale outwardly and upwardly with redundancy and cost efficiency.
This article is not about what technologies (open-source or closed-source) can fill the gap or have already done so in some areas, but instead about the underlying architectural problems that are not "top of mind" when technologies are being developed. These days, it isn't sufficient to create technology for the sake of technology in an attempt to fill a niche. Only through strategy, planning, and developing of large-scale solutions will we be able to support large-scale requirements. The world is growing, needs are mounting, and it only makes sense if infrastructures are, too.
About the authors
Shaun Hirschman has worked in the hosting space for over eight years and is very in-tune with the inherent business and technical challenges. Prior to joining Microsoft in 2006, Shaun started out in the hosting industry as technical support and rose to a senior Windows systems administrator position. He has considerable technical knowledge and experience with Microsoft-based systems. He mostly enjoys playing with the bleeding edge of software and hardware technology in an attempt to appease his thirst for knowledge.
Mathew Baldwin has been working within the hosted-services space for the last 10 years. In the past, he has held roles as architect, senior systems engineer, regular troubleshooter, and consultant for Microsoft-based platforms with multiple hosted-services companies. He was instrumental in bringing to market the first highly available, clustered Windows 2003 shared-hosting platform and has worked closely with Microsoft on a variety of initiatives. He currently works as a senior architect for Affinity Internet.
Tito Leverette has worked in the hosting space for more than eight years. He started his career with Interland, Inc. (now Web.com). Before coming to Microsoft, Tito was an architect on the Web platform team for SunTrust, Inc., the ninth largest bank in the U.S. Tito has years of experience in designing, building, and deploying Web infrastructures for large enterprises, as well as hosting clients. He is a graduate of the Georgia Institute of Technology (Georgia Tech).
Michael Johnson spent more than seven years employed in the Web services provider space, where he developed and architected highly scalable applications and solutions on Microsoft products. He is currently working as a Web architect evangelist for Microsoft.
This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the Architecture Journal Web site.