Scaling Visual Studio Team Foundation Server 2010
Mario Rodriguez, Senior Program Manager, Microsoft Corporation
Built to Scale
Visual Studio Team Foundation Server 2010 introduces an evolved architecture that was built to scale to the most challenging teams and scenarios. This white paper describes a series of best practices that you can apply to help your deployment reach its full potential.
In software engineering scalability is a desired property of a system, indicating its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged. A system whose performance improves after adding hardware proportionally to the capacity added is said to be a scalable system.
It is important to reiterate that capacity and performance are the main variables of a scalable system. The best practices outlined in this white paper will walk you through scenarios where capacity is increased, and as an administrator you are looking to add resources to the system in order to either improve or maintain the performance levels of the application.
The methods for adding more resources to a system fall into two categories: scale up (vertically) and scale out (horizontally). When you are “scaling up,” you are adding resources to a single node in the system, typically adding more CPU, memory or disk space. In the “scale out” model, a different approach is taken. When adding resources, you are adding a new node to the system to distribute load and achieve greater capacity. An example would be adding a new computer as an application-tier server in order to distribute user request load. As computer prices drop and performance continues to increase, low cost "commodity" systems can be easily leveraged in a grid/cluster to achieve large amounts of computer power and performance.
Team Foundation Server 2010 and the best practices outlined in this white paper utilize both models in order to achieve the deployment's full potential.
Before we can start discussing scalability best practices, it is important to outline the elements that compose the evolved architecture of Team Foundation Server 2010. This section sets the groundwork by defining several of the new concepts introduced to support a true scale-out, multi-tenant system.
Team Project Collections
The first important concept to understand is what we call team project collections (TPCs). A team project collection is nothing more than a group of tightly related team projects. When you are thinking about them, it helps to focus on correlating them with products, codebases, or application suites. For example, if your company makes four unique products that have almost no code sharing between them, it would be practical to create four team project collections. If, on the other hand, your company has several products that compose a solution or product suite with high code reuse and framework sharing, then you will have only one team project collection.
The concept is represented in the back end as a single SQL Server database, which allows us to provide complete encapsulation, greater mobility, and improved administration. Team project collections are the key pillar of multi-tenancy, due to their encapsulation, and hence enablers of server consolidation by allowing multiple groups within an organization to share the same deployment and infrastructure.For more information about the team project collection concept, see Organizing Your Server with Team Project Collections and Team Foundation Server 2010 Key Concepts.
The introduction of team project collections has brought with it changes to the organization of the Team Foundation Server databases. The most important change is the creation of a configuration database. This “root” database contains a centralized representation of our configuration data, including the list of all team project collections, identities, resources, and global application settings. Customers should treat this database as the core of the Team Foundation Server farm and hence configure it for high availability.
Background Job Agent
The Background Job Agent is an executable installed on the application-tier server and is responsible for processing all of the background tasks generated by the server components. Examples of these tasks are: pumping data into the warehouse by the adapters, processing asynchronous long-running operations like synchronizing identities from Active Directory, or maintenance tasks like installing an update to team project collections. The agent contacts the configuration database, asks for tasks to execute from a queue, and starts processing those requests by leveraging its plug-ins. For more information about the agent, see Team Foundation Background Job Agent.
Evolved Architecture Visualized
Often it is said that a picture is worth a thousand words so I have included an image outlining the evolved architecture and its key components. The image shows a scale-out configuration known as a Team Foundation Server farm.
When planning your deployment to scale, it is necessary to start simple, even if you feel that business priorities and adoption plans are constantly changing. Don’t complicate your initial deployment with large-scale plans but rather build the minimum configuration that meets your usage load and expand from there, one step at a time.
Starting simple means that you should focus only on three questions:
How many users am I initially looking to support with this deployment within a year?
Am I deploying SharePoint Products and SQL Server Analysis Services on the same computer or utilizing already existing instances?
What machine specs match my requirements?
During this exercise, your most important decision is to accurately assess the number of users who will be using the server. The second question reminds you that Team Foundation Server has integration points with other server products, and sharing resources with those other servers can significantly impact overall performance. After determining the user load, reference the section Recommended Hardware for Team Foundation Server Deployment in this document. You can also reference System Requirements for Team Foundation Server. With this data, you now have your initial hardware specs and are ready for a successful deployment.
On a related topic, unless you are a small/medium team with fewer than 100 total users, our recommendation will always be to install a dual-server configuration so that you can easily scale out your nodes (application tier, data tier) when needed.
Optimize I/O bandwidth
Team Foundation Server, as well as other features for application lifecycle management, can be very I/O intensive due to the type of load subsystems like version control, work items, and build generation. For a subset of our customers, the most common performance issue or impediment to achieve higher scale is directly related to the disk I/O bandwidth. Small customers that deploy on machines with specs similar to development boxes are not initially affected, but when you start scaling to 100+ users and utilizing more features, disk I/O can become a problem quickly.
Our best practice recommends that administrators monitor the disk I/O of both the application-tier and data-tier servers, the latter being of most importance, and to implement mitigation plans when these become a limiting factor.
There are three technologies you can deploy in order to increase your capacity: Disk Arrays, Network-Attached Storage (NAS) and Storage Area Network (SAN). Your IT infrastructure and budget will dictate which of these technologies would be best for your company.
#3 Use a Load Balancer or Application Delivery device
Our third best practice focuses on increasing the scale and reliability of the application tier. Loosely defined, a Team Foundation Server application tier is the node that hosts the web application for the Team Foundation web services. In Team Foundation Server 2010, you can install an additional application-tier server and have it join an existing server deployment or create a new one altogether by also deploying a configuration database with it. Scaling out the application tier refers to the former option where you are configuring a new server by installing the feature in that machine without deploying any database components.
We recommend that customers configure a load balancer when adding a second or third application-tier server to the deployment. This load balancer would be configured to sit in front of these application tiers and be in charge of effectively balancing the load across them. At the core, there are two solutions you can choose from: software (Network Load Balancing is a good example of this) and hardware.
Hardware solutions, although usually more expensive, provide the most features, configuration flexibility, and best performance. These hardware devices are known in today’s market as Application Delivery devices due to their versatility as they do much more than balance load requests (e.g. routing, HTTPS, and content acceleration).
In summary, the benefits of having a load balancer are:
High-availability solution by routing requests to the active/hot nodes
Automatically balances load across nodes so users don’t have to selectively connect to individual application-tier servers
Allows seamless scale, increase or decrease of capacity, by provisioning new nodes and adding or removing them to the load balancer configured list
#4 Leverage Team Foundation Server Proxy
Within the internal deployment of Team Foundation Server at Microsoft, the single most-called method is download file. This is expected as we have thousands of developers coding features on a daily basis and hundreds of build machines delivering builds to test.
All of this traffic tends to overwhelm the application tier, impacting its ability to handle user requests in a responsive manner. The files requested are usually on the application-tier cache but not loaded in memory – would be too large – and have to be constantly fetched, compressed, and transferred to end users.
This scenario is where Team Foundation Server Proxy can be effectively leveraged to reduce application-tier load and effectively solve performance issues. Team Foundation Proxy caches these versioned files, and it is optimized exactly for this function. With low hardware specs (except in disk speed), ease of deployment and configuration (quick install and registration), it is the perfect resource to keep application-tier resources focused on handling requests and not delivering content.
The two most popular scenarios for deployments are Offshore and Build Labs. For the Offshore scenarios, you should have one or more servers running Team Foundation Proxy on the LAN premise of the offshore team. The following illustration depicts that deployment.
In the Build Lab scenario, your goal is to transfer load from the application tier to a proxy machine all within the same local area network. In this deployment, you set up Y number of proxies and have all download/read-intensive applications register a proxy for their use. The following illustration depicts that deployment.
For more information about how to install Team Foundation Server Proxy, see Scenario: Installing Team Foundation Server Proxy.
#5 Utilize the new Scale-Out Option
You have successfully deployed Team Foundation Server, and adoption has increased very sharply in the last three months. As this is occurring, user complaints about performance are increasing in frequency, and it is time to act.
In previous releases, the decision was very easy: scale up. That meant either buying higher-end machines or improving the memory and CPU specs of the current deployment. In Team Foundation Server 2010, the decision changes as you now have the new option to scale out.
There are two core questions for scaling out:
Which variables get factored in the decision?
What is the best topology and machine specs?
Technically, you should scale out when 1) you need to increase the throughput, and it is more cost effective to distribute the load by moving resources (e.g. team project collections) to other nodes or 2) when you need to add more capacity to a deployment configuration that cannot be scaled up.
One of the most important elements is the load distribution of your team project collections. As part of this decision, you are trying to optimize around three axes: CPU, memory, and Disk I/O (we are assuming disk space is not an issue). Collections with high number of requests will need more CPU, while collections with “normal” request levels but large data sizes will need more memory.
As you collect the load breakdown for each of the collections, try to match their capacity needs to the adequate hardware configuration. We recommend the use of our administration tools together with Windows diagnostic tools (e.g. performance counters) to effectively reach those decisions.
Grant Holiday, a program manager in our team, has a good blog post detailing how to gather and analyze this data. Reference it if you are not familiar with our tools.
In a scale-out deployment, the rule of thumb is to keep the machine specs as close to “commodity hardware” as possible. Our recommended topology is a set of application-tier and data–tier servers with the following specs:
Application Tier: 1 processor, Dual Core @2.13 GHz and 4GB RAM
Data Tier: 1 processor, Quad Core @2.33 GHz and 8GB RAM
Each of these machines will be able to service the load of 2,000-plus users.
#6 Stay within Team Foundation Server limits
Although Team Foundation Server is highly scalable, there are limits you should be aware of since they directly impact scale-up and scale-out decisions. Most of the limits are not enforced by the product, but rather recommendations by the product team in order to maintain a certain level of performance. Our planning guidance outlines other areas of the product that are important during the planning phases.The following limits should be closely monitored by the administrator as those have the potential to incur the most impact on performance.
The team project limit for a deployment of Team Foundation Server can be expressed as a limit on the size of the work item tracking metadata cache. As the number of team projects grows, the main resources that influence performance are available memory in the client computers and the network speed between the client and server. A server that handles many team projects that were created over time does not show significant performance degradation when it updates client computers that connect to that server regularly.
However, client computers that connect to a server for the first time or for the first time after many team projects have been created require the server to download more data and use a large amount of memory to process the data. For a client computer that has little available RAM or a slow connection to the server, the result can be a one-time delay that can last for minutes as Team Explorer tries to expand the Team Project node. The server can also become a bottleneck during the initial connection period if it must add a large number of new client computers at the same time.
Team Project Collections
The team project collection limits apply to the data-tier component of Team Foundation Server, and they're dependent on the memory given to the SQL Server instance holding the collections. As the number of team project collections grows within one instance of SQL Server, more memory is utilized for the plan cache of the stored procedures handling user requests. The plan cache has a limit of 3.4GB for the first 4GB of RAM and 10% of each subsequent 1GB thereafter. Once this limit is reached, plan caches are swapped out of memory, and CPU utilization increases as stored procedures are recompiled. The customer will experience slower performance on many operations, while the administrator will see CPU utilization reaching 100% for a prolonged period of time.
We should note that the limit is really targeted toward active collections: those that are being utilized daily by users. Collections that are not being used will have none or a very small number of plan caches in memory and will not consume any significant resources. This is why a SQL Server instance can host thousands of team project collections and not degrade performance as long as the recommended number of active collections is not exceeded.
Identifying the plan cache consumed by Team Foundation Server is a very technical and in-depth operation. It would be better for administrators to follow the ranges outlined below and monitor performance to determine their own deployment’s limit. The range tries to provide guidance in correlation to the number of Team Foundation Server features they use. The more features teams leverage, the fewer active collections per instance of SQL Server.
RAM for SQL Server
Active Team Project Collections per SQL Server
30 - 75
35 - 90
50 - 125
75 - 195
Always keep in mind that server limits are important and should be treated with as much priority as machine configuration or branch management strategies.
Every Team Foundation Server deployment will have its own unique properties that will help determine the physical hardware and software requirements. Team Foundation Server itself has some minimum requirements for hardware. It also has limits on number of team projects, number of work items, size of the version control repository, and other factors that you must consider when choosing the hardware for your specific business needs.
The guidelines presented here are exactly that, guidelines, not requirements. These guidelines are appropriate for most Team Foundation Server deployments. They have been tested and validated specifically for environments that simulate the load profiles of our customers. Note: The server should have a reliable network connection with minimum bandwidth of 1-10 Mbps (range changes with team size) and a maximum latency of 350ms.
Active Team Project Collections per SQL Server
1P Single Core @2.13 GHz
1x7.2K RPM (125 GB)
1 - 5
1P Dual Core @2.13 GHz
1x10K RPM (300 GB)
20 - 40
AT: 1P Dual Core Intel Xeon @2.13 GHz
DT: 1P Quad Core Intel Xeon @2.33 GHz
1x7.2K RPM (500 GB)
10K RPM SAS Disk Arrays (2 TB)
20 - 60
AT: 1P Dual Core Intel Xeon @2.13 GHz
DT: 2P Quad Core Intel Xeon @2.33 GHz
1x7.2K RPM (500 GB)
10K RPM SAS Disk Arrays (3 TB)
30 - 75
The theme of this set of best practices is to utilize product features and physical resources to direct the load into the most appropriate and effective hardware. The end goal is to allow growth of both users and data without impacting the overall performance of the system as perceived by the end users.
Managing the scale of complex applications like Team Foundation Server with its many integration points and dependencies is always a challenge. We hope that this white paper has given you the necessary information to tackle that task. We remain committed to scaling the server to thousands or millions of users and continuing our legacy of Built to Scale!