Considerations for Designing Distributed Systems
Summary: This article discusses the distributed-systems momentum, spotlighting its remaining challenges and considerations.
We’re at an exciting crossroads for distributed computing. Historically considered something of a black art, distributed computing is now being brought to the masses. Emerging programming models empower mere mortals to develop systems that are not only distributed, but concurrent—allowing for the ability to both scale up to use all the cores in a single machine and out across a farm.
The emergence of these programming models is happening in tandem with the entry of pay-as-you-go cloud offerings for services, compute, and storage. No longer do you need to make significant up-front investments in physical hardware, a specialized environment in which to house it, or a full-time staff who will maintain it. The promise of cloud computing is that anyone with an idea, a credit card, and some coding skills has the opportunity to develop scalable distributed systems.
With the major barriers to entry being lowered, there will without a doubt be a large influx of individuals and corporations that will either introduce new offerings or expand on existing ones—be it with their own first-party services, adding value on top of the services of others in new composite services, or distributed systems that use both categories of services as building blocks.
In the recent past, “distributed systems” in practice were often distributed across a number of nodes within a single Enterprise or a close group of business partners. As we move to this new model, they become Distributed with a capital D. This may be distributed across tenants on the same host, located across multiple external hosts, or a mix of internal and externally hosted services.
One of the much-heralded opportunities in this brave new world is the creation of composite commercial services. Specifically, the ability for an individual or a corporation to add incremental value to third-party services and then resell those compositions is cited as both an accelerator for innovation and a revenue generation opportunity where individuals can participate at various levels.
This is indeed a very powerful opportunity, but looking below the surface, there are a number of considerations for architecting this next generation of systems. This article aims to highlight some of these considerations and initiate discussion in the community.
When talking about composite services, it’s always easier if you have an example to reference. In this article, we’ll refer back to a scenario involving a simple composite service that includes four parties: Duwamish Delivery, Contoso Mobile, Fabrikam, and AdventureWorks (see Figure 1).
Figure 1. Simple composite-service scenario
Duwamish Delivery is an international company that delivers packages. It offers a metered Web service that allows its consumers to determine the current status of a package in transit.
Contoso Mobile is a mobile-phone company that provides a metered service that allows a service consumer to send text messages to mobile phones.
Fabrikam provides a composite service that consumes both the Duwamish Delivery Service and the Contoso Mobile Service. The composite service is a long-running workflow that monitors the status of a package, and the day before the package is due to be delivered, sends a text-message notification to the recipient.
AdventureWorks is a retail Web site that sells bicycles and related equipment. Fabrikam’s service is consumed by the AdventureWorks ordering system, and is called each time a new order is shipped.
A delta between supply and demand, whether the product is a gadget, a toy, or a ticket, is not uncommon. Oftentimes, there are multiple vendors of these products with varying supply and price.
A. Datum Corporation sees this as an opportunity for a composite service and will build a product locator service that consumes first-party services from popular product-related service providers such as Amazon, eBay, Yahoo!, CNET, and others (see Figure 2).
Figure 2. Popular product-related service providers
The service allows consumers to specify a product name or UPC code and returns a consolidated list of matching products, prices, and availability from the data provided by the first-party services.
In addition, A. Datum Corporation’s service allows consumers to specify both a timeframe and a price range. The composite service monitors the price of the specified product during the timeframe identified, and when the price and/or availability changed within that timeframe, a notification would be sent to the consumer.
As the supply of high-demand items can deplete quickly, this service should interact with consumers and provide these notifications across different touch points (e-mail, SMS text message, and so on). A. Datum will utilize its own e-mail servers, but for text messaging will leverage Contoso Mobile’s text-messaging service.
Referring to Customer in a World of Composite Services
If composite services reach their potential, an interesting tree of customers will emerge. Services will be composited into new composite services, which themselves will be composited. When identifying the customers of a service, one must have a flexible means of referencing where a customer sits in the hierarchy and be able to reflect their position as a customer in relation to the various services in the tree. A common approach is to refer to customers with a designation of Cx, where x identifies the distance of the customer from the service in question.
We will use this approach when referring to the participants in our reference scenarios.
In our first reference scenario, Fabrikam is a C1 of Duwamish Delivery and Contoso Mobile. AdventureWorks is a C1 of Fabrikam. AdventureWorks would be seen as a C2 of Duwamish Delivery and Contoso Mobile.
In our second scenario, A. Datum would be the C1 of Amazon, eBay, Yahoo!, CNET, and others. Consumers of the composite service would be seen as a C1 of A. Datum and a C2 of the first-party service providers.
Now that we have a lingua franca for discussing the customers in the composite-services n-level hierarchy, let’s look at considerations for the services themselves.
More Services Will Become Long-Running and Stateful
It is common that services are designed as stateless, and typically invoked with either a request/response or a one-way pattern. As you move to a world of cloud services and distributed systems, you’ll see more and more services become long running and stateful, with “services” oftentimes now being an interface to a workflow or business process.
Both of our reference scenarios reflect this. In the first scenario, a package-tracking service monitors the state of package until such point as it’s delivered. Depending on the delivery option chosen, this process runs for a minimum of a day, but could run for multiple days, even weeks, as the package makes its way to its destination. In our second reference scenario, the elapsed time could be even longer—weeks, months, even years.
When moving the focus from short-lived, stateless service interactions to these stateful, longer running services, there are a number of important things to keep in mind when designing your systems.
Implementing a Publish-and-Subscribe Mechanism to Manage Resources and Costs Effectively
One of these considerations is the importance of the ability to support subscriptions and publications.
Many long-running, stateful services need to relay information to their service consumers at some point in their lifetime. You’ll want to design your services and systems to minimize frequent polling of your service, which unnecessarily places a higher load on your server(s) and increases your bandwidth costs.
By allowing consumers to submit a subscription for notifications, you can publish acknowledgments, status messages, updates, or requests for action as appropriate. In addition to bandwidth considerations, as you start building composites from metered services, the costs associated with not implementing a publish-and-subscribe mechanisms rise.
In reference scenario 2, the service is querying back-end ecommerce sites for the availability of an item. If a consumer queries this service for a DVD that won’t be released until next year, it’s a waste of resources for the consumer to query this service (and consequently the service querying the back-end providers) regularly for the next 365 days.
Scenario 2 also highlights that if your service communicates via mechanisms affiliated with consumer devices (that is, SMS text-message or voice applications), the value of publishing to your C1 is minimal; you’ll want to allow your C1 customers to subscribe to notifications not just for themselves but also for their customers.
Designing your services to support publications and subscriptions is important in developing long-running services.
Identifying How You Will Communicate to Consumers
One of the great things about moving processes to services in the cloud is that the processes can run 24 hours a day, 7 days a week, for years at a time. They can be constantly monitoring, interacting, and participating on your behalf.
Eventually, the service will need to interact with a consumer, be it to return an immediate response, an update, or a request to interact with the underlying workflow of a service.
The reality is that while your services may be running 24/7 in the same place, your consumer may not be. After you’ve created a publish-and-subscribe mechanism, you should consider how and when to communicate. Will you interact only when an end consumer’s laptop is open? Will you send alerts via e-mail, SMS text message, RSS feed, a voice interface?
Price, Convenience, Control: Being Aware of Your Audience’s Sensitivities
When designing your service, you should do so with the understanding that when you transition to the pay-as-you-go cloud, your customers may range from a couple of guys in a garage to the upper echelons of the Fortune 500. The former may be interested in your service, but priced out of using it due to the associated costs for communication services like SMS or voice interfaces. The latter group may not want to pay for your premium services, and instead utilize their own vendors. And there are a large group of organizations in between, which would be happy to pay for the convenience.
When designing your solutions, be sure to consider the broad potential audience for your offering. If you provide basic and premium versions of your service, you’ll be able to readily adapt to audiences’ sensitivity, be it price, convenience, or control.
White-Labeling Communications: Identifying Who Will Be Communicating
Per Wikipedia, “A white label product or service is a product or service produced by one company (the producer) that other companies (the marketers) rebrand to make it appear as if they made it.”
While perhaps the writer of the entry did not have Web services in mind when writing the article, it is definitely applicable. In a world of composite services, your service consumer may be your C1 but the end consumer is really a C9 at the end of a long chain of service compositions. For those that do choose to have your service perform communications on their behalf, white labeling should be a consideration.
In our first reference scenario, the service providing the communication mechanism is Contoso Mobile. That service is consumed by Fabrikam who offers a composite service to AdventureWorks Web site. Should Contoso Mobile, Fabrikam, AdventureWorks, or a combination brand the messaging?
When communicating to the end consumer, you will need to decide whether to architect your service to include branding or other customization by your C1. This decision initiates at the communication service provider, and if offered, is subsequently provided as the service is consumed and composited downstream.
Is the “Hotmail Model” Appropriate for Your Service?
When looking at whether or not to white label a service, you may want to consider a hybrid model where you allow for some customization by your C1 but still include your own branding in the outbound communications. In our first reference scenario, this could take the form of each text message ending with “Delivered by Contoso Mobile.”
I refer to this as the “Hotmail model.” When the free e-mail service Hotmail was introduced, it included a message at the bottom of each e-mail that identified that the message was sent via Hotmail. There is definitely a category of customers willing to allow this messaging for a discounted version of a premium service.
Communicating When Appropriate by Using Business Rules
After you’ve decided how these services will communicate and who will have control over the messaging, you need to determine when the communication will occur and which communication mode is appropriate at a given point of time.
Interaction no longer ends when a desktop client or a Web browser is closed, so you’ll want to allow your C1 to specify which modes of communication should be used and when. This can take the form of a simple rule or a set of nested rules that define an escalation path, such as:
communicate via e-mail
if no response for 15 minutes
communicate via SMS
if no response for 15 minutes
communicate via voice
Rules for communication are important. Services are running in data centers and there’s no guarantee that those data centers are in the same time zone as the consumer. In our global economy, servers could be on the other side of the world, and you’ll want to make sure that you have a set of rules identifying whether it’s never appropriate to initiate a phone call at 3am or that it is permitted given a number of circumstances.
If you provide communication mechanisms as part of your service, it will be important to design it such that your C1 can specify business rules that identify the appropriate conditions.
Communicating System Events
As stated previously, long-running, stateful services could run for days, weeks, months, even years. This introduces some interesting considerations, specifically as it relates to communicating and coordinating system events.
When developing your service, consider how you will communicate information down the chain to the consumers of your service and up the chain to the services you consume.
In a scenario where you have a chain of five services, and the third service in the chain needs to shut down (either temporarily or permanently), it should communicate to its consumers that it will be offline for a period of time. You should also provide a mechanism for a consumer to request that you delay the shutdown of your service and notify them of your response.
Similarly, your service should communicate to the services it consumes. Your service should first inspect the state to determine if there is any activity currently ongoing with the service(s) that it consumes and, if so, communicate that it will be shut down and delay the relay of messages for a set period of time.
Design a Mechanism to Retrieve State Easily
There will be times when it is appropriate for an application to poll your service to retrieve status. When designing your service, you’ll want to provide an endpoint that serializes state relevant to a customer.
Two formats to consider for this are RSS and ATOM feeds. These provide a number of key benefits, as specified in Table 1.
Table 1. Benefits of RSS and ATOM feeds
Directly consumable by existing clients
A number of clients already have a means for consuming RSS or ATOM feeds. On the Windows platform, common applications including Microsoft Office Outlook, Internet Explorer, Vista Sidebar, and Sideshow have support to render RSS and ATOM feeds.
Both RSS and ATOM have elements to identify categories, which can be used to relay information relevant to both software and UI-based consumers. Using this context, these consumers can make assumptions on how to best utilize the state payload contained within the feed.
Easily filterable and aggregatable
State could be delivered as an item in the feed, and those items could be easily aggregated or filtered compositing data from multiple feeds.
Client libraries exist on most platforms
Because RSS and ATOM are commonly used on the Web, there are client-side libraries readily available on most platforms.
Designing to Failover and Not Fall Over
As soon as you take a dependency on a third-party service, you introduce a point of failure that is directly out of your control. Unlike software libraries where there is a fixed binary, third-party services can and do change without your involvement or knowledge.
As no provider promises 100 percent uptime, if you’re building a composite service, you will want to identify comparable services for those that you consume and incorporate them into your architecture as a backup.
If, for example, Contoso Mobile’s SMS service was to go offline, it would be helpful to be able to reroute requests to a third-party service, say A. Datum Communications. While the interfaces to Contoso Mobile’s and A. Data Communication’s services may be different, if they both use a common data contract to describe text messages, it should be fairly straightforward to include failover in the architecture.
Considering Generic Contracts
In the second reference scenario, for example, eBay and Amazon are both data sources. While each of these APIs has result sets that include a list of products, the data structures that represent these products are different. In some cases, elements with the same or similar names have both different descriptions and data types.
This poses an issue not just for the aggregation of result sets, but also for the ability of a service consumer to seamlessly switch over from Vendor A’s service to that of another vendor’s service. For the service consumer, it’s not just about vendor lock-in, but the ability to build a solution confidently that can failover readily to another provider. Whether a startup such as Twitter or a Web 1.0 heavyweight such as Amazon, vendor’s services do go offline. And an offline service impacts the business’ bottom line.
Generic contracts—for both data and service definitions—provide benefits from a consumption and aggregation standpoint, and would provide value in both of our reference scenarios. Service providers may be hesitant to utilize generic contracts, as they make it easier for customers to leave their service. The reality is that while generic contracts—for both data and service definitions—do make it easier for a customer to switch, serious applications will want to have a secondary service as a backup. For those new to market, this should be seen as an opportunity and generic contracts should be incorporated into your design.
Handling Service Failure
Many service providers today maintain a Web page that supplies service status and/or a separate service that allows service consumers to identify whether a service is online. This is a nice (read: no debugger required) sanity check that your consumers can use to determine whether your service is really offline or if the issue lies in their code.
If you’re offering composite services, you may want to provide a lower level of granularity, which provides transparency not just on your service, but also on the underlying services you consume.
There is also an opportunity to enhance the standard status service design to allow individuals to subscribe to alerts on the status of a service. This allows a status service to proactively notify a consumer when a service may go offline and again when it returns to service. This is particularly important to allow your consumers to proactively handle failovers, enabling some robust system-to-system interactions.
This can provide some significant nontechnical benefits as well, the biggest of which is a reduced number of support calls. This can also help mitigate damage to your brand; if your service depends on a third-party service—say, Contoso Mobile—your customers may be more understanding if they understand the issue is outside of your control.
Which Customers Do You Need to Know About?
For commercial services, you will, of course, want to identify your C1 consumers and provide a mechanism to authenticate them. Consider whether you need to capture the identities of customers beyond C1 to C2 to C3 to Cx.
Why would you want to do this? Two common scenarios: compliance and payment.
If your service is subject to government compliance laws, you may need to audit who is making use of your service. Do you need to identify the consumer of your service, the end-consumer, or the hierarchy of consumers in between?
In regards to payment, consider our first reference scenario. Duwamish and Contoso sell directly to Fabrikam, and Fabrikam sells directly to AdventureWorks. It seems straightforward, but what if Fabrikam is paid by AdventureWorks but decides not to pay Duwamish and Contoso? This is a simple composition, unless you expand the tree, with another company that creates a shopping cart service that consumes the Fabrikam service, and is itself consumed in several e-commerce applications.
While you can easily shut off the service to a C1 customer, you may want to design your service to selectively shut down incoming traffic. The reality is that in a world of hosted services, the barriers to entry for switching services should be lower. Shutting off a C1 customer may result in C2 customers migrating to a new service provider out of necessity.
For example, if Fabrikam also sells its service to Amazon, Contoso Mobile and Duwamish may recognize that shutting off service to C2 customer Amazon could drive it to select a new C1, significantly cutting into projected future revenues for both Contoso Mobile and Duwamish. AdventureWorks, however, is a different story and doesn’t drive significant traffic. Both Contoso and Duwamish could decide to reject service calls from AdventureWorks but allow service calls from Amazon while they work through their financial dispute with Fabrikam.
There is no doubt that now is an exciting time to be designing software, particularly in the space of distributed systems. This opens up a number of opportunities and also requires a series of considerations. This article should be viewed as an introduction to some of these considerations, but is not an exhaustive list. The goal here was to introduce some of these considerations, enough to serve as a catalyst to drive further conversation. Be sure to visit the forums on the Architecture Journal site to continue the conversation.
About the author
Marc Mercuri is a principal architect on the Platform Architecture Team, whose recent projects include the public-facing Web sites Tafiti.com and RoboChamps.com. Marc has been in the software industry for 15 years and is the author of Beginning Information Cards and CardSpace: From Novice to Professional, and coauthored the books Windows Communication Foundation Unleashed! and Windows Communication Foundation: Hands On!
This article was published in the Architecture Journal, a print and online publication produced by Microsoft. For more articles from this publication, please visit the .