Appendix A - Replicating, Distributing, and Synchronizing Data
All but the most trivial of applications have a requirement to store and retrieve data. For many systems, this aspect employs a database to act as a reliable repository. Modern database management systems, such as Microsoft SQL Server, provide multiuser access capable of handling many thousands of concurrent connections if the appropriate hardware and network bandwidth is available. However, to support highly-scalable data storage that reduces the need to install and maintain expensive hardware within an organization's data center, solutions such as the SQL Azure™ technology platform provide a cloud-based database management system that implements many of the same features.
Using SQL Azure, you can deploy a database to the same datacenter hosting the cloud-based applications and services that use it, which helps to minimize the network latency frequently associated with remote database access. However, in a hybrid system that spans applications running in a variety of distributed locations, using a single instance of SQL Azure in isolation may not be sufficient to ensure a good response time. Instead, an organization might decide to maintain a copy of the database at each location. In this scenario, it is necessary to ensure that all instances of the database contain the same data. This can be a non-trivial task, especially if the data is volatile.
Additionally, you may decide to store data in a different repository; for example, you may choose to use a different database management system or implement a different form of data source. In these cases, you may need to implement your own custom strategies to synchronize the data that they contain.
This appendix examines the issues concerned with distributing and replicating data the between services running in the cloud and across the cloud/on-premises divide by using the technologies available on the Windows Azure™ technology platform. It describes some possible solutions that an organization can implement to keep the data sources synchronized.
Use Cases and Challenges
A primary motivation for replicating data in a hybrid cloud-based solution is to reduce the network latency of data access by keeping data close to the applications and services that use it, thereby improving the response time of these applications and services. As described previously, if a service running in a datacenter uses data held in a local database stored in the same datacenter then it has eliminated the latency and reliability issues associated with sending and receiving messages across the Internet. However, these benefits are not necessarily cost-free as you must copy data held in a local database to other datacenters, and ensure that any changes to this data are correctly synchronized across all datacenters.
You must also consider that replicating data can introduce inconsistencies. When you modify data, the same modification must be made to all other copies of that data and this process may take some time. Fully transactional systems implement procedures that lock all copies of a data item before changing them, and only releasing this lock when the update has been successfully applied across all instances. However, in a globally distributed system such an approach is impractical due to the inherent latency of the Internet, so most systems that implement replication update each site individually. After an update, different sites may see different data but the system becomes "eventually consistent" as the synchronization process ripples the data updates out across all sites.
Consequently, replication is best suited to situations where data changes relatively infrequently or the application logic can cope with out-of-date information as long as it is eventually updated, possibly minutes or even hours later. For example, in an application that enables customers to place orders for products, the application may display the current stock level of each product. The number displayed is likely to be inaccurate if a product is popular; other concurrent users may be placing orders for the same product. In this case, when a user actually submits an order, the stock level should be checked again, and if necessary the customer can be alerted that there may be a delay in shipping if there are none left.
Depending on the strategy you choose to implement, incorporating replication and managing eventual consistency is likely to introduce complexity into the design, implementation, and management of your system. When you are considering replicating data, there are two main issues that you need to focus on:
- Which replication topology should you use?
- Which synchronization strategy should you implement?
The selection of the replication topology depends on how and where the data is accessed, while the synchronization strategy is governed by the requirements for keeping data up-to-date across replicas. The following sections describe some common use cases for replicating, distributing, and synchronizing data and summarize the key challenges that each use case presents.
Replicating Data across Data Sources in the Cloud and On-Premises
Description: Data must be positioned close to the application logic that uses it, whether this logic is running at a datacenter in the cloud or on-premises.
Replication is the process of copying data, and the problems associated with replication are those of managing and maintaining multiple copies of the same information. Choosing an appropriate replication topology can have a major impact on how you address these problems.
In its simplest form, the implementation of this use case copies all data in a data source to all other instances of the same data source, whether these data sources are located in the cloud or on-premises. In this scenario, applications running on-premises and services running in the cloud may be able to query and modify any data. They connect to the most local instance of the data source, perform queries, and make any necessary updates. At some point, these updates must be transmitted to all other instances of the data source, and these updates must be applied in a consistent manner. Figure 1 illustrates this topology, referred to as "Topology A" throughout this appendix.
|In the diagrams in this section, the bold arrows indicate the synchronization paths between databases.|
If your principal reason for moving data services to the cloud is purely for scalability and availability you might conclude that the data sources should just be removed from your on-premises servers, relocated to the cloud, and duplicated across all datacenters. Such a strategy might be useful if the bulk of the application logic that accesses the data has previously been migrated to the cloud. The same concerns surrounding updating data and propagating these changes consistently, as described in Topology A still apply, the only difference is that there is no data source located on-premises. Figure 2 shows this scenario.
Although Topology A and Topology B are simple to understand, such blanket strategies might not always be appropriate, especially if the data naturally partitions itself according to the location of the services that most use it. For example, consider a stock control system for an organization that maintains several warehouses at different geographical locations. A service running at a datacenter in the same geographical region as a warehouse might be responsible for managing only the data in that warehouse. In this case, it may be sensible to replicate just the data pertaining to that warehouse to the datacenter hosting the corresponding service instance, while retaining a copy of the entire database on-premises.
When a service modifies data in its local database, it can arrange to make the same change to the on-premises database. If a service should need to access data held elsewhere, it can query the on-premises database for this information. The on-premises database effectively acts as a master repository for the entire system, while the databases running at each datacenter act as a cache holding just the local data for that datacenter.
|"Appendix E - Maximizing Scalability, Availability, and Performance" describes additional ways to implement a cache by using the Windows Azure Caching.|
This approach reduces the need to copy potentially large amounts of data that is rarely used by a datacenter at the cost of the developing the additional logic required in the code for the service to determine the location of the data. Additionally, if a service regularly requires query access to non-local data or makes a large number of updates then the advantages of adopting this strategy over simply accessing the on-premises data source for every request are reduced. This approach also assumes that each item in a data source is managed exclusively by one and only one datacenter, otherwise there is a risk of losing data updates (services running at two datacenters might attempt to update the same item).
In a variation on this scenario, services running in the cloud primarily query the data, and send all updates to an application running on-premises; if a service running in the cloud needs to modify data, it sends a request to the on-premises application, which is designed to listen for and handle such requests. The on-premises application can modify the information in the master data source, also hosted on-premises, and then arrange for the corresponding changes to be copied out to the appropriate databases running in the cloud. This approach keeps the logic for the services in the cloud relatively straightforward (compared to Topology C) at the expense of providing an on-premises application to manage updates. This topology, shown in Figure 4, also ensures that all data sources in the cloud are eventually consistent by virtue of the replication process from the master data source.
In a simple variant of Topology D, the application running on-premises updates the master database of its own volition rather than in response to requests from services in the cloud. The application may be performing some data maintenance tasks under the direct control of a user at the host organization. In this scenario, the updated data is simply replicated to each of the databases in the cloud.
A final use case concerns an organization spread across multiple sites not in the cloud. The organization retains the data and the applications that use it on-premises, but replicates this data between sites through the cloud. In this case, the cloud simply acts as a conduit for passing data between sites. You can apply this technique to situations such as remote branch offices which may require either a complete copy of the data, or just the subset that relates to the branch office. In either case, applications running at each branch office access the local data source hosted at the same site, and any updates are propagated through the cloud. A copy of the data source in the cloud through which all updates pass can act as the replication hub which gathers all updates and redistributes them, as well as performing the role of a master repository if a branch office requires access to non-local data.
Synchronizing Data across Data Sources
Description: Applications and services modify data, and these modifications must be propagated across all instances of the database.
Data changes, it is rarely entirely static. Applications inevitably insert, update, and delete records. In a replicated environment you must ensure that all such changes are propagated to all appropriate instances of a data source. Synchronizing data can be expensive in terms of network bandwidth requirements, and it may be necessary to implement the synchronization process as a periodic task that performs a batch of updates. Therefore you must be prepared to balance the requirements for data consistency against the costs of performing synchronization and ensure that your business logic is designed with eventual rather than absolute consistency in mind, as descried earlier in this appendix.
In determining your synchronization strategy you should consider the following questions:
- What data do you need to synchronize and how will you handle synchronization conflicts?
The answer to this question depends largely on when and where the data is updated, as described by the topologies listed in the previous section. For example, in Topology D the master data source is only modified by applications running on-premises, so synchronization will be a matter of copying the changes made on-premises to each datacenter in the cloud. This is a one-way operation (on-premises out to the cloud) with little or no possibility of synchronization conflicts; the on-premises master database holds the definitive copy of the data, overwriting data held by datacenters in the cloud.
Where data is modified by services in the cloud but not by applications running on-premises (Topology C), if the data is partitioned by datacenter again there is no possibility of conflicts (services at two different datacenters will not update the same data) and the synchronization process is effectively one-way, and instant. In this case, the datacenters in the cloud hold the definitive data and overwrite the data held on-premises.
Where applications and services may modify data located anywhere in the cloud or on-premises (Topologies A, B, and E) the synchronization process is multi-way as each database must be synchronized with every other database. The data is not partitioned, so there is a possibility that conflicting changes can occur, and you must define a strategy for handling this situation.
Jana Says: Full multi-way synchronization between replicated relational databases can be a resource-intensive operation and the latency associated with transmitting and applying large numbers of updates across a number of sites may mean that some of the data held in one or more databases is inconsistent until all the sites are synchronized. To minimize the time taken and resources required to synchronize databases, you should carefully consider which data your applications and services need to replicate, whether your applications and services can live with potentially stale data, and whether any data should be read-only at each site.
- What are the expected synchronization data volumes? If there is a large amount of volatile data then replicating the effects of every insert, update, and delete operation may generate a lot of network traffic and consume considerable processing power, impacting the performance of each data source, and possibly nullifying the reason for replicating data in the first place. For example, in Topology C, if each service running in the cloud performs a large number of updates then maintaining a replica in the cloud becomes an overhead rather than a performance asset.
- When do you need to synchronize data? Does every instance of the database have to be fully up-to-date all of the time; does your system depend on complete transactional integrity all of the time? If so, then replication might not be the most appropriate solution as synchronization will necessarily be a continuous process, circulating all changes immediately that they occur and locking resources in each data source while it does so to prevent inconsistency from occurring. In this case, using a single centralized data source is a better choice than implementing replication.
Effective data replication has a high level of dependency on the network in terms of security, reliability, and performance. System requirements and application design can also have a significant bearing on how well your chosen replication approach functions. The following sections provide a summary of the possible issues.
Data Access Security
Each data source, whether it is a SQL Azure database or some other repository, must protect the data that it contains to prevent unauthorized access. This requirement applies during the synchronization process as well as during the regular data access cycle. The network packets containing the data being replicated must also be protected as a security breach at this point could easily propagate corrupted information to multiple instances of your precious data.
Data Consistency and Application Responsiveness
Data consistency and application responsiveness are conflicting requirements that you must balance to address the needs of your users.
If you require a high level of consistency across all replicated databases, then you must take steps to prevent competing applications from accessing data that may be in a state of flux somewhere in the system. This approach depends on application logic locking data items and their replicas before changing them and then releasing the locks. While the data is locked, no other applications can access it, adversely affecting the responsiveness of the system from the users' perspective. As mentioned elsewhere in this appendix, in many distributed systems immediate and absolute consistency may not be as important as maintaining application responsiveness; users want to be able to use the system quickly, and as long as information is not lost and eventually becomes consistent then they should be satisfied.
Integrity and Reliability
Even in a solution where immediate data consistency is not a critical requirement the system must still update data in a reliable manner to preserve the integrity of the information the application presents. For example, in the Orders application cited earlier, if the system accepts an order from a customer then that order should be fulfilled; the data comprising the order should not be lost and the order process must be completed. Therefore, any solution that replicates data between databases must implement a reliable mechanism for transporting and processing this data. If some aspect of the system handling this synchronization process fails, it should be possible to restart this process without losing or duplicating any data.
|Don't forget that the network is a critical component that impacts reliability. A reliable solution is one that is resilient in the case of network failure.|
Windows Azure and Related Technologies
If you are implementing databases in the cloud using SQL Azure, you can configure replication and manage synchronization between these databases and SQL Server databases running on-premises by using SQL Azure Data Sync. This technology is a cloud-based synchronization service based on the Microsoft Sync Framework. Using the Windows Azure Management Portal you can quickly configure synchronization for the most common scenarios between your on-premises SQL Server databases and SQL Azure databases running in the cloud. Additionally, SQL Azure Data Sync is compatible with the Microsoft Sync Framework 2.1, so you can use the Sync Framework SDK to implement a custom synchronization strategy and incorporate additional validation logic if necessary.
|SQL Azure Data Sync is compatible with SQL Server 2005 Service Pack 2 and later.|
The Sync Framework SDK is also useful for scenarios where you need to implement a custom synchronization approach that you cannot configure easily by using the Management Portal. For example, you can build your own synchronization services if you need to synchronize data between databases located on-premises and mobile devices for roaming users.
Another approach is to implement a custom mechanism that passes updates between databases using messaging, with sender and listener applications applying the logic to publish updates and synchronize them with the exiting data. Service Bus topics and subscriptions provide an ideal infrastructure for implementing this scenario.
The following sections provide more information on how to use SQL Azure Data Sync, the Sync Framework SDK, and Service Bus topics and subscriptions for implementing replication in some common scenarios.
Replicating and Synchronizing Data Using SQL Azure Data Sync
Using SQL Azure Data Sync to implement SQL Server synchronization provides many benefits, including:
- Elastic scalability. The SQL Azure Data Sync service runs in the cloud and scales automatically as the amount of data and number of sites participating in the synchronization process increases.
- Simple configuration. You can use the Management Portal to define the synchronization topology. The portal provides wizards that step through the configuration process and enable you to specify the data to be synchronized. You can also indicate whether replication should be one-way or bidirectional. The portal provides a graphical view of your topology and its current health status through the Sync Group dashboard.
- Scheduled synchronization. You can specify how frequently the synchronization process occurs, and you can easily modify this frequency even after synchronization has been configured. Using the Management Portal you can also force an immediate synchronization.
- Preconfigured conflict handling policies. SQL Azure Data Sync enables you to select how to resolve any conflicts detected during synchronization by selecting from a set of built-in conflict resolution policies.
- Comprehensive logging features. SQL Azure Data Sync logs all events and operations. The Management Portal enables you to examine this information, and filter it in a variety of ways, enabling you to quickly determine the cause of any problems and take the necessary corrective action.
The following sections provide information about the way in which SQL Azure Data Sync operates, and include guidance on using SQL Azure Data Sync in a number of common scenarios.
Guidelines for Configuring SQL Azure Data Sync
When you configure SQL Azure Data Sync, you must make a number of decisions concerning the definition of the data that you want to replicate, and the location of the databases holding this data. This section provides guidance for defining the key elements of a synchronization architecture.
Defining a Sync Group and Sync Dataset
SQL Azure Data Sync organizes the synchronization process by defining a sync group. A sync group is a collection of member databases that need to be synchronized, together with a hub database that acts as a central synchronization point. All member databases participating in a topology synchronize through the hub; they send local updates to the hub and receive updates made by other databases from the hub.
When you define a sync group, you also define a sync dataset that specifies the tables, rows, and columns to synchronize. You do not have to select every table in a database, and you can define filters to restrict the rows that are synchronized. However, every table that participates in a sync dataset must have a primary key; otherwise synchronization will fail. Additionally, although you do not need to include every column in each participating table, you must include all columns that do not allow null values; again synchronization will fail otherwise.
|SQL Azure Data Sync creates triggers on each table in a sync group. These triggers track the changes made to the data in each table in the sync group. For more information about the triggers that SQL Azure Data Sync generates, see "Considerations for Using Azure Data Sync" on MSDN.|
It is important to understand that SQL Azure Data Sync imposes some constraints on the column types in the tables that participate in the synchronization process. These constraints are due to the Sync Framework on which SQL Azure Data Sync is based; the Sync Framework is designed to operate with a variety of database management systems, not just SQL Server, and so the types it supports are limited to those common across the major database management systems. For example, you cannot synchronize columns based on user-defined data types, spatial data types, or CLR types. For a full list of supported and unsupported types see "SQL Azure Data Sync – Supported SQL Azure Data Types" on MSDN.
|If you attempt to create a sync dataset that includes columns with unsupported types, these columns will be ignored and the data that they contain will not be replicated.|
|The same sync dataset applies globally across all member databases in the sync group. You define the sync dataset when you add the first member database to the sync group, and if necessary, the tables that underpin the sync dataset will be automatically added to subsequent member databases when they are enrolled in the sync group. However, once you have defined the sync dataset for a sync group you cannot modify the definition of this dataset; you must drop the sync group and build a new one with the new sync dataset.|
Implementing the Database Schema for Member Databases
In a typical scenario, the schema of the data that you want to replicate may already exist in an on-premises or SQL Azure database. When you deploy a sync group, if the necessary tables do not already exist in the other member databases or the hub, then SQL Azure Data Sync will automatically create them based on the definition of the sync dataset. In this case, the deployment process will only generate the columns specified by the sync dataset, and will add an index for the primary key of each table. While the deployment process does a reasonable job of replicating the schema for the sync dataset, it may not always be identical due to the differences between SQL Azure and SQL Server.
Additionally, any indexes other than that for the primary key will not be generated, and this may have an impact on the performance of queries performed against a replicated database. Therefore, to ensure complete accuracy and avoid any unexpected results, it is good practice to create a SQL script containing the commands necessary to create each table to be replicated, together with the appropriate indexes. You can also define any views and stored procedures that each member database may require as these cannot be replicated automatically. You can then run this script against each database in turn before provisioning replication.
|Use a tool such as Microsoft SQL Server Management Studio to generate and edit the SQL scripts that create the tables, views, and stored procedures for each member database. If you have the appropriate credentials, you can also connect to each member database using SQL Server Management Studio and run these scripts.|
Managing Synchronization Conflicts
During the synchronization process, SQL Azure Data Sync connects to each member database in turn to retrieve the updates performed in that database and applies them to the hub. Any updates previously applied to the hub from another member database are transmitted to the database and applied.
The hub is the focus for detecting and resolving conflicts. SQL Azure Data Sync enables you to select from two conflict resolution policies:
- Hub Wins. If the data at the hub has already been changed, then overwrite changes to this data made at the member database with the data at the hub. In effect, this means that the first member database to synchronize with the hub predominates.
- Client Wins. If the data has been changed at the member database, this change overwrites any previous changes to this data at the hub. In contrast to the Hub Wins policy, in this case the last member database to synchronize with the hub predominates.
|The synchronization process visits each member database in turn in a serial manner and applies the necessary updates to synchronize that member database and the hub. Databases visited earlier will not incorporate the changes resulting from the synchronization with databases visited later. For member databases to be fully synchronized with each other, you need to perform two synchronizations across the sync group.|
During synchronization, each batch of updates is applied as a transaction; either all the updates in a batch are applied successfully or they are rolled back. However, these batch transactions do not necessarily reflect the business transactions performed by your system. For example, a business transaction that modifies data in two tables may have these updates propagated by different batches when these changes are synchronized.
Additionally, each synchronization applies only the changes in effect at that time to each database. If a row undergoes several updates between synchronizations, only the final update will be replicated; SQL Azure Data Sync does not keep a log of every change made between synchronizations.
|You should give the conflict resolution policy careful thought as the same policy applies across all databases in a sync group. Additionally, you specify this policy when you first create the sync group and you cannot change it without dropping and recreating the sync group.|
Note that, although you can select the conflict resolution policy, you cannot currently influence the order in which databases are synchronized with the hub. Ideally, you should design your solution to minimize the chances of conflicts occurring; in a typical distributed scenario, applications running at different sites tend to manage their own subset of an organization's data so the chances of conflict are reduced. Remember that the primary purpose of replication is to propagate updates made at one site to all other sites so that they all have the same view of the data.
If you need to guarantee the effects of the conflict resolution policy, you can divide your replication topology into a series of sync groups with each sync group containing the hub and a single member database. The synchronization schedule for each sync group determines the order in which each member database is synchronized with the hub. The sync group for a high priority member database with updates that must always take precedence can select the Client Wins conflict resolution policy so that these changes are always replicated.
The policy for other sync groups can be set to Hub Wins, and in this way the changes made at the high priority database will always be replicated out to the other member databases. You can implement many variations on this topology. For example you can place several member databases into the Hub Wins sync group if none of these databases are likely to contain changes that conflict with each other.
|To avoid issues with conflicting primary key values, do not use columns with automatically generated key values in replicated tables. Instead use a value that is guaranteed to be unique, such as a GUID.|
Conflict is typically a result of bidirectional synchronization. To reduce the chances of a conflict occurring you can configure one-way replication and specify the synchronization direction for each member database in a sync group relative to the hub. For more information, see the section "Selecting the Synchronization Direction for a Database" later in this appendix.
Pay careful attention to the definition of the columns in replicated tables as this can have a significant impact on the likelihood of conflict. For example, if you define the primary key column of a replicated table with the SQL Server IDENTITY attribute, then SQL Server will automatically generate values for this column in a monotonic increasing sequence, typically starting at 1 and incrementing by 1 for each newly inserted row. If rows are added at multiple member databases in a sync group, several of these rows might be given the same primary key value and will collide when the tables are synchronized. Only one of these rows will win and the rest will be removed. The results could be disastrous if, for example, this data represented orders for different customers; you will have lost the details of all the orders except for the winner selected by the conflict resolution policy!
To avoid situations such as this, do not use columns with automatically generated key values in replicated tables, but instead use a value that is guaranteed to be unique, such as a GUID.
Locating and Sizing the Hub Database
The hub must be a SQL Azure database. After synchronizing with all the member databases, it holds the definitive and most up-to-date version of the data. The location of this database is key to maintaining the performance of the synchronization process; you should store it at a datacenter that is geographically closest to the most active member databases, whether these databases are located on-premises or in the cloud. This will help to reduce the network latency associated with transmitting potentially large amounts of data across the Internet. If your databases are distributed evenly around the world, and the volume of database updates and query traffic is roughly the same for each one, then you should position the hub at the datacenter closest to your highest priority sites.
SQL Azure Data Sync replicates and synchronizes data between your databases through the hub. You can provision a single instance of the SQL Azure Data Sync Server for each Windows Azure subscription that you own, and you can specify the region in which to run this server. Ideally, you should locate this server in the same region that you plan to use for hosting the hub database.
You create the hub database manually, and it should be at least as big as the largest of the member databases. SQL Azure does not currently support automatic growth for databases, so if you make the hub database too small synchronization could fail. You should also note that when you configure synchronization, SQL Azure Data Sync creates additional metadata tables in your databases to track the changes made, and you must take these tables into account when sizing the hub database.
Apart from acting as the focus around which the synchronization process revolves, the hub contains exactly the same data as any other SQL Azure member database in the sync group. You can insert, update, and delete data in this database and these changes will be replicated throughout the sync group. In some situations, you can elect to use one of the SQL Azure databases originally intended as a member of the sync group as the hub. For example, you may opt to designate the SQL Azure database for the most active site as the hub. This strategy can help to minimize the network latency and thereby improve the performance of the synchronization process.
However, every other member database in the sync group will periodically synchronize with this database, and the work involved in performing the synchronization operations may impact the performance of this database, especially if the tables in the sync dataset contain a complex collection of indexes. You must strike a balance between the overhead associated with a database being the hub of a sync group against the time required to synchronize this database with a hub located elsewhere.
Specifying the Synchronization Schedule for a Sync Group
Synchronization is a periodic process rather than a continuous operation; you can specify a simple synchronization schedule for a sync group, and you can also force synchronization to occur manually by using the Management Portal. If you set a synchronization schedule, you must select a synchronization frequency that is appropriate to your solution; if it is too long, then member databases may contain outdated information for an extended period, while if it is too short a previous synchronization might not have completed and the attempt will fail. As described previously, the time taken to complete the synchronization process depends on the location of the hub database.
It will also depend on the volume of data to be synchronized; the longer the interval between synchronizations the more data will need to be synchronized and transmitted to and from the hub. Additionally, as the synchronization period increases, it is more likely that conflicts will occur and the synchronization process will have to expend effort resolving these conflicts, which will increase the time taken still further. You may need to determine the optimal synchronization period based on observations, and tune it accordingly as you establish the data timeliness requirements of your applications and services.
|As with the conflict resolution policy, the synchronization schedule applies globally across all databases in the sync group. However, you can modify this schedule at any time, so you can observe the effects of synchronizing data at different intervals and then select the period most appropriate to your requirements.|
Finally, you should also consider that SQL Azure charges are applied to data that is moved in and out of SQL Azure datacenters; the more data you synchronize between datacenters and the more frequently you perform this synchronization, the higher the cost.
Selecting the Synchronization Direction for a Database
When you add a member database to a sync group, you specify the synchronization direction. Synchronization can be:
- Bidirectional. The member database can make changes and upload them to the hub, and it can also receive updates from the hub. This is likely to be the most common form of synchronization implemented by many organizations.
- To the hub. The member database can make changes and upload them to the hub, but it will not receive changes from the hub. This form of synchronization is useful for situations such as Topology D (on-premises master repository with one-way synchronization to the cloud) described earlier in this appendix. A service running in the cloud updates the local member database and also copies changes to the database running on-premises as they occur. The on-premises database can be configured to synchronize to the hub. When synchronization occurs, the changes made by each service in the cloud can be propagated out to the member databases for the other services via the hub. The on-premises database does not need to be synchronized as it already contains all the updates.
- From the hub. The member database can receive changes from the hub, but will not upload any local changes to the hub. Again, this form of synchronization is useful for implementing scenarios similar to Topology D. In this case, the member databases can be configured to synchronize from the hub; any changes made locally will have already been made to the on-premises database by the services running in the cloud, so the only changes that need to be replicated are those originating from other services located elsewhere that have also updated the on-premises database.
|The synchronization direction is an attribute of each member database; each database in a sync group can specify a different synchronization direction.|
Figure 6 depicts an updated version of Topology D with the hub database and Data Sync Service required by SQL Azure Data Sync.
|Although Figure 6 shows the hub as a separate database, except for the circumstances described ate the end of the section "Locating and Sizing the Hub Database" it is likely that one of the member databases in the cloud would perform this role. The examples in the section "Guidelines for Using SQL Azure Data Sync" illustrate this approach.|
Avoiding Sync Loops
A member database can participate in more than one sync group. However, such a configuration can result in a sync loop. A sync loop occurs when the synchronization process in one sync group results in synchronization being required in another sync group, and when this second synchronization occurs the configuration of this sync group results in synchronization being required again in the first group, which again may render synchronization necessary in the second group, and so on. Sync loops are self-perpetuating and can result in large amounts of data being repeatedly written and rewritten, resulting in degraded performance and increased costs.
When you define a sync group, you must be careful to ensure that sync loops cannot exist by evaluating the role of any databases that participate in multiple sync groups, selecting the appropriate conflict resolution policy for each sync group, using row filtering to prevent the same rows in a table participating in different sync groups, and by carefully setting the synchronization direction for each database. For a more detailed description of sync loops and the circumstances under which they can occur, see "Synchronization Loops" on MSDN.
Guidelines for Using SQL Azure Data Sync
You can use SQL Azure Data Sync to implement the replication topologies described earlier in this appendix. You can apply these topologies with SQL Azure Data Sync to many common scenarios, as described in the following list.
Applications running on-premises access a SQL Server database also held on-premises. Services running in the cloud use a copy of the same data. Any changes made at any site must eventually be propagated to all other sites, although these updates do not have to occur immediately.
This is possibly the most common scenario for using SQL Azure Data Sync, and describes the situation covered by Topology A (bidirectional synchronization across all databases on-premises and in the cloud). As an example of this scenario, consider a database holding customer and order information. An application running on-premises maintains customer information, while customers use a web application running in the cloud that creates new orders. The web application requires access to the customer information managed by the on-premises application, and the code running on-premises frequently queries the order details to update the status of orders when they are delivered and paid for.
In this example, response time is important, but neither the application running on-premises nor the web application running in the cloud requires access to completely up-to-date information. As long as the data is available at some near point in the future, that is good enough. Therefore, to minimize the effects of network latency and ensure that it remains responsive, the web application employs a SQL Azure database hosted in the same datacenter as the application and the on-premises application uses a SQL Server database also located on-premises.
SQL Azure Data Sync enables you to replicate and share the customer and order information between the on-premises application and the cloud by using bidirectional synchronization, as shown in Figure 7. Note that, in this diagram, the SQL Azure database in Datacenter A also acts as the synchronization hub running the Data Sync service.
You have relocated the logic for your business applications to services running in the cloud. The business applications previously used data held in a SQL Server database. The services have been distributed across different datacenters, and the SQL Server database has been migrated to SQL Azure. To minimize network latency each data center has a replica of the SQL Azure database.
This is the scenario that compares to Topology B (bidirectional synchronization only across databases in the cloud). In this example, the application logic that accesses the database has been completely relocated to the cloud, so the on-premises database has been eliminated. However, the cloud based applications all require access to the same data, and may modify this information, so each instance of the SQL Azure database must be periodically synchronized with the other instances. This replication will be bidirectional. Figure 8 shows the structure of a possible solution, with the SQL Azure database in Datacenter A also performing the role of the synchronization hub. In this example, any of the applications may query and modify any data. Consequently, the application logic might need to be amended to handle data that may be out of date until the next synchronization cycle.
You need to make your data held in an on-premises SQL Server database available to services running in the cloud. These services only query data and do not modify it; all modifications are performed by applications running on-premises.
This is the simple variant of Topology D (on-premises master repository with one-way synchronization to the cloud) described earlier. In this scenario, the services that query the data execute remotely from your on-premises database. To minimize response times, you can replicate the data to one or more SQL Azure databases hosted in the same datacenters as each of the services. Using SQL Azure Data Sync, you can publish the data held on premises and periodically synchronize any updates made by the on-premises applications with the databases in the cloud. Figure 9 shows an example. This configuration requires one-way replication, with the on-premises database synchronizing to a hub database in the cloud and each of the SQL Azure member databases synchronizing from the hub.
You have a number of applications and SQL Server databases running on-premises. However, you have migrated much of your business intelligence and reporting functionality to services running in the cloud. This functionality runs weekly, but to support your business operations it requires query access to your business data.
In this scenario, all the data modifications are performed against a number of SQL Server databases hosted within the organization by applications running on-premises. These applications may be independent from each other and operate by using completely different databases. However, assume that the business intelligence functionality performs operations that span all of these databases, querying data held across them all, and generating the appropriate reports to enable a business manager to make the appropriate business decisions for the organization. Some of these reports may involve performing intensive processing, which is why these features have been moved to the cloud.
You can use SQL Azure Data Sync to aggregate and consolidate data from the multiple on-premises databases into a single SQL Azure database in the cloud, possibly replicated to different datacenters as shown in Figure 10. The business intelligence service at each datacenter can then query this data locally. The synchronization process only needs to be one way, from the on-premises databases to the hub and then from the hub to each SQL Azure database; no data needs to be sent back to the on-premises database. Additionally, synchronization can be scheduled to occur weekly, starting a few hours before the business intelligence service needs to run (the exact schedule can be determined based on how much data is likely to be replicated and the time required for this replication to complete).
You can use the same approach to aggregate data from multiple offices to the cloud, the only difference being that the on-premises SQL Server databases are held at different locations.
You have a number of services running in the cloud at different datacenters. The services at each datacenter maintain a separate, distinct subset of your organization's data. However, each service may occasionally query any of the data, whether it is managed by services in that datacenter or any other datacenter. Additionally, applications running on-premises require access to all of your organization's data.
This situation occurs when the data is partitioned between different sites, as described in Topology C (on-premises master repository with one-way synchronization from the cloud). In this scenario, a SQL Server database running on-premises holds a copy of all the data for the organization, but each datacenter has a SQL Azure database holding just the subset of data required by the services running at that datacenter. This topology allows the services running at a datacenter to query and update just its subset of the data, and periodically synchronize this subset of the data with the on-premises database.
If a service needs to query data that it does not hold locally, it can retrieve this information from the on-premises database. As described earlier, this mechanism necessitates implementing logic in each service to determine the location of the data, but if the bulk of the queries are performed against the local database in the same datacenter then the service should be responsive and maintain performance.
Implementing this system through SQL Azure Data Sync requires defining a separate sync group for each SQL Azure database. This is because the sync dataset for each SQL Azure database will be different; the data will be partitioned by datacenter. The on-premises database will be a member common to each sync group. To simplify the structure, you can specify that the SQL Azure database for each datacenter should act as its own synchronization hub. Figure 11 shows an implementation of this solution.
Markus Says: Be careful to avoid introducing a sync loop if you follow this strategy; make sure that the different sync datasets do not overlap and include the same data.
Your organization comprises a head office and a number of remote branch offices. The applications running at head office require query access to all of the data managed by each of the branch offices, and may occasionally modify this data. Each branch office can query any data held in that branch office, any other branch office, or at head office, but can only modify data that relates to the branch office.
This is the scenario for Topology E (on-premises databases synchronized through the cloud). The data can be stored in a SQL Server database, and each branch office can retain a replica of this database. Other than the hub database, no data is stored in the cloud. The location of the hub should be close to the most active office (possibly the head office). Synchronization should be bidirectional, and can be scheduled to occur with a suitable frequency depending on the requirement for other branch offices to see the most recent data for any other branch. If each branch only stores its own local subset of the data, you will need to create a separate sync group for each branch database with the appropriate sync dataset, as described in the previous scenario. If the sync datasets for each branch do not overlap, it is safe to use the same SQL Azure database as the synchronization hub for each sync group. Figure 12 shows a possible structure for this solution.
Many of your services running in the cloud perform a large number of on-line transaction processing (OLTP) operations, and the performance of these operations is crucial to the success of your business. To maintain throughput and minimize network latency you store the information used by these services in a SQL Azure database at the same datacenter hosting these services. Other services at the same site generate reports and analyze the information in these databases. Some of this reporting functionality involves performing very complex queries. However, you have found that performing these queries causes conflict in the database that can severely impact the performance of the OLTP services.
In this scenario, the solution is to replicate the database supporting the OLTP operations to another database intended for use by the reporting and analytical services, implementing a read scale-out strategy. The synchronization only needs to be performed one-way, from the OLTP database to the reporting database, and the schedule can be set to synchronize data during off-peak hours. The OLTP database can perform the role of the hub. Additionally, the reporting database can be optimized for query purposes; the tables can be configured with indexes to speed the various data retrieval operations required by the analytical services, and the data can be denormalized to reduce the processing requirements of complex queries. In contrast, the number of indexes in the OLTP database should be minimized to avoid the overhead associated with maintaining them during update-intensive operations. Figure 13 shows this solution.
SQL Azure Data Sync Security Model
SQL Azure Data Sync uses a piece of software called the Data Sync client agent to communicate between your on-premises instances of SQL Server and the SQL Azure Data Sync Server in the cloud; you must download and install the Data Sync client agent on one of your on-premises servers. The communications between the SQL Azure Data Sync Server and the Data Sync client agent are encrypted. The Data Sync client agent uses an outbound HTTPS connection to communicate with the SQL Azure Data Sync Server. Additionally, all sensitive configuration information used by the Data Sync client agent and SQL Azure Data Sync Server are encrypted, including the credentials used to connect to each on-premises database and SQL Azure database. The agent key defined in the Management Portal is used by the Data Sync Service for authentication.
The Data Sync client agent software consists of two elements:
- A Windows service that connects to the on-premises databases, and
- A graphical utility for configuring the agent key and registering on-premises databases with this service.
The client agent service must be configured to run using a Windows account that has the appropriate rights to connect to each server hosting on-premises databases to be synchronized; these databases do not have to be located on the same server as the client agent service. When you register an on-premises database with the client agent you must provide the login details for accessing the SQL Server hosting the database, and this information is stored (encrypted) in the client agent configuration file. When the SQL Azure Data Sync Server synchronizes with an on-premises database, the client agent uses these details to connect to the database.
|For additional security, if the client agent is running on a different server from your databases, you can configure the client agent service to encrypt the connection with each on-premises database by using SSL. This requires that you have installed the appropriate SSL certificates installed in each instance of SQL Server. For more information, see "Encrypting Connections to SQL Server" on MSDN.|
For more information about the SQL Azure Data Sync security model, see "Data Security" on MSDN.
Implementing Custom Replication and Synchronization Using the Sync Framework SDK
The Management Portal enables you to replicate and synchronize SQL Server and SQL Azure databases without writing any code, and it is suitable for configuring many common replication scenarios. However, there may be occasions when you require more control over the synchronization process, for example a service may need to force a synchronization to occur at a specific time. For example, if you are caching data by using the Windows Azure Caching service, using SQL Azure Data Sync may render any cached data invalid after synchronization occurs. You may need to more closely coordinate the lifetime of cached data with the synchronization frequency, perhaps arranging to flush data from the cache when synchronization occurs, to reduce the likelihood of this possibility. You might also need to implement a different conflict resolution mechanism from the policies provided by SQL Azure Data Sync, or replicate data from a source other than SQL Server. You can implement a just such customized approach to synchronization in your applications by using the Sync Framework SDK.
|Chapter 6, "Maximizing Scalability, Availability, and Performance in the Orders Application" describes how Trey Research implemented caching. "Appendix E - Maximizing Scalability, Availability, and Performance" provides further details and guidance on using Windows Azure Caching.|
The Sync Framework 2.1 SDK includes support for building applications that can synchronize with SQL Azure. Using this version of the Sync Framework SDK, you can write your own custom synchronization code and control the replication process directly.
Using this approach, you can address a variety of scenarios that are not easy to implement by using SQL Azure Data Sync, such as building offline-capable/roaming applications. As an example, consider an application running on a mobile device such as a notebook computer used by a plumber or a building maintenance engineer. At the start of each day, he or she uses the application to connect to the local branch office and receive a work schedule with a list of customers' addresses and job details. As each job is completed, an application running on the mobile device is used to input the details, which are stored in a database on the mobile device. Between jobs, he or she can connect to the branch office again and upload the details of the work completed so far, and the application also downloads any amendments to the work schedule for that day. For example, he or she can be directed to attend an urgent job prior to moving on to the previously scheduled engagement. Every Friday afternoon, an administrator in the branch office generates a report detailing the jobs carried out by all workers reporting to that branch.
If the branch office database is implemented by using SQL Azure, the mobile application running on the mobile device can use the Sync Framework SDK to connect to the datacenter hosting this database and synchronize with the local database on the device. Figure 14 shows a simplified view of this architecture.
For more information about using the Sync Framework SDK with SQL Azure, see "SQL Server to SQL Azure Synchronization using Sync Framework 2.1" on MSDN.
Replicating and Synchronizing Data Using Service Bus Topics and Subscriptions
SQL Azure Data Sync provides an optimized mechanism for synchronizing SQL Server and SQL Azure databases, and is suitable for an environment where changes can be batched together, propagated as a group, and any conflicts resolved quickly. In a dynamic environment such batch processing this may be inappropriate; it may be necessary to replicate changes as they are made rather than batching them up and performing them at some regular interval. In these situations, you may need to implement a custom strategy. Fortunately, this is a well-researched area, and several common patterns are available. This section describes two generic scenarios that cover most situations, and summarizes how you might implement solutions for these scenarios by using Service Bus topics and subscriptions.
Guidelines for Using Service Bus Topics and Subscriptions
You can use Service Bus topics and subscriptions to implement a reliable infrastructure for routing messages between sender and receiver applications. You can exploit this infrastructure to provide a basis for constructing a highly customizable mechanism for synchronizing data updates made across a distributed collection of data sources, as described by the following scenarios:
Concurrent instances of applications or services running as part of your system require read and write access to a set of distributed resources. To maintain responsiveness, read operations should be performed quickly, and so the resources are replicated to reduce network latency. Write operations can occur at any time, so you must implement controlled, coordinated write access to each replica to reduce the possibility of any updates being lost.
As an example of this scenario, consider a distributed system that comprises multiple instances of an application accessing a database. The instances run in various datacenters in the cloud, and to minimize network latency a copy of the database is maintained at each datacenter. If an application instance at datacenter A needs to modify an item in the database at datacenter A, the same change must be propagated to all copies of the database residing at other datacenters. If this does not happen in a controlled manner, application instances running in different datacenters might update the local copy of the same data to different values, resulting in a conflict, as shown in Figure 15.
In the classic distributed transaction model, you can address this problem by implementing transaction managers coordinating with each other by using the Two-Phase Commit protocol (2PC). However, although 2PC guarantees consistency, it does not scale well. In a global environment based on networks that are not always fully reliable this could lead to data being locked for excessively long periods, reducing the responsiveness of applications that depend on this data. Therefore you must be prepared to make some compromises between consistency and availability.
One way to approach this problem is to implement update operations as BASE transactions. BASE is an acronym for Basic Availability, Soft-state, and Eventual consistency, and is an alternative viewpoint to traditional ACID (Atomic, Consistent, Isolated, and Durable) transactions. With BASE transactions, rather than requiring complete and total consistency all of the time, it is considered sufficient for the database to be consistent eventually, as long as no changes are lost in the meantime. What this means in practice, in the example shown in Figure 15, is that an application instance running at Datacenter A can update the database in the same datacenter, and this update must be performed in such a way that it is replicated in the database a Datacenter B. If an application instance running at Datacenter B updates the same data, it must likewise be propagated to Datacenter A. The key point is that after both of the updates are complete, the result should be consistent and both databases should reflect the most recent update.
There may be a period during which the modification made at Datacenter A has yet to be copied to Datacenter B, and during this time an application instance running at Datacenter B may see old, stale data, so the application has to be designed with this possibility in mind; consider the orders processing system displaying the stock levels of products to customers cited earlier in this appendix as an example.
Service Bus topics and subscriptions provide one solution to implementing controlled updates in this scenario. Application instances can query data in the local database directly, but all updates should be formatted as messages and posted to a Service Bus topic. A receiving application located in each datacenter has its own subscription for this topic with a filter that simply passes all messages through to this subscription. Each receiver therefore receives a copy of every message and it uses these messages to update the local database. Figure 16 shows the basic structure of this solution.
You can use the Enterprise Library Transient Fault Handling Block to provide a structure for posting messages reliably to a topic and handling any transient errors that may occur. As an additional safeguard, you should configure the topic with duplicate detection enabled, so if the same update message does get posted twice any duplicates will be discarded.
The receiver should retrieve messages by using the PeekLock receive mode. If the update operation succeeds, the receive process can be completed, otherwise it will be abandoned and the update message will reappear on the subscription. The receiver can retrieve the message again and retry the operation.
Instances of a long-running service executing in the cloud access a single remote data source, but to maintain response times each instance has a cache of the data that it uses. The volume of data is reasonably small and is held in-memory. The data acting as the source of the cache may be updated, and when this happens each service instance should be notified so that it can maintain its cached copy.
An example of this scenario is a job processor service. Client applications store information about business tasks that need to be performed in a database. The job processor service periodically polls the database looking for new tasks to perform, and then executes the appropriate actions when a task appears.
In this scenario, each instance of the job processor service is seeded with data from the database when it starts up. When a new job is added by a client application it is stored in the database. However, the client application can also post a message to a Service Bus topic with the details of the new job. Each instance of the job processor service has a subscription to this topic, and uses the messages posted to this topic to update its local copy of the cached data and perform the appropriate processing. The job processor service no longer needs to poll the database which now acts purely as an audit log of jobs.
This architecture optimizes the use of the database and also reduces the possibility of conflict occurring; if multiple instances of the job processor service have to poll the database, there is the possibility that they might all pick up the details of the same new job unless the database query logic includes a mechanism for exclusively locking data as it is retrieved, whereas a Service Bus subscription automatically prevents multiple instances from retrieving the same message. Additionally, this architecture can more easily spread the load evenly and is naturally scalable; you can partition the job messages to direct them to different subscriptions by defining an appropriate set of filters, and you can start and stop instances of the job processor service listening on each subscription as the queue length grows and shrinks. Figure 17 shows the structure of this solution.
The logic of the client application and job processor service can easily be extended if, for example, the client application wishes to cancel a job, or change some of the details for a job. In these cases, the client application can remove or update the job information in the database and post a job cancellation or job update message to the Service Bus topic.
All links in this book are accessible from the book's online bibliography available at: http://msdn.microsoft.com/en-us/library/hh871440.aspx.
- "Considerations for Using Azure Data Sync" at http://sqlcat.com/sqlcat/b/technicalnotes/archive/2011/12/21/considerations-when-using-data-sync.aspx.
- "SQL Azure Data Sync – Supported SQL Azure Data Types" at
- "Synchronization Loops" at http://msdn.microsoft.com/en-us/library/hh667312.aspx.
- "Encrypting Connections to SQL Server" at
- "Data Security" at http://msdn.microsoft.com/en-us/library/hh667329.aspx.
- "SQL Server to SQL Azure Synchronization using Sync Framework 2.1" at http://blogs.msdn.com/b/sync/archive/2010/08/31/sql-server-to-sql-azure-synchronization-using-sync-framework-2-1.aspx.
Last built: June 4, 2012