How to Develop an Efficient Architecture based on SQL Server CE Replication Requirements

 

Jim Wilson
JW Hedgehog, Inc.

February 2004

Applies to:
   Microsoft® Compact Framework 1.0
   Microsoft Visual Studio® .NET 2003
   Microsoft SQL Server™ CE
   Microsoft SQL Server 2000

Summary: In this article, we examine the SQL Server CE replication options available and how to choose the most efficient tool for each replication scenario. We also look out how best to work with each solution once the decision has been made. (15 printed pages)

Contents

Introduction
SQL Server CE Replication Options
Considerations of Merge Replication
Determining Replication Needs
Guidelines for Efficiency
Conclusion

Introduction

Smart device applications virtually never stand alone. In almost all cases they must interact with data held on enterprise servers. When working with the .NET Compact Framework, data required by the smart device will commonly be stored using SQL Server CE some part of which is copied from or inserted into a SQL Server enterprise database. Although one can build custom solutions using Web Services or the SQL Server ADO.NET provider, often taking advantage of the built-in replication facilities offered by SQL Server CE & SQL Server 2000 provide an easy-to-use and effective solution.

SQL Server CE Replication Options

There are two "built-in" choices for replicating data between SQL Server and SQL Server CE: Merge Replication and Remote Data Access, better known as RDA. Both technologies support downloading data from the server, tracking changes made to the data on the client and transmitting changes back to the server. Both require the installation of SQL Server CE Server Tools and use IIS to support firewall-friendly data communication. In fact at first glance, both technologies appear verily similar and there is indeed some overlap in their capabilities; however the details of the way each does its job varies tremendously with each having very distinct strengths and limitations.

Note   You can download the latest copy of SQL Server CE Server Tools at SQL Server CE 2.0 Update.

Merge Replication

Merge Replication is a SQL Server technology originally designed to synchronize data between SQL Server instances. It was not originally intended for smart device data replication. .NET Compact Framework applications using SQL Server CE are able to act as a replication subscriber through the SqlCeReplication class. Merge Replication is a powerful tool capable of full bidirectional synchronization between the enterprise server and one or more smart clients complete with customizable handling for conflict resolution.

Support for bidirectional synchronization and conflict resolution doesn't come cheap. The SQL Server instance requires special tables, stored procedures and triggers to manage the synchronization process. Participating in Merge Replication may even modify the structure and increase the size of the tables being replicated. As we discuss replication efficiency, we'll look at how the decisions we make affect these issues.

Note   The remainder of this article assumes a basic understanding SQL Server Merge Replication. If you're not familiar with SQL Server merge Replication or need a little refresher, check out How Merge Replication Works.

Remote Data Access (RDA)

RDA is a solution designed specifically for SQL Server CE to communicate with SQL Server. .NET Compact Framework applications use the SqlRemoteDataAccess class to communicate with SQL Server via RDA. Unlike Merge Replication, the SQL Server instance does not require any changes to the server data or database. As far as the SQL Server instance is concerned, the RDA client is just another database client issuing Select, Insert, Update, and Delete commands.

RDA's lightweight server impact often makes it an excellent replication choice. It does however tend to require more programming then Merge Replication. The problem is that RDA provides only unidirectional synchronization. That is, the smart client is able to pull down an initial data snapshot and the smart client is able to push locally made changes back to the server. These changes are applied without regard for conflicts with changes made on the server or by other clients. The smart client changes simply overwrite the server's current state. In the case of RDA, the burden of non-conflicting updates is the responsibility of the system designer/developer. There is a similar issue with regard to downloading new or modified records from the server. When RDA pulls data down from the server, it simply issues a SQL statement. There is no implicit concept of getting only new or modified records. Individual systems must be specifically designed to manage these.

Note   The remainder of this article assumes a basic understanding of RDA. To become familiar with RDA or to get a refresher, check out Introducing RDA.

Making the Choice

The combination of Merge Replication and RDA offers smart client developers tremendous choice and power. Of course both choice and power bring the burden of responsibility. In the case of data replication to smart clients, the responsibility is to choose the most appropriate tool and use that tool efficiently within the required problem space. The remainder of this article discusses the factors involved in making the choice and provides some guidance as to which solutions to apply in various circumstances.

Considerations of Merge Replication

As we know, Merge Replication provides the ability to manage and synchronize updates between multiple clients including the handing of conflict resolution. Providing these capabilities notably impacts both the server and clients. Before looking at the details of designing an efficient replication strategy, we need to understand these impacts.

Server Impacts of Merge Replication

Supporting Merge Replication is a full-time responsibility for the server. The server experiences increased load not only during the actual replication process but anytime any change is made to the replicated table. Once activated the server must carry additional data and perform additional processing with every modification.

Increased Data

Enabling Merge Replication increases the data stored with each record in the replicated table as well as adding new tables to the database containing the table. In order to track modifications to the table contents, each row must be uniquely identifiable. As you might expect, Merge Replication uses GUIDs to track uniqueness. Each tracked table has a 16 byte "rowguid" column added along with a corresponding index on the rowguid column. Although 16 bytes doesn't sound like a lot, it can be a significant overhead for smaller tables containing only a few columns. To put the resource cost in concrete terms, the rowguid increases the stored table size by 1kb for every 32 records in the table ((16 bytes in the record + 16 bytes in the index) * 32 = 1024 bytes). The actual data increase is actually slightly larger due to the indexing overhead.

Note   If a published table happens to already contain a column of type uniqueidentifier then publishing that table does not result in an attempt to add a second uniqueidentifier column.

In addition to the overhead within the individual table, additional administrative tables are added to the database containing the replicated table. A database containing Merge Replicated publications will have ten basic replication administration tables shared by all published tables within the database plus one conflict tracking table for each published table. In general, the administrative tables are not extremely large with the possible exception of the MSmerge_contents table. This table is responsible to track all modifications to the published table and will have an entry for each record modified in the published table. Frequently modified publications can cause this table to grow somewhat large.

Increased Processing

Also of significant note is the fact that enabling Merge Replication increases the processing cost of every modification made to a published table. Changes to the publication are tracked through the execution of Insert, Update and Delete triggers. Each time any modification is made to a published table, the associated triggers fire so as to record the modification in the appropriate administrative tables. Although written efficiently, these triggers are non-trivial and notably increase the processing associated with performing table updates.

During the actual synchronization operation, the server is faced with the problem of resolving conflicting updates. This means that each update must be verified to not conflict with prior updates. If there is a conflict, the appropriate conflict resolution scheme must be applied.

Device Impacts of Merge Replication

The device impact of Merge Replication is similar to that of the server although as one might expect, on a smaller scale. When a table is replicated to the smart device, the table carries three extra columns: one is the rowguid uniqueidentifer created on the server to uniquely identify each row. The other two are one column to track the current generation of the row and a column to track the current lineage. The current lineage can potentially grow large as it is a varbinary(510) but is usually much smaller then the potential maximum of 510 bytes. SQL Server CE tracks the required replication data entirely within the replicated table and therefore does not create any administrative tables.

SQL Server CE does not have triggers but does perform trigger-like processing as it automatically modifies the generation and lineage columns when other modifications are made to the table.

How Does RDA's Impact Compare to Merge Replication

Because RDA does not track conflicts, there is no need to individually track updates on the server. As a result, RDA literally has no impact on the server other then when the actual RDA Push or Pull operation is being performed. The database and tables are not affected in anyway by using RDA. In fact there is no special setup on the server other then the installation of SQL Server CE Server Tools (which is also required by Merge Replication). Even when modifications are being applied to the server, RDA simply performs standard Insert, Update or Delete operations.

From the client standpoint, RDA's impact is slightly less then that of Merge Replication although very similar. A table created by an RDA Pull has two extra columns added: one integer column to track the operations performed on the row and a variable length binary column to track information about the details of the medication. These columns are automatically managed by SQL Server CE as part of any modification operations.

Determining Replication Needs

The common response of people first learning about the SQL Server replication options is that their application requires the bidirectional update and conflict resolution capabilities of Merge Replication. The feeling is that Merge Replication "automatically" handles any issues with regard to updates by different clients. In reality there is very little that is "automatic" about any replication solution. Designing a replication solution requires careful planning and an efficient solution is arrived at only by carefully evaluating replication requirements.

Asking the Right Questions

Arriving at the appropriate replication architecture is simply a matter of asking the right questions. Figure 1 shows the questions and corresponding decision path I recommend for designing a replication solution.

Click here for larger image

Figure 1. Choosing a replication technology. Click on the thumbnail for a larger image.

With so much to consider, how does one know where to start? Basically the solution comes down to attacking each of the points individually. Let's go through the questions in Figure 1 one-by-one and see how each affects the implementation of a replication strategy.

A Question of Efficiency

In all but the simplest scenarios, RDA will almost always require more programming then Merge Replication. On the other hand, Merge Replication has a greater impact on the server then a well-architected RDA solution in all but the most complex replication scenarios. This leads us to the first question.

Is increased efficiency worth extra programming?

This is a basic yet essential question. If your goal is only to share data between a server and one or more smart clients and you are not interested in achieving an efficient solution, then just go with a basic Merge Replication architecture.

I don't generally recommend, choosing Merge Replication to avoid the work of planning a replication strategy but sometimes one is just looking for the easiest solution to implement. This can be particular useful for demos and proof-of-concept scenarios

Efficiently Replicating Read-Only Data

Once you decide you are interested in producing an efficient solution, you then must determine if the data will be modified by the smart client. Whether or not the smart client updates the data has more affect on replication design then any other point. Read-only data is almost always easier to replicate and manage than client updatable data. In smart client read-only scenarios, RDA tends to provide a significantly more efficient solution then Merge Replication. Once the determination is made that the client requires read-only access to the data, the exact replication strategy can be determined.

Is the server data set small or infrequently modified?

Small or infrequently modified data sets are a perfect opportunity to take advantage of RDA. The change tracking overhead of Merge Replication provides little value and unnecessarily increases server data and processing overhead. The lightweight nature of an RDA Pull to download new or modified data provides an efficient and easy to implement architecture.

Applications interested in the data can simply issue a Pull to retrieve the initial data snapshot. Anytime the client requires a fresh copy of the data, the local table can be dropped then recreated with a fresh Pull.

Where the problem becomes more complex is when server data frequently changes and the data set is too large to reasonably download the entire set. This is especially true in cases where only a few records may have changed. This is where the question as to the types of modifications comes into play.

Are server modifications primarily additions and deletions?

For scenarios where the server is not expected to be modifying data in place but will simply add or delete records, RDA continues to provide an efficient option if you are willing to make some slight modifications to the replicated table.

Any table can be made to easily track inserted and deleted records by adding an identity column and delete trigger. An identity column provides each row with a unique value within the table with each newly added record having an identity value greater then all previously added records. For maximum lookup efficiency, be sure to index the identity column.

CREATE TABLE ExampleData (
   DataId int IDENTITY (1, 1) NOT NULL ,
   col1 nvarchar (50) NOT NULL ,
   col2 decimal(18, 0) NOT NULL 
) 

Note   For a more detailed discussion of using identity columns, check out the section "Make Identifying New and Changed Records Easy" in my previous article Architecting Server Resources for Efficient Smart Client Consumption.

With the identity column, smart clients can use RDA to simply Pull any newly added records by selecting those with an identity value larger then the largest identity value already on the smart client. The following Stored Procedure facilitates this.

CREATE PROCEDURE get_newrows_exampledata 
   @previous_id as int  = 0
AS
SELECT DataId, col1, col2
   FROM exampledata
   WHERE DataId > @previous_id
   ORDER BY DataId

To download the initial set of data, the smart client initially calls this stored procedure with no arguments thereby receiving all records. To retrieve new records, the smart client simply passes the value of the largest identity column received in the previous download. The smart client then merges the new records into the existing smart client data set.

Note   An RDA Pull always creates a new table on the smart client. The merge process requires that the smart client Pull the new records into a temporary intermediate table in SQL Server CE then explicitly merge the new records into the original SQL Server CE table.

To track deletes on the server, we can create a simple two column table to track deleted records. We then put a delete trigger on the tracked table. The deleted records table contains two columns: the first is an identity column for the delete table itself uniquely identifying each delete; the second is the identity of the row deleted from the tracked table. Here's the SQL to create the deleted records table.

CREATE TABLE ExampleData_Deleted (
  DeleteId int IDENTITY (1, 1) NOT NULL ,
  DataId int NOT NULL 
  PRIMARY KEY (DeleteId)
)

With the ExampleData_Deleted table in place, we can put a trigger on the ExampleData table to automatically insert the id of any deleted records.

Create Trigger TrackExampleDataDeletes On ExampleData for delete as
insert into ExampleData_Deleted (DataId)
select DataId from deleted

With the delete table and trigger in place, identifying deleted records becomes easy.

CREATE PROCEDURE get_deletedrows_exampledata 
   @previous_delete_id as int  = 0
AS
SELECT DeleteId, DataId
   FROM ExampleData_Deleted
   WHERE DeleteId > @previous_delete_id
   ORDER BY DeleteId

The smart client simply calls the get_deletedrows_exampledata to retrieve the list of deleted records. The ExampleData_Deleted table has an identity column so that smart clients can request new deletes uses the same technique used to retrieve new records from the ExampleData table: calling the get_deletedrows_exampledata initially with no arguments to get the complete delete list then calling each successive time with the value of the largest DeleteId received on the previous sync. The smart client can then locally apply the deletes to its local SQL Server CE table.

Note   It's likely that clients downloading the list of deleted records for the first time may receive the id values of records that have previously been deleted. For this reason, the smart client should be coded to not consider an attempt to delete non-existent record an error.

Although this solution requires the addition of special triggers and tables to the server database along with some extra smart client programming, it provides a much more efficient and lightweight synchronization solution than using Merge Replication and is well worth the effort in systems requiring maximum efficiency.

As server data modifications become more complex, involving frequent updates to existing records rather then simple inserts and deletes, solving the problem using the efficiencies of RDA becomes notably more difficult. If you are facing this problem, you need to ask one more question.

Is increased efficiency worth significantly increased programming?

In cases where the server frequently modifies existing records, the identity column and delete triggers aren't enough. Updates to individual records must be tracked. Accurately managing this manually, can be complicated so in most cases I recommend using Merge Replication. In most systems, the server resources of Merge Replication are justified when compared to the developer effort required to adequately resolve the problem.

Since this is an article about efficient replication architectures, I'll touch on how to build a more lightweight solution. Again, I would encourage most readers to go with Merge Replication.

Any solution we build will need to track when a record is updated, so we'll need to add a datetime column to the table that's updated each time a record is changed.

CREATE TABLE ExampleData (
   DataId int IDENTITY (1, 1) NOT NULL ,
   col1 nvarchar (50) NOT NULL ,
   col2 decimal(18, 0) NOT NULL, 
   last_modified datetime DEFAULT GETDATE()  
)

The DEFAULT GETDATE() clause automatically sets the last_modified value of any newly added records to the current date/time. The challenge is in keeping last_modified up to date when updates occur to the table. One relatively simple way to do this is use a stored procedure to perform updates.

CREATE PROCEDURE update_exampletable 
                 @col1_value as nvarchar(50), 
                 @col2_value as decimal(18, 0), 
                 @idcolumn_value as int
AS
UPDATE ExampleData 
    SET col1=@col1_value,
        col2=@col2_value,
        last_modified=GETDATE()
    WHERE idcolumn = @idcolumn_value

This procedure automatically updates the last_modified value with each change. The client can then pull down the list of new and modified rows with the following stored procedure.

 CREATE PROCEDURE get_modifiedrows_exampledata 
   @lasttime as DATETIME  = '19000101'
AS
SELECT DataId, col1, col2, last_modified
   FROM exampledata
   WHERE last_modified > @lasttime
   ORDER BY last_modified

Calling the procedure with no arguments downloads the complete table. Successive calls receive the list of new and updated records by passing the time of the last modification previously downloaded.

Of course the problem with maintaining the last_modified value through a stored procedure is that all updates must be made through that stored procedure. If a developer inadvertently used an Update statement or someone used SQL Query Analyzer to manually modify the record with an Update statement, the column would not get updated unless the statement specifically updated the column.

A more reliable way to manage the last_modified value would be to put an Update Trigger on the table which automatically updates the last_modified column with each change or tracked each change in a "changed record" table.

As mentioned at the start of this section, the manual solution will be more efficient in terms of server resources but few developers should use it. The data storage is more compact and the triggers are far less resource intensive than those of Merge Replication but the combined client and server programming effort to implement this are substantial as the smart client will have to manually apply all inserts, updates and deletes.

Efficiently Replicating Client Modified Data

So we've covered the choices for smart client read-only data. That was the easy part. As noted earlier, replicating client updatable data tends to be much more involved then read-only data. This does not necessarily mean that client modified data requires Merge Replication. When it comes to managing client updatable data, the key to efficient design is minimizing data overlap - so for our next question.

Is the data set unique?

I encourage readers to examine any smart client modified data to identify uniquely updatable sets for each client. More often then not, data can easily be grouped into unique sets for each client. Things like the geography covered by a delivery driver, personnel reporting to a manager or customers served by a salesperson are examples of data that are commonly unique to a particular user. Identifying these unique data sets provides the opportunity to greatly improve synchronization efficiency.

With each client operating on a unique data set, replication can be easily and efficiently handled using RDA. The client need only perform a Pull to download the initial data set into SQL Server CE. SQL Server CE will automatically track any locally made changes with a Push uploading the changes back to SQL Server.

The key point is that RDA applies the updates without verifying that the server data hasn't been changed since the download so it is essential that the data set be unique. If the data set is not implicitly unique, we need to ask another question.

Are overlaps limited and predictable?

In some cases, there is no inherent uniqueness to the data modified by each smart client and clients therefore must update overlapping records - in other words two or more clients are likely to modify the same replicated record. This is a situation where Merge Replication does very well. Merge Replication is particularly useful if the updates affect a large percentage of the overall data set or are difficult to predict.

It is however, unusual for multiple smart clients to have large segments of overlapping updatable data. Many factors including geography and the responsibilities of individual users tend to limit the scope of likely updates. These factors offer the opportunity to create a more efficient solution using RDA.

The key to success with RDA is non-overlapping updates. By understanding the data set on which an individual smart client operates, an efficient RDA solution can be applied through the addition of a smart client staging table in the SQL Server database. The purpose of the table is to contain a unique set of records for each smart client. It is simply a de-normalized version of the data to be replicated.

The table has the same columns as the original table with one extra column added to identify the individual smart client using the user name, device id or similar. Data from the original table is copied to this new table with one copy of each record for each smart client who may access or modify the record. Each record is uniquely identified by creating a compound primary key made up of the data's original primary key and the smart client identity. This results in a table containing data that is unique to each smart client and then can be safely modified by each.

Each smart client then performs the Pull against this new table downloading only those records with the smart client's identifier. The smart client Pushes modifications back to this table with these modifications later rolled back into the original table making them available to the rest of the system.

Note   My previous article Architecting Server Resources for Efficient Smart Client Consumption discusses this topic further in the section "Consider De-Normalizing Some Data". More importantly, that article contains a complete sample demonstrating how to implement such a solution.

The RDA solution provides a tremendously efficient mechanism for replicating this data. It does however require more programming then the Merge Replication solution. The key to success is that the overlapping client updates are limited in scope and predictable. If clients share more then a few records or it is difficult to know just which records a particular client will update Merge Replication is the way to go. In such cases, the complexity of a home-grown solution is likely to cancel any potential efficiency and may significantly increase the likelihood of errors.

Guidelines for Efficiency

Choosing the right technology is the first and most important step in developing an efficient replication solution. However once the appropriate technology is chosen one must still be sure to use that technology efficiently. Let's look at the efficiency considerations of each.

Merge Replication

As noted throughout, there are several scenarios that justify the overhead of Merge Replication. Merge Replication is a very powerful tool providing an out-of-the-box solution to many difficult replication problems. Deciding to use Merge Replication doesn't mean that we are completely at its mercy. There are several things one can do to create a more efficient Merge Replication solution.

Take Only What You Need

The most important thing one can do when do when designing a Merge Replication solution is to identify exactly which tables are required by the smart client. Avoid the urge to duplicate the entire server data base on the smart clients. This is an important point for so many reasons. Smart client devices have limited storage and CPU capacity and are not capable of effectively managing data sets of the scale of enterprise servers. The limitations of smart devices prevent SQL Server CE from employing many of the optimizations that are central to SQL Server's large scale performance. The Low bandwidth of wide-area wireless communications makes moving large blocks of data time consuming and prone to failure.

Apply Filters

This is really a continuation of the previous point of taking only what one needs. Filters apply criteria that limit the data downloaded by a specific publication. An effective smart client replication solution demands that the time is spent to minimize data as much as possible.

There are a number of different kinds of filters all of which help reduce the data included in a publication. Row filters apply a where clause to minimize which records are download. Column filters are useful in cases where only certain columns are need by the smart client. Dynamic filters are useful for producing different publication content for individual smart clients executing a where-clause containing a function. Although, filters increase the cost to the server of generating the publication the reduced data is almost always more important when replicating data to a smart client.

When creating a Row or Dynamic filter, indexing is essential. These filters operate by effectively applying a where clause against one of more columns in the replicated table. Failure to provide indexes against filtered columns substantially increases the overhead of replicating the data.

Seeing filters mentioned in a replication efficiency article may seem strange to some. It's not uncommon to see Merge Replication efficiency recommendations that distinctly recommend against applying filters. The reason being that filters increase the server processing required by replication. The reason for the difference is that articles recommending against filters are assuming that replication occurs between servers over a high-speed wired network rather than to a smart device on a relatively low band-width wireless network. When it comes to smart client communications, few things are more important than minimizing data transmission.

Note   For more information on filters see Row Filters, Column Filters, and Dynamic Filters.

Enable Network Optimizations

When applying filters, Merge Replication offers an option to "Optimize Synchronization". This option causes the server to store extra data in order to reduce the volume of network communications required to perform replication. In most cases this option notably improves replication performance. Figure 2 shows the Optimize Synchronization wizard screen.

Note   For more information on optimizing synchronization, see Optimizing Synchronization.

Click here for larger image

Figure 2. Optimize Synchronization wizard. Click on the thumbnail for a larger image.

Consider Copying an Existing Subscription

When performing the initial subscription to a large data set, the time required to synchronize the smart client may be substantial. Interestingly, the time required to register the subscription on the server is relatively small. The time is spent downloading the initial subscription.

To improve the efficiency of deploying large subscriptions to several smart client devices, SQL Server CE allows a subscription database to be copied to other devices. The subscription database is initially created using the SqlCeReplication class as normal. The database can then be copied to other smart devices. Once copied to these devices, the SqlCeReplication class can register the database with SQL Server as a subscriber without incurring the overhead of re-downloading the data.

Note   For more information on copying existing subscriber databases, see Creating a Subscription.

Guidelines for Efficient RDA

Unlike Merge Replication, RDA is relatively simple in nature. For this reason there aren't a lot of areas of efficiency to address. Basically RDA is just executing simple SQL statements; therefore the guidelines for efficiently using RDA are the same as those for creating any database query; focus on writing concise well structured SQL and be sure to index frequently searched columns.

The only other concern is data volume. As is the case with all smart device communications, do not download any more data then necessary. Tightly filter data so that only those records and columns of interest are downloaded.

Conclusion

Data replication is central to so many smart client applications. Choosing the inappropriate solution affects not only the smart client application but will in many cases negatively impact the overall performance of the enterprise database. When developing a replication strategy be sure to consider all of the issues and don't be afraid to do a little extra work up front to improve the performance and efficiency of your replication solution.

Show: