Large Data Strategies
November 7, 2001
As I mentioned in a previous guest appearance on this column, I've been working on a file storage XML Web Service called ColdStorage, along with Marcelo Uemura and David Willson. We released our first version of that sample service a few weeks ago. One of the core issues we ran into during its design was how to handle large data transactions using SOAP. We discussed a few options and eventually settled on a manual chunking scheme. This is a topic that I think bears some examination. As Web Service become more prevalent, the complexity of the tasks they perform is bound to increase. Correspondingly, the amounts of data they work with will increase as well. How you choose to handle the transfer of that data can have a direct impact on the usability of your Web Service.
Definitions and Design Issues
By large data, I am referring to multi-megabyte to potentially multi-gigabyte streams of data. The source is irrelevant. We might be talking about video clips, legal documents, or medical images. The data doesn't even necessarily need to be in a file form. Maybe you are processing satellite radio data searching for E.T., or modeling drug interactions with cancer cells in a distributed environment. Volume of data is the single characteristic that ties together these disparate sources.
Designing for high volumes of data means that you'll need to consider some issues more closely than you might have otherwise. Here is a short list, in no particular order:
Performance and Usage Scenarios
By their very nature, large data objects magnify overhead costs that might have been overlooked before. For example, encoding a 20 KB array parameter creates little more that a minor hiccup on a performance test, whereas encoding a 20 MB array parameter is definitely noticeable. Delivering large data on a secure channel means encryption costs in addition to encoding.
How you respond to those performance issues is really dependent on how you expect customers to use your Web Service. A well-planned design will consider the usage scenarios up front. For example, will the Web Service be used well behind the scenes in a B2B, or will it be tied to a user interface? Obviously, this type of data transfer is a candidate for an asynchronous design on the client side. If there is a user interface driving the Web Service, as was the case with the ColdStorage sample, few customers will want to sit idly by while a data transfer ties up their machine for hours.
Is timing a significant factor for your Web Service? For example, does the Web Service method call need to be completed in real time, or can the delivery of results be delayed and delivered at a later time?
Catastrophic Failure and Error Recovery
Timeouts and broken connections can be particularly painful if the data is sent as one large block. If a timeout or connection failure occurs during the transfer of large data, what is the best mechanism for reestablishing the connection and picking up where you left off, rather than starting over from scratch?
Capacity planning with large data is difficult. Uploading or downloading an extremely large file will keep a TCPIP connection open for the duration of the transfer. Do that across a large number of users, and scalability falls through the floor. Modeling real-world performance when data sizes range from a few kilobytes to multiple megabytes can give you results that are not indicative of reality. You will need some firm statistics showing what the true average size of your incoming and outgoing data are.
In general, you have to assume that the transport mechanism that underlies SOAP can handle the transfer of large data. After all, your test team has enough on its hands verifying that the Web Service itself is functioning properly. Issues still remain for testing, however. For example, say your Web Service routinely handles data that is multiple megabytes in size. Does that mean your test team must test with files of the same size, or is it okay if they work with smaller pieces of data to return faster test results?
The answers to those issues are dependent on what your Web Service does with those large blocks of data. At a minimum, while scheduling the development effort of the service, pad your estimated test times to allow for long delays.
Is an XML Web Service the Best Tool for This Job?
There is no point in using a screwdriver to pound nails. It takes forever, and it messes up your screwdriver. If all you want to do is transfer raw bits up to a server, there are lower-level protocols that can achieve this much more efficiently—FTP or WebDAV, for example. This may change in the future, but for now, there is no denying the reality: SOAP over HTTP introduces overhead.
One point to consider: What happens to the data once it gets on the server? Do you want to attach attributes to the data and archive it in a SQL database? If you do, it means additional server-side code. If you plan on using some other transfer method, how will your customers know where to connect to, what account to use, and where to drop the files?
As questions of this sort mount up, it becomes more apparent that there are other costs associated with the fastest upload schemes. Granted, tools are available to handle the routing of large data files to the appropriate destination (forgive me while I insert a shameless Microsoft® BizTalk™ Server plug here). The point to take away here is, when you send a large data object directly to a Web Service, you are hand-delivering the data to the code logic that knows explicitly what to do with it.
So, let's assume you've made peace with the issues related to large data design objects, and have decided that an XML Web Service is the appropriate way to send or receive large blocks of data. Where do we go from here?
There are a couple of basic approaches to large data transfer. You can pass it as a buffer of bytes, or you can offload the transfer to a more efficient mechanism.
This boils down to passing raw data in the form of an array of bytes as one of the parameters of a Web Service method. Typically, one of the other parameters will provide information about how large the buffer is.
This is the easiest approach, but is not without hazards. From a Web Service consumer perspective, you don't really know the maximum size of the block being sent. Unless you establish some arbitrary maximum size that you will accept, you run the risk of running out of memory or driving system performance into the ground as the virtual memory on your machine is consumed.
From the server perspective, the risk of running out of memory is not as strong, but you still need to establish an arbitrary cap on the size of the file. If you don't, you open yourself to a potential denial-of-service attack of excessively large files. The other danger comes from customers overstating the size of the data buffer they are providing, causing potential crashes or security problems if you are not careful.
Speaking of arbitrary caps on parameter size, here's a tip for those using the .NET Framework. Under the version of the .NET Framework that shipped with Visual Studio .NET Beta 2, the maximum SOAP message size for a Web Service was capped at 4 MB. You can increase that cap by changing the
maxRequestLength setting in the
httpRuntime section of the Web.Config file for your Web Service. In the example below, the maxRequestLength has been set to 8096 KB:
<configuration> <system.web> <httpRuntime maxRequestLength="8096" useFullyQualifiedRedirectUrl="true" executionTimeout="45"/> </system.web> </configuration>
There are variations on the "one big buffer" scenario. One strategy to consider here is the approach we took with ColdStorage: break the large data blob up into smaller chunks of a fixed size for upload and download. Here are some benefits of this strategy:
- Each method call uses a fixed maximum amount of memory.
- System resources are not hogged for an extreme period of time by any single call. Instead, breaking one large request into smaller pieces allows other smaller requests to be serviced. Overall throughput in terms of total requests serviced should increase.
- It offers a mechanism for error recovery. If any individual chunk fails during delivery, you can resend just that chunk.
- It should provide better scalability in a server farm scenario. Any available server can service incoming requests, potentially taking better advantage of under-utilized hardware.
Here are the downsides to this approach:
- Additional processing time is required to establish connections to the server. In the one big buffer approach, the connection can be held, rather than having to close.
- The consumer of the Web Service needs to be a little more intelligent to manage the upload and download.
- Very large transfers, occurring amid many smaller transfers, can potentially suffer long delays. This scenario is not as bad when the majority of the transfers involve data of roughly the same volume. In such cases, all of the transfers will be delayed uniformly.
Another option you might want to consider is to compress the data before transmitting it up to the server or down to the consumer of the Web Service. Naturally, this strategy limits the usefulness of your Web Service; consumers of the service will need to know the scheme used to compress the data and have access to the software required to compress and uncompress the files.
Offloading the Transfer
This approach is a variation on one of the concepts addressed in the W3C SOAP with Attachments specification. That is, rather than sending the actual bytes of the file in the SOAP message, you can send a URL to the file to be downloaded. The recipient of the URL can bind to it separately, outside the world of SOAP, and download it.
This approach works for downloads, but unless you are calling the Web Service from an HTTP server on the Internet, uploading data is still impossible. The Web Service will be unable to resolve the URL when attempting to bind it.
Another upload possibility is to do a two-stage transfer along these lines:
- The client calls the Web Service requesting to upload a file.
- The Web Service responds with an upload location, information about the protocol to use, and logon credentials.
- Outside the context of the Web Service, you log on to that location using the provided credentials, and complete the transfer using the protocol provided.
- When the transfer is complete, the Web Service routes the file appropriately. To tell when transfer is complete, the service may monitor activity in that location, or could require you to call an UploadComplete method.
This approach could work well if, say, you were directed to log on to a specific FTP server and deposit your file in a location to which you had restricted access.
Other Hybrid Approaches
There are other variations on the theme that don't comply with all of the existing SOAP standards. For example, I've seen methods mentioned on chat groups that would have you send the Web Service a TCPIP address and a socket to connect to. Transfer of data is then done at the TCPIP level for dramatically improved performance. The downside to this approach is that it ties up resources on the Web Service and requires trust on the part of the client. Since the primary transfer is done outside the HTTP protocol, it won't work easily through a firewall without extra configuration.
Here's another variant: If you are working in a server-to-server environment, you can potentially pass an IStream pointer as one of the parameters in the Web Service call. The recipient of the pointer can call methods to read data from or write data to the stream—rather than trying to transfer one large blob of data at a time.
The difficulty in this arises in that it will not work in a server-to-generic-client model. For this type of communication, the client needs to be able to accept SOAP requests representing the Web Service's calls on that IStream pointer.
In general, performance for this hybrid will not really improve dramatically. You will still end up encoding all the bytes in the file as they are marshaled from one server to the other, but at least the volume of data received will be controllable.
Finally, there is another solution available—using the .NET Framework. If you have control over both the Web Service and the client, you can use the .NET Remoting and Channels Services provided by the framework. Communication between client and server can happen as low as the TCP level, without having to open additional ports in a firewall. Using .NET Remoting, you can achieve higher performance and have access to other advanced features. The trade-off is that this form of data transfer does not use an open standard. It is proprietary to the .NET Framework.
Large data handling in XML Web Services raises a lot of questions, not all of which can be easily answered. As SOAP support grows, some of the performance issues related to handling large amounts of data will hopefully be addressed. In the meantime, there are a number of different approaches you can try. The number of options grows if you are willing to deviate from a pure SOAP implementation. In a perfect world, there would be a standardized way to negotiate the most effective upload and download mechanism between a Web Service and its clients. For now, there are really only two choices: You can accept some of the performance hit for transferring large amounts of data in a SOAP-standard way, or you can blaze your own trail, and try something like one of the options discussed in this article.
At Your Service