Download the Code Sample
Azure Table storage causes a lot of head scratching among developers. Most of their experience with data storage is with relational databases that have various tables, each containing a predefined set of columns, one or more of which are typically designated as identity keys. Tables use these keys to define relationships among one another.Azure stores information a few ways, but the two that focus on persisting structured data are SQL Azure and Azure Table storage. The first is a relational database and aligns fairly closely with SQL Server. It has tables with defined schema, keys, relationships and other constraints, and you connect to it using a connection string just as you do with SQL Server and other databases.
Azure Table storage, on the other hand, seems a bit mysterious to those of us who are so used to working with relational databases. While you’ll find many excellent walk-throughs for creating apps that use Azure Table storage, many developers still find themselves forced to make leaps of faith without truly understanding what it’s all about.
This column will help those stuck in relational mode bridge that leap of faith with solid ground by explaining some core concepts of Azure Table storage from the perspective of relational thinking. Also, I’ll touch on some of the important strategies for designing the tables, depending on how you expect to query and update your data.
By design, Azure Table services provides the potential to store enormous amounts of data, while enabling efficient access and persistence. The services simplify storage, saving you from jumping through all the hoops required to work with a relational database—constraints, views, indices, relationships and stored procedures. You just deal with data, data, data. Azure Tables use keys that enable efficient querying, and you can employ one—the PartitionKey—for load balancing when the table service decides it’s time to spread your table over multiple servers. A table doesn’t have a specified schema. It’s simply a structured container of rows (or entities) that doesn’t care what a row looks like. You can have a table that stores one particular type, but you can also store rows with varying structures in a single table, as shown in Figure 1.
Figure 1 A Single Azure Table Can Contain Rows Representing Similar or Different Entities
Our typical development procedure with databases is to create them, define tables in them and then, for every table, define a particular structure—specific columns, each with a specified data type—as well as relationships to other tables. Our applications then push data into and pull data out of the tables.
With Azure Table services, though, you don’t design a database, just your classes. You define your classes and a container (table) that one or more classes belong to, then you can save instantiated objects back to the store as rows.
In addition to the properties you need in your classes, each class must have three properties that are critical in determining how Azure Table services do their job: PartitionKey, RowKey and TimeStamp. PartitionKey and RowKey are both strings, and there’s an art (or perhaps a science) to defining them so you get the best balance of query and transaction efficiency along with scalability at run time. For a good understanding of how to define PartitionKeys and RowKeys for the most benefit, I highly recommend the PDC09 session “Azure Tables and Queues Deep Dive,” presented by Jai Haridas, which you can watch at microsoftpdc.com/sessions/svc09.
Many developers are used to a system of primary keys, foreign keys and constraints between the two. With Azure Table storage, you have to let go of these concepts or you’ll have difficulty grasping its system of keys.
In Azure Tables, the string PartitionKey and RowKey properties work together as an index for your table, so when defining them, you must consider how your data is queried. Together, the properties also provide for uniqueness, acting as a primary key for the row. Each entity in a table must have a unique PartitionKey/RowKey combination.
But you need to consider more than querying when defining a PartitionKey, because it’s also used for physically partitioning the tables, which provides for load balancing and scalability. For example, consider a table that contains information about food and has PartitionKeys that correspond to the food types, such as Vegetable, Fruit and Grain. In the summer, the rows in theVegetable partition might be very busy (becoming a so-called “hot” partition). The service can load balance the Food table by moving the Vegetable partition to a different server to
better handle the many requests made to the partition.
If you anticipate more activity on that partition than a single server can handle, you should consider creating more-granular partitions such as Vegetable_Root and Vegetable_Squash. This is because the unit of granularity for load balancing is the PartitionKey. All the rows with the same PartitionKey value are kept together when load balancing. You could even design your table so that every single entity in the table has a different partition.
Notice that when I suggested fine-tuning the Vegetable PartitionKeys, I placed Vegetable at the beginning of the key, not the end. That’s another mechanism for enabling more efficient queries. Queries to Azure Tables from the Microsoft .NET Framework use LINQ to REST and a context that derives from the WCF Data Services System.Data.Services.Client.DataServiceContext. If you want to find any green squash, you can search in the Vegetable_Squash partition without wasting resources to search the entire table:
var query = _serviceContext.FoodTable.AsTableServiceQuery()
.Where(c => c.PartitionKey=="Vegetable_Squash"&& c.Color == "Green");
A big difference between querying OData (returned by WCF Data Services) and querying against Azure Tables is that string functions are not supported. If you want to search part of a string, you must use String.CompareTo to inspect the beginning characters of the string. If you want to query the entire Vegetable category, however, you can use the CompareTo method to do a prefix search over the start of the PartitionKey:
var query = _serviceContext.FoodTable.AsTableServiceQuery()
.Where(c => c.PartitionKey.CompareTo("Vegetable")>=0
&& c.Color == "Green");
This would limit the search to only partitions that begin with Vegetable—nothing less and nothing more. (Using Vegetablf rather than Vegetable in the second predicate defines the upper bound, preventing foods in partitions such as Yogurt or VegetableLike from being returned.) In the code sample accompanying this article, you’ll see how I’ve done this replacement dynamically.
What if you were searching for all green food, regardless of type? Azure would have to scan through the entire table. If it’s a large table, Azure throws in another wrench: It can return only 1,000 rows at a time (or process for 5 seconds). Azure will return those results along with a continuation key, then go back for more. This can be a tedious synchronous process.
Instead you could execute a number of queries, perhaps iterating through a known list of categories, then building each query:
.Where(c => c.PartitionKey == _category && c.Color == "Green");
Then you can send off all the queries to run in parallel.
The RowKey property serves a number of purposes. In combination with PartitionKey, it can define uniqueness within a table for each row. For example, I know another Julie Lerman (truly I do). So the RowKey will be critical in differentiating us when we share a PartitionKey of lerman_julie. You can also use RowKey to help with sorting, because it acts as part of an index. So then, what would be useful RowKeys for Julie Lerman the elder (that’s me) and Julie Lerman the younger? A GUID will certainly do the trick for identity, but it does nothing for searches or sorting. In this case, a combination of values would probably be best.
What else differentiates us? We live on opposite ends of the United States, but locations can change so that’s not useful for a key. Certainly our date of birth is different (by more than 20 years) and that’s a static value. But there’s always the chance that another Julie Lerman with my birth date exists somewhere in the world and could land in my database—highly implausible, but not impossible. After all of the deliberation I might go through, birth date may still not be a value on which my application is searching or sorting. So in this case, RowKey might not be part of queries, and a plain-old GUID would suffice. You’ll have to make these kinds of decisions for all of your Azure Tables.
There’s much more to learn about defining keys, and factors such as retrieving data, storing data, scalability and load balancing all come into play.
In a relational database, we rely on foreign keys and constraints to define relationships. We certainly could define a foreign key property in a class that refers to another class, but there’s nothing in Azure Table storage to enforce relationships. Your code will still be responsible for that.
This impacts how you perform queries and updates (including inserts and deletes) from tables.
When querying, you can’t perform joins across tables. And when persisting data, you can’t have transacted commands that span partitions or tables. There is, however, a mechanism for working with data in graphs, which is something I pointed out at the beginning of this column—you can store rows of varying schemas in a single table.
If your application requires that users work with contacts and addresses together, you could store the addresses in the same table as the contacts. It would be critical to ensure that the addresses have the same PartitionKey—for example, “lerman_julie.” Also, the RowKey should contain a value that specifies the type or kind of entity, such as “address_12345,” so you can easily differentiate between contact types and address types when querying.
The common PartitionKey ensures that the rows will always stay together to take advantage of a feature called Entity Group Transactions (EGT). This allows a single transaction to carry out operations atomically across multiple entities as long as all the entities have the same PartitionKey value. One of the benefits of EGT with respect to related data is that you can perform a transacted update on all the entities in a single transaction.
Azure Tables live in the cloud, but for me they began in a fog. I had a lot of trouble getting my head wrapped around them because of my preconceived understanding of relational databases. I did a lot of work (and pestered a lot of people) to enable myself to let go of the RDBMS anchors so I could embrace and truly appreciate the beauty of Azure Tables. I hope my journey will make yours shorter.
There’s so much more to learn about Azure Table services. The team at Microsoft has some great guidance in place on MSDN. In addition to the PDC09 video mentioned earlier, check this resource page on Azure Storage team blog at blogs.msdn.com/windowsazurestorage/archive/2010/03/28/windows-azure-storage-resources. The team continues to add detailed, informative posts to the blog, and I know that in time, or even by the time this column is published, I’ll find answers to my myriad questions. I’m looking forward to providing some concrete examples in a future Data Points column.
Julie Lerman is a Microsoft MVP, .NET mentor and consultant who lives in the hills of Vermont. You can find her presenting on data access and other Microsoft .NET topics at user groups and conferences around the world. Lerman blogs at thedatafarm.com/blog and is the author of the highly acclaimed book, “Programming Entity Framework” (O’Reilly Media, 2009). Follow her on Twitter.com: julielerman.
Thanks to the following technical experts for reviewing this article: Brad Calder and Jai Haridas
If anyone comes to this article in the future with the same goal of trying to work out if to go 'Tables and Blobs' or SQL server, and not feeling comfortable about the whole thing, just a note from me to future readers that I am now doing web apps quite happily and comfortably using this technology, and with no real need for the Azure SQL dbms at all. It takes a while to get into, and the feeling initially is that it is all strange territory, but so far I haven't really come across a real problem that can't be solved using tables
If you're looking for Jai Haridas' video, see http://channel9.msdn.com/events/PDC/PDC09/SVC09
Thanks you so much for the information, I've started with a new project and this article was useful to learn about Azure Tables. Thank you again.
After much experimenting with table storage, it is really disappointing because I can not perform any of the aggregate queries and there is no way to bulk export the data and there is no support on Hadoop. So I am really not sure whats the use of Azure Table Storage.
Julie: I wanted to take a moment to Thank You. This article was well written and really helped to quench my thirst for information on the subject. Actually, I think in a way it made me even more thirst (that's a good thing). I really enjoyed it and I wanted to let you know before I followed the links away from your page. Thank you Julie Joe Moniz
There is a lot of topic about data management. I don't see in this article how to have many index on the same "data block". Does this solution need data redundancy for different needs?
thank you, very instructive and teaching !
Its very informative Julie Lerman! Thanks a lot...
Thanks for this article about Winbdows Azure Table Storage!!! This is a first step to understand Azure from a c# dev perspective. @+ rv.
Pat: To answer why use Table storage: For one reason really: scalability. SQL Server just isn't designed to scale like Table storage is. Table storage (if anything) is the closest to No-SQL, Big Data and Hadoop. Of course with SQL Server you can have a disk array SAN with Active/Active to n nodes (not sure if SQL Server Clustering is supported in Windows Azure) but the bottle neck will be the SAN - eventually. This will only ever be an issue if you have heavy data needs whether that is high-traffic or simply high number of data reads in your applications. However, there is now SQL Server Federation which uses the Sharding pattern but this will not be suitable for all applications. Simon
It would seem that a Hybrid solution is necessary. In my xBase days (long ago!) we maintained external indexes that were just 'little" tables to keep track of how to look things up.. you might have a color table that just keeps track of the PartKey and row keys were you put green stuff. Keeping in Sync would be the ole fashioned way.. programitically with the occasional 'rebuild indexes' method. It also seems that you could likely query attributes in SQL and just return the Part.RowKeys. A fair question is WHY? It cheaper for storage cost using tables.. Is that it? Pat NH USA PS Hi Julie...
I really appreciated the content and hope to see your examples soon.
Thanks so much for the detailed explanation of the different keys and how to best use them, its been quite frustrating finding good resources and explanations. Also, great work with the EF !!
Thanks for the article. It's an interesting topic I think SQL Azure as a solution in search of problem. I still can't see how this solves anything. But thanks for the effort.
More MSDN Magazine Blog entries >
Browse All MSDN Magazines
Subscribe to MSDN Flash newsletter
Receive the MSDN Flash e-mail newsletter every other week, with news and information personalized to your interests and areas of focus.