Data Points - Microsoft Azure Table Storage – Not Your Father’s Database

Article
08/12/2015

July 2010

Volume 25 Number 07

Data Points - Microsoft Azure Table Storage – Not Your Father’s Database

By Julie Lerman | July 2010

Julie Lerman Azure Table storage causes a lot of head scratching among developers. Most of their experience with data storage is with relational databases that have various tables, each containing a predefined set of columns, one or more of which are typically designated as identity keys. Tables use these keys to define relationships among one another.Azure stores information a few ways, but the two that focus on persisting structured data are SQL Azure and Azure Table storage. The first is a relational database and aligns fairly closely with SQL Server. It has tables with defined schema, keys, relationships and other constraints, and you connect to it using a connection string just as you do with SQL Server and other databases.

Azure Table storage, on the other hand, seems a bit mysterious to those of us who are so used to working with relational databases. While you’ll find many excellent walk-throughs for creating apps that use Azure Table storage, many developers still find themselves forced to make leaps of faith without truly understanding what it’s all about.

This column will help those stuck in relational mode bridge that leap of faith with solid ground by explaining some core concepts of Azure Table storage from the perspective of relational thinking. Also, I’ll touch on some of the important strategies for designing the tables, depending on how you expect to query and update your data.

Storing Data for Efficient Retrieval and Persistence

By design, Azure Table services provides the potential to store enormous amounts of data, while enabling efficient access and persistence. The services simplify storage, saving you from jumping through all the hoops required to work with a relational database—constraints, views, indices, relationships and stored procedures. You just deal with data, data, data. Azure Tables use keys that enable efficient querying, and you can employ one—the PartitionKey—for load balancing when the table service decides it’s time to spread your table over multiple servers. A table doesn’t have a specified schema. It’s simply a structured container of rows (or entities) that doesn’t care what a row looks like. You can have a table that stores one particular type, but you can also store rows with varying structures in a single table, as shown in Figure 1.

image: A Single Azure Table Can Contain Rows Representing Similar or Different Entities

Figure 1 A Single Azure Table Can Contain Rows Representing Similar or Different Entities

It All Begins with Your Domain Classes

Our typical development procedure with databases is to create them, define tables in them and then, for every table, define a particular structure—specific columns, each with a specified data type—as well as relationships to other tables. Our applications then push data into and pull data out of the tables.

With Azure Table services, though, you don’t design a database, just your classes. You define your classes and a container (table) that one or more classes belong to, then you can save instantiated objects back to the store as rows.

In addition to the properties you need in your classes, each class must have three properties that are critical in determining how Azure Table services do their job: PartitionKey, RowKey and TimeStamp. PartitionKey and RowKey are both strings, and there’s an art (or perhaps a science) to defining them so you get the best balance of query and transaction efficiency along with scalability at run time. For a good understanding of how to define PartitionKeys and RowKeys for the most benefit, I highly recommend the PDC09 session “Azure Tables and Queues Deep Dive,” presented by Jai Haridas, which you can watch at microsoftpdc.com/sessions/svc09.

PartitionKeys and RowKeys Drive Performance and Scalability

Many developers are used to a system of primary keys, foreign keys and constraints between the two. With Azure Table storage, you have to let go of these concepts or you’ll have difficulty grasping its system of keys.

In Azure Tables, the string PartitionKey and RowKey properties work together as an index for your table, so when defining them, you must consider how your data is queried. Together, the properties also provide for uniqueness, acting as a primary key for the row. Each entity in a table must have a unique PartitionKey/RowKey combination.

But you need to consider more than querying when defining a PartitionKey, because it’s also used for physically partitioning the tables, which provides for load balancing and scalability. For example, consider a table that contains information about food and has PartitionKeys that correspond to the food types, such as Vegetable, Fruit and Grain. In the summer, the rows in theVegetable partition might be very busy (becoming a so-called “hot” partition). The service can load balance the Food table by moving the Vegetable partition to a different server to  better handle the many requests made to the partition.

If you anticipate more activity on that partition than a single server can handle, you should consider creating more-granular partitions such as Vegetable_Root and Vegetable_Squash. This is because the unit of granularity for load balancing is the PartitionKey. All the rows with the same PartitionKey value are kept together when load balancing. You could even design your table so that every single entity in the table has a different partition.

Digging Deeper into PartitionKeys and Querying

Notice that when I suggested fine-tuning the Vegetable PartitionKeys, I placed Vegetable at the beginning of the key, not the end. That’s another mechanism for enabling more efficient queries. Queries to Azure Tables from the Microsoft .NET Framework use LINQ to REST and a context that derives from the WCF Data Services System.Data.Services.Client.DataServiceContext. If you want to find any green squash, you can search in the Vegetable_Squash partition without wasting resources to search the entire table:

var query = _serviceContext.FoodTable.AsTableServiceQuery()
.Where(c => c.PartitionKey=="Vegetable_Squash"&& c.Color == "Green");

A big difference between querying OData (returned by WCF Data Services) and querying against Azure Tables is that string functions are not supported. If you want to search part of a string, you must use String.CompareTo to inspect the beginning characters of the string. If you want to query the entire Vegetable category, however, you can use the CompareTo method to do a prefix search over the start of the PartitionKey:

var query = _serviceContext.FoodTable.AsTableServiceQuery()
            .Where(c => c.PartitionKey.CompareTo("Vegetable")>=0
            && c.PartitionKey.CompareTo("Vegetablf")<0
            && c.Color == "Green");

This would limit the search to only partitions that begin with Vegetable—nothing less and nothing more. (Using Vegetablf rather than Vegetable in the second predicate defines the upper bound, preventing foods in partitions such as Yogurt or VegetableLike from being returned.) In the code sample accompanying this article, you’ll see how I’ve done this replacement dynamically.

Parallel Querying for Full Table Scans

What if you were searching for all green food, regardless of type? Azure would have to scan through the entire table. If it’s a large table, Azure throws in another wrench: It can return only 1,000 rows at a time (or process for 5 seconds). Azure will return those results along with a continuation key, then go back for more. This can be a tedious synchronous process.

Instead you could execute a number of queries, perhaps iterating through a known list of categories, then building each query:

_serviceContext.FoodTable.AsTableServiceQuery()
.Where(c => c.PartitionKey == _category && c.Color == "Green");

Then you can send off all the queries to run in parallel.

More Design Considerations for Querying

The RowKey property serves a number of purposes. In combination with PartitionKey, it can define uniqueness within a table for each row. For example, I know another Julie Lerman (truly I do). So the RowKey will be critical in differentiating us when we share a PartitionKey of lerman_julie. You can also use RowKey to help with sorting, because it acts as part of an index. So then, what would be useful RowKeys for Julie Lerman the elder (that’s me) and Julie Lerman the younger? A GUID will certainly do the trick for identity, but it does nothing for searches or sorting. In this case, a combination of values would probably be best.

What else differentiates us? We live on opposite ends of the United States, but locations can change so that’s not useful for a key. Certainly our date of birth is different (by more than 20 years) and that’s a static value. But there’s always the chance that another Julie Lerman with my birth date exists somewhere in the world and could land in my database—highly implausible, but not impossible. After all of the deliberation I might go through, birth date may still not be a value on which my application is searching or sorting. So in this case, RowKey might not be part of queries, and a plain-old GUID would suffice. You’ll have to make these kinds of decisions for all of your Azure Tables.

There’s much more to learn about defining keys, and factors such as retrieving data, storing data, scalability and load balancing all come into play.

Rethinking Relationships

In a relational database, we rely on foreign keys and constraints to define relationships. We certainly could define a foreign key property in a class that refers to another class, but there’s nothing in Azure Table storage to enforce relationships. Your code will still be responsible for that.

This impacts how you perform queries and updates (including inserts and deletes) from tables.

When querying, you can’t perform joins across tables. And when persisting data, you can’t have transacted commands that span partitions or tables. There is, however, a mechanism for working with data in graphs, which is something I pointed out at the beginning of this column—you can store rows of varying schemas in a single table.

If your application requires that users work with contacts and addresses together, you could store the addresses in the same table as the contacts. It would be critical to ensure that the addresses have the same PartitionKey—for example, “lerman_julie.” Also, the RowKey should contain a value that specifies the type or kind of entity, such as “address_12345,” so you can easily differentiate between contact types and address types when querying.

The common PartitionKey ensures that the rows will always stay together to take advantage of a feature called Entity Group Transactions (EGT). This allows a single transaction to carry out operations atomically across multiple entities as long as all the entities have the same PartitionKey value. One of the benefits of EGT with respect to related data is that you can perform a transacted update on all the entities in a single transaction.

A Base of Understanding from Which to Learn More

Azure Tables live in the cloud, but for me they began in a fog. I had a lot of trouble getting my head wrapped around them because of my preconceived understanding of relational databases. I did a lot of work (and pestered a lot of people) to enable myself to let go of the RDBMS anchors so I could embrace and truly appreciate the beauty of Azure Tables. I hope my journey will make yours shorter.

There’s so much more to learn about Azure Table services. The team at Microsoft has some great guidance in place on MSDN. In addition to the PDC09 video mentioned earlier, check this resource page on Azure Storage team blog at blogs.msdn.com/windowsazurestorage/archive/2010/03/28/windows-azure-storage-resources. The team continues to add detailed, informative posts to the blog, and I know that in time, or even by the time this column is published, I’ll find answers to my myriad questions. I’m looking forward to providing some concrete examples in a future Data Points column.

Julie Lerman is a Microsoft MVP, .NET mentor and consultant who lives in the hills of Vermont. You can find her presenting on data access and other Microsoft .NET topics at user groups and conferences around the world. Lerman blogs at thedatafarm.com/blog and is the author of the highly acclaimed book, “Programming Entity Framework” (O’Reilly Media, 2009). Follow her on Twitter.com: julielerman.

Thanks to the following technical experts for reviewing this article: Brad Calder and Jai Haridas

Data Points - Microsoft Azure Table Storage – Not Your Father’s Database

Storing Data for Efficient Retrieval and Persistence

It All Begins with Your Domain Classes

PartitionKeys and RowKeys Drive Performance and Scalability

Digging Deeper into PartitionKeys and Querying

Parallel Querying for Full Table Scans

More Design Considerations for Querying

Rethinking Relationships

A Base of Understanding from Which to Learn More

Additional resources