5 - Implementing a Document Database

A key/value store provides arguably the fastest means of storing data. Additionally, as long as you only need to retrieve data by using the key, it is also the quickest mechanism amongst NoSQL databases. Together, these properties are ideally suited for applications that simply need to save and fetch large amounts of data extremely rapidly. However, for situations where an application needs to query or filter data based on attributes other than the key, then a key/value store becomes less useful. To support scenarios such as this, an application can save information to a document database instead.

At the most abstract level, a document database is similar in concept to a key/value store because it holds values that an application can search for by using a key. Indeed, you can use many document databases in exactly the same manner as a key/value store. However, the distinguishing feature of a document database compared to a key/value store is the transparency of the data that the database holds. You can define fields in documents, and applications can query the documents in a database by using these fields. Additionally, most document databases can create indexes over fields, and these indexes can help to optimize queries that reference these fields. This mechanism makes a document database more suitable than a key/value store for applications that need to retrieve data based on criteria more complex than the value of the document key. A document database can also support applications that require simple ad hoc querying capabilities, although the query features provided by most document databases are not as generalized or powerful as those available to relational databases through SQL.

Dn313284.note(en-us,PandP.10).gifPoe says:
Poe It is sometimes difficult to determine whether a specific NoSQL database is a key/value store or a document database. For example, some key/value stores enable you to include metadata with values to enable the database to retrieve data by using attributes other than the key. The important point is not to worry unduly about the classification of a NoSQL database, rather you should ask yourself whether it provides the features that you require for your application and if so, how can you design a database that uses these features to their best effect.

This chapter describes the primary features of common document databases, and summarizes how you can design documents that take best advantage of these features to store and retrieve structured information quickly and efficiently. This chapter also discusses how Adventure Works used a document store to implement the product catalog and order history information for the Shopping application.

What is a Document?

A document in a document database is simply an entity that contains a collection of named fields. Don't confuse them with documents created by using Microsoft Word, or PDF files.

Dn313284.note(en-us,PandP.10).gifNote:
Different document databases use varying terms to describe the structure of the data that they hold. This chapter uses the term database to refer to a document store; some vendors call them domains, or even buckets. Similarly, this chapter uses the term collection to refer to a group of documents holding similar data. For example, a database may hold a collection of orders documents and a separate collection of products documents. Finally, this chapter uses the term field rather than attribute or key to refer to elements within a document.

If you are familiar with relational databases, you probably understand that before you can store data in a database you must first create the tables that will hold this data. These tables define the schema, and specify the fields (columns) that each data item (entity) contains. A document database is different. As described in Chapter 1, "Data Storage for Modern High-Performance Business Applications," most NoSQL databases are schema-less. What this means in practice is that the database simply acts as a data repository but does not impose any structure on that data because that is the responsibility of applications that store and maintain the data. This approach provides applications with an enormous amount of freedom because they can effectively store whatever they like. It also requires a great deal of discipline and consistency by application developers because these same applications need to be able to parse and process the data that they retrieve from the database, and therefore they need to understand the structure of the data. For this reason, it is common to group the documents into collections that have some semblance of commonality. Documents in a collection do not have to contain the same fields, but it is helpful if they model similar types of entities such as customers, orders, or products.

Dn313284.note(en-us,PandP.10).gifPoe says:
Poe The ability to define the structure of documents dynamically is a very flexible feature that enables you to adapt quickly to changing business requirements. In a relational database, you define the schemas for your data in advance (often expending considerable time and effort in the process). If the schema needs to change it is not always easy to apply these changes retrospectively to exist data in the database. In a document database, you simply include the fields that you require when you store a document.

A document database requires a document to be self-describing, and most document databases store information in a well understood and portable format such as JSON, BSON, or XML. A few document databases restrict the fields in a document to simple string values. However, most support other common types such as numbers, dates, and Booleans, and many allow you to define complex structures in fields, enabling you to create nested documents.

Designing a Document Database

The structure of the data in a document database can have an important bearing on the performance of applications that use the database, as well as the throughput and scalability of the database itself. In particular, when you design document structures, you should consider the following factors:

  • How should you divide your data into documents and collections, and how far should you normalize or denormalize the data in a document?
  • How are applications going query documents, and what predicates and filters are applications likely to use?
  • How dynamic is the data; are your applications going to perform a large number of create, delete, and update operations?

The following sections discuss these factors in more detail and describe the choices that you can make.

Modeling Data as Documents

When you save data in a document database, initially it may seem tempting to simply serialize all of the objects that your application uses into the format expected by the database (JSON, BSON, XML, and so on) and then write them to the database. In a few scenarios where your applications process a very limited set of objects that are roughly all of the same type, this strategy might be appropriate. However, in the vast majority of cases this approach is likely to be unsatisfactory as it can lead to a vast amount of data duplication.

Dn313284.note(en-us,PandP.10).gifMarkus says:
Markus Just because you can serialize almost any object as a document, it doesn't mean that you should.

To make the best use of a document database, you should plan the structure of documents to optimize the queries that your applications perform and you should not treat the document database as a document dump.

Dn313284.note(en-us,PandP.10).gifPoe says:
Poe Do not use a document database as a heap for unstructured documents because you will gain little benefit from doing so. If your applications simply need to save and retrieve data based on keys, then use a key/value store instead.

Denormalizing Documents and Embedding Data

You should avoid attempting to design documents by following the relational database approach because although this strategy can eliminate duplicate data it will inevitably lead to lots of small documents. In this case, to reconstruct the data for an entity, your applications may need to retrieve data from multiple documents and combine them together. Relational databases are specifically designed to support and optimize generalized queries that can join data across tables whereas document databases are not. Most document databases do not provide the capability to join documents together in queries, so any processing of this type must be performed by the applications themselves. This may involve the application issuing multiple queries to the database.

The general approach to designing documents is to examine how your application retrieves and processes data; use the business requirements of the application to mold the structure of the data and not vice versa. Ideally, the majority of queries that an application performs should be satisfied by a single visit to the database. This implies that you should denormalize your data, so that a document contains all the information required to display or manage the details of an entity, as described in Chapter 1. For example, if your application stores and retrieves information about customers, including their telephone numbers and addresses, you could use the conceptual structure shown in Figure 1.

Figure 1 - The conceptual structure of a document collection containing customer information

Figure 1 - The conceptual structure of a document collection containing customer information

In JSON terms, the document structure for these two customers looks like the following code example. Notice that the Address field is a nested document, and the Telephone field contains a list of telephone numbers:

{"CustomerID":99,
 "Title":"Mr",
 "FirstName":"Mark",
 "LastName":"Hanson",
 "Address":{
    "StreetAddress":"999 500th Ave",
    "City":"Bellevue",
    "State":"WA","ZipCode":"12345"
 },
 "Telephone":["111-2223334","222-1112223","333-2221114"]}

{"CustomerID":100,
 "Title":"Ms",
 "FirstName":"Lisa",
 "LastName":"Andrews",
 "Address":{
    "StreetAddress":"888 W. Front St",
    "City":"Boise",
    "State":"ID",
    "ZipCode":"54321"
},
"Telephone":["555-4443332","444-5552223","333-5554442"]}
Dn313284.note(en-us,PandP.10).gifNote:
Several document databases can generate a unique key value automatically when you create a new document and store it in the database. Others enable you to specify your own value for the key.

This structure is highly suitable for retrieving data based on the customer ID, and you can also perform queries that retrieve data based on other criteria such as the customer name. Remember that you can add indexes over fields in a collection to improve the performance of queries. For more information, see the section "Indexing Documents" later in this chapter.

Dn313284.note(en-us,PandP.10).gifMarkus says:
MarkusBe consistent in your application code. Make sure that you save documents by using the same names and data types for common fields, otherwise you will find it difficult to query the data later. For example, don't save the document for one customer where the customer ID is stored in a field called ID, and the document for another customer where the customer ID is stored in a field called CustomerID.


Dn313284.note(en-us,PandP.10).gifNote:
If your applications frequently need to retrieve data based on information stored in a field that contains a nested document, you might find it more beneficial to use a column-family database rather than a document database. For more information, see Chapter 6, "Implementing a Column-Family Database."

You should avoid making your documents too large and cumbersome; many document databases place limits on the sizes of documents, and hefty documents with a multitude of fields can be difficult to read and maintain. In the previous example, storing addresses and telephone numbers with customers is a natural fit, and the quantity of addresses and telephone numbers that a customer has at any one time are likely to be small.

Additionally, many document databases work best when the size of a document does not change (within reasonable parameters), but if repeated updates cause a document to grow, the resulting overhead of managing the space required can cause the performance of the database to decline. In the customers example, a single customer can have a variable number of telephone numbers, but this list is likely to be reasonably stable and not change very often for any given customer. Therefore it is practical to store this data as part of the customers' documents.

In many cases, you can consider a document to be an historical record where the data should not change after the document has been created. The sales order example containing a list of order items and the customer's shipping address described in Chapter 1 is one such scenario. Figure 2 shows the conceptual structure of a pair of sales order documents.

Figure 2 - The conceptual structure of a document collection containing sales order information

Figure 2 - The conceptual structure of a document collection containing sales order information

The JSON definition of order 1001 in Figure 2 might look like this:

{"OrderID":1001,
 "OrderDate":"\/Date(1370473200000+0100)\/",
 "OrderItems":[{
      "ProductID":2010,
      "Quantity":2,
      "Cost":520
     },
     {
      "ProductID":4365,
      "Quantity":1,
      "Cost":18
     }],
 "OrderTotal":1058,
 "CustomerID":99,
 "ShippingAddress":{
     "StreetAddress":"999 500th Ave",
     "City":"Bellevue",
     "State":"WA",
     "ZipCode":"12345"}}

In this example, the shipping address will be duplicated if the same customer places several orders, but the shipping address should indicate the destination of the order regardless of whether the customer subsequently moves to a different location, so this duplication is necessary to ensure the historical accuracy of each sales order document.

Normalizing Documents and Referencing Data

Denormalizing data can help to optimize queries by reducing the number of documents that an application has to query to fetch data. However, it can have some detrimental side effects, and in some circumstances it may be more appropriate to normalize data.

Figure 2 illustrates an example of the normalized approach. Each sales order specifies the ID of the customer that placed the order, but does not include any detailed information about the customer. You could argue that if the applications frequently retrieve information about orders and the details of the customers that placed them, then the customer details should be embedded directly into the sales order documents. However, if the application regularly queries the details of customers and not their orders, the logic for performing these queries can become contorted and it does not feel natural that an application should have to iterate through sales orders to obtain a list of customers. Additionally, it is very likely that a single customer will (hopefully) place many orders, leading to a high degree of duplication and making any updates to customer details difficult and time consuming to perform (unlike shipping addresses, the details of customers may be dynamic). Therefore, in this example, it is better to maintain sales orders and customers as separate collections. The same logic applies to products; each order item contains the product ID rather than an embedded document that describes the product.

Dn313284.note(en-us,PandP.10).gifNote:
The sales order example references other documents (for customers and products) by using the keys for these items. A few document databases implement reference types that provide a direct link to a related document, enabling them to be retrieved more quickly by applications which do not have to perform a search for the related document.
Other document databases may cache document keys in memory to provide fast access to documents, although this approach may require a large amount of RAM if your database contains tens or hundreds of millions of documents all of which are being accessed frequently (this situation is not uncommon in vary large systems).

Normalizing your documents can have some beneficial effects on performance if you have mutating data. As described in the previous section, you should avoid creating documents that can grow indefinitely, even if the rate of growth is small. This is because many document databases will have to move the data around on disk (or in memory) to make space as the document is updated, reducing the performance of the database. Some document databases enable you to pre-allocate additional space to documents, enabling them to grow without being relocated, but once you exhaust this space the document will have to be moved.

Dn313284.note(en-us,PandP.10).gifPoe says:
PoeMany document databases support in-place updates as long as the document does not increase in size, but if a modification causes a document to grow the database might need to rewrite it to a different location. If possible, avoid creating documents that exhibit this behavior, and especially avoid creating documents that contain unbounded, growing, mutable arrays.

As an example, your application might need to store information about the products that your organization sells. If the application additionally needs to maintain a price history for each, you might define product documents with the structure shown in Figure 3, where the price history is maintained as a field within the product document.

Figure 3 - Storing price history information

Figure 3 - Storing price history information

As prices change, each product document will increase in size. In many business scenarios, price changes tend to occur across an entire product range at the same time (for example, in the Adventure Works scenario, the prices of all bicycles might increase if the cost of manufacturing bicycles rises, or decrease if the company decides to have a sale.) The effects of a large number of documents being written and relocated rather than updated in place could have a detrimental effect on performance, although in this case these updates are likely to happen relatively infrequently.

In a different scenario that implements a similar structure, if you are writing applications that capture data measured by a series of scientific instruments and logging the measurements for a single instrument with information about that instrument in the same document, the cost of rewriting the document for each data point could become prohibitive, especially if your application captures measurements from a large number of different instruments. Figure 4 shows the information that such a document might contain if the instrument is a thermometer capturing the ambient temperature of the environment at 1 minute intervals.

Figure 4 - Appending measurements to an array in a document

Figure 4 - Appending measurements to an array in a document

In this case, you should implement scientific measurements as separate documents that reference the instrument document, as shown in Figure 5. Each of these measurements is a fixed size and will not change once they have been saved. Most document databases are optimized to handle precisely this type of behavior.

Dn313284.note(en-us,PandP.10).gifNote:
Alternatively, you could store the information in a column-family database where the time indicates the name of a column and the temperature is the value of that column. Chapter 6, "Implementing a Column-Family Database" provides more information.

Figure 5 - Storing measurements as documents in a separate collection

Figure 5 - Storing measurements as documents in a separate collection
Dn313284.note(en-us,PandP.10).gifPoe says:
PoeIn a normalized relational database, there is often no distinction made between the structure of the various one-to-many (1:n) relationships that can occur, and you nearly always implement the entities that participate in these relationships as separate tables. When you design a document database, you need to understand that there are different types of 1:n relationships. If n is small and does not vary much, then you can consider denormalizing and combining both entities into the same document. If n is large or can change considerably over time, then implement both entities as separate documents.

Handling Complex Relationships

You can model complex relationships in a document database, such as hierarchical data and tree structures, by using references. Figure 6 shows an example that records which employees in an organization report to which managers.

Figure 6 - Modeling hierarchical data in a collection of documents

Figure 6 - Modeling hierarchical data in a collection of documents

This collection is not much different from a relational table that implements the same functionality, and searching data to find a list of employees that directly or indirectly report to a specific manager requires an application to perform several queries over the same collection. If a hierarchy is relatively static, you could consider including materialized paths in the data. A materialized path records the path from a document to the top of the hierarchy. Figure 7 shows the employee data from Figure 6 with the Manager field replaced by the ManagerPath field that contains a materialized path that specifies the manager, the manager's manager, and so on. An application searching for the employees that report to a specific manager can query the ManagerPath field to find this information.

Dn313284.note(en-us,PandP.10).gifNote:
The order of data in a materialized path is significant, and you may need to store the materialized path as a string rather than a list of employee IDs. This is because some document databases do not preserve the order of data in list fields.

Figure 7 - Storing hierarchy information by using materialized paths

Figure 7 - Storing hierarchy information by using materialized paths

You can follow a similar approach for other complex relationships. However, maintaining materialized paths can be resource intensive and time consuming, so only adopt this approach if the relationships do not change frequently.

Dn313284.note(en-us,PandP.10).gifNote:
If you need to store information about dynamic relationships, consider using a graph database. If necessary, you can integrate a graph database and a document database together. Use the graph database to store information about relationships between documents, and add references to documents in the document database to the nodes in the graph database. For more information, see Chapter 8, "Building a Polyglot Solution."

Retrieving Data and Indexing Documents

The main reason for using a document database is to enable applications to quickly find data based on the values held in one or more fields. The fastest form of data retrieval is to use the document key. However, in many cases, simply retrieving data by using the key will not be sufficient (if it is, you should use a key/value store rather than a document database). For this reason, document databases enable you to specify query criteria that identify documents based on the non-key fields in your documents.

The actual query features available depend on the specific document database that you are using. All document databases enable an application to filter data by testing whether the data in a field is equal to a specified value, and some document databases allow an application to specify a variety of comparison operators such as less than, greater than, and so on. In some cases, you can combine criteria together with and and or operators. The majority of document databases support projections, enabling an application to specify which fields a query should retrieve, but some document databases can only retrieve the entire document. Additionally, while the various document databases currently available provide their own APIs and support applications written in a variety of different programming languages, most of them also support the REST model of working, enabling your applications to submit queries as REST requests, and allowing your applications to page through data if a query returns a large number of documents.

Summarizing Data by Using Materialized Views

Some document databases support materialized views, enabling an application to quickly retrieve summary information from documents or pose queries that require calculating values across a collection of documents. For example, if your database stores information about sales orders as shown in Figure 2 earlier in this chapter, and your application needs to calculate the average value of each order that includes a specific product, then using a materialized view could be more efficient than searching for all sales orders that contain the product, retrieving these orders across the network, and then performing the calculation by using logic in the application. Figure 8 illustrates the concepts behind a materialized view.

Figure 8 - Using a materialized view to calculate and store summary data

Figure 8 - Using a materialized view to calculate and store summary data

Most document databases implement materialized views by using a map/reduce framework. They maintain the information held by the view as documents are added, updated, and deleted. Therefore, you should assess the cost of keeping a materialized view up to date against the performance benefits gained when you query this data.

Dn313284.note(en-us,PandP.10).gifPoe says:
PoeMaintaining data in a materialized view will have an impact on the performance of insert, update, and delete operations. You need to balance the frequency of queries that benefit from using a materialized view against the volume of inserts, updates, and deletes that can occur.

Indexing Documents

When an application retrieves documents by specifying criteria based on fields other than the key, the document database may need to scan through every document in a collection to find matches. If the collection is large, this process can take some time. Most document databases enable you to define indexes over fields to help speed this process up. An index contains a sorted list of values in a given field (or fields), together with direct references to documents that have these values in the specified fields. When an application queries a collection with criteria that include one or more indexed fields, the database management system can quickly find matching documents by looking them up in the appropriate indexes. Figure 9 shows an example index created over the City field in the nested Address document in the Customers document collection. An application that needs to find customers located in a specific city will benefit from using this index.

Dn313284.note(en-us,PandP.10).gifNote:
Not all document databases enable you to create indexes over fields in nested documents, or fields that contain lists.

Figure 9 - The Customers collection with an index over the City field

Figure 9 - The Customers collection with an index over the City field
Dn313284.note(en-us,PandP.10).gifNote:
A few document databases only support queries that either reference the key or indexed fields. In these databases, non-indexed fields cannot be specified as filter criteria.

The techniques that different document databases use to implement indexes varies. Some databases use balanced B-Tree structures while others implement indexes by creating materialized views with the data ordered by the index key. Some document databases provide support for pluggable third-party indexing libraries, enabling developers to select an indexing implementation that is most closely suited and optimized to the type of data that is being indexed. However they work, all indexes impose an overhead on the database because they have to be maintained as documents are added, modified, and deleted.

If the documents in a collection are subjected to a high proportion of queries compared to inserts, updates, and deletes, then you should consider creating indexes to optimize as many of these queries as possible. If a collection has a high proportion of write operations compared to queries, then you may need to reduce the number of indexes to a minimum. You should also bear in mind that indexes require space on disk (or in memory); the more indexes you create, the more space they will occupy in the database.

Dn313284.note(en-us,PandP.10).gifNote:
Some document databases maintain indexes by using a separate low-priority thread running in the background. This approach helps to minimize the overhead associated with an index, although it is possible that the data in an index may be out of date when a query that utilizes that index runs. This may mean that a query may fail to find a document that would be retrieved if the index was not used.

The primary use of indexes is to support fast searching and filtering of data, but if an application regularly needs to fetch data in a specific order, then indexes can help here as well. The fields in an index are sorted, so an application that requires documents to be ordered by a particular field can benefit if that field is indexed. In the example shown in Figure 9, an application that needs to retrieve and display all customers in order of the city in which they are located can make use of the City index.

Several document databases allow indexes to span multiple fields; you specify the relative significance of fields, and the index will be sorted by the most significant field first, then by the next most significant field, and so on. In some cases, if an index contains all the fields required by a query, then there is no need for the database management system to actually retrieve the document itself. This type of index is known as a covering index, and they can be very effective at speeding queries in document databases that support projections.

Dn313284.note(en-us,PandP.10).gifPoe says:
PoeIn a static or query-intensive collection, consider creating indexes to optimize as many queries that your applications perform as possible, and define covering indexes where appropriate. In a dynamic collection, only create the indexes that are absolutely essential, to reduce the overhead of maintaining them.

The decision of whether a query submitted by an application to a document databases uses an index or not can depend on many factors. In some cases, the application itself indicates the index to use, while in others the database management system makes the selection. Some document databases implement sophisticated query optimization strategies that rival those found in many relational database management systems. A popular document database even considers the cost of dynamically creating an index over a field if no such index exists and it would prove useful to the query being run. This index is temporary but remains in existence for a short time. If the same (or similar) query is repeated, it will use the temporary index, and if this temporary index is used enough within a given period it will be made permanent.

Creating, Updating, and Deleting Documents

For reasons of speed and scalability, most document databases provide only a very basic programmatic interface for creating, updating, and deleting documents. To create a new document you simply encode the data in the appropriate format (JSON, BSON, XML, or whatever the document database expects) and then add it to a collection in the database. There are no defined schemas in a document database, and as long as the document is well formed and has a valid structure the database will store it.

To delete a document, your application typically identifies the document either by its key value or by specifying other query criteria to match against fields in the document. The application then sends a request to the database to remove this document. Some document databases provide a search and delete API, enabling your application to combine these steps into a single atomic operation. Depending on the database, the document might not be physically removed at this point but might instead be marked as for deletion (it will not be retrieved by any subsequent queries, however.) At some point in the future, the space used by this document will be reclaimed.

It is important to remember that unlike relational systems, document databases tend not to implement any form of relational integrity. This means that your applications can easily delete a document that is referenced by another document, and it is the responsibility of the application code to ensure that any relationships between documents remain valid and intact.

Dn313284.note(en-us,PandP.10).gifMarkus says:
MarkusMany document databases can remove multiple documents in the same request if they all match the specified criteria. There is no "Are you sure?" confirmation, and it is the responsibility of your application code to make sure that only the intended documents are deleted.

Document databases frequently provide two types of update operation, as follows.

  • In the first type, the application identifies the document (or documents) to modify by following a similar process to that for deleting documents. The application also provides a collection of field and value pairs, and the data in these fields in the document are updated with the new values.
  • In the second type, the application submits an entire document, including the key, to the database. If a document with this key already exists it is replaced with the new document. If the document does not already exist it is added to the database. This is an upsert operation.
Dn313284.note(en-us,PandP.10).gifNote:
Remember that if a document increases in size, the database may need to relocate it. This can take a little time and will slow down the update operation.

Some document databases enable you to batch multiple updates together and send them to the database as a single request. However, most document databases do not support transactions in the same way as relational systems. In most cases you can assume that operations that write to a single document are atomic (the database will not insert half a document, for example), but operations that span multiple documents are not. This means that if, for example, your application submits a request to update 10 documents, you cannot guarantee that all 10 documents will be modified at the same time and some inconsistencies can creep in, especially if these documents are heavily used.

Dn313284.note(en-us,PandP.10).gifJana says:
JanaIn most document databases, only insert, update, and delete operations that affect a single document are atomic. This will have a bearing on the design of your documents and your applications.

To counter possible problems arising from inconsistency, some document databases enable an application to lock documents. The lock will prevent concurrent applications from modifying these documents, although they may be able to read locked data. It is the responsibility of the application to ensure that locks are released in a timely manner when the updates are complete.

Dn313284.note(en-us,PandP.10).gifNote:
Some document databases perform insert, update, and delete operations asynchronously. This means that you cannot guarantee when these operation will actually be completed, and this approach can lead to inconsistences occurring if developers are not aware that this is happening.

If this type of pessimistic locking is inappropriate to your solution, you can consider implementing an optimistic strategy instead. Some document databases provide check and set operations to help with this approach. In this case, when the data is updated, the database first verifies that it has not changed since the application retrieved it, passing back an error indication and if it has. The application should check for this error, and retrieve the latest version of the document before attempting to update it again.

Dn313284.note(en-us,PandP.10).gifMarkus says:
MarkusPessimistic locking can severely impact the performance of the system, possibly resulting in extended response times by concurrent applications sharing the same database. It is better to design your applications assuming that documents may change while your applications are using them than to lock documents for exclusive access.

How Adventure Works Used a Document Database to Store the Product Catalog and Order History Information

As described in Chapter 2, "The Adventure Works Scenario," when a customer starts the Shopping application it shows a list of product categories. The customer can click a category and the Shopping application then displays a list of subcategories within that category. If the customer clicks a subcategory, the Shopping application presents the products within that subcategory. The customer can then click a product to view its details. You can see from this description that the product catalog constitutes a three-level hierarchy, with categories at the top, subcategories in the middle, and products at the bottom. The application has to be able to quickly find and display a list of items at each level in this hierarchy. However, the category, subcategory, and product information rarely changes (with the possible exception of product pricing), and as far as the Shopping application is concerned this data is all read-only. The category, subcategory, and product information are good candidates for restructuring as product catalog in a document database.

When a customer places an order, the Shopping application stores the details of the order in a SQL Server database (Chapter 3, "Implementing a Relational Database" describes this part of the process in more detail). The warehousing and shipping systems inside Adventure Works retrieve and modify the status of an order as items are picked and dispatched (these systems are OLTP-based services and are outside the scope of the Shopping application). The Shopping application enables a customer to view the details of the orders that they have placed, and it would be possible to retrieve this information from the SQL Server database. However, once an order has been fulfilled it should be considered a historical record. If any details of an order change, the system should maintain a full audit trail to enable these changes to be tracked. The volume of historical order information in the database will increase as customers place orders, and it is anticipated that the database could easily be expected to hold hundreds of thousands of orders by the end of the first year of operation, increasing to millions of orders in the years after that. The database must be scalable to store this amount of data, and also be quick to find the historical details of the orders for any customer. Therefore, the developers chose to make a copy of each order as it is created and each time it is updated, and store these copies in a separate database, optimized for fast query access. The developers considered that a document database might provide better scalability and performance for holding this order history information than SQL Server.


Refer to Chapter 2, "The Adventure Works Scenario" for a more detailed description of the features of the Shopping application.

Designing the Documents for the Product Catalog

The developers at Adventure Works considered several options for structuring the documents for the product catalog. First, they thought about adopting a completely denormalized approach, where the category information includes a list of subcategories, which in turn contain a list of products. Figure 10 shows an example of a Category document. Note that subcategories are not shared across categories, and each product can only be in a single subcategory, so there is no duplication of data. Additionally, it is possible to create indexes over the SubcategoryID and ProductID fields, enabling an application to quickly locate the details of a document containing a specified subcategory or product. However, each document is potentially quite large (possibly exceeding the document size limitations for many document databases). Additionally, the unit of I/O for most document databases is the document, meaning that queries that needed to find the details of a single product may end up reading the entire document containing that product into memory. This is potentially a very expensive overhead.

Dn313284.note(en-us,PandP.10).gifNote:
Some document databases support projections, enabling a query to return only the specified fields from a document. However, the database server may still need to read the entire document from the database to extract this information.

Figure 10 - The structure of a denormalized document containing category, subcategory, and product information

Figure 10 - The structure of a denormalized document containing category, subcategory, and product information

A second option was to partially normalize the data; split category and subcategory information into separate document collections, and store product details with subcategories as shown in Figure 11. Again, although this approach reduces the size of the Category documents, the Subcategory documents could still exceed the document size limitations of many document databases and impose an unnecessary overhead on queries that retrieve the details of individual products so it was also discarded.

Figure 11 - Normalized category documents with denormalized documents containing subcategory and product information

Figure 11 - Normalized category documents with denormalized documents containing subcategory and product information

A further option was to completely normalize the data and create separate collections for Category documents, Subcategory documents, and Product documents. However, to optimize the retrieval of categories and subcategories by the Shopping application, the developers chose to combine these Category and Subcategory documents together, resulting in the document structure illustrated by Figure 12. The relatively small number of subcategories in a category makes it unlikely that the size of each document will exceed the capabilities of the document database (the Components category is the largest, with 14 subcategories). Additionally, creating an index over the SubcategoryID field in the Product documents can speed retrieval of all products in a given subcategory.

Figure 12 - Denormalized documents containing category and subcategory information, and normalized product documents

Figure 12 - Denormalized documents containing category and subcategory information, and normalized product documents

Having decided on the high-level structure of the documents, the developers looked more closely at the information held for each product and the possible queries that the application has to support. The complete data for a typical product is shown in Figure 13:

Figure 13 - The complete details of a product

Figure 13 - The complete details of a product

After due consideration, the developers uncovered two issues:

  • Although it is not a feature used by the initial version of the Shopping application, the developers felt that it was likely that in a future release, queries that retrieved product information would also need to fetch information about the category that contained the product. The product document structure defined in Figure 12 only lists the subcategory ID, so to retrieve category information an application has to find the category document that contains this subcategory. Therefore, the developers decided to extend the design of the product document structure to include category information as a nested document. The subcategory details are nested inside this document.
  • Products have size and weight properties, but these properties are actually pairs; the Size property is simply a number and the SizeUnitMeasureCode property specifies the units (different products have different units of size–centimeters for wheels, for example). Similarly, the value in the Weight property is only meaningful in conjunction with the WeightUnitMeasureCode property. The developers decided to implement these properties as small subdocuments that combine the information, to indicate that they are only meaningful when read together.

Figure 14 shows the refined structure of a product document (the subdocuments are highlighted):

Figure 14 - The structure of a product document

Figure 14 - The structure of a product document

The developers decided to use MongoDB to provide the document database. They created two MongoDB collections in the database; categories which holds the denomormalized category and subcategory documents, and products which contains the product documents.

Dn313284.note(en-us,PandP.10).gifNote:
Adventure Works could have chosen one of many different document databases. They just happened to select MongoDB because that is the document database with which they are most familiar. Another document database could work equally as well.

Retrieving Documents from the Product Catalog

The developers at Adventure Works created the classes shown in the following code example to represent category and subcategory information. These classes mirror the structure of the documents shown in Figure 12. A Category object contains a collection of Subcategory objects. The BSonId attribute identifies the field that MongoDB should use as the document key when it stores and retrieves documents:

public class Category
{
    [BSonId]
    public int Id { get; set; }
    public string Name { get; set; }
    public IEnumerable<Subcategory> Subcategories { get; set; }
}

public class Subcategory 
{
    [BSonId]
    public int Id { get; set; }
    public string Name { get; set; }
}

Product documents are implemented by using the Product class shown below. The SizeUnitofMeasure, WeightUnitOfMeasure, and ProductCategory types represent the nested documents in a product document.

public class Product 
{
    [BsonId]
    public int Id { get; set; }
    public string Name { get; set; }
    public string ProductNumber { get; set; }
    public string Color { get; set; }
    public decimal ListPrice { get; set; }
    public SizeUnitOfMeasure Size { get; set; }
    public WeightUnitOfMeasure Weight { get; set; }
    public string Class { get; set; }
    public string Style { get; set; }
    public ProductCategory Category { get; set; }
}

public class SizeUnitOfMeasure
{
    public string Value { get; set; }
    public string UnitOfMeasure { get; set; }
}

public class WeightUnitOfMeasure
{
    public decimal Value { get; set; }
    public string UnitOfMeasure { get; set; }
}

public class ProductCategory
{
    public int CategoryId { get; set; }
    public string Name { get; set; }
    public ProductSubcategory Subcategory { get; set; }
}

public class ProductSubcategory
{
    public int SubcategoryId { get; set; }
    public string Name { get; set; }
}

The developers used MongoDB to provide the document database, and followed the Repository pattern to isolate the document database-specific code from the rest of the Shopping application.

The CategoriesController class in the MvcWebApi web service retrieves category information from the document database by using the CategoryRepository class. The code that actually establishes the connection is located in the BaseRepository class from which the CategoryRepository class inherits. The address of the MongoDB server and the name of the database are passed as parameters to a constructor in this class that saves them locally. The GetDatabase method in the BaseRepository class creates a connection to the MongoDB database using this information (this method also checks to make sure that the database actually exists). The following code example shows the main elements of the BaseRepository class:

public class BaseRepository
{
    private readonly string databaseName;
    private readonly string hostNames;
    private readonly bool setWriteConcernToJournal;
    private readonly bool setWriteConcernToWMajority;
    ...

    public BaseRepository(string hostNames, 
                          string databaseName,
                          bool setWriteConcernToJournal, 
                          bool setWriteConcernToWMajorit)
    {
        if (string.IsNullOrWhiteSpace(hostNames))
        {
            throw new ArgumentNullException(hostNames);
        }

        if (string.IsNullOrWhiteSpace(databaseName))
        {
            throw new ArgumentNullException("databaseName");
        }

        this.hostNames = hostNames;
        this.databaseName = databaseName;
        this.setWriteConcernToJournal = setWriteConcernToJournal;
        this.setWriteConcernToWMajority = setWriteConcernToWMajority;
    }

    protected MongoDatabase GetDatabase()
    {
        string connectionString = "mongodb://" + hostNames;
        MongoClientSettings settings = 
            MongoClientSettings.FromUrl(new MongoUrl(connectionString));
        if(setWriteConcernToJournal)
        {
            settings.WriteConcern.Journal = true;
        }
        if(setWriteConcernToWMajority)
        {
            settings.WriteConcern = WriteConcern.WMajority;
        }
        var mongoClient = new MongoClient(settings);
        var mongoServer = mongoClient.GetServer();
        if (!mongoServer.DatabaseExists(this.databaseName))
        {
            throw new MongoException(string.Format(
                                     CultureInfo.CurrentCulture,
                                     Strings.DatabaseDoesNotExist,
                                     this.databaseName)) 
        }

        var mongoDb = mongoServer.GetDatabase(databaseName); 
        return mongoDb;
    }
}
Dn313284.note(en-us,PandP.10).gifNote:
The GetDatabase method also ensures that all writes to the database are recorded in the MongoDB journal (the equivalent of the database transaction log used by SQL Server). This feature guarantees that the database can survive an unexpected shutdown and recover any information that has not been flushed from memory to the MongoDB database.



Appendix A provides more information on how the Shopping application implements the Repository pattern.

The MongoDB-specific data access code is located in the GetAllCategories and GetSubcategories methods in the CategoryRepository class. The code for these methods use the C# APIs exposed by a .NET Framework assembly provided by MongoDB to communicate with the database.


You can find more information about the MongoDB APIs for C# and the .NET Framework online at the MongoDB website.

The following code example shows how the CategoryRepository class retrieves the documents that describe the product categories. As described earlier, in the product catalog database, a category document also contains a list of subcategories. Therefore, to save network bandwidth when transmitting data back to the client, this method only returns the category ID and name from each document:

Dn313284.note(en-us,PandP.10).gifNote:
In this code example, the DE namespace is an alias for the DataAccess.Domain.Catalog namespace that contains the domain entity classes used by the controllers.

public class CategoryRepository : BaseRepository, ICategoryRepository
{
    private const string mongoCollection = "categories";

    ...

    public ICollection<DE.Category> GetAllCategories()
    {
    ...
        // This method is ONLY used for the front page of the reference 
        // implementation.  As an optimization, instead of pulling
        // back the whole category document, we are only returning the _id 
        // and name of the category with no associated subcategories.
        var categories = GetDatabase().GetCollection<Category>(mongoCollection)
            .FindAll()
            .SetFields(Fields.Include("_id", "name"))
            .ToList();

        if (categories == null)
        {
            return null;
        }

        var result = new List<DE.Category>();

        Mapper.Map(categories, result);
        return result;
    }

    ...
}

To optimize the query performed by this method, the developers at Adventure Works decided to create a covering index that spans the customer ID and name fields over the documents in the categories collection. These documents are very static, so there is minimal overhead associated with maintaining this index.

Dn313284.note(en-us,PandP.10).gifNote:
In the sample application, the code that creates the indexes for the product catalog is located in the PowerShell scripts that create and populate the MongoDB database.

Note that although the method only returns the category ID and name, the database server still has to read the entire category document, including subcategory information, into memory. However, this action causes the documents to be cached in memory on the database server, and the GetSubcategories method that retrieves data from the same documents and which is typically invoked soon afterwards by the Shopping application, should run more quickly as a result. The following code example shows the GetSubcategories method. This method makes use of the GetMongoCategory method (also shown) that retrieves the details of a single category from the document database:

public class CategoryRepository : BaseRepository, ICategoryRepository
{
    private const string mongoCollection = "categories";
    
    ...

    private Category GetMongoCategory(int categoryId)
    {
        return GetDatabase().GetCollection<Category>(mongoCollection). 
            FindOneById(categoryId);
    }

    ...

    public ICollection<DE.Subcategory> GetSubcategories(int categoryId)
    {
        var category = GetMongoCategory(categoryId);

        if (category != null && category.Subcategories != null)
        {
            var subcategories = category.Subcategories.ToList();

            if (subcategories == null)
            {
                return null;
            }
            var result = new List<DE.Subcategory>();

            Mapper.Map(subcategories, result);
            return result.Select(s =>
                {
                    // Map the parent category.
                    Mapper.Map(category, s.Category);
                    return s;
                }).ToList();
        }
        else
        {
            return null;
        }
    }
}

MongoDB stores data in BSON format in the document database. The MongoDB APIs deserialize the data retrieved from the database into .NET Framework objects (or collections of objects). The GetAllCategories and GetSubcategories methods convert this data into database-neutral domain entities by using AutoMapper before returning them.


For more information on how the repository classes uses AutoMapper, see the section "How the Entity Framework and MongoDB Repository Classes use AutoMapper to Convert Data" in Appendix A.

The ProductRepository class, which is used by the ProductsController to fetch product information from the document database, follows a similar pattern except that it reads data from the products document collection. The GetProducts method in the ProductRepository class returns a list of products in a subcategory and the GetProduct method returns the details of a single product. The GetProducts method has to locate products by subcategory, so the developers created an index over the SubcategoryID field in the products document collection. The following code example shows how these methods are implemented:

public class ProductRepository : BaseRepository, IProductRepository
{
    private const string mongoCollection = "products";
    ...

    private IEnumerable<Product> GetMongoProducts(
        IDictionary<string, object> matchingFilter = null)
    {
        return (matchingFilter != null)
             ? GetDatabase().GetCollection<Product>(mongoCollection).
                 Find(new QueryDocument(matchingFilter))
             : GetDatabase().GetCollection<Product>(mongoCollection).FindAll();
    }

    private Product GetMongoProduct(int productId)
    {
        return GetDatabase().GetCollection<Product>(mongoCollection).
            FindOneById(productId);
    }

    ...

    public ICollection<DE.Product> GetProducts(int productSubcategoryId)
    {
    ...
        var products = GetMongoProducts(new Dictionary<string, object> { 
                { "category.subcategory.subcategoryId", productSubcategoryId } 
            }).ToList();

        if (products == null || products.Count == 0)
        {
            return null;
        }
            
        var result = new List<DE.Product>();

        Mapper.Map(products, result);
        return result;
    ...
    }

    public DE.Product GetProduct(int productId)
    {
    ...
        var product = GetMongoProduct(productId);
        if (product == null)
        {
            return null;
        }

        var result = new DE.Product();

        Mapper.Map(product, result);
        return result;
    ...
    }
}

The ProductExists method in the ProductRepository class returns a Boolean value indicating whether the specified product is available in the product catalog. The CartController uses this method to check that the customer has added a valid product to the shopping cart. The MongoDB query that this method runs reads the document key only (the product ID field), and this information is available in the index over this field. Therefore this query can run very quickly and does not actually need to retrieve any product documents from the database:

public class ProductRepository : BaseRepository, IProductRepository
{
    private const string mongoCollection = "products";

    ...

    public bool ProductExists(int productId)
    {
        var mongoProductId = GetDatabase().GetCollection(mongoCollection)
            .Find(Query.EQ("_id", productId))
            .SetLimit(1)
            .SetFields(Fields.Include("_id"))
            .FirstOrDefault();
        return mongoProductId != null;
    }

    ...
}

Designing Documents to Hold Order History Information

When a customer places an order, the Shopping application creates a normalized set of records that it stores in the SalesOrderHeader and SalesOrderDetails tables in the SQL Server database. The Shopping application also has to maintain a full audit trail of each order, including a complete history of any changes made to each order. The SalesOrderHeader and SalesOrderDetails tables in SQL Server are designed to support the OLTP requirements of the warehousing and shipping systems in Adventure Works, but they are not suitable for storing the change history of orders. Equally, they are not optimized for performing the queries required by customers viewing order history information, especially as the database could contain tens of millions of orders. Therefore, when a customer places a new order, the developers at Adventure Works decided to copy the important details of the order as an order history document to a document database. Each time an order is modified the system stores another copy of the order history that contains the changes in the document database. Each copy is date stamped.

To enable order history information to be retrieved quickly, the developers adopted a fully denormalized approach where each order history document includes the complete details of the billing address, shipping address, credit card, and the list of items in the order. Figure 15 shows the structure of a typical order history record as stored in the document database.

Figure 15 - An order history document showing the denormalized structure of the data

Figure 15 - An order history document showing the denormalized structure of the data

In this document, the key (order history ID) is a unique GUID, while the OrderCode field actually identifies the order; the same order can have more than one order history record. The ModifiedDate field contains the date and time that the order history record was created (this will be the time when the order itself was either placed or modified), and the Status field tracks the status of the order. The remaining fields contain the details of the order.

Dn313284.note(en-us,PandP.10).gifNote:
When an order is first placed the status will be marked as Pending, and when the order has been processed its status is changed to Completed. Therefore, there will usually be at least two order history documents for each order, comprising an audit trail of these changes in status.
In the real world, the warehousing and dispatch systems inside Adventure Works may cause the order to pass through additional states, and they will add further records that indicate these changes for each order to the document database. However, the warehousing and dispatch systems are outside the scope of the Shopping application, and this functionality is not implemented by the sample solution provided with this guide.

The order history documents are stored in the ordershistory collection in the database.

Storing and Retrieving Order History Information

Following the same pattern as the product catalog, the developers at Adventure Works created a series of classes to hold order history information. The following code example shows these classes:

public class OrderHistory
{
    ...
    [BSonId]
    public Guid Id { get; set; }
    public Address BillToAddress { get; set; }
    public CreditCard CreditCard { get; set; }
    public int CustomerId { get; set; }
    public DateTime DueDate { get; set; }
    public decimal Freight { get; set; }
    public IEnumerable<OrderItem> Items { get; set; }
    public DateTime ModifiedDate { get; set; }
    public Guid OrderCode { get; set; }
    public DateTime OrderDate { get; set; }
    public Address ShippingAddress { get; set; }
    public OrderStatus Status { get; set; }
}

public class OrderItem
{
    public short Quantity { get; set; }
    public decimal UnitPrice { get; set; }
    public int ProductId { get; set; }
    public string ProductName { get; set; }
}

public class Address
{
    public int Id { get; set; }
    public string AddressLine1 { get; set; }
    public string AddressLine2 { get; set; }
    public string City { get; set; }
    public string PostalCode { get; set; }
    public int StateProvinceId { get; set; }
}

public class CreditCard
{
    public string CardNumber { get; set; }
    public string CardType { get; set; }
    public byte ExpMonth { get; set; }
    public short ExpYear { get; set; }
}

When a customer places an order, the Post method of the OrdersController class creates an initial order history document containing the details of the order, and stores it in the ordershistory collection in the database with the status of Pending. When the order has been processed, another order history document is created for the same order and added to the ordershistory collection with the status set to Completed (the ModifiedDate field is also populated with the date and time that the status changed).

To enable a customer to display their order history, the developers at Adventure Works created the OrderHistoryRepository class that provides the following public methods:

  • GetOrderHistoryByTrackingId. This method retrieves the details of a specific order from the order history document database. The order is specified by the tracking ID passed as the parameter to this method. Note that there will be more than one order history document for a given order, and this method simply returns the first order history document that it finds that has the appropriate order code.
  • GetOrdersHistories. This method retrieves a list of OrderHistory documents for a specified customer. The customer ID is provided as the parameter to this method.
  • GetOrderHistoryByHistoryId. This method fetches the details of a specific OrderHistory document identified by order history ID passed as the parameter to this method.
  • IsOrderCompleted. This method determines whether the specified order has been successfully completed or not by examining the Status property of the order. If this field contains the value OrderStatus.Completed, the method returns true, otherwise it returns false. The method uses the unique tracking ID to identify the order.
  • GetPendingOrderByTrackingId. This method retrieves the Order document for the order that matches the tracking ID specified as the parameter. The method examines the Status field of the order, and only returns order information if this status is OrderStatus.Pending. If the status is different, or there is no order that matches the tracking ID, this method returns a null value.
  • SaveOrderHistory. This method writes an OrderHistory document to the database.

Internally, the code in the OrderHistoryRepository class operates along similar lines to the other MongoDB repository classes, and uses methods defined by the MongoDB API to retrieve and store documents in the database.

The order history documents in the database are automatically indexed by the order history ID (this is the Id field in the OrderHistory class). However, the OrderHistoryRepository class also retrieves documents by using the OrderCode and CustomerId fields. Therefore the developers defined indexes over these fields in the ordershistory collection.

Dn313284.note(en-us,PandP.10).gifNote:
The indexes are created by the PowerShell script that sets up the MongoDB database.

Implementing a Document Database to Maximize Scalability, Availability, and Consistency

Most document databases are intended to handle large volumes of data being accessed concurrently by millions of users. It is therefore vital that a document database is able to locate, retrieve, insert, update, and delete documents quickly and efficiently.

Maximizing Scalability

As with many other types of NoSQL databases, most scalable document databases enable you to spread a large database across a collection of database servers by using sharding, described in Chapter 1. This is very effective for applications that perform a large proportion of write operations as it can help to reduce disk contention.

In a document database, sharding is implemented at the document collection level, meaning that different collections have their own sets of partitions and shards. A collection of customer documents does not usually occupy the same shard as a collection of orders documents, for example.

Sharding strategies vary from simple hashing of the document key (in a manner similar to key/value stores), to schemes based on the values of one or more fields in a collection of documents. Some document database support range partitioning, enabling you to group documents with adjacent key values together in the same shard. In a few cases, you can customize the sharding process, specifying how the database management system should determine in which shard a given document should be placed, and where to store the document in that shard. The rationale behind customizing the sharding process is that a generic sharding mechanism may not always be suitable for the likely access patterns used required your applications. For example, most document databases do not support referential integrity and they treat each document as an independent entity. By default, if the database management system needs to relocate a document to a different shard it will not worry about keeping related documents together. If you know that certain groups of documents are always likely to be accessed as together, customizing the sharding process can enable you to retrieve these documents more efficiently.

Most document databases that provide sharding automatically rebalance data across the shards as documents are created, modified, and deleted. However, the shard key that you select can have a significant influence on the performance of sharding. The following list highlights some recommendations:

  • If possible, select a shard key that matches the most common queries performed by your applications.
  • Do not use a shard key that has very few possible values. This can result in a small number of very large shards, reducing the effectiveness of distributing data across servers.
  • Unless you are partitioning data into ranges, do not select a shard key that has low randomness. Highly random keys will enable the database management system to spread the data more evenly across shards. Less random keys may result in hotspots occurring in the database.
  • If you are inserting a large number of documents and you are not hashing the shard key, do not base the shard key on the date and time that each document is added to the database. This can result in poor write scalability as multiple documents may be written to the same shard in quick succession.

Ensuring Availability

Commercial document databases usually implement availability by using primary/secondary replication with automatic failover and recovery. A few document databases utilize peer-to-peer replication.

In a typical document database, each shard in a cluster will be replicated, and read requests will be directed to the most available replica. You may also be able to implement cross-cluster replication to improve the locality of data. For example, if you replicate a cluster to a data center located in San Francisco, a user based in the western United States can access the data more quickly than if the cluster was based in Tokyo.


Refer back to the section "Implementing High Availability" in Chapter 1 for more information on replication strategies.

Maximizing Consistency

As described in the preceding sections in this chapter, many document databases attempt to optimize operations by reducing the overhead associated with these operations, therefore it is uncommon for a document database to guarantee any form of cross-document transactional consistency.

Dn313284.note(en-us,PandP.10).gifNote:
To ensure good performance, document databases typically only provide eventual consistency when you create, modify, or delete documents. It is often the responsibility of applications to handle any data inconsistencies that may occur.

Document databases that implement replication often use read quorums (described in Chapter 1) to reduce the possibility of inconsistencies occurring when applications retrieve documents. However, read quorums usually only operate within a cluster. Cross-cluster replication will increase the latency of inserts, updates, and deletes. A change made to data in Tokyo has to propagate to the cluster in San Francisco before it is visible to a user in the western United States, possibly resulting in inconsistencies between clusters. In many cases, documents that are modified in Japan will most likely be queried by Japanese users, and documents that are modified in the western United States will most likely be queried by American users, so such inconsistencies might not occur very often. However, your applications should be prepared to handle this eventuality.

How Adventure Works Implemented Scalability and Availability for the Product Catalog and Order History

The product catalog for the Shopping application contains category, subcategory, and product details. As far as the Shopping application is concerned, this data is read-only (the data can change, but such changes are very infrequent and are outside the scope of the Shopping application). Additionally, the set of categories, subcategories, and products is very static. On the other hand, the set of orders generated by the Shopping application constitutes a rapidly increasing collection of documents, as do the documents in the order history collection. Because the product catalog and orders exhibit decidedly different behaviors, the developers at Adventure Works decided to consider their scalability and availability requirements separately.

Implementing Scalability and Availability for the Product Catalog

The section "How Adventure Works Used a Document Database to Store the Product Catalog and Orders" earlier in this chapter described how the developers at Adventure Works divided the product catalog into two document collections; the first collection holds category and subcategory information, and the second collection contains the details of products.

The number of documents in the category and subcategory information collection is very small (there are only four categories), so the developers at Adventure Works decided not to shard this collection. Similarly, although the products collection is considerably larger, it is still of a reasonably finite size and is static, so the developers chose not to shard this collection either.

Dn313284.note(en-us,PandP.10).gifNote:
Sharding a small collection adds an overhead that can reduce the responsiveness of queries that retrieve data from this collection. If you are using MongoDB, you should only consider sharding if the dataset is bigger than the storage capacity of a single node in the system. For example, if the Adventure Works product line consisted of millions of items, the developers could have decided to shard the products collection by using a hash of the Product ID as the shard key. The reason for this choice is that most of the queries performed by the Shopping application that retrieve product information do so by specifying the product ID. Product IDs are assigned in a monotonic increasing sequence, so the most recent products have the highest IDs. It is also possible that these are the most popular products that will be queried most often, so hashing the product ID ensures that the documents for the latest products are distributed evenly across the shards.

To ensure good read-scalability, the developers implemented replication by using MongoDB replica sets with 12 members (the maximum number of nodes that a MongoDB replica set can currently contain). The default configuration of a MongoDB cluster is to perform all read and write operations through the primary server in a cluster and the secondary nodes act as failover servers in case the primary server should crash. However, because the Shopping application does not modify information in the product catalog, the Shopping application was configured to enable read operations to be satisfied by all secondary servers as well.

To reduce network latency, the developers deployed the Shopping application to multiple Windows Azure data centers located in various regions around the world. They also deployed a copy of the product catalog cluster to a collection of virtual machines running at each data center. Figure 16 shows the architecture of the Shopping application and the product catalog at a single data center. Notice that because of the read-only nature of the product catalog, the developers did not configure any form of cross-cluster replication between data centers; they simply deployed a copy of the same cluster at each data center.


The section "How Adventure Works Implemented Scalability and Availability, and Maximized Consistency for Shopping Cart Information" in Chapter 4, "Implementing a Key/Value Store," describes how the developers at Adventure Works deployed the Shopping application to the various Windows Azure data centers around the world.

Figure 16 - The product catalog implemented as a MongoDB cluster by using virtual machines in a Windows Azure data center

Figure 16 - The product catalog implemented as a MongoDB cluster by using virtual machines in a Windows Azure data center

Implementing Scalability and Availability for Order History Information

The Shopping application needs to be able to store and retrieve order history documents from a large and rapidly increasing dataset. The indexes over the OrderCode and CustomerId fields in documents in the ordershistory collection enable the application to locate OrderHistory documents quickly. However, the system may also need to support many hundreds of write operations per second if a large number of customers are placing orders concurrently. To reduce the possible contention that could occur, the developers decided to shard the ordershistory collection and hash the CustomerId as the shard key. This approach helps to keep OrderHistory documents for a given customer close to each other (ideally in the same shard), but distributes the documents for other customers across the shard cluster. To maximize availability, each shard is actually a replica set. Each replica is implemented as a MongoDB database running on a separate virtual machine hosted using Windows Azure (these virtual machines are distinct from the virtual machines that contain the product catalog).

The Shopping application is hosted in multiple datacenters. As described above, each data center hosts its own set of replicas containing the product catalog. This approach is feasible for data that does not change. The situation with order history documents is somewhat different, and it is impractical to copy order histories to replica sets in different datacenters (at least in real time). The developers instead chose to store order history information for a customer only at the datacenter to which that customer connects. This means that each datacenter holds only its own non-overlapping collection of order history documents. The disadvantage of this approach is that if a customer connects to a different datacenter then their order history documents will not be available, but this is likely to be an infrequent occurrence (customers tend not to move between geographic regions very often).

Summary

This chapter has described strategies for using a document database, and the circumstances under which a document store is a better choice than a key/value store for holding information. A document is an entity that contains a collection of fields, and a document database enables an application to perform queries that retrieve documents based on the values in these fields. In this way, they provide a more flexible mechanism for locating and fetching data than the simple approach provided by a key/value store. You can also create indexes over fields to improve the speed of queries, but indexes can also have a detrimental effect on insert, update, and delete operations, so only define the indexes that your applications really require.

You should pay careful attention to how you design your documents. Do not simply follow the relational approach because this will inevitably lead to poor retrieval performance. Instead, you should examine the requirements of your applications and the data that they use, and then design your documents to optimize the operations that read data; denormalize your data where appropriate.

This chapter also described how Adventure Works used document stores to hold the product catalog and order history information for the Shopping application. It discussed the document structures that the developers decided would best meet the requirements of the application, and has shown how the developers implemented the databases to maximize scalability and availability by using replication and clustering.

More Information

All links in this book are accessible from the book's online bibliography on MSDN at: http://msdn.microsoft.com/en-us/library/dn320459.aspx.

Show: