1 - Data Storage for Modern High-Performance Business Applications
In the days of classic business systems analysis, the first tasks that designers typically had to perform were to understand exactly what information a proposed new system needed to store, and how the business was going to use this information. After much consideration, the result was usually a schema that described the structure of the data.
The schemas generated by many common data analysis tools consist of a collection of relational tables and views. There are many reasons for this, not least of which are the simplicity and elegance the relational model, and the fact that the relational model is well-understood by most database specialists.
However, although the relational model is a very powerful abstraction that helps to clarify and solve many data storage problems, it also has its limitations. This chapter describes some of the common challenges that organizations have encountered with the relational model and relational database technology, and discusses how NoSQL databases can help to address these challenges.
Issues with Implementing a Relational Database
Traditionally, businesses have invested a lot of time, effort, and money in modeling their processes and rationalizing the structure of the data that these processes use. The "business database" has become a key element of the IT infrastructure of many organizations, holding the data that supports the day-to-day operations. The core applications that a business depends upon often share this same database. This strategy ensures that the organization does not end up with a disparate set of schemas and duplicated data (possibly in different formats). In essence, the database is the central repository for all the key business systems. The database can also act as the point of integration between business systems, enabling them to work with each other and share information, as shown in Figure 1. In this example, the Sales application creates and manages orders for products that an organization sells, the Inventory application is used to monitor and control the stock levels for each product, and a Customer Relationship Management (CRM) application handles customer information and is used to coordinate marketing campaigns. The data access logic in each application, in conjunction with the database management system software, takes responsibility for coordinating access to the data shared by these applications, as well as ensuring data integrity, and maintaining the consistency of information.
This strategy looks fine in theory, but in practice many organizations have discovered several failings with this approach, including:
- It can take a long time to perform a complete and thorough data analysis, resulting in an extended lead time before developers could actually start writing any code. Business users want their applications up and running quickly. After all, they are the ones who need the system to help them make money for the organization.
- A relational database design consists of a set of tables and you access them by using a query language, most commonly SQL. Programmers prefer to write applications based on other abstractions, such as objects and methods. This mismatch between the database model and the programming model requires developers to either implement a mapping layer in their applications to convert between them, or learn how to access the database directly, possibly by embedding SQL statements directly in their code and using a vendor-specific API to send these requests to the database and convert the data returned into a collection of objects.
- Business requirements can change and new business requirements may appear. Data requirements are subject to change as a result. The longer it takes to construct a data model, the more likely it is to be out of date before the system is complete. A formal approach to database design does not lend itself to building agile business solutions. Organizations that are unable to adapt their systems to rapidly evolving requirements soon become extinct.
- Once a relational schema has been fixed it can become very difficult to modify, especially as more and more applications depend on it. If data requirements change, then any change in the structure of the schema will have an impact on every application that uses the database. If you revise the schema, then there is a real risk that key applications in your organization may stop working correctly unless you can take steps such as performing an impact analysis of every proposed change.
- Successful businesses grow, and the amount of data that they need to hold tends to grow accordingly. A database initially designed to hold many hundreds of records might not scale well if it has to hold millions, or even billions of items. This situation could result in the need to redesign the database to support vastly different volumes of data, with the corresponding impact on the systems that use the database.
- The business may have expanded across multiple, geographically dispersed sites. Additionally, it is now very commonplace for users to be sitting at a remote desktop, or running the application through a web browser connected across the Internet. A database designed to support applications running at a single location might not perform as well when applications access to it from all points of the globe. Network failure or extended network latency may cause database resources to become locked for an extended period, or unavailable, to an application running remotely.
- Sometimes the relational model is just not the best way to describe the data required by an application. The relational model is great at enabling you to implement the bulk of business requirements, but not all relationships can be easily described by the relational algebra that such a model is designed to support.
|Licensing costs can also be a significant factor. Many high-end relational database systems that include scalability and high-availability features are very expensive. This can be a massive barrier for many small to medium sized business that have to manage significant amounts of data but only have a limited budget. Many NoSQL offerings are open source, which makes them a very attractive option to these organizations.|
The common factors that transcend most of these concerns are agility, flexibility, performance, and scalability. For business systems that are critical to the day-to-day operations of an organization, database availability is also an important issue. The following sections summarize these concerns.
Agility and Programmability
The impedance mismatch between the structure of data in the database and the models used by most applications has an adverse effect on the productivity of developers. Developers can spend a significant amount of time developing a mapping layer that converts data between the object model used by the application and the tables stored in the database, rather than focusing on the business logic of the application itself. Using a commercial Object-Relational Mapping (ORM) layer such as the Microsoft ADO.NET Entity Framework can reduce some of the effort involved, but it is not always an ideal solution.
Chapter 3, "Implementing a Relational Database" describes the use of ORMs and the Entity Framework with a relational database.
In theory, the relational model is extremely flexible and can model almost any type of data and relationships. In practice, overfamiliarity with the relational model has led many developers to design solutions and user interfaces that overemphasize the tabular way in which the data is stored rather than implementing a data storage schema that meets the business requirements; a case of bending the solution to match the technology. The situation is not helped by the nature of SQL. As a query language it is brilliant for specifying from which tables a query should return data, and how to join, sort, and aggregate data. However, SQL is extremely focused on data as a set of rows and columns, and not all information can be shoehorned easily into this shape. Furthermore, the tendency of relational database designers to focus on a pure, normalized schema can lead to a database that contains a large set of narrow tables. The result is that most queries will need to perform join operations to reconstruct the entities that business applications need, and this process can be slow and resource intensive.
Chapter 3 summarizes strategies for reducing the overhead of multi-table joins by denormalizing the database schema.
Performance and Scalability
Many systems developed in the 1970s and 1980s made assumptions about the volume of requests that were likely to occur, the number of users making these requests, and the locations of these users. In many cases, it was also assumed that these statistics were unlikely to vary much over time. Relational database technology can work very well if it is tuned carefully to match these assumptions.
If the volume of traffic increased, many organizations elected to scale up the hardware on which it was running to handle the additional increased load. However, there comes a point at which scaling-up is no longer feasible, either through the limitations of the hardware or the expense of maintaining large servers. At this point you can scale out by partitioning the data into a set of smaller databases and running each database on a separate commodity (and cheaper) server.
|You can partition data vertically (splitting tables into subsets of columns and implementing each subset as a separate table) and horizontally (splitting tables into subsets of rows, and implementing each subset as a separate table where each table contains the same columns). You can place different partitions on different servers to spread the load and improve scalability.|
Figure 2 shows an example of horizontal partitioning across servers. In this example, the data for different rows is divided across tables running on different servers (each table has the same schema). When the various applications query this data, the requests are routed to the appropriate partition. This schema is also known as sharding. Many relational databases support sharding by implementing a technique known as federation; the different database servers maintain metadata that identifies which rows they contain, and an application can connect to any server that uses this metadata to route the request to the appropriate server. However, federation may require additional configuration, and can be expensive in terms of licensing costs. For this reason, it is common to implement sharding manually, by incorporating the routing strategy into the data access logic of the applications that use the data, as shown in Figure 2. This is also known as the Shared Nothing approach because the database servers do not share any information about the data that they hold with any other servers.
Chapter 3 discusses vertical and horizontal partitioning strategies for relational databases in more detail.
Sharding is also a key scalability strategy employed by many NoSQL databases. The section "Improving Scalability and Reducing Network Latency" later on in this chapter provides more information.
Partitioning enables you to handle more data and users, but can have an important impact on the performance of query operations that need to join data held in different partitions, so you need to design your partitions carefully to minimize these occurrences. Additionally, in a partitioned system, transactions that update data held in different partitions can have a significant effect on the throughput and responsiveness of your applications. Relational databases place great emphasis on ensuring that data is consistent, and they achieve this by implementing ACID (Atomic, Consistent, Isolated, and Durable) transactions. An ACID transaction that modifies data in multiple tables only completes successfully if every modification is successful; if any part of the transaction fails, then the entire transaction is rolled back and any work already completed is undone. To prevent concurrent users from seeing partial updates as a transaction progresses, most relational databases lock the data as it is changed, and then only unlock the data when the transaction completes successfully or it fails. In either case, the locks ensure that other users only see a consistent view of the data. If they try and access locked data they will be blocked until the lock is released. For short-lived transactions, concurrent users may only be delayed for a short while, but the longer a transaction takes, the more noticeable this delay gets.
If you scale out a relational database, transactions that previously encompassed data in a single database might now have the complication of spanning multiple databases, requiring careful coordination of locks across databases. In the early 1990s, the X/Open group defined a model for distributed transaction processing and published the XA specification. The XA specification describes how to use a protocol called the Two Phase Commit (2PC). The details are unimportant here except to state that this protocol enables an application to implement ACID transactions across more than one database. This consistency comes at a price however.
When the XA specification was first introduced, fast, private corporate networks had become commonplace and issues such as network latency were not considered too important. However, as the 1990s progressed, the increasing use of the Internet, and the World Wide Web in particular enabled many businesses to take advantage of the connectivity that was now available to build online systems. These systems had to support a widely varying number of concurrent users, generating an unpredictable volume of traffic, from sites located anywhere in the world. Additionally, the opportunities to capture and store more data meant that databases inevitably grew larger. Unfortunately, the emphasis that distributed transactions place on consistency can impact the performance of a system that spans a large public network with unpredictable connectivity and throughput characteristics; the more dispersed the databases that participate in a distributed transaction, and the slower the transaction becomes. A slow transaction locks data for longer, increasing the likelihood that a concurrent request for the same data will be blocked. As the traffic increases, the more likely such collisions are to occur, and the system can grind to a halt. You can take steps to reduce the scope for such slow-down, but it is very difficult to eliminate it altogether.
Chapter 3 provides more information on implementing efficient transactions in a relational database. You can also consider trading consistency for performance by implementing BASE (Basically Available, Soft state, Eventual consistency) transactions, although this strategy often requires additional infrastructure to work successfully. Appendix A, "Replicating, Distributing, and Synchronizing Data" in the guide "Building Hybrid Applications in the Cloud on Windows Azure" available from patterns & practices describes a possible implementation based on Windows Azure Topics and Subscriptions.
In more general terms, you can follow the Command Query Responsibility Segregation (CQRS) pattern to divide a distributed system up into functionality that is responsible for writing data and functionality that handles queries. For more information, download the patterns & practices guide "CQRS Journey" from MSDN.
The relational model does not address the issue of availability, although many relational database management systems offer high availability (for a price). In most cases, availability is implemented by maintaining a copy of the database. This copy might be hosted by a separate failover server that is brought online if the primary server hosting the database fails, or it might be sent to one or more additional servers that provide load-balancing for concurrent users (user requests may be transparently directed to whichever server is least-heavily loaded).
However, increased availability complicates consistency. If you maintain multiple copies of data, and that data changes, you need to ensure that each copy is modified consistently. The more changes that you need to make, spanning more databases, the slower transactions can become.
The Advent of NoSQL
A number of well-known organizations, building Internet-facing systems that capture huge amounts of data and that support massive numbers of concurrent requests, felt that relational technology did not provide the scalability, availability, and performance that they required. Instead, they decided to create their own non-relational solutions tailored to their requirements. Other organizations have followed a similar strategy leading to a selection of non-relational database systems, each designed to solve a specific set of problems. Collectively, this ever-expanding set of database systems is referred to as NoSQL.
|The term NoSQL is an historic anomaly, originally intended to convey the idea that applications do not use SQL to interact with the database. In a lot of literature on the subject NoSQL is defined as meaning "not only SQL," but this description would also fit many common relational database systems that implement custom extensions to the SQL language (such as Microsoft Transact-SQL), and these are not NoSQL databases in the manner that most NoSQL enthusiasts would recognize. A better term might have simply have been "non-relational," but the NoSQL tag has stuck and is now in common use. |
Throughout this book, we refer to a NoSQL database when we mean a non-relational database.
NoSQL databases come in a variety of shapes and functionality. Arguably, the only feature that unifies them is that they are not relational. Other than that, they can differ quite markedly from each other. However, the software industry has attempted to categorize NoSQL databases into a small set of functional areas: key/value stores, document databases, column-family databases, and graph databases. The following sections summarize the distinguishing features of each of these categories.
|Some NoSQL databases fit naturally into a single category, while others include functionality that spans categories. For example, some key/value stores can also exhibit some of the behavior of document databases. Therefore, you should not view these categories as a collection of definitive and constraining functional areas, but as a continuum of features.|
The article "Getting Acquainted with NoSQL on Windows Azure" available on the ISV Developer Community blog provides an overview of common NoSQL databases and describes some common implementations. You can also find more general information on the NoSQL-Database.org website.
A key/value store implements arguably the simplest of the NoSQL storage mechanisms, at least at a conceptual level. A key/value store is essentially a large hash table. You associate each data value with a unique key, and the key/value store uses this key to determine where to store the data in the database by using an appropriate hashing function. The hashing function is selected to provide an even distribution of hashed keys across data storage. The values deposited in a key/value store are opaque to the database management system software. Essentially, values are BLOBs and the key/value store simply retrieves or stores a value by using the key provided by the application. For this reason, most key/value stores only support simple query, insert, and delete operations. To modify a value an application must overwrite the existing data for the entire value. An application can store whatever data it likes, although some key/value stores impose limits on the maximum \size of values. In most implementations, reading or writing a single value is an atomic operation (if the value is large, writing may take some time).
Chapter 4, "Implementing a Key/Value Store" provides detailed information on creating and using a key/value store in a business application.
A document database is similar in concept to a key/value store except that the values stored are documents. A document is a collection of named fields and values, each of which could be simple scalar items or compound elements such as lists and child documents. The data in the fields in a document can be encoded in a variety of ways, including XML, YAML, JSON, BSON, or even stored as plain text.
|The term "document" in this context does not imply a free-form text structure, rather a document is the name given to a collection of the related data items that constitute an entity.|
The fields in the documents are exposed to the database management system, enabling an application to query and filter data by using the values in these fields. Figure 4 shows an example of a couple of documents holding sales order information in a document database. The items that comprise each order are held as a list with the order information.
The database in this example is designed to optimize query access to sales order information. In the application that uses this database, the most common operations performed on this data are queries that retrieve the details of each order. The details of the order are unlikely to change, so the denormalized structure imposes little overhead on updates while ensuring that queries can be satisfied with a single read operation.
A problem that a relational database specialist might have with this example concerns the shipping address that is embedded into the sales order. In a typical relational database, this information would be held in a separate table, enabling it to be modified without having to hunt down all the orders that reference it. However, the relational approach is not necessarily an advantage. A sales order is an historical document, and the shipping address should indicate where the order was actually sent. If the address of the customer changes at a later date, the shipping address for orders already fulfilled should not be changed (indeed, doing so may break the auditing requirements of the system). Implementing this functionality in a relational database would require maintaining version histories and timestamps for each shipping address, and each query that required the shipping address for an order would be more complicated as a result. The document database is more naturally tuned to this type of information, and queries are simpler as a result.
A typical document contains the entire data for an entity. The items that constitute an entity are application specific; for example, they could be the details of a customer, an order (as shown in Figure 4), or a combination of both. Additionally, a document store does not require that all documents have the same structure. This free-form approach confers great flexibility on applications, which can store different data in documents as business requirements change.
An application can query documents by using the document key (which is hashed, to help distribute data evenly), but a more common approach is to retrieve documents based on the value of one or more fields in a document. Some document databases support indexing to facilitate fast lookup of documents based on an indexed field. Additionally, many document databases support in-place updates, enabling an application to modify the values of specific fields in a document without rewriting the entire document. Read and write operations over multiple fields in a single document are usually atomic.
Chapter 5, "Implementing a Document Database" describes how to create and use a document database in more detail.
A column-family database organizes its data into rows and columns, and in its simplest form a column-family database can appear very similar to a relational database, at least conceptually. However, the real power of a column-family database lies in its denormalized approach to structuring sparse data.
|Column-family databases are also known as Wide Column databases or Extensible Record databases.|
For example, if you need to store information about customers and their addresses in a relational database (ignoring the need to maintain historical data as described in the previous section), you might design a schema similar to that shown in Figure 5. This diagram also shows some sample data. In this example, customer 1 and customer 3 share the same address, and the schema ensures that this address information is not duplicated. This is a standard way of implementing a one-to-many relationship.
The relational model supports a very generalized approach to implementing this type of relationship, but to find the address of any given customer an application needs to run a query that joins two tables. If this is the most common query performed by the application, then the overhead associated with performing this join operation can quickly become significant if there are a large number of requests and the tables themselves are large.
The purpose of a column-family database is to handle situations such as this very efficiently. You can think of a column-family database as holding tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column-family holds a set of columns that are logically related together. Figure 6 shows one way of structuring the same information as Figure 5 by using a column-family database to group the data into two column-families holding the customer name and address information. Other schemes are possible, but you should implement your column-families to optimize the most common queries that your application performs. In this case, queries that retrieve the addresses of customers can fetch the data with fewer reads than would be required in the corresponding relational database.
|A relational database purist will argue that this column-family schema is less optimal than the relational schema because of the data redundancy; the address details for customers 1 and 3 are repeated. However, the details of a physical building are unlikely to change very often (buildings don't move), so although this data occupies more space than it would in a relational database, maintaining this duplicate information is unlikely to have much overhead.|
The illustration in Figure 6 is conceptual rather than physical, and is intended to show the logical structure of the data rather than how it might be materially organized. Each row in a column-family database contains a key, and you can fetch the data for a row by using this key. Unlike a key/value store or a document database, most column-family databases store data in key order rather than by computing a hash to determine the location, so you should pay close attention to how applications are likely to query and sort data and select the row key appropriately.
Many implementations also allow you to create indexes over specific columns in a column-family, and this feature enables you to retrieve data by referencing the values stored in the individual columns rather than the row key. In the example shown above, if you created an index over the State column in the AddressInfo column-family, you could easily find all customers with a State value of WA.
|Column-family databases are designed to hold vast amounts of data (hundreds of millions of rows). Even if data is indexed, it can take a little time to find an item. Many column-family databases implement Bloom filters to help determine whether an item actually exists before attempting to retrieve it. In this way, the database does not have to spend time performing fruitless searches. For more information, see Chapter 6, "Implementing a Column-Family Database."|
Another feature of a typical column-family database is that the structure of the information in a column-family can vary from row to row; different rows can have different fields and the data does not confirm to a rigid layout. In the example shown in Figure 6, this variability is useful if the database needs to include the details of customers in other countries. For example, the United Kingdom includes county names rather than states in addresses, and uses post codes instead of zip codes. If Mark Hanson relocated to the UK, his address details might change as shown in Figure 7. Note that if you query a column-family database by referencing an indexed column that does not exist in every row, such as the State column described previously, the rows without this column will simply be omitted from the results (they will not be included in the index that spans this column). This feature enables a column-family database to hold sparse data very efficiently.
|In a column-family database, you can think of a column as a simple name/value pair and a column-family as a collection of columns. Unlike a relational database, the names of columns do not have to be static (they can actually be generated from data items).|
Although Figures 6 and 7 show the column-families for customers appended together to form a single logical entity they are most likely to be stored separately, with the CustomerInfo column-family on one disk and the AddressInfo column-family on another, in a simple form of vertical partitioning. You should really think of the structure in terms of column-families rather than rows. The data for a single entity that spans multiple column-families will have the same row key in each column-family. As an alternative to the tabular layout shown previously, you can visualize the data shown in Figure 7 as the following pair of structures.
In a column-family database, read and write operations for the part of a row in a single column-family are usually atomic, although some implementations provide atomicity across the entire row (spanning multiple column-families) as well.
Chapter 6, "Implementing a Column-Family Database" provides more information on designing and using a column-family database in a business application.
Just like the other categories of NoSQL databases, a graph database enables you to store entities, but the main focus is on the relationships that these entities have with each other. A graph database stores two types of information; nodes that you can think of as instances of entities, and edges which specify the relationships between nodes. Nodes and edges can both have properties that provide information about that node or edge (like columns in a table). Additionally, edges can have a direction indicating the nature of the relationship.
The purpose of a graph database is to enable an application to efficiently perform queries that traverse the network of nodes and edges, and to analyze the relationships between entities. Figure 9 shows an organization's personnel database structured as a graph. The entities are the employees and the departments in the organization, and the edges indicate reporting lines and the department in which employees work. In this graph, the arrows on the edges show the direction of the relationships.
A structure such as this makes it straightforward to conduct inquiries such as "Find all employees who directly or indirectly work for Sarah" or "Who works in the same department as John?" For large graphs with lots of entities and relationships, you can perform very complex analyses very quickly, and many graph databases provide a query language that you can use to traverse a network of relationships efficiently. You can often store the same information in a relational database, but the SQL required to query this information might require many expensive recursive join operations and nested subqueries.
Chapter 7, "Implementing a Graph Database" describes how to create and use a graph database in more detail.
Common Features of NoSQL Databases
Despite their differences, there are some common features that underpin most NoSQL databases. These features are the result of the common requirement for most NoSQL databases to provide high scalability and performance, enabling large numbers of requests to be performed very quickly and efficiently. However, to meet these goals, many NoSQL databases make tradeoffs between speed and consistency. This section describes the approach that many developers using NoSQL databases follow to achieve fast performance, while at the same time ensuring that the data that the databases hold remains meaningful.
Relational databases conceptually hold data as rows in tables, but key/value stores, document databases, and column-family databases all use their own non-relational structures to group related data together into aggregates, whether this data is an opaque BLOB, a document, or a collection of values held in a column-family associated with a given row key. The common features that these three NoSQL examples share is that an aggregate typically identifies a discrete set of data values, and a NoSQL database usually ensures that operations that affect a single aggregate are atomic.
|An aggregate in a NoSQL database is similar to a row in a table in a relational database.|
An aggregate is the unit of consistency in most NoSQL databases. You can consider operations that update multiple fields in a single aggregate as exhibiting ACID properties, although operations that modify data spread across more than one aggregate are unlikely to provide the same guarantees of consistency (some NoSQL databases do support ACID transactions that span aggregates, but these tend to be the exceptions rather than the rule). Note that in most column-family databases the unit of aggregation is the column-family and not the row, so a write operation that modifies data in several column-families for the same row key might not be atomic.
|The term "aggregation" comes from Domain-Driven Design (DDD), a collection of guidelines and patterns that you can use to design and implement elegant object-oriented systems. In a DDD approach, an aggregate is simply a collection of items that should be handled as a single atomic unit.|
|Graph databases tend not to use aggregates in the same manner as key/value stores, document databases, or column-family databases. This is due to their primary focus being the relationships between entities. The write operations performed on a single node or an edge are often atomic, and some graph databases implement ACID semantics for operations that define relationships spanning multiple nodes and edges.|
The aggregate is also the unit of I/O employed by many NoSQL databases. When an application retrieves a value from key/value storage, a document from a document database, or a set of values from a column-family, the entire aggregate (BLOB, document, or row from column-family) is fetched and passed to the application.
When you implement a NoSQL database, design it with aggregates in mind. Make sure that you understand how applications will query data, and how they are likely to process related information in operations. Design the database to optimize query access, store related data together in the same aggregate, and handle any denormalization issues when you write data back to the database.
|Currently, few NoSQL databases provide any design tools. Additionally, the schemaless nature of most NoSQL databases means that they do not provide the equivalent of the CREATE TABLE statements available in SQL (there are some exceptions). You typically write your own application code to create the aggregates for a NoSQL database, using the APIs provided by the NoSQL database vendor.|
Materializing Summary Data
One advantage that relational databases have over NoSQL solutions is their generality. If you can store information by using the relational model you can perform almost any query over this data, although the resulting SQL may contain complex joins and subqueries. To reduce the overhead associated with locating and retrieving the data for a complex query at runtime, relational databases provide views. You can think of a view as a precompiled, pre-optimized SQL statement that an application can query in the same way as a simple table. When an application queries a view, the underlying SQL statements that define the view are run.
Some relational database management systems take the notion of views a stage further. For data that is queried often but updated relatively infrequently, they support materialized views. A materialized view is a pre-computed dataset that represents the data exposed through a view; essentially it is a copy of the data that the SQL statement for a regular view would retrieve, and it is stored in a separate table. When an application queries a materialized view, it simply retrieves this copy of the data. Materialized views can become stale, and it is an application design decision to determine the frequency with which the data in a materialized view should be refreshed.
Although most NoSQL databases do not provide views in the same way as relational databases, many NoSQL databases enable you to adopt a similar approach to materialized views for applications that need to combine or summarize data. For example, if you store customer and address information in a column-family database such as that shown in Figures 6 and 7, but you regularly need to perform an analysis of customers based on their geographies such as "how many customers are located in WA?," you could define an additional column-family that contains summary information, as shown in Figure 10. In this case, the row keys are the states from the AddressInfo column-family.
For a relatively static dataset such as customers and address, the overhead of maintaining summary information might be small compared to the cost of computing this data every time an analysis is performed. However, for more dynamic data, such as customer orders, this overhead can become significant. In these situations, you should determine how accurate and up-to-date such summary information needs to be. If the data does not need to be completely fresh all of the time, you can perform the calculations on a periodic (daily, weekly, or even monthly) basis and update the summary information.
Many NoSQL databases provide map/reduce frameworks that enable you to implement incremental updates for summary data in a NoSQL database. Most map/reduce frameworks generate summary data on a per-aggregate basis, and then combine these per-aggregate summaries into totals. As an optimization mechanism, if the data in an aggregate is unchanged, the summary data for that aggregate does not have to be recomputed.
You can find detailed information about using map/reduce frameworks to analyze and summarize data at the mapreduce.org website.
|If you design aggregates carefully, you can optimize the way in which many map/reduce frameworks generate summary data.|
Implementing High Availability
Many relational database systems implement high availability through clustering. They maintain copies of data across several computers, and if one machine fails, requests can be transparently redirected to a working computer. However, the relational model, which places great emphasis on data consistency, was not designed with high-availability in mind, and the effort of maintaining redundant and consistent data across multiple databases while keeping performance at an acceptable level can consume considerable resources.
In contrast, most NoSQL databases are explicitly designed around high-availability, and their architecture is geared towards maintaining high performance and availability. Absolute consistency is a lower priority, as long as data is not lost and eventually becomes consistent.
|According to Brewer's CAP Theorem, a distributed system can only implement two out the following three features at any one time; consistency, availability, and partition tolerance. In a distributed environment, most relational database systems focus on consistency whereas most NoSQL databases are designed to maximize availability.|
NoSQL databases commonly implement one of two replication schemes for ensuring that data is available; primary/secondary replication (sometimes referred to as master/subordinate replication) and peer-to-peer replication, as described in the following sections.
In a primary/secondary replication cluster, a copy of the database is maintained at each participating server. One server is designated as the primary or master server, and all requests that write data to the database are routed through this server. The primary server updates its local copy of the database and then indicates that the write request is complete. The application issuing this request can continue in the knowledge that the change has been saved. At some point, the primary server propagates the same changes to each of the other servers, referred to as secondary servers or subordinates.
To ease the load on the primary server, many implementations of this scheme enable any server (primary or secondary) to respond to read requests.
|Read requests made by client applications should be spread across the servers to maximize throughput, possibly taking into account factors such as the location of the client application relative to each server to reduce latency. The routing logic for requests should be hidden from the client application, possibly inside a library that implements the connectivity to the database, often provided by the NoSQL vendor.|
Primary/secondary replication can help to maximize read performance at the cost of consistency. I It is possible for an application to save data to the primary server, and then attempt to read the same data back from a secondary server and get an older version of the data if the change has not yet been fully propagated. Equally, other applications may see different results when they query the same data, depending upon which servers they access. However, all changes will, at some point, find their way across all servers, and the system will eventually become consistent again.
|The section, "Improving Consistency" later in this chapter describes some mechanisms that many NoSQL databases implement to reduce the possibility of inconsistent data.|
Figure 11 illustrates the simplified structure of a primary/secondary replication cluster.
A simple primary/secondary cluster exposes three principal possible failure points:
- The primary server could fail. In this case, one of the secondary servers can be promoted to the role of the primary server while the original primary server is recovered. When the original primary server comes back online, it can act as a secondary server and be brought up to date with changes from the new primary server.
- A secondary server could fail. This is not necessarily a big problem because any read requests made by applications can be directed towards working server while the failed server is recovered and the changes it has missed in the meantime are applied.
- The network connecting the primary server to one or more secondary servers could fail. In this situation, one or more secondary servers will be isolated and will not receive changes from the primary server. The data stored by the secondary servers will become stale, and applications issuing read requests to these servers will be presented with out-of-date information. When connectivity is restored, the changes from the primary server can be applied.
Peer to Peer Replication
A primary/secondary cluster can help to balance the load in a system that is read-intensive, but the primary server can act as a bottleneck if the system performs a large number of write operations. In this situation peer-to-peer replication might be more beneficial.
In a peer-to-peer cluster all servers support read and write operations. An application connects any of the servers when it wants to write data. The server saves the changes locally and indicates to the application that the write operation is complete. The data is subsequently transmitted to all other servers.
|Many peer-to-peer systems implement the Gossip Protocol to exchange information about updates between replication servers. However, the more servers that a cluster contains, the longer it can take for updates to propagate, increasing the possibility of conflicting updates occurring at different servers.|
This approach spreads the load, but can result in conflicting changes if two concurrent applications change the same data simultaneously at different replication servers. In this situation, the system needs to resolve the conflicting data. The section "Improving Consistency" later in this chapter provides more information.
Figure 12 shows the structure of a peer-to-peer replication cluster.
Improving Scalability and Reducing Network Latency
Scalability is a primary objective of many NoSQL databases. The most common strategy used is sharding; data is evenly partitioned amongst separate databases, and each database is stored on a separate server. As the workload increases over time, you can add further servers and repartition the data across the new servers to handle the additional load.
|Sharding is not so useful for graph databases. The highly connected nature of nodes and edges in a typical graph database can make it difficult to partition the data effectively. Many graph databases do not provide facilities for edges to reference nodes in different databases. For these databases, scaling up rather than scaling out may be a better option.|
Different NoSQL databases provide varying implementations of sharding; a few adopt a Shared Nothing approach where each shard is an autonomous database and the sharding logic is implemented by the applications themselves. Many provide a more federated approach, also known as Auto Sharding in some NoSQL databases, where the database software itself takes on responsibility for directing data to the appropriate shard.
When you shard data, it is helpful to try and keep data that is accessed together in the same shard. In NoSQL terms, these means that you should design your aggregates with sharding in mind and understand the relationships that might occur between aggregated data to avoid splitting information that is frequently accessed as part of a single logical operation across different servers.
|It is always better to explicitly design a NoSQL database to take advantage of sharding rather than try and shard an existing database, otherwise you may lose the benefits of sharding by requiring that your applications retrieve data from multiple servers. You cannot simply bolt scalability onto a NoSQL database!|
Sharding can also help to decrease latency. If you have a widely dispersed set of users, you can host the shards containing the data that users are most likely to access at sites geographically close to those users. For example, to reduce the network latency that occurs when a user logs in and provides their credentials, aggregates holding the details of users located in Boston could be hosted at a server in New England, while aggregates containing the details of users who live in San Francisco could be stored on a server in California. Again, this strategy requires that you pay close attention to the design of your aggregates so that you can more easily partition data in this manner.
It is important to understand that while sharding can improve scalability and reduce network latency, it does not improve availability. For this reason, many NoSQL databases combine sharding with replication; each shard can be replicated following the primary/secondary or peer-to-peer replication topologies described in the previous section.
Sharding and replication are integral to many NoSQL solutions, but an increasing number of relational database systems also provide these features. However, the implementations in relational solutions tend to be more complex (and expensive) than the equivalent functionality in most NoSQL systems.
Chapter 3, "Implementing a Relational Database," discusses how to apply sharding to a relational database. Chapter 4, "Implementing a Key-Value Data Store," Chapter 5, "Implementing a Document Database," and Chapter 6, "Implementing a Column-Family Database" describe possible sharding strategies for NoSQL databases.
Replication and sharding are important strategies that can help to ensure good performance and availability, but at the increased risk of inconsistency between replicated data. As described throughout this chapter, most NoSQL databases provide eventual rather than immediate consistency, and you should design your applications based around this model of working. However, many NoSQL databases include features that help to reduce the likelihood of inconsistent data being presented to an application. These strategies include defining quorums, and using data versioning.
|Some NoSQL databases implement ACID transactions. In these cases, write consistency is guaranteed.|
Defining Read and Write Quorums
Inconsistencies occur if data is replicated across several servers, but not all of those servers agree on the current value of that data. Quorums help to reduce inconsistencies by ensuring that a significant subset of the servers in a replication cluster (a read quorum) agree on the value of a data item. When an application performs a query, the data values stored by the read quorum are returned. However, for this strategy to be successful, when an application writes data to a database it must be sure that this data has been successfully stored by a write quorum of the servers in the cluster.
Ideally, the use of quorums, and the number of servers that constitute a read or write quorum in a cluster, should be totally transparent to applications reading and writing data; they should simply be able to write data to a cluster and read from a cluster without requiring any knowledge of how many servers are being updated or interrogated. Behind the scenes, a typical read operation will retrieve data from as many servers as necessary until the number specified by the read quorum can agree on a value. Similarly, a write operation will store the data to as many servers as are specified by the write quorum, and will not complete until all of these servers indicate that the data has been written successfully, or that the data cannot be saved due to a conflict.
Many NoSQL databases support such transparent configurations, and in many cases an administrator can silently add or remove servers from a quorum without interrupting applications. However, it is important to understand that the more servers you include in a write quorum the more servers will need to be updated before the database can indicate a successful write operation to an application, and this can in turn have an impact on the performance of the database and the latency of your application. The same is true for read operations; the bigger the read quorum, the more servers have to respond with the same data before the read operation is considered successful.
The number of servers in a quorum can vary between read operations and write operations, and a quorum does not have to constitute a majority of servers in a replication cluster. For example, in a peer-to-peer replication cluster, if you judge that inconsistencies are unlikely to occur (the chances of two applications updating the exactly same data within a short period of each other is small), you could specify that as long as two servers in the cluster agree on the value of a data item then that value should be returned. You can apply a similar principle to write operations.
Versioning Data and Resolving Conflicts
Even in a system that implements ACID transactions, or that provides read and write quorums to maximize data consistency, an application can still end up viewing or updating inconsistent data. For example, application A could retrieve a data value from a database, modify it, and then store it back in the database. However, between retrieving the data and storing the modified version, application B may have read the same data and changed it to a different value. If application A saves its data, it will have overwritten the change made by application B (it is not even aware that application B had changed the data). The result is a lost update.
A lost update may not be a problem to an application, depending on the nature of the application and the context in which it is running. In other cases, it may be critical. Many relational database management systems reduce the possibility of lost updates by implementing pessimistic locking; when an application reads data, that data is locked and remains locked until the end of the transaction. Another application attempting to read the same data is blocked until the lock is released, whereupon it will see any changes made to that data by the first transaction. However, pessimistic locking schemes can severely impact performance and throughput of an application, so most applications prefer to use an optimistic locking scheme, as follows:
- When an application reads a data item it also retrieves a piece of version information.
- If the application attempts to update the data item in the database, it re-reads the version information from the database to make sure it has not changed.
- If the version information is unchanged, the application saves the modified data back to the database together with new version information.
- If the version information is different, the application retrieves the latest value of the data from the database (another application must have changed it since it was last read), and then returns to step 2.
This strategy still has a small window of opportunity to lose an update (the data could be changed by another application between steps 2 and 3). Some NoSQL databases provide atomic compare-and-set operations that can eliminate this possibility; they momentarily lock data during steps 2 and 3 in the expectation that these tasks will be performed quickly.
The way in which different NoSQL databases format version information can vary significantly, and some implementations (mainly column-family databases) enable you to store multiple versions of the same data in an aggregate together with a timestamp indicating when that version was created. This approach enables you to retain a history of the modifications made to a data item, and if necessary an application can use this information to help resolve conflicts in inconsistent data (if an application reads different values for the same data from different servers in a replication cluster, it can use the history information to determine which is the more recent version, and possibly update the outdated information).
Timestamping depends on different servers in a cluster remaining time-synchronized with each other, and this strategy may not work well if they drift slightly or the servers are remote from each other (it takes time to transmit data over a network). An approach to conflict detection commonly implemented by peer-to-peer replication clusters is to use vector clocks. Each server in the cluster can maintain a count of the number of times it has updated an item, and include this information when it synchronizes updates with other servers in the cluster. For example, in a two server cluster comprising servers A and B, if server A updates an item it can store a vector containing the counter "A:1" locally with this item. When the update is synchronized with server B it will save the same vector with its local copy of the data. If server A makes another modification to the same data, it increments its counter in the vector for this item to "A:2." Meanwhile, if server B modifies the same information (before synchronizing with server A), it appends its own counter to the vector stored locally in server B to "A:1, B:1." The next time servers A and B attempt to synchronize, they should observe that the vectors for this data are now different ("A:2" in server A, and "A:1, B:1" in server B). This enables the two servers to discover that a conflict has occurred. However, it does not necessarily indicate how to resolve this conflict; this decision is often deferred customizable logic in the database servers, or to the applications that access the data.
Schemas and Non-Uniformity
When you design a relational database, you decide on a set of tables, and for each table you specify the columns it contains, and for each column you select the data type and range for the information it holds together with any other attributes, such as whether it allows NULL values. This is the schema of the database. Developers building applications that use a relational database have to have at least a passing understanding of the database schema, so that they don't try and store data of the wrong type, or omit mandatory information from an entity. Using an ORM can help to abstract some of this information, but it is unlikely to be able to completely isolate the structure of the database from the applications that use it. Some dependencies or features of the database always seep through and somewhere inside the application or the ORM there will be one or more SQL statements that reference the database schema directly. This coupling of an application to a database can lead to brittle solutions. If the database schema needs to be modified due to changed or new business requirements, it can be difficult to make such modifications without impacting every application that references that database. Some organizations have used strategies such as implementing additional views, triggers, and stored procedures that provide a mapping from the old schema to the new, enabling existing applications to continue functioning unchanged, but each of these items adds an extra level of indirection and they will inevitably have an impact on the performance of applications that use them.
NoSQL databases are frequently described as being schemaless. When you create a NoSQL database, you should pay attention to the structure of the data and the aggregates that will hold this data, but in most cases it is the applications that actually create these structures rather than a separate set of database utilities or schema editors. The simplified APIs exposed by most NoSQL databases typically enable an application to store and retrieve data, but rarely impose any restrictions on what this data actually is. This approach essentially gives applications free reign over the database, and as business requirements change applications can modify the structure of the data that they store.
|Most column-families databases require that you explicitly create a column-family before you can reference it in a row. However, the columns in the column-family do not have to be defined, and can vary from row to row.|
However, this freedom also brings with it a set of responsibilities and problems. If an application is modified to change the structure of the data that it stores, any existing data already in the database will not change, and the database ends up holding a non-uniform set of data. Therefore, applications must be prepared to handle varying data formats or they might risk losing information (or crashing). This approach can complicate the data access logic in an application, and requires a very disciplined approach to managing application development.
|NoSQL databases are often described as being schemaless. What this really means is that the responsibility for managing the structure of data has moved from the databases to the applications that use them.|
Integrating NoSQL Databases into a Polyglot Solution
The code for numerous NoSQL databases has been made available for public download, and the IT departments in many organizations are increasingly looking at these databases to resolve their own data storage problems. The major task is integrating this disparate software into a cohesive system. In the relational world, the database was viewed as the integration point for applications, but the dissimilarity of the various data structures means that this approach is clearly not viable for NoSQL solutions. Instead, a common technique is to implement a web service as a façade; the web service provides an interface that exposes business operations and converts requests into operations that communicate with the appropriate database. The web service can receive and send requests in a standard format, such as REST, and serialize data in a well-understood and portable format, such as JSON. Additionally, the web service can handle cross-cutting concerns such as security and inter-database integrity. Following this strategy, you can incorporate any number of databases, or switch to a different implementation should one database no longer meet the requirements of your organization. If you retain the same interface to the web service, applications that connect to the web service should be unaffected by such a change. Figure 13 shows the structure of this solution:
However, although this model is conceptually simple, there are several challenges that you may need to resolve to construct a practical solution. For example, how do you maintain consistency and integrity across different databases if your application expects operations to be atomic? These issues are discussed in Chapter 8, "Building a Polyglot Solution."
Relational databases provide an excellent mechanism for storing and accessing data in a very generalized manner, but this generalization can also be a weakness. In some cases, applications need to perform very specific types of operations that require data to be accessed in certain ways, and the relational model might not always provide the most optimal way to meet these requirements. NoSQL databases are designed to handle these situations.
A variety of NoSQL databases are available, each intended to focus on a particular data storage and access strategy. While a typical NoSQL database might not be as comprehensive as a relational database, its focus on a well-defined range of tasks enables it to be highly optimized for those tasks. The key to success is to understand the features of different NoSQL databases, and then use these features to implement a repository that matches the specific requirements of your applications.
The white paper, "NoSQL and the Windows Azure Platform" provides a good comparison between NoSQL and relational databases, focusing on Microsoft technologies. You can download this white paper from the Microsoft Download Center.
The following table compares the common features and functionality of relational (SQL) and non-relational (NoSQL) databases, highlighting the strengths of the different approaches to structuring the data and the types of applications for which each technology is most appropriate.
Standardization and interoperability
Mature technology, well understood.
Subject to many ANSI and ISO standards, enabling interoperability between conforming implementations.
Standard APIs (ODBC, SQL/CLI, and so on) enable applications to operate with databases from different vendors.
New technologies with no common overarching model or standardization.
Each NoSQL database vendor defines their own APIs and data formats. Minimal interoperability between different vendors is possible at the database level.
Application code written for one NoSQL database is unlikely to work against a NoSQL database from a different vendor.
Storing complex data
Data for complex objects is often normalized into several entities spanning multiple tables to reduce data redundancy. It may require intricate SQL queries to reconstruct the data for a single complex object from these tables.
ORMs can abstract some of the details, but this extra layer can lead to inefficiencies.
Can store data for complex objects without splitting across aggregates, enabling a much simpler mapping between objects used by applications and the data stored in the database. This enables fast query access at the possible cost of additional complexity in application code when storing or updating denormalized data.
Relational model is very generalized. SQL supports ad-hoc queries by joining tables and using subqueries.
Very good at summarizing and grouping relational data.
Less good at handling complex non-relational queries.
Databases are designed to optimize specific queries, most commonly retrieving data from a single aggregate by using a key.
Most NoSQL databases do not support data retrieval from multiple aggregates within the same query.
Summarizing and grouping data may require implementing map/reduce functions to work efficiently.
Graph databases can be excellent for handling complex non-relational queries.
Most suited to scale-up scenarios rather than scale out due to the performance of distributed transactions and cross-database queries.
Some relational vendors support clustering and sharding as optional extensions.
Mostly designed with support for scaling out built-in. Many NoSQL databases provide seamless support for clustering and sharding data.
Performance with large datasets
Can require significant tuning to read from or write to large datasets efficiently.
Designed to be very efficient for handling large datasets.
Data consistency and transactions
Designed to be highly consistent (ACID), but at the cost of transactional performance. Transactional consistency can slow down operations in a distributed environment.
Designed to be eventually consistent (BASE) in a distributed environment.
Some support for ACID properties for updates within an aggregate, but cross-aggregate operations are not guaranteed to be atomic.
Careful use of quorums can help to reduce instances of inconsistency.
Relational databases can be easily shared between different applications. The database acts as the point of integration between applications.
Databases are usually designed specifically to support a single application. Application integration is usually handled by application code.
All links in this book are accessible from the book's online bibliography on MSDN at: http://msdn.microsoft.com/en-us/library/dn320459.aspx.
- You can download the guide "Building Hybrid Applications in the Cloud on Windows Azure" from MSDN.
- The guide "CQRS Journey" is available from MSDN.
- The article "Getting Acquainted with NoSQL on Windows Azure" is available on the MSDN blogs site.
- You can download the white paper "NoSQL and the Windows Azure Platform" from the MSDN blogs site.
- You can find more information about NoSQL databases at http://nosql-database.org/.
- You can find detailed information about using map/reduce techniques to summarize data at http://mapreduce.org/.