.gif)
by Michael Pizzo
Summary: Most applications deal with data
of some type or another, where the source of the data often resides in a
database. Yet for a variety of reasons the shape of the data in the database is
often different from the model that the application interacts with. This
article describes working with data in a database through a virtual "Conceptual
Model" more appropriate for the application. (12 printed pages)
Contents
Application Models vs. Storage Schemas
Database Schema Normalization
Representing Inheritance
Schema Changes
ADO.NET Entities
Enter the ADO.NET Entity Framework
An Application Model
Querying the Conceptual Model
Nested Results
Different Views for Different Applications
Modeling Results as Objects
Conclusions
Resources
Application Models vs. Storage Schemas
Modern applications, Web applications in particular, are
fundamentally about exposing and manipulating data of one type or another. The
data may be in the form of search results, inventory catalog, user profile,
account information, financial information, personnel information, map
coordinates, weather, and so on, but it's all data, and it's typically stored
in a database somewhere.
However, the data stored in the database is generally not in the
most appropriate form for the application to manipulate or expose to the user.
The relational data model of the database schema is typically (and rightly)
optimized around storage and integrity concerns, not for application usage.
As Dr. Peter Chen explains in his groundbreaking paper
introducing the Entity-Relationship Model, "The relational model…can
achieve a high degree of data independence, but it may lose some important
semantic information about the real world." His paper goes on to describe
an alternate Entity-Relationship Model that "...adopts the more natural
view that the real world consists of entities and relationships." (See
Resources.)
Simply put, today's object-oriented applications, not to mention
end-users, tend to reason about data in richer terms than the flat rows and
columns of a relational database. The "real world" includes a strong
notion of the type of the object, its identity, and its relationships with
other objects.
Putting aside the expressivity issues for a moment, even if all
of the application concepts could be represented through the relational model,
the application writer often doesn't have control over the database schema. Even
worse, the schema could change over time in order to optimize different usage
patterns, invalidating hard-coded access paths, mappings, and implicit assumptions.
Small applications typically start out directly embedding the
logic to map relational schema to application data objects. As the application
grows, or for applications that are built from the start as part of a larger enterprise
framework, the data-access logic is commonly broken out into a separate Data
Abstraction Layer, or DAL. Whether part of the application or a separate
component, hard-coding implicit assumptions about database schema,
relationships, usage conventions, and access patterns make it increasingly
difficult to maintain or extended this data-access code over time, especially
in light of changes to the underlying schema.
Let's take a look at some of these problems in a little more
detail.
Database Schema Normalization
Data in a database is generally laid out in a "normalized"
view as described by Dr. Codd, with separate, homogenous (rectangular) tables
containing columns of single scalar values. Redundancy is reduced in order to
improve integrity of the data through moving non-row specific values into a
separate table. Data from these individual tables is combined through joins,
based on implicit application knowledge about what different values within the
columns represent. Foreign keys may or may not be used between tables of
related information as a means of further enforcing data integrity, but do not
themselves define navigation paths or join conditions.
Let's take an example. Suppose you ran a marina that sold used
watercraft, and you wanted to track your inventory in a database, exposed to
customers through a Web application. The information you have to store for each
boat is: Registration, Make, Year, Length and Beam (width). For boats with
engines you want to store Make, Model, Year, Horsepower, type of fuel, and
SerialNumber of the engine(s). A fully normalized schema might break the engine
information into three separate tables, one containing the information for a
particular type of motor (Make, Model, Horsepower and fuel type, with Make and
Model comprising a Composite Key), one for each actual engine (SerialNumber,
Year, Make, and Model) and one that associated boats and engines. The resulting
schema might look something like Figure 1.
Figure 1. Fully normalized schema (Click on the picture for a
larger image)
In order to show a relatively simple inventory page containing
boat registration number, year, and make, along with associated engine information,
for all motor boats, your Web application may use the following query:
SELECT Boats.RegNum, Boats.Year AS Boat_Year,
Boats.Make AS Boat_Make, BoatEngine.SerialNumber,
BoatEngine.Year AS Engine_Year, BoatEngine.Make
AS Engine_Make, BoatEngine.Model AS Engine_Model,
BoatEngine.HP
FROM Boats
INNER JOIN (
SELECT EngineType.Make, BoatEngines.BoatID,
EngineTypes.HP
FROM EngineTypes
INNER JOIN (Engines
INNER JOIN BoatEngines
ON Engines.SerialNumber =
BoatEngines.EngineSerialNum)
ON EngineTypes.Model = Engines.Model
AND EngineTypes.Make = Engines.Make
) AS BoatEngine
ON Boats.RegNum = BoatEngine.BoatID
This query is not only fairly complex, but requires (and embeds)
an implicit understanding of how the tables are related; that the Make and
Model columns of the Engines table relate to the Make and Model columns of the EngineTypes
table, and that the EngineSerialNum and BoatID columns of the BoatEngines table
relate to the SerialNum and RegNum columns of the Engines and Boats tables,
respectively.
Also, the query attempts to return only motorboats (and not
sailboats) by performing an inner join between boats and motors. Of course,
this query would miss any motor boats being sold without an engine, and while
the query might exclude small sailboats, larger boats (including our 25-foot
Hunter sailboat) generally have motors. So, while the assumption may be valid
when the application is written, based on the schema and data at that time,
baking this type of implicit logic into queries within the application
introduces subtle dependencies on the data and schema that are hard to track
and maintain.
Clearly, although the schema is nicely normalized from a database
perspective, it is not in a very convenient form for the application. What's
worse, in order to retrieve the desired information (not to mention making
updates such as moving an engine to a different boat) implicit knowledge about
the relationships between the tables must be baked into the application. Changes
to the schema, for example combining the Engines and BoatEngines tables (a
reasonable degree of denormalization a DBA may consider in order to improve
performance) causes the application to break in ways that are hard to
anticipate and resolve.
Representing Inheritance
Now let's say you want to want to add additional information to
the schema. First you make explicit the difference between sailboats and
different types of motorboats by adding a "Style" column to the
table. Then, for sailboats, you add the type of Keel (Fixed, Swing, or Dagger)
and number of sails. For ski boats, you add whether it has a ski pylon and/or a
tower, and for MotorYachts, you want to add whether or not it has a flybridge. This
becomes challenging with the relational data model because each piece of
additional information only applies to a subset of the rows within the Boats
table. One way to represent this is by extending the base table ("Boats")
with members for each derived type, using Nulls for rows in which they do not
apply. For example, this information could be expressed in a single sparse "Boats"
table, as shown in Figure 2.
Figure 2. Representing hierarchy in a single sparse "Boats"
table (Click on the picture for a larger image)
As we see, the more properties we add for each derived type, the
more the schema of the overall table grows and the more we fill in non-relevant
fields with Null values. An alternate way to represent this same information
would be to break the additional information for Sailboats, Ski Boats, and
Motor Yachts into separate tables, as shown in Figure 3.
Figure 3. Storing extended information in separate tables (Click
on the picture for a larger image)
This layout avoids the need to add sparse columns to the base
table for each property of the derived type, but querying becomes even more
complex as each query has to join the additional tables in order to include
full information for each of the derived types. For example, to return boat
registration, year, make, and associated engine information for all motor boats
without a tower, we might write something like the following query:
SELECT Boats.RegNum, Boats.Year AS Boat_Year,
Boats.Make AS Boat_Make, BoatEngine.SerialNumber,
BoatEngine.Year AS Engine_Year, BoatEngine.Make
AS Engine_Make, BoatEngine.Model AS Engine_Model,
BoatEngine.HP
FROM (Boats
LEFT OUTER JOIN (
SELECT EngineTypes.Make, BoatEngines.BoatID,
EngineTypes.HP
FROM EngineTypes
INNER JOIN (Engines
INNER JOIN BoatEngines
ON Engines.SerialNumber =
BoatEngines.EngineSerialNum)
ON EngineTypes.Model = Engines.Model
AND EngineTypes.Make = Engines.Make
) AS BoatEngine
ON Boats.RegNum = BoatEngine.BoatID )
LEFT JOIN SkiBoats
ON Boats.RegNum = SkiBoats.RegNum
WHERE Boats.Style In ("Ski","Motor","PWC","Yacht")
AND (SkiBoats.Tower=False OR SkiBoats.Tower IS NULL)
Note that the Boats table must be joined with the SkiBoats table
in order to get the Tower information, and we must account for the NULL value
in our predicate for rows which are not Ski boats.
Schema Changes
As the tables grow, the DBA may decide to refactor the schema
such that all of the information for a particular type of boat is in one table,
as shown in Figure 4.
Figure 4. Completely separate tables for each boat type (Click
on the picture for a larger image)
This schema optimizes for queries against a single type of boat
at the expense of queries across types of boats. And, of course, it means that
the queries within your application have to change. Our query for boat and
engine information becomes the following:
SELECT Boats.RegNum, Boats.Year AS Boat_Year,
Boats.Make AS Boat_Make, BoatEngine.SerialNumber,
BoatEngine.Year AS Engine_Year, BoatEngine.Make
AS Engine_Make,BoatEngine.Model AS Engine_Model,
BoatEngine.HP
FROM (
(SELECT RegNum, Year, Make, Tower FROM SkiBoats
UNION ALL SELECT RegNum, Year, Make, NULL As
Tower FROM MotorBoats
UNION ALL SELECT RegNum, Year, Make, Null AS
Tower FROM PWC
UNION ALL SELECT RegNum, Year, Make, Null AS
Tower FROM MotorYachts
) AS Boats
LEFT OUTER JOIN (
SELECT EngineTypes.Make, BoatEngines.BoatID,
EngineTypes.HP
FROM EngineTypes
INNER JOIN (Engines
INNER JOIN BoatEngines
ON Engines.SerialNumber =
BoatEngines.EngineSerialNum)
ON EngineTypes.Model = Engines.Model AND
EngineTypes.Make = Engines.Make
) AS BoatEngine
ON Boats.RegNum = BoatEngine.BoatID )
WHERE (Boats.Tower=False OR Boats.Tower IS NULL)
Note that in order to query across all types of boats including
the Tower column specific to SkiBoats, we must explicitly project a Null value
for that field for the other tables within the UNION ALL.
ADO.NET Entities
Each of these examples highlights some of the challenges that application
developers face today when trying to expose, manipulate, and persist
interesting real-world models into a flat relational schema. The fact that the
schema may not be owned by the application developer, and may change over time,
helps explain why the data-access "goo" can occupy such a
disproportionate share of an application or framework in terms of code,
development, and maintenance cost.
Enter the ADO.NET Entity Framework
The ADO.NET Entity Framework will ship in the first half of 2008
as an extension to the .NET Framework that is part of Microsoft Visual Studio
code name "Orcas" and represents the first deliverable in a new
Microsoft Entity Data Platform for working with data in terms of a rich, common
Entity Data Model. The ADO.NET Entity Framework, is an implementation of Dr.
Chen's Entity-Relationship model on top of relational data . Instead of letting
storage representation dictate the application model, writing to a common
Conceptual Model allows applications greater expressivity in modeling data
using real-world concepts which can be flexibly mapped to a variety of storage
representations. (For more on the ADO.NET Entity Framework, see
Resources.)
The Entity Framework uses a Client View mechanism to expand
queries and updates written against the conceptual model into queries against
the storage schema. The expanded queries are evaluated entirely within the
database; there is no client-side query processing. These Client Views may be
compiled into your application for performance, or generated at runtime from
mapping metadata provided in terms of XML files, allowing deployed applications
to work against different or evolving storage schemas without recompilation.
An Application Model
Figure 5 shows a more appropriate application-oriented data model
for our Web application. Note that, although I use a class diagram to represent
the model, objects are just one way to expose the conceptual model to the
application in the Entity Framework. The same conceptual model could be
targeted directly using an extended SQL grammar and returned as polymorphic,
hierarchical records.
.gif)
Figure 5. An application-oriented conceptual model
The first thing we notice about this model is that this doesn't
look anything like any of the previous storage schemas. For example:
1.
The Engine class contains information from both
the EngineTypes table (Make, Horsepower, and Fuel) and the Engines table (Year
and SerialNumber).
2.
Instead of a BoatID property, the Engine class
contains a reference to a Boat.
3.
Boats contain a collection of zero or more
engines.
4.
Instead of exposing a Style property on Boat, or
having distinct tables for each, we've used the more natural inheritance
concept to distinguish the different types of boats.
The Entity Framework allows you to expose this conceptual model
to the application, using application concepts such as strong typing,
inheritance, and relationships, over any of the previously described database
schema. The fact that the mapping is done declaratively, outside of the application,
means that if the database schema evolves over time to optimize for different
access patterns, only the mapping has to change; the application can continue
using the same queries and retrieve the same results against the same
conceptual model.
Let's see how working with this conceptual model simplifies our
application patterns.
Querying the Conceptual Model
Given this conceptual model, querying a single polymorphic set of
boats becomes much simpler. For example, the following code uses this
conceptual schema to query the registration year, make, and engine information
for all motorboats without a tower.
SELECT boat.RegNum, boat.Year, boat.Make,
boat.Engines
FROM Boats AS boat
WHERE boat IS OF (Motorboat)
AND (Boat IS NOT OF (SkiBoat)
OR TREAT(boat AS SkiBoat).Tower = False)
Note that no joins are required in the query; entities are
strongly typed, relationships are traversed through properties, and collections
can be filtered according to types within the hierarchy.
Nested Results
In each of the first three queries written directly against the
database schema, the results would look something like Figure 6. Note that the
last boat (the 1996 BayLiner) appears twice. From looking at the data, we see
that the 1996 Bayliner is a MotorYacht with two Hino 310 engines. Because
relational data is flat, there is no good way to represent multiple engines in
a single row of the result; so, two rows are returned for the same boat—one for each engine.
Figure 6. Nested results returned as a rectangular table (Click
on the picture for a larger image)
The query against the conceptual model returns an "Engines"
column with a single row for each boat containing the collection of engines as
shown in Figure 7.
Figure 7. Results returned as a nested table (Click on the
picture for a larger image)
Alternatively, if only a subset of the data for each engine is
desired (for example, the Make and HP) that information can be projected out as
follows:
SELECT boat.RegNum, boat.Year, boat.Make,
SELECT engine.Make, engine.HP
FROM boat.Engines AS engine
FROM Boats AS boat
WHERE boat IS OF (Motorboat)
AND (Boat IS NOT OF (SkiBoat)
OR TREAT(boat AS SkiBoat).Tower = False)
Note that this query still does not require the developer to
write any joins; the relevant fields from the engine are projected out as a
nested column using boat.Engines as the source for the subquery.
Different Views for Different Applications
Web application frameworks in particular often expose different
views of the same data through different Web applications. For example, the
data you expose to unauthenticated Web clients may be a subset of the data
exposed to preferred members, which may be modeled differently than the data
exposed to internal administration and reporting applications. Similarly, the
schema of the data you work with within your application framework may differ
significantly from the schema of the data you exchange in Business to Business
transactions. The ADO.NET Entity Framework facilitates these types of scenarios
by allowing multiple conceptual models to be mapped to the same database
schema.
Modeling Results as Objects
The previous example shows returning results as records. In the
case of the ADO.NET Entity Framework, this means returning results as a DataReader,
which has been extended to support type information, polymorphism, nesting, and
complex values. With Entities, it is also possible to write queries against the
same conceptual model and return results as strongly typed business objects. When
modeling results as business objects, relationships may be navigated and
updated through typed properties on the objects rather than manipulating scalar
foreign key values. The business objects may optionally be identity resolved
and change tracked.
Use of the conceptual model through business objects is
illustrated by the following code example. This example shows querying against
the conceptual model to return boats as objects, navigate through properties to
the collection of engines, and remove any engines that are not the same year as
the boat. Changes are saved to the database through the call to SaveChanges().
// Specify query as an eSQL string
string eSql =
"SELECT VALUE boat FROM Boats AS boat " +
"WHERE EXISTS(" +
"SELECT engine from boat.Engines AS engine " +
"WHERE engine.Year != boat.Year)";
BoatInventory inventory = new BoatInventory();
ObjectQuery<Boat> motorizedBoats =
inventory.CreateQuery<Boat>(eSql);
// Include Engines in results motorizedBoats.Span.Include("Engines");
// Loop through Engines for each Boat
foreach(Boat boat in motorizedBoats) {
foreach(Engine engine in boat.Engines) {
if(engine.Year!= boat.Year)
boat.Engines.Remove(engine);
// alternatively
// engine.Boat = null;
}
}
inventory.SaveChanges();
Note that, as in the previous conceptual query examples, no joins
are required in the query. The object results are strongly typed and updatable,
and navigating and modifying the relationship between different types is done
through properties and methods, rather than updating scalar key values.
The same query could be written using the new Language Integrated
Query ("LINQ") extensions being introduced in the next version of
Microsoft Visual Studio and the .NET Framework (code-named "Orcas")
as shown here:
BoatInventory inventory = new BoatInventory();
// Include Engines in queries for Boats
inventory.Boats.Span.Include("Engines");
// Specify query through LINQ
var motorizedBoats =
from boat in inventory.Boats
where boat.Engines.Any(e => e.Year != boat.Year)
select boat;
// Loop through Engines for each Boat
foreach(Boat boat in motorizedBoats) {
foreach(Engine engine in boat.Engines) {
if(engine.Year != boat.Year)
boat.Engines.Remove(engine);
// alternatively
// engine.Boat = null;
}
}
inventory.SaveChanges();
Conclusions
In summary, there are at least six reasons why writing
applications directly to the database storage schema may be problematic:
1. You
may not have control over either the application object model or the storage
schema.
2. The
degree of database schema normalization may make it cumbersome to consume
directly from an application.
3. Certain
real-world modeling concepts can't be directly represented in a relational
schema.
4. The
application may be forced to embed implicit knowledge of how fields in the
database schema are used that is difficult to track and brittle to maintain.
5. Different
applications may want to expose different views of the same data.
6. The
database schema may change over time, breaking applications that directly write
to that schema.
The ADO.NET Entity Framework allows your applications to target a
conceptual model using application concepts such as strong typing, inheritance,
and relationships. This conceptual model can be mapped to a variety of storage
schemas. Because the mapping is done declaratively, outside of the application,
changes to the database schema over time to optimize for different access patterns
only requires that the mapping change; the application can continue using the
same queries, retrieve the same results, and make changes against the same
conceptual model.
Resources
"Next-Generation Data Access: Making
the Conceptual Level Real,"
J. Blakeley, D. Campbell, J. Gray, S. Muralidhar, and A. Nori (MSDN, June 2006).
"The ADO.NET Entity Framework
Overview" (MSDN, June 2006).
"The LINQ Project"
(MSDN, January 2007).
"Visual Studio Future Versions"
(MSDN, January 2007).
"Dr. Peter Chen: Entity Relationship
Model—Past, Present, and Future,"
April 2007.
"The Entity-Relationship Model: Toward a Unified View of
Data." Peter P.S. Chen. ACM Transactions on Database Systems (TODS), 1976.
"A Relational Model of Data for Large Shared Data Banks."
E.F. Codd. Communications of the ACM, 1970.
About the author
Michael Pizzo has worked for over 17
years in the design and delivery of data-access solutions and APIs at
Microsoft. Michael got started in data access as a program manager for
Microsoft Excel in 1989, and was involved in the design and delivery of ODBC,
along with the ODBC-based Microsoft Query Tool shipped with Microsoft Office.
He has been active in the standards organizations, sitting as Chair for the SQL
Access Group, working with X/Open on the CAE specification for "Data
Management: SQL-Call Level Interface (CLI)", serving as Microsoft's
representative to the ANSI X3H2 Database Committee, and as an elected ANSI
representative to the ISO committee meetings that defined and adopted Part 3 of
the ANSI/ISO SQL specification for a call-Level Interface (SQL/CLI). Michael
was a key designer and driver of Microsoft's OLE DB API, and later owned the
design and delivery of ADO.NET version 1.0. He is currently a software
architect in the Data Programmability Team at Microsoft, contributing to the
architecture and design of the next version of ADO.NET and core building block
for Microsoft's new Entity Data Platform, the ADO.NET Entity Framework.
This article was published in the Architecture Journal, a print
and online publication produced by Microsoft. For more articles from this
publication, please visit the Architecture Journal Web site.