Ted Neward
January 2008
It’s once again the monthly
meeting, and after introductions and the monthly entreaty to our Higher Power
for wisdom and guidance as we go about our architect lives, one of the
long-time members asks to address the group. Standing, he takes a deep breath,
and I notice the beads of sweat on his forehead as he begins.
“Hi, my name’s Kevin, and I’m
an architect.”
“Hi, Kevin.”
The simple greeting elicits a
grateful smile at the support implied therein. “Thanks. I’ve been buzzword-free
for almost a year now, and I’m finally realizing that you can be an architect
without spending all your time in PowerPoint.” Murmurs of respect ripple
through the group as he continues. “But I’m getting ready to roll on to a new
project, and I need to figure out how our developers will… access the
database.”
As the group
stirs, Kevin tries to explain. “I just don’t know if they should use an
object/relational mapper, or if they should use stored procedures, or…”
But it’s too
late. Like a wild racehorse, the group has the bit between its teeth, and the
conversation runs wild from there.
“Stored
procedures are the only way to go!”
“You’re out of your mind! Keep
it light on the database and just write queries from the middle tier!”
“Are you both nuts? That just
creates more work! You have to keep your storage implementation details
hidden!”
“Code generation!”
“Mobile objects!”
As I sit in the corner,
nibbling on my cookie and sipping my (hopefully heavily spiked) punch, I notice
that Kevin has returned to his seat to try and make sense of the ensuing chaos.
The thought flickers across my mind as he tries to listen to the arguments
flying back and forth: How did this subject of data access become such a
hot-button issue for us?
*
* *
No topic in software development seems to
raise as much excitement, interest, and argument as that of how developers
should fetch and store data to a relational database. Perhaps its ubiquity
serves as the reason for the heatedness of the debate—after all, accessing and
storing data to a database is one of those elements of software development
that permeates across all projects, and even across platforms and operating
systems. Perhaps the reason lies in the fact that there are many different ways
to think about storing and retrieving data. Perhaps developers just need
something to argue about, and this serves as the universally-agreed
industry-wide topic.
Regardless of the reason, no other subject
in architecture and design triggers as much fear and adrenaline in the
architect as the innocent project kickoff meeting question of “How should we
fetch data from the database?” No matter what answer the architect gives, there
will be someone in the meeting who will disagree—violently—with the approach
chosen. Logical reasoning will be attempted, then discarded. Performance
figures will be made up and cited. Scalability concerns will be tossed around
liberally without benefit of fact or verification. By the end of the
discussion, if it in fact ever ends, the architect will silently curse the Gods
of Computer Science, because whatever sins mankind committed surely did
not merit this kind of punishment.
*
* *
Data access (and storage, when you get
right down to it) comes in a variety of flavors, and at the risk of once again
rousing the sleeping bear, it serves the discussion well to go over some of the
principal players in this drama, along with their commonly attributed
advantages and disadvantages. However, to avoid any sort of hurt feelings or
implied endorsements, no products or specific implementations will be
discussed; instead, the focus will remain entirely on the concepts, which means
certain levels of detail will have to go unexplored. And, it should go without
saying, many of the ideas discussed here can be combined in various ways,
creating a nearly infinite range of possibilities; for purposes of simplicity,
just the core ideas will be raised here.
The simplest approach, in some ways, is the
call-level interface (CLI) approach, in which a library provides basic
SQL call-and-retrieval access to the database, usually through a standardized
API. In previous decades, this was a proprietary library provided by the
database vendor, but in the mid-'90s Microsoft created a standardized
“umbrella” API, Open Database Connectivity (ODBC), that became highly
successful because it encapsulated many of the details of working with a
particular database implementation. Later, as the “managed platforms” came into
fashion, similar CLI libraries emerged for both platforms (JDBC to the Java
community and ADO.NET for the common language runtime [CLR]) that
architecturally resembled ODBC, with the caveat that ODBC was designed for
procedural languages—that is, C and its siblings—and both JDBC and ADO.NET were
clearly “object-ish" in nature.
The benefit of the call-level interface
lies in its simplicity and direct control over SQL—the developer specifies the
SQL to execute against the database, hands it off to the library, and harvests
the results, usually by way of some kind of iterator across an
array-of-name/value-pairs. The drawback to this is similarly rooted in the
CLI’s simplicity and direct nature—for example, the SQL calls can, without some
kind of discipline to the contrary, end up scattered across the entirety of the
code, making it difficult to refactor or evolve the database schema. Similarly,
lack of any sort of “automated” support in writing the SQL or harvesting the
results means developers can easily get out of sync with the actual database
schema, with type mismatch errors at run time an all-too-common result. Just
writing the code to set up the query and iterate through the results can be
tedious and error-prone, and if developers are not aware of it, simple
concatenation-of-strings when constructing the query can create a security hole
(known in the security world as a SQL injection attack).
Most modern databases offer an ability to
perform non-trivial processing on the database itself, executable in a way
similar to a remote procedure call (RPC), known to most developers as stored
procedures. Stored procedures, or sprocs as they are sometimes
called, look alarmingly like traditional procedural programming (hence the
name), and are usually executed through a call-level interface as described earlier,
using standard SQL syntax for passing parameters and gathering the results of
the procedure call. Although the execution of sprocs is standard, their
definition is definitely not governed by the ANSI SQL committee; this leaves
stored procedure languages looking like whatever the database vendor believes
they should look like.
The principal advantage of a sproc lay is that
it is executed on the database server, typically inside the same process as the
query engine itself, meaning that if multi-step processing (“fetch data,
analyze it, perform some processing, store some data, repeat for a different
set of data,” and so on) is necessary, the queries are all executing against
local data files instead of having to move across the network. Additionally, because
the sproc language is typically a vendor-proprietary extension to SQL, vendors’
extensions and enhancements to SQL are available to simplify the development
tasks.
At the other end of the spectrum, if you
will, lies the object/relational mapper (O/R-M), generally a library or
code-generative tool that tries to “hide” the differences between the
object-oriented programming languages we favor and the relational database
preferred by the enterprise. So much has been written about how such toolkits
work and the relative benefits of the O/R-M that little benefit accrues from
discussing it in depth here. Readers unfamiliar with the O/R-M discussion are
encouraged to check out Martin Fowler’s Patterns of Enterprise Application
Architecture, which stands as the best discussion of O/R-M in one volume to
date.
Some O/R-M tools or utilities will create a
relational database schema from an object domain model, leaving the developers
almost completely isolated from the details of SQL, although others will go to
longer lengths to map developer-designed domain objects against a database
administrator (DBA)–designed schema, and still others will construct a domain
model from an existing schema. (Some will do all three, depending on which tool
or utility is run.) This distinction will be particularly important for those
environments in which the database model is not “owned” by the developers.
Problems with object/relational mapping in
the Java community (specifically, with Enterprise Java Beans [EJB] Entity
Beans) gave rise to a variation on the concept known as data transfer
objects (DTOs). With this idea, instead of domain objects taking care of
their own persistence, data transfer objects, whose sole intention is to carry
data from the application server to the presentation server (typically a Web
server), are transmitted together when requested. This has the advantage of
allowing developers to continue to work with objects, yet avoid some of the
more heinous network input/output (I/O) traps that an O/R-M can generate.
Unfortunately, DTOs are typically “dumb” objects because they are essentially
nothing more than carriers of data; they do not have the encapsulation and
modeling benefits that are typically associated with domain object models. This
sometimes gives rise to “parallel object trees,” in which domain objects are
defined in a parallel class hierarchy to the DTOs that are used to store and
fetch their internals. There is also the added disadvantage that DTOs must
usually be manually written because they often correspond more closely to how
the data is being sent back and forth instead of how the data conceptually
relates as objects. This makes code generation trickier.
A variation on this theme is the idea of mobile
objects, in which a software service accepts graphs of objects serialized
using the serialization capabilities built in to the managed platform and
executes methods on those objects after they are on the application server to
persist or fetch their data. This allows developers to use whatever data-access
tool they find most attractive while on the application server, and it generally
reduces the amount of network I/O traffic from the presentation server. It
generally avoids the “dual hierarchy” problem found in DTO systems, but again,
this does not provide a solution to the problem of mapping relational tuplesets
(rows) to objects.
One other option, increasingly popular with
stand-alone systems or applications that have some flexibility in how data is
to be stored is the object database, in which objects are stored in their
“native” form instead of in a relational database. This makes writing the
persistence code simpler, because there is no mapping that needs be written.
Typically, object database implementations also refactor more easily than
relational systems, because the object database can detect the difference in
the object definition (which serves as the database schema) and automatically
migrate the stored schema to match. Naturally, this is not perfect—certain
refactorings will confuse even the most sophisticated object-oriented database
management system (OODBMS) and require some hand-coding, but in general, this
is simpler for most developers than refactoring a relational schema.
However, the big drawback to the OODBMS is
that it is not a relational database and, therefore, the data is “inaccessible”
without some kind of adaptation. Certain OODBMS vendors are taking a variety of
approaches to adapt to this problem, such as providing relational export
services or providing some kind of relational “view” over the object data, but
this then presents the same impedance mismatch as providing an object “view”
over relational data and can lead to many of the same problems. And, certainly,
the object database can yield the same network I/O issues as an O/R-M, because navigating
object graphs can lead to multiple network round trips, depending on how the
OODBMS is configured.
*
* *
In many ways, the data access problem lies
at the nexus of a number of concerns that all combine to create a problem that
is on the surface simplistic, yet upon investigation is nearly bottomless.
The database traditionally lies on a
separate computer from the program that wants access to the data, thus
incurring the hideous temporal cost of traversing the network when data is
needed. The relational approach of storing data favored by IT departments
worldwide views associations in a fundamentally different way than objects do,
meaning that the traditional one-to-many association that both objects and
relations use everywhere creates a point of contention. Toss in the fact that
most object languages have no concept of “sets” or “tuples,” a core principle
behind the relational mindset, just to make things interesting.
Even the basic notion of what how a
particular entity—a customer, an asset, an order, or whatever particular
“things” the application is working with—is defined runs afoul of both the
database and the object system because both want definitions that may or may
not line up exactly. Even in the rare case where the two entity definitions do
line up perfectly, there arises an unhealthy tension over who “owns” these
entity definitions—the database administrator, whose perspective is typically
enterprise-wide, or the application developers, whose perspective is typically
focused on the particular application a department or customer needs. And, of
course, behind all of this stands the ugly spectre of the run-time performance
optimization benefits of being in full control of the queries sent to the
database, measured against the degree of sophistication required on the part of
the programmer using the data access tool in question.
It is, my friends, a tangled web we weave.
If all the concerns about data access were the
same—concerns about performance, scalability, or ownership—some kind of universal
resolution might be possible. Unfortunately, this is not the case for most
environments, which means the architect is left with a pile of expectations
from all corners of the company that will rarely ever all be met. (Which, of
course, leaves us with the tempting, yet rarely successful, strategy of “To
heck with ‘em all.”)
*
* *
Sir Winston Churchill once said, “Democracy
is the worst form of government known to man… except for all the rest that have
been tried so far.” With some simple word substitution, the former Prime
Minister could have been speaking just as accurately of data access. The fact
is, no matter which one an architect selects, there will always be a use case
or scenario in which that particular data access approach yields the worst
result on any axis chosen by the observer: performance, scalability,
extensibility, modularity, and so on. Seeking to find the “One True Path to
Data” is generally a failed quest even as it begins, because many
architects—who have led themselves down the dark path of “Oh, to heck with it,
I’ll just build it myself”—have discovered to their chagrin.
Some of the judgment around a data access
approach will be rooted in its particular implementation—a well-disciplined use
of a call-level interface will generally yield better results than a poorly implemented
use of O/R-M or stored procedures, and the same is true of a well-disciplined
use of O/R-M against chaotic use of call-level interfaces. And, of course, much
will depend on the particular culture in which the implementation is taking
place—a company that believes all database design and implementation activity should
remain firmly in the hands of the DBAs will quickly run into obstacles when
trying to implement a code-generation or an O/R-M approach that would not be
present when using stored procedures, for example. These types of cultural
concerns should factor into any sort of data access decision, at least as far
as can be discerned by the architect at the time of selection. (More than one
“right” choice has been upset by a later corporate reorganization effort that
changes the context in which data access is taking place; generally, these events
cannot be anticipated any more than any other natural disaster.)
Practically speaking, however, a few generalizations
are possible around these ideas and are offered up for consideration for the
next project that needs data storage and access capabilities.
A call-level interface offers an amazing
degree of flexibility to the system, including ad hoc query capabilities for
the user, at the expense of greater code complexity in gathering the results
out of a query. Converting the results of a call-level query into a domain
object is also non-trivial, typically requiring explicit conversion code,
unless the application is one of those that just “flows better” by avoiding the
creation of a domain model entirely. (It may come as a shock to those who were
raised with object-oriented languages as the “One True Way” to building
applications, but more than a few applications actually develop more easily if
the whole domain model concept is tossed out and a more traditional/legacy
“client/server” style is taken.) Concerns over the “scattering of SQL across
the codebase” can be mitigated by storing the actual queries in a text file or
resource, creating a single place to modify queries if/when the database schema
updates, though this breaks down when harvesting the results.
Note that the call-level interface is often
extended by way of a fairly simple object API to include disconnected support,
known as “recordsets” around the time of the Microsoft Visual Basic 6.0
development system, and carried forward into the managed platforms (RowSets in
JDBC, DataSets in .NET). This is essentially simple wrappers around the
call-level interface and is often backed by some kind of code generation to
make it easier to insert and extract data from what looks like a “relational
object.” The principal advantage to this approach is that the database updates
can be batched up until all the modifications can be sent over at once, thus
saving on network I/O.
Stored procedures offer the opportunity to
avoid network I/O (and the performance costs incurred because of it) by doing
significant processing on the database itself, though obviously this is only an
option if the database server is not CPU-bound in practice. Practically
speaking, most database servers in the modern enterprise are not CPU-bound, but
I/O-bound, usually because of all the fetch-process-store processing going on
in application servers. Additionally, certain data-centric business logic
arguably belongs in stored procedures, because this ensures that no matter how
the application code accesses the data—whether from C++, .NET, Java, Ruby, or
some other platform that has yet to make its presence felt—the business logic
surrounding correct access and modification of that data can be enforced. Of
course, stored procedures also have the disadvantage of requiring porting
efforts if the company decides to migrate from one database vendor to another,
but anecdotal evidence suggests that sometimes this porting effort pales in
comparison to porting the schema itself, or worse, the data. (This will, of
course, vary with the organization, its zealotry in adhering to the SQL
standard, and the various tools offered by vendors to aid in the porting
effort.)
The advantage to the OODBMS is both obvious
and profound: by storing objects in their “native” form, the architect avoids
any and all mappings to the relational database. The disadvantage to the OODBMS
is similarly both obvious and profound: by storing objects in their “native”
form, the ability to reach that data from other platforms, including those that
do not view objects in the same way as the host platform, is lost. Obviously,
this is a significant detraction in projects where the database is pre-existing
and immutable, but for those projects where data storage and access is intended
to remain purely an implementation detail, such as the modern SOA-based
enterprise is described, the OODBMS can significantly reduce application
development time.
In many ways, object/relational mapping
offers both the most tempting benefit—that of seeing persistence solely in
terms of the same domain object model that developers work with—as well as the
most dangerous drawbacks—that of lacking the visibility into how the data will
be persisted or, more critically, retrieved. Although the gains seem
significant, issues with O/R-Ms, if they arise, will typically appear toward
the end of a project, exactly when architectural refactoring is most difficult.
Because of this, developers working with O/R-Ms must be ready to “take over”
for particularly complicated queries, because even the most sophisticated O/R-M
systems will often fail to take into account the very optimizations that a
skilled SQL practitioner can apply, such as the relative populations of tables
or the nature of which columns are indexed (or even how they are indexed).
*
* *
It should be obvious by now, to even the
most strenuous supporter of one “style” of data access over the others, that
each has its strengths and weaknesses. It should be equally obvious that where
a solution might work well within one company or project, the same solution
implemented in a different company or project will yield far less satisfying
results. This is hardly a new situation for the practicing architect—similar
kinds of minefields are hidden in the areas of security, user interface,
layering and topology decisions, and so on. As with all architectural
decisions, the successful architect weighs all of these factors together,
carefully considers the available options, and then throws a dart at the
dartboard of buzzwords to figure out which one to use this time around.
OK, maybe not that last part. Even though
it may seem like it at times, purely random selection of technology does not
generally guarantee success. Fortunately, much of the time, when viewed with
this kind of open-minded decision-making, an appropriate solution suggests
itself after even just cursory consideration. Just remember to keep the various
pros and cons of each data access approach firmly in mind, and never let dogma
of any kind drive your decision making, for once you turn to the Dogmatic Side,
forever will it dominate your destiny.