Pragmatic Architecture: Data Access

Ted Neward

January 2008

          It’s once again the monthly meeting, and after introductions and the monthly entreaty to our Higher Power for wisdom and guidance as we go about our architect lives, one of the long-time members asks to address the group. Standing, he takes a deep breath, and I notice the beads of sweat on his forehead as he begins.

          “Hi, my name’s Kevin, and I’m an architect.”

          “Hi, Kevin.”

          The simple greeting elicits a grateful smile at the support implied therein. “Thanks. I’ve been buzzword-free for almost a year now, and I’m finally realizing that you can be an architect without spending all your time in PowerPoint.” Murmurs of respect ripple through the group as he continues. “But I’m getting ready to roll on to a new project, and I need to figure out how our developers will… access the database.”

As the group stirs, Kevin tries to explain. “I just don’t know if they should use an object/relational mapper, or if they should use stored procedures, or…”

But it’s too late. Like a wild racehorse, the group has the bit between its teeth, and the conversation runs wild from there.

“Stored procedures are the only way to go!”

          “You’re out of your mind! Keep it light on the database and just write queries from the middle tier!”

          “Are you both nuts? That just creates more work! You have to keep your storage implementation details hidden!”

          “Code generation!”

          “Mobile objects!”

          As I sit in the corner, nibbling on my cookie and sipping my (hopefully heavily spiked) punch, I notice that Kevin has returned to his seat to try and make sense of the ensuing chaos. The thought flickers across my mind as he tries to listen to the arguments flying back and forth: How did this subject of data access become such a hot-button issue for us?

*     *     *

No topic in software development seems to raise as much excitement, interest, and argument as that of how developers should fetch and store data to a relational database. Perhaps its ubiquity serves as the reason for the heatedness of the debate—after all, accessing and storing data to a database is one of those elements of software development that permeates across all projects, and even across platforms and operating systems. Perhaps the reason lies in the fact that there are many different ways to think about storing and retrieving data. Perhaps developers just need something to argue about, and this serves as the universally-agreed industry-wide topic.

Regardless of the reason, no other subject in architecture and design triggers as much fear and adrenaline in the architect as the innocent project kickoff meeting question of “How should we fetch data from the database?” No matter what answer the architect gives, there will be someone in the meeting who will disagree—violently—with the approach chosen. Logical reasoning will be attempted, then discarded. Performance figures will be made up and cited. Scalability concerns will be tossed around liberally without benefit of fact or verification. By the end of the discussion, if it in fact ever ends, the architect will silently curse the Gods of Computer Science, because whatever sins mankind committed surely did not merit this kind of punishment.

*     *     *

Data access (and storage, when you get right down to it) comes in a variety of flavors, and at the risk of once again rousing the sleeping bear, it serves the discussion well to go over some of the principal players in this drama, along with their commonly attributed advantages and disadvantages. However, to avoid any sort of hurt feelings or implied endorsements, no products or specific implementations will be discussed; instead, the focus will remain entirely on the concepts, which means certain levels of detail will have to go unexplored. And, it should go without saying, many of the ideas discussed here can be combined in various ways, creating a nearly infinite range of possibilities; for purposes of simplicity, just the core ideas will be raised here.

The simplest approach, in some ways, is the call-level interface (CLI) approach, in which a library provides basic SQL call-and-retrieval access to the database, usually through a standardized API. In previous decades, this was a proprietary library provided by the database vendor, but in the mid-'90s Microsoft created a standardized “umbrella” API, Open Database Connectivity (ODBC), that became highly successful because it encapsulated many of the details of working with a particular database implementation. Later, as the “managed platforms” came into fashion, similar CLI libraries emerged for both platforms (JDBC to the Java community and ADO.NET for the common language runtime [CLR]) that architecturally resembled ODBC, with the caveat that ODBC was designed for procedural languages—that is, C and its siblings—and both JDBC and ADO.NET were clearly “object-ish" in nature.

The benefit of the call-level interface lies in its simplicity and direct control over SQL—the developer specifies the SQL to execute against the database, hands it off to the library, and harvests the results, usually by way of some kind of iterator across an array-of-name/value-pairs. The drawback to this is similarly rooted in the CLI’s simplicity and direct nature—for example, the SQL calls can, without some kind of discipline to the contrary, end up scattered across the entirety of the code, making it difficult to refactor or evolve the database schema. Similarly, lack of any sort of “automated” support in writing the SQL or harvesting the results means developers can easily get out of sync with the actual database schema, with type mismatch errors at run time an all-too-common result. Just writing the code to set up the query and iterate through the results can be tedious and error-prone, and if developers are not aware of it, simple concatenation-of-strings when constructing the query can create a security hole (known in the security world as a SQL injection attack).

Most modern databases offer an ability to perform non-trivial processing on the database itself, executable in a way similar to a remote procedure call (RPC), known to most developers as stored procedures. Stored procedures, or sprocs as they are sometimes called, look alarmingly like traditional procedural programming (hence the name), and are usually executed through a call-level interface as described earlier, using standard SQL syntax for passing parameters and gathering the results of the procedure call. Although the execution of sprocs is standard, their definition is definitely not governed by the ANSI SQL committee; this leaves stored procedure languages looking like whatever the database vendor believes they should look like.

The principal advantage of a sproc lay is that it is executed on the database server, typically inside the same process as the query engine itself, meaning that if multi-step processing (“fetch data, analyze it, perform some processing, store some data, repeat for a different set of data,” and so on) is necessary, the queries are all executing against local data files instead of having to move across the network. Additionally, because the sproc language is typically a vendor-proprietary extension to SQL, vendors’ extensions and enhancements to SQL are available to simplify the development tasks.

At the other end of the spectrum, if you will, lies the object/relational mapper (O/R-M), generally a library or code-generative tool that tries to “hide” the differences between the object-oriented programming languages we favor and the relational database preferred by the enterprise. So much has been written about how such toolkits work and the relative benefits of the O/R-M that little benefit accrues from discussing it in depth here. Readers unfamiliar with the O/R-M discussion are encouraged to check out Martin Fowler’s Patterns of Enterprise Application Architecture, which stands as the best discussion of O/R-M in one volume to date.

Some O/R-M tools or utilities will create a relational database schema from an object domain model, leaving the developers almost completely isolated from the details of SQL, although others will go to longer lengths to map developer-designed domain objects against a database administrator (DBA)–designed schema, and still others will construct a domain model from an existing schema. (Some will do all three, depending on which tool or utility is run.) This distinction will be particularly important for those environments in which the database model is not “owned” by the developers.

Problems with object/relational mapping in the Java community (specifically, with Enterprise Java Beans [EJB] Entity Beans) gave rise to a variation on the concept known as data transfer objects (DTOs). With this idea, instead of domain objects taking care of their own persistence, data transfer objects, whose sole intention is to carry data from the application server to the presentation server (typically a Web server), are transmitted together when requested. This has the advantage of allowing developers to continue to work with objects, yet avoid some of the more heinous network input/output (I/O) traps that an O/R-M can generate. Unfortunately, DTOs are typically “dumb” objects because they are essentially nothing more than carriers of data; they do not have the encapsulation and modeling benefits that are typically associated with domain object models. This sometimes gives rise to “parallel object trees,” in which domain objects are defined in a parallel class hierarchy to the DTOs that are used to store and fetch their internals. There is also the added disadvantage that DTOs must usually be manually written because they often correspond more closely to how the data is being sent back and forth instead of how the data conceptually relates as objects. This makes code generation trickier.

A variation on this theme is the idea of mobile objects, in which a software service accepts graphs of objects serialized using the serialization capabilities built in to the managed platform and executes methods on those objects after they are on the application server to persist or fetch their data. This allows developers to use whatever data-access tool they find most attractive while on the application server, and it generally reduces the amount of network I/O traffic from the presentation server. It generally avoids the “dual hierarchy” problem found in DTO systems, but again, this does not provide a solution to the problem of mapping relational tuplesets (rows) to objects.

One other option, increasingly popular with stand-alone systems or applications that have some flexibility in how data is to be stored is the object database, in which objects are stored in their “native” form instead of in a relational database. This makes writing the persistence code simpler, because there is no mapping that needs be written. Typically, object database implementations also refactor more easily than relational systems, because the object database can detect the difference in the object definition (which serves as the database schema) and automatically migrate the stored schema to match. Naturally, this is not perfect—certain refactorings will confuse even the most sophisticated object-oriented database management system (OODBMS) and require some hand-coding, but in general, this is simpler for most developers than refactoring a relational schema.

However, the big drawback to the OODBMS is that it is not a relational database and, therefore, the data is “inaccessible” without some kind of adaptation. Certain OODBMS vendors are taking a variety of approaches to adapt to this problem, such as providing relational export services or providing some kind of relational “view” over the object data, but this then presents the same impedance mismatch as providing an object “view” over relational data and can lead to many of the same problems. And, certainly, the object database can yield the same network I/O issues as an O/R-M, because navigating object graphs can lead to multiple network round trips, depending on how the OODBMS is configured.

*     *     *

In many ways, the data access problem lies at the nexus of a number of concerns that all combine to create a problem that is on the surface simplistic, yet upon investigation is nearly bottomless.

The database traditionally lies on a separate computer from the program that wants access to the data, thus incurring the hideous temporal cost of traversing the network when data is needed. The relational approach of storing data favored by IT departments worldwide views associations in a fundamentally different way than objects do, meaning that the traditional one-to-many association that both objects and relations use everywhere creates a point of contention. Toss in the fact that most object languages have no concept of “sets” or “tuples,” a core principle behind the relational mindset, just to make things interesting.

Even the basic notion of what how a particular entity—a customer, an asset, an order, or whatever particular “things” the application is working with—is defined runs afoul of both the database and the object system because both want definitions that may or may not line up exactly. Even in the rare case where the two entity definitions do line up perfectly, there arises an unhealthy tension over who “owns” these entity definitions—the database administrator, whose perspective is typically enterprise-wide, or the application developers, whose perspective is typically focused on the particular application a department or customer needs. And, of course, behind all of this stands the ugly spectre of the run-time performance optimization benefits of being in full control of the queries sent to the database, measured against the degree of sophistication required on the part of the programmer using the data access tool in question.

It is, my friends, a tangled web we weave.

If all the concerns about data access were the same—concerns about performance, scalability, or ownership—some kind of universal resolution might be possible. Unfortunately, this is not the case for most environments, which means the architect is left with a pile of expectations from all corners of the company that will rarely ever all be met. (Which, of course, leaves us with the tempting, yet rarely successful, strategy of “To heck with ‘em all.”)

*     *     *

Sir Winston Churchill once said, “Democracy is the worst form of government known to man… except for all the rest that have been tried so far.” With some simple word substitution, the former Prime Minister could have been speaking just as accurately of data access. The fact is, no matter which one an architect selects, there will always be a use case or scenario in which that particular data access approach yields the worst result on any axis chosen by the observer: performance, scalability, extensibility, modularity, and so on. Seeking to find the “One True Path to Data” is generally a failed quest even as it begins, because many architects—who have led themselves down the dark path of “Oh, to heck with it, I’ll just build it myself”—have discovered to their chagrin.

Some of the judgment around a data access approach will be rooted in its particular implementation—a well-disciplined use of a call-level interface will generally yield better results than a poorly implemented use of O/R-M or stored procedures, and the same is true of a well-disciplined use of O/R-M against chaotic use of call-level interfaces. And, of course, much will depend on the particular culture in which the implementation is taking place—a company that believes all database design and implementation activity should remain firmly in the hands of the DBAs will quickly run into obstacles when trying to implement a code-generation or an O/R-M approach that would not be present when using stored procedures, for example. These types of cultural concerns should factor into any sort of data access decision, at least as far as can be discerned by the architect at the time of selection. (More than one “right” choice has been upset by a later corporate reorganization effort that changes the context in which data access is taking place; generally, these events cannot be anticipated any more than any other natural disaster.)

Practically speaking, however, a few generalizations are possible around these ideas and are offered up for consideration for the next project that needs data storage and access capabilities.

A call-level interface offers an amazing degree of flexibility to the system, including ad hoc query capabilities for the user, at the expense of greater code complexity in gathering the results out of a query. Converting the results of a call-level query into a domain object is also non-trivial, typically requiring explicit conversion code, unless the application is one of those that just “flows better” by avoiding the creation of a domain model entirely. (It may come as a shock to those who were raised with object-oriented languages as the “One True Way” to building applications, but more than a few applications actually develop more easily if the whole domain model concept is tossed out and a more traditional/legacy “client/server” style is taken.) Concerns over the “scattering of SQL across the codebase” can be mitigated by storing the actual queries in a text file or resource, creating a single place to modify queries if/when the database schema updates, though this breaks down when harvesting the results.

Note that the call-level interface is often extended by way of a fairly simple object API to include disconnected support, known as “recordsets” around the time of the Microsoft Visual Basic 6.0 development system, and carried forward into the managed platforms (RowSets in JDBC, DataSets in .NET). This is essentially simple wrappers around the call-level interface and is often backed by some kind of code generation to make it easier to insert and extract data from what looks like a “relational object.” The principal advantage to this approach is that the database updates can be batched up until all the modifications can be sent over at once, thus saving on network I/O.

Stored procedures offer the opportunity to avoid network I/O (and the performance costs incurred because of it) by doing significant processing on the database itself, though obviously this is only an option if the database server is not CPU-bound in practice. Practically speaking, most database servers in the modern enterprise are not CPU-bound, but I/O-bound, usually because of all the fetch-process-store processing going on in application servers. Additionally, certain data-centric business logic arguably belongs in stored procedures, because this ensures that no matter how the application code accesses the data—whether from C++, .NET, Java, Ruby, or some other platform that has yet to make its presence felt—the business logic surrounding correct access and modification of that data can be enforced. Of course, stored procedures also have the disadvantage of requiring porting efforts if the company decides to migrate from one database vendor to another, but anecdotal evidence suggests that sometimes this porting effort pales in comparison to porting the schema itself, or worse, the data. (This will, of course, vary with the organization, its zealotry in adhering to the SQL standard, and the various tools offered by vendors to aid in the porting effort.)

The advantage to the OODBMS is both obvious and profound: by storing objects in their “native” form, the architect avoids any and all mappings to the relational database. The disadvantage to the OODBMS is similarly both obvious and profound: by storing objects in their “native” form, the ability to reach that data from other platforms, including those that do not view objects in the same way as the host platform, is lost. Obviously, this is a significant detraction in projects where the database is pre-existing and immutable, but for those projects where data storage and access is intended to remain purely an implementation detail, such as the modern SOA-based enterprise is described, the OODBMS can significantly reduce application development time.

In many ways, object/relational mapping offers both the most tempting benefit—that of seeing persistence solely in terms of the same domain object model that developers work with—as well as the most dangerous drawbacks—that of lacking the visibility into how the data will be persisted or, more critically, retrieved. Although the gains seem significant, issues with O/R-Ms, if they arise, will typically appear toward the end of a project, exactly when architectural refactoring is most difficult. Because of this, developers working with O/R-Ms must be ready to “take over” for particularly complicated queries, because even the most sophisticated O/R-M systems will often fail to take into account the very optimizations that a skilled SQL practitioner can apply, such as the relative populations of tables or the nature of which columns are indexed (or even how they are indexed).

*     *     *

It should be obvious by now, to even the most strenuous supporter of one “style” of data access over the others, that each has its strengths and weaknesses. It should be equally obvious that where a solution might work well within one company or project, the same solution implemented in a different company or project will yield far less satisfying results. This is hardly a new situation for the practicing architect—similar kinds of minefields are hidden in the areas of security, user interface, layering and topology decisions, and so on. As with all architectural decisions, the successful architect weighs all of these factors together, carefully considers the available options, and then throws a dart at the dartboard of buzzwords to figure out which one to use this time around.

OK, maybe not that last part. Even though it may seem like it at times, purely random selection of technology does not generally guarantee success. Fortunately, much of the time, when viewed with this kind of open-minded decision-making, an appropriate solution suggests itself after even just cursory consideration. Just remember to keep the various pros and cons of each data access approach firmly in mind, and never let dogma of any kind drive your decision making, for once you turn to the Dogmatic Side, forever will it dominate your destiny.

 

 

Page view tracker