Tales from the Trenches: Digital Marketing
This case study is contributed by Scott Brown.
Refactoring an existing application has many challenges. Our story is about refactoring an existing application over an extended period of time while still delivering new features. We didn’t start with CQRS as the goal, which was a good thing. It became a good fit as we went along. Our product is composed of multiple pieces, of which our customer facing portal (CFP) uses CQRS.
There are many aspects of the DMS that fit well with CQRS, but there were two main problems we were trying to solve: slow reads and bloated View Objects (VO).
The CFP has a very large dataset with many tables containing tens of millions of rows; at the extreme some tables have millions of rows for a single client. Generally, the best practice for this amount of data in SQL Server is highly denormalized tables—ours is no exception. A large portion of our value add is structured and reporting data together, allowing clients to make the most informed decision when altering their structured data. The combination of structured and reporting data required many SQL joins and some of our page load times were over 20 seconds. There was a lot of friction for users to make simple changes.
The combination of structured and reporting data also resulted in bloated View Objects. The CFP suffered from the same woes that many long lived applications do—lots of cooks in the kitchen but a limited set of ingredients. Our application has a very rich UI resulting in the common Get/Modify/Save pattern. A VO started out with a single purpose: we need data on screen A. A few months later we needed a similar screen B that had some of the same data. Fear not, we already had most of that, we just needed to show it on screen B too—after all we wouldn’t want to duplicate code. Fast forward a few months and our two screens have evolved independently even though they represented "basically the same data." Worse yet, our VO has been used in two more screens and one of them has already been deprecated. At this point we are lucky if the original developers still remember what values from the VO are used on which screens. Oh wait, it's a few years later and the original developers don’t even work here anymore! We would often find ourselves trying to persist a VO from the UI and unable to remember which magical group of properties must be set. It is very easy to violate the Single Responsibility Principle in the name of reuse. There are many solutions to these problems and CQRS is but one tool for making better software.
Before trying to make large architectural changes there are a few things we found to be very successful for the CFP: Dependency Injection (DI) and Bounded Contexts.
Make your objects injectable and go get a DI Container. Changing a legacy application to be injectable is a very large undertaking and will be painful and difficult. Often the hardest part is sorting out the object dependencies. But this was completely necessary later on. As the CFP became injectable it was possible to write unit tests allowing us to refactor with confidence. Now that our application was modular, injectable, and unit tested we could choose any architecture we wanted.
Since we decided to stick with CQRS, it was a good time to think about bounded contexts. First we needed to figure out the major components of the overall product. The CFP is one bounded context and only a portion of our overall application. It is important to determine bounded contexts because CQRS is best applied within a bounded context and not as an integration strategy.
One of our challenges with CQRS has been physically separating our bounded contexts. Refactoring has to deal with an existing application and the previous decisions that were made. In order to split the CFP into its own bounded context we needed to vastly change the dependency graph. Code that handles cross cutting concerns was factored into reference assemblies; our preference has been NuGet packages built and hosted by TeamCity. All the remaining code that was shared between bounded contexts needed to be split into separate solutions. Long term we would recommend separate repositories to ensure that code is not referenced across the bounded contexts. For the CFP we had too much shared code to be able to completely separate the bounded contexts right away, but having done so would have spared much grief later on.
It is important to start thinking about how your bounded contexts will communicate with each other. Events and event sourcing are often associated with CQRS for good reason. The CFP uses events to keep an auditable change history which results in a very obvious integration strategy of eventing.
At this point the CFP is modular, injectable, testable (not necessarily fully tested), and beginning to be divided by bounded context but we have yet to talk about CQRS. All of this ground work is necessary to change the architecture of a large application—don’t be tempted to skip it.
The first piece of CQRS we started with was the commands and queries. This might seem obtusely obvious but I point it out because we did not start with eventing, event sourcing, caching, or even a bus. We created some commands and a bit of wiring to map them to command handlers. If you took our advice earlier and you are using an Inversion of Control (IoC) container, the mapping of command to command handler can be done in less than a day. Since the CFP is now modular and injectable our container can create the command handler dependencies with minimal effort which allowed us to wire our commands into our existing middleware code. Most applications already have a remoting or gateway layer that performs this function of translating UI calls into middleware / VO functions. In the CFP, the commands and queries replaced that layer.
One of our challenges has been to refactor an existing UI to a one-way command model. We have not been able to make a strict one-way contract mainly due to database side ID generation. We are working towards client side ID generation which will allow us to make commands fire and forget. One technique that has helped a bit was to wrap the one way asynchronous bus in a blocking bus. This helped us to minimize the amount of code that depends on the blocking capability. Even with that we have too much code that relies upon command responses simply because the functionality was available, so try not to do this if possible.
Unfortunately we could only do this for so long before we realized it is just the same application with a fancy new façade. The application was easier to work on, but that was more likely due to the DI changes then to the commands and queries. We ran into the problem of where to put certain types of logic. Commands and queries themselves should be very light weight objects with no dependencies on VOs. There were a few occasions we were tempted during a complicated refactor to use an existing VO as part of a query but inevitably we found ourselves back down the path of bloated objects. We also became tempted to use complex properties (getters and setters with code) on commands and queries but this resulted in hidden logic—ultimately we found it better to put the logic in the command handler or better yet in the domain or command validator.
At this point we also began to run into difficulties accomplishing tasks. We were in the middle of a pattern switch and it was difficult to cleanly accomplish a goal. Should command handlers dispatch other commands? How else will they exercise any logic that is now embedded in a command handler? For that matter, what should be a command handler’s single responsibility?
We found that these questions could not be answered by writing more commands and queries but rather by flushing out our CQRS implementation. The next logical choice was either the read or the write model. Starting with the cached read model felt like the best choice since it delivers tangible business value. We chose to use events to keep our read model up to date, but where do the events come from? It became obvious that we were forced to create our write model first.
Choose a strategy for the write model that makes sense in your bounded context. That is, after all, what CQRS allows: separating reads and writes to decouple the requirements of each. For the CFP we use domains that expose behavior. We do not practice DDD, but a domain model fits well with CQRS. Creating a domain model is very hard, we spent a lot of time talking about what our aggregate roots are—do not underestimate how hard this will be.
When creating the write model we were very careful about introducing any dependencies to the domain assembly. This will allow the domain to outlive other application specific technologies, but was not without pain points. Our domain started out with a lot of validation that was eventually moved into command validators; dependencies required for validation were not available from within the domain. In the end, the domain simply translates behavior (methods) into events (class instances). Most of our pain points were centered on saving the events without taking dependencies into the domain assembly. The CFP does not use event sourcing, we were able to translate the domain events into our existing SQL tables with objects we call Event Savers. This allows our domain to focus on translating behavior to events and the command handler can publish and save the events. To prevent the command handler from doing too much, we use a repository pattern to get and save a domain. This allows us to switch to event sourcing in a later refactoring of the application if desired with minimal effect on the domain. The Event Savers are simple classes that map an event to a stored procedure call or table(s). We use RabbitMq to publish the events after saving, it is not transactional but that has been ok so far.
As events become more ubiquitous it is possible to keep a read model up to date. We have a separate service that subscribes to events and updates a Redis cache. By keeping this code separate we isolate the dependencies for Redis and make our caching solution more pluggable. The choice of caching technology is difficult and the best solution is likely to change over time. We needed the flexibility to test multiple options and compare the performance vs. maintainability.
Once our cache was in place we discovered the oldest known theorem of caching: That which is cached becomes stale. Invalid cache results can occur many different ways; we found enough that a temporary measure was introduced to update items in the cache on a rolling schedule. The plan was (and still is) to find and eliminate all sources of inconsistency. Database integrations or people/departments that update the write model directly will need to be routed through the domain to prevent the cache from becoming incorrect. Our goal is total elimination of these discrepancies for complete confidence in cached results.
Single Responsibility of Objects
Definitions specific to our implementation:
- Command – carries data
- Command Authorizer – authorizes user to place a command on the bus
- Command Validator – validates a command can be placed on the bus
- Command Handler – maps command to domain call
- Repository Factory – retrieves a repository for specified domain type
- Repository – retrieves/persists domain instance by key
- Domain – maps behavior to domain event
- Domain EventSaver – called by the repository and saves domain events to existing database structure