August 2012

Volume 27 Number 08

Data Points - Pitfalls and Pointers for a Base Logging Class in EF Models

By Julie Lerman | August 2012

Julie LermanI recently spent time with a client who was experiencing some occasional—but severe—performance issues with their Entity Framework-related code. Using a tool to profile the queries generated by Entity Framework, we discovered a 5,800-line SQL query hitting the database. (Learn about profiling tools in my December 2010 Data Points column, “Profiling Database Activity in the Entity Framework,” at msdn.microsoft.com/magazine/gg490349.) I gasped when I saw that the EDMX model contained an inheritance hierarchy I had taught friends, loved ones and developers to avoid. The model had a single base entity from which every other entity derived. The base entity was used to ensure each entity had properties to track logging data such as DateCreated and DateLastModified. Because this model was created using Model First, the inheritance was interpreted by Entity Framework as a Table per Type (TPT) model in which each entity mapped to its own table in the database. To the uninitiated, this looks innocent enough.

But TPT inheritance is notorious in Entity Framework for its generic query-building pattern that can result in sprawling, poorly performing SQL queries. You might be starting out with a new model where you can avoid TPT, or you might already be stuck with an existing model and TPT inheritance. Either way, this month’s column is dedicated to helping you comprehend the potential performance pitfalls of TPT in this scenario, and showing you a few tricks you can leverage to get around them if it’s too late to modify the model and database schema.

The Scary Model

Figure 1 shows an example of a model I’ve seen all too often. Notice the entity named TheBaseType. Every other entity derives from it in order to auto-matically inherit a DateCreated property. I understand why it’s tempting to design it this way, but Entity Framework schema rules also demand that the base type owns the key property of each derived entity. To me, that’s already a red flag signaling it’s not an appropriate use of inheritance.

All Classes Inherit from TheBaseType
Figure 1 All Classes Inherit from TheBaseType

It’s not that Entity Framework specifically created a problem with this design; it’s a design flaw in the model itself. In this case, inheritance says that Customer is a TheBaseType. What if we changed the name of that base entity to “LoggingInfo” and then repeated the statement “Customer is a LoggingInfo”? The fallacy of the statement becomes more obvious with the new class name. Compare that to Customer is a Person. Perhaps I’ve now convinced you to avoid doing this in your models. But if not, or if you’re already stuck with this model, let’s take it a little further.

By default, the Model First workflow defines a database schema with one-to-one relationships between the base table and all of the tables that represent the derived types. This is the TPT hierarchy mentioned earlier.

If you were to execute a few simple queries, you might not notice any problem—especially if, like me, you’re not a DBA or other flavor of database guru.

For example, this LINQ to Entities query retrieves the DateCreated property for a particular customer:

context.TheBaseTypes.OfType<Customer>()
  .Where(b => b.Id == 3)
  .Select(c => c.DateCreated)
  .FirstOrDefault();

The query results in the following TSQL executed in the database:

SELECT TOP (1)
[Extent1].[DateCreated] AS [DateCreated]
FROM  [dbo].[TheBaseTypes] AS [Extent1]
INNER JOIN [dbo].[TheBaseTypes_Customer] 
AS [Extent2] ON [Extent1].[Id] = [Extent2].[Id]
WHERE 3 = [Extent1].[Id]

It’s a perfectly good query.

A query to retrieve an entire entity is a little uglier because of the need to perform a nested query. The base query retrieves all of the fields that represent the join between TheBaseTypes table and the table containing the derived type. Then a query over those results projects the fields to be returned to Entity Framework to populate the type. For example, here’s a query to retrieve a single product:

 

context.TheBaseTypes.OfType<Product>().FirstOrDefault();

Figure 2 shows the TSQL executed on the server.

Figure 2 Partial Listing of a Nested TSQL Query When Specifying a Type

SELECT
[Limit1].[Id] AS [Id],
[Limit1].[C1] AS [C1],
[Limit1].[DateCreated] AS [DateCreated],
[Limit1].[ProductID] AS [ProductID],
[Limit1].[Name] AS [Name],
[...continued list of fields required for Product class...]
FROM ( SELECT TOP (1)
      [Extent1].[Id] AS [Id],
      [Extent1].[DateCreated] AS [DateCreated],
      [Extent2].[ProductID] AS [ProductID],
      [Extent2].[Name] AS [Name],
      [Extent2].[ProductNumber] AS [ProductNumber],
      [...continued list of fields from Products table aka "Extent2" ...],
      [Extent2].[ProductPhoto_Id] AS [ProductPhoto_Id],
      '0X0X' AS [C1]
      FROM  [dbo].[TheBaseTypes] AS [Extent1]
      INNER JOIN [dbo].[TheBaseTypes_Product] 
      AS [Extent2] ON [Extent1].[Id] = [Extent2].[Id]
)  AS [Limit1]

That’s still not such a bad query. If you were profiling your queries up to this point, you might not notice any issues caused by the model’s design.

But what about this next “simple” query, which wants to find all objects that were created today, regardless of type? The query could return Customers, Products, Orders, Employees or any other type in your model that derives from the base. Thanks to the model design, this seems like a reasonable request, and the model plus LINQ to Entities makes it easy to express (DateCreated is stored in the database as a date type, so I don’t have to be concerned about comparing to DateTime fields in my example queries):

var today= DateTime.Now.Date;
context.TheBaseTypes
  .Where(b => b.DateCreated == today)  .ToList();

Expressing this query in LINQ to Entities is short and sweet. But don’t be fooled. It’s a hefty request. You’re asking EF and your database to return instances of any type (be it Customer or Product or Employee) created today. Entity Framework must begin by querying each table that maps to the derived entities and joining each one to the single related TheBaseTypes table with the DateCreated field. In my environment, this creates a 3,200-line query (when that query is nicely formatted by EFProfiler), which can take Entity Framework some time to build and the database some time to execute.

In my experience, a query like this belongs in a business analysis tool anyway. But what if you’ve got the model and you want to get that info from within your app, perhaps for an administrative reporting area of an application you’re building? I’ve seen developers try to do this type of query in their applications, and I still say you need to think outside of the Entity Framework box. Build the logic into the database as a view or stored procedure and call that from Entity Framework, rather than asking EF to build the query for you. Even as a database procedure, this particular logic isn’t simple to build. But there are benefits. First, you have a greater chance of building a better-performing query. Second, EF won’t have to take the time to figure out the query. Third, your application won’t have to send a 3,300 (or more!)-line query across the pipe. But be warned that the more you dig into this problem and attempt to solve it from within the database or by using EF and .NET coding logic, the clearer it will become that the problem is not so much Entity Framework as the overall model design that’s getting in your way.

If you can avoid querying from the base type and query-specific types, your queries will be much simpler. Here’s an example that expresses the previous query to focus on a particular type:

context.TheBaseTypes.TypeOf<Product>()
  .Where(b => b.DateCreated == today)
  .ToList();

Because EF didn’t have to be prepared for every type in the model, the resulting TSQL is a simple 25-line query. With the DbContext API, you don’t even have to use TypeOf to query derived types. It’s possible to create DbSet properties for the derived types. So I could query even more simply:

context.Products
  .Where(b => b.DateCreated == today)
  .ToList();

In fact, for this model I’d completely remove the TheBaseTypes DbSet from my context class and prevent anyone from expressing queries directly from this base type.

Logging Without the Scary Model

I’ve focused so far on a hierarchy scenario I strongly advise devel-opers to avoid when building models with Entity Framework: using a mapped entity as a base from which every single entity in the model also derives. Sometimes I come upon scenarios where it’s just too late to change the model. But other times I’m able to help my clients avoid this path completely (or they’re early enough in development that we’re able to change the model).

So how to better achieve the goal—which is to provide commonly tracked data, such as logging data—for every type in your model?

Often, the first thought is to keep the inheritance but change the type of hierarchy. With Model First, TPT is the default but you can change that to Table per Hierarchy (TPH) using the Entity Designer Generation Power Pack (available in Visual Studio Gallery via Extension Manager). Code First defaults to TPH when you define inheritance in your classes. But you’ll quickly see that this is not a solution at all. Why? TPH means that the entire hierarchy is contained in a single table. In other words, your database would consist of just one table. I’m hoping no more explanation is necessary to convince you that this is not a good path.

As I said earlier (when I asked if a Customer is really a type of LoggingInfo), the specific scenario I’ve focused on, to solve the problem of tracking common data, begs that you just avoid inher-itance altogether for that goal. I’d recommend you consider an interface or complex types instead, which will embed the fields into each table. If you’re already stuck with the database that created a separate table, a relationship will do.

To demonstrate, I’ll switch to a model based on classes using Code First instead of an EDMX (although you can achieve these same patterns using an EDMX model and the designer).

In the first case I’ll use an interface:

public interface ITheBaseType
{
  DateTime DateCreated { get; set; }
}

Each class will implement the interface. It will have its own key property and contain a DateCreated property. For example, here’s the Product class:

public class Product : ITheBaseType
{
  public int ProductID { get; set; }
  // ...other properties...
  public DateTime DateCreated { get; set; }
}

In the database, each table has its own DateCreated property. Therefore, repeating the earlier query against the Products creates a straightforward query:

context.Products
.Where(b => b.DateCreated == today)
.ToList();

Because all of the fields are contained in this table, I no longer need a nested query:

SELECT TOP (1) [Extent1].[Id]                     AS [Id],
               [Extent1].[ProductID]              AS [ProductID],
               [Extent1].[Name]                   AS [Name],
               [Extent1].[ProductNumber]          AS [ProductNumber],
               ...more fields from Products table...
               [Extent1].[ProductPhoto_Id]        AS [ProductPhoto_Id],
               [Extent1].[DateCreated]            AS [DateCreated]
FROM   [dbo].[Products] AS [Extent1]
WHERE  [Extent1].[DateCreated] = '2012-05-25T00:00:00.00'

If you prefer to define a complex type and reuse that in each of the classes, your types might look like this:

public class Logging
{
  public DateTime DateCreated { get; set; }
}
public class Product{
  public int ProductID { get; set; }
  // ...other properties...
  public Logging Logging { get; set; } 
}

Note that the Logging class doesn’t have a key field (such as Id or LoggingId). Code First conventions will presume this to be a complex type and treat it as such when it’s used to define properties in other classes, as I’ve done with Product.

The Products table in the database has a column generated by Code First called Logging_DateCreated, and the Product.Logging.DateCreated property maps to that column. Adding the Logging property to the Customer class would have the same effect. The Customers table will also have its own Logging_DateCreated property, and it maps back to Customer.Logging.DateCreated.

In code, you’ll need to navigate through the logging property to reference that DateCreated field. Here’s the same query as before, rewritten to work with the new types:

context.Products.Where(b => b.Logging.DateCreated == DateTime.Now).ToList();

The resulting SQL is the same as the interface example except the field name is now Logging_DateCreated rather than DateCreated. It’s a short query that only queries the Products table.

One of the benefits of inheriting from the class in the original model is that it’s easy to code logic to automatically populate the fields from the base class—during SaveChanges, for example. But you can create logic just as easily for the complex type or for the interface, so I don’t see any disadvantage with these new patterns. Figure 3 shows a simple example of setting the DateCreated property for new entities during SaveChanges (you can learn more about this technique in the Second and DbContext editions of my “Programming Entity Framework” book series).

Figure 3 Setting the Interface’s DateCreated Property During SaveChanges

public override int SaveChanges()
{
  foreach (var entity in this.ChangeTracker.Entries()
    .Where(e =>
    e.State == EntityState.Added))
  {
    ApplyLoggingData(entity);
  }
  return base.SaveChanges();
}
private static void ApplyLoggingData(DbEntityEntry entityEntry)
{
  var logger = entityEntry.Entity as ITheBaseType;
  if (logger == null) return;
  logger.DateCreated = System.DateTime.Now;
}

Some Changes in EF 5

Entity Framework 5 does bring some improvements to queries that are generated from TPT hierarchies that help but do not alleviate the problems I demonstrated earlier. For example, rerunning my query that initially resulted in 3,300 lines of SQL on a machine that has the Microsoft .NET Framework 4.5 installed (with no other changes to the solution) generates a query that’s reduced to 2,100 lines of SQL. One of the biggest differences is that EF 5 doesn’t rely on UNIONs to build the query. I’m not a DBA, but my understanding is that such an improvement wouldn’t impact the performance of the query in the database. You can read more about this change to TPT queries in my blog post, “Entity Framework June 2011 CTP: TPT Inheritance Query Improvements,” at bit.ly/MDSQuB.

Not All Inheritance Is Evil

Having a single base type for all entities in your model is an extreme example of modeling and inheritance gone wrong. There are many good cases for having an inheritance hierarchy in your model—for example, when you do want to describe that a Customer is a Person. What’s also important is the lesson that LINQ to Entities is just one tool that’s available to you. In the scenario my client showed me, a clever database developer reconstructed the query against the base type fields as a database stored procedure, which brought a multisecond activity down to one that took just 9 milliseconds. And we all cheered. We’re still hoping they’ll be able to redesign the model and tweak the database for the next version of the software, though. In the meantime, they’re able to let Entity Framework continue to generate those queries that aren’t problematic and use the tools I left behind to tweak the application and database for some fantastic performance improvements.


Julie Lerman is a Microsoft MVP, .NET mentor and consultant who lives in the hills of Vermont. You can find her presenting on data access and other Microsoft .NET topics at user groups and conferences around the world. She blogs at thedatafarm.com/blog and is the author of “Programming Entity Framework” (2010) as well as a Code First edition (2011) and a DbContext edition (2012), all from O’Reilly Media. Follow her on Twitter at twitter.com/julielerman.

Thanks to the following technical expert for reviewing this article: Diego Vega