Under the Table
How Data Access Code Affects Database Performance
There's been a consistent debate over whether query tuning, and database application performance tuning in general, is the province of the database administrator, the application developer, or both. The database administrator usually has access to more tools than the developer. The DBA can look at the performance monitor counters and dynamic management views, run SQL Profiler, decide where to place the database, and create indexes to make queries perform better. The application developer usually writes the queries and stored procedures that access the database. The developer can use most of the same tools in a test system, and based on knowledge of the application's design and use cases, the developer can recommend useful indexes. But an often overlooked point is that the application developer writes the database API code that accesses the database. Code that accesses the database, such as ADO.NET, OLE DB, or ODBC, can have an effect on database performance. This is especially important when attempting to write a generalized data access framework or choosing an existing framework. In this article, we'll delve into some typical approaches to writing data access code and look at the effect they can have on performance.
Query Plan Reuse
Let's start by going over the lifetime of a query. When a query is submitted through the query processor, the query processor parses the text for syntactic correctness. Queries in stored procedures are syntax-checked during the CREATE PROCEDURE statement. Before the query or stored procedure is executed, the processor checks the plan cache for a query plan that matches the text. If the text matches, the query plan is reused; if no match occurs, a query plan is created for the query text. After the query is executed, the plan is returned to the cache to be reused. Query plan creation is an expensive operation, and query plan reuse is almost always a good idea. The query text that's being compared against text of the plan in the cache must match using a case-sensitive string comparison.
So the query
SELECT a.au_id, ta.title_id FROM authors a JOIN titleauthor ta ON a.au_id = ta.au_id WHERE au_fname = 'Innes';
will not match
SELECT a.au_id, ta.title_id FROM authors a JOIN titleauthor ta ON a.au_id = ta.au_id WHERE au_fname = 'Johnson';
Note that it also will not match this text
SELECT a.au_id, ta.title_id FROM authors a JOIN titleauthor ta ON a.au_id = ta.au_id WHERE au_fname = 'Innes';
This is because the line feed characters are in different places in the statement. To help with query plan reuse, the SQL Server query processor can perform a process known as autoparameterization. Autoparameterization will change a statement like
SELECT * FROM authors WHERE au_fname = 'Innes'
To a parameterized statement and a parameter declaration:
(@0 varchar(8000))SELECT * FROM authors WHERE au_fname = @0
These statements can be observed in the plan cache, using either sys.dm_exec_query_stats or sys.dm_exec_cache_plans, with a CROSS APPLY using sys.dm_exec_sql_text(handle) to correlate the text with the other information. Autoparameterization assists in query plan reuse, but it's not perfect.
For example, the statement
SELECT * FROM titles WHERE price > 9.99
is parameterized to
(@0 decimal(3,2))SELECT * FROM titles WHERE price > @0
SELECT * FROM titles WHERE price > 19.99
is parameterized to use a different data type
(@0 decimal(4,2))SELECT * FROM titles WHERE price > @0
SELECT * FROM titles WHERE price > $19.99
is parameterized to
(@0 money)SELECT * FROM titles WHERE price > @0
Having multiple query plans for similar queries that could use the same plan is known as plan cache pollution. Not only does it fill up the plan caches with redundant plans, but it also causes time (and CPU and I/O) to be consumed creating the redundant plans. Notice that in autoparameterization, the query processor must "guess" the parameter type based on the parameter value. Autoparameterization helps, but it does not completely eliminate plan cache pollution. In addition, the text of parameterized queries is normalized so that plans are reused even if the original text uses different formatting. Autoparameterization is used only for a subset of queries, based on the complexity of the query. Although a complete discussion of all the autoparameterization rules is beyond the scope of this article, realize that SQL Server uses one of two sets of rules: SIMPLE and FORCED parameterization. An example of the difference is that simple parameterization will not autoparameterize a multitable query, but forced parameterization will.
Stored Procedures and Parameterized Queries
A much better choice is to use parameterized queries or stored procedures. Not only do these help with query plan reuse, but, if you define your parameters properly, data type guessing is never done. Using stored procedures is the best choice, because the parameter data type is specified exactly in the stored procedure definition. Bear in mind that stored procedures are not perfect either. One difficulty is that database object name resolution is not done at CREATE PROCEDURE time; a table or column name that does not exist causes an execution-time error. Also, although a stored procedure's parameters constitute a "contract" between application code and procedure code, stored procedures can also return resultsets. No metadata definition, and therefore no contract, exists on the number of resultsets and the number and data types of resultset columns.
Stored procedures can be called in at least two ways in database API code. Here's an example using ADO.NET:
SqlCommand cmd = new SqlCommand("EXECUTE myproc 100", conn); int i = cmd.ExecuteNonQuery();
SqlCommand cmd = new SqlCommand("myproc", conn); cmd.CommandType = CommandType.StoredProcedure; cmd.Parameters.Add("@a", SqlDbType.Int); cmd.Parameters.Value = 100; int i = cmd.ExecuteNonQuery();
Executing a stored procedure as a command string (CommandType.Text) without using ADO.NET's ParameterCollection uses a SQL Server language statement, while using CommandType.StoredProcedure results in a lower-overhead remote procedure call (RPC). This difference can be observed in SQL Profiler. How you pass your parameters is also important because of when the query plans are created, but I'll get back to that later.
Parameterized queries use parameters in APIs the way stored procedures do, with a few important exceptions. ADO.NET's SqlParameter class contains properties not only for parameter name and value, but also for parameter data type, length, precision, and scale. It's important to avoid plan cache pollution by specifying the correct values for all relevant parameters in parameterized queries. Otherwise, because there isn't a parameter contract as there is with stored procedures, ADO.NET must guess at these properties. This is similar to the way that autoparameterization guesses, but the implementation is different in a few areas. The charts below, Figure 1 and Figure 2, show the current implementation in SQL Server 2008 of autoparameterization and ADO.NET 3.5 SP1's SqlParameterCollection's AddWithValue method.
|Figure 1 Parameter Data Types Produced by Autoparameterization|
|Literal Type||Parameter Produced|
|Whole Number||Smallest fit: TINYINT, SMALLINT, INT, or BIGINT|
|Fractional Number||DECIMAL(p,s) with precision and scale matching the literal|
|Number with Currency Symbol||MONEY|
|Figure 2 Parameter Data Types Produced by ADO.NET's AddWithValue and Parameterized Queries|
|Literal Type||Parameter Produced|
|String||NVARCHAR(x) where x is the length of the string|
|Whole Number||Smallest fit: INT or BIGINT|
|Fractional Number||FLOAT(53) Note: this is double-precision floating point|
When you're using parameterized queries, it's a bad idea to use Parameters.AddWithValue. In this case, ADO.NET must guess the data type, and there's a special hazard when using strings and AddWithValue. First of all, the .NET string class is a Unicode string, whereas in T-SQL, a string constant can either be specified as Unicode or non-Unicode. ADO.NET will pass in a Unicode string parameter (NVARCHAR in SQL). This can have negative repercussions on the query plan itself, if the column that's being used in the predicate is non-Unicode. For example, suppose you have a table with a VARCHAR column as the clustered primary key:
CREATE TABLE sample ( thekey varchar(7) primary key, name varchar(20) -- other columns omitted )
In my ADO.NET code, I want to do a lookup by primary key:
cmd.CommandText = "SELECT * FROM sample when thekey = @keyvalue;"
And I specify the parameter using this:
ADO.NET will decide on a parameter data type of NVARCHAR(7). Because the conversion from NVARCHAR to VARCHAR happens in the query execution step that retrieves the rows, the parameter value cannot be used as a search argument. Rather than perform a Clustered Index Seek of one row, the query will perform a Clustered Index Scan of the entire table. Now, imagine this with a table with 5 million rows. Since you've submitted a parameterized query, there's nothing that SQL Server autoparameterization can do, and nothing that a database administrator can change in the server to fix this behavior. Using the FORCESEEK query hint as a last resort fails to produce a plan at all. When the parameter type is specified as SqlDbType.VarChar rather than making ADO.NET guess the data type, the response of such a query drops from multiple seconds (at best) to milliseconds.
Another good habit to get into for string-based data types is to always specify the length of the parameter. This value should be the length of the field in the SQL predicate that uses the parameter, or the maximum string length (4,000 for NVARCHAR, 8,000 for VARCHAR), not the length of the string itself. SQL Server autoparameterization always assumes the maximum string length, but SqlParameterCollection.AddWithValue makes the parameter length equal to the length of the string. So, using the following calls produces different parameter data types and therefore different plans:
// produces an nvarchar(5) parameter cmd.Parameters.AddWithValue( "SELECT * FROM authors WHERE au_fname = @name", "@name", "Innes"); // produces an nvarchar(7) parameter cmd.Parameters.AddWithValue( "SELECT * FROM authors WHERE au_fname = @name", "@name", "Johnson");
By not specifying the length when you're using ParameterCollection.AddWithValue, you can have as many different queries in the plan cache as you have different string lengths. Now that's plan cache pollution in a big way. Although I mention ADO.NET in conjunction with this behavior, note that other database APIs share the problem of string parameter plan cache pollution. The current versions of both LINQ to SQL and ADO.NET Entity Framework exhibit a variant of this behavior. With vanilla ADO.NET, you have the option of specifying a string parameter's length; with the frameworks, the conversion to API calls is done by LINQ to SQL or Entity Framework itself, so you can't do anything about their string parameter plan cache pollution. Both LINQ to SQL and Entity Framework will address this problem in the upcoming .NET Framework 4release. So if you're using your own parameterized queries, don't forget to specify the proper SqlDbType, the length of string parameters, and the precision and scale of decimal parameters. Performance here is absolutely the realm of the programmer, and most DBAs won't check your ADO.NET code if they're concerned about performance. If you're using stored procedures, the explicit parameter contract will ensure that you always use the correct parameter type and length.
Although you should always use parameterized SQL statements inside and output stored procedures if possible, there are a few cases when parameterization cannot be used. You cannot parameterize the names of columns or names of tables in your SQL statements. This includes DDL (Data Definition Language statements) as well as DML (Data Manipulation Language statements). So although parameterization helps performance and is the best safeguard against SQL injection (using string concatenation rather than parameters can allow nefarious users to inject addition SQL into your code), it's not always possible to parameterize everything.
Where you set the value of your parameters is also significant. If you've used SQL Profiler to observe the SQL generated by ADO.NET when you use parameterized queries, you'll notice that it doesn't look like this:
(@0 VARCHAR(40))SELECT * FROM authors WHERE au_fname = @0
Instead you'll see:
sp_executesql N'SELECT * FROM authors WHERE au_fname = @name', N'@name VARCHAR(40)', 'Innes'
The procedure sp_executesql is a system stored procedure that executes a dynamically built SQL string that can include parameters. One reason why ADO.NET uses it to execute a parameterized query is that this results in use of the lower-overhead RPC call. Another important reason why sp_executesql is used is to enable a SQL Server query processor behavior known as "parameter sniffing." This results in the best performance, because the query processing knows the parameter values at plan-generation time and can make the best use of its statistics.
SQL Server Statistics
SQL Server uses statistics to help generate the best query plan for the job. The two main types of statistics are density statistics (how many unique values exist for a specify column) and cardinality statistics (a histogram of value distribution.) For more information about these statistics, reference the white paper "Statistics Used by the Query Optimizer in Microsoft SQL Server 2005," by Eric N. Hanson and Lubor Kollar. The key to understanding how SQL Server uses statistics is knowing that SQL Server creates query plans for all queries in a batch or stored procedure at the beginning of the batch or stored procedure. If the query processor knows the value of a parameter at plan-generation time, then the cardinality and density statistics can be used. If the value is unknown at plan-creation time, then only the density statistics can be used. For example, if the ADO.NET programmer uses parameters like the following, there's no way for the query processor to know you're looking for authors from California and use the cardinality statistics on the state column:
cmd.CommandText = "declare @a char(2); set @a = 'CA'; select * from authors where state = @a "; SqlDataReader rdr = cmd.ExecuteReader();
The plan is created before the first statement (before the DECLARE statement in this case), when the parameter value hasn't been assigned yet. That's why a parameterized query is translated into sp_executesql. In this case, the plan is created on entry to the sp_executesql stored procedure, and the query processor can sniff the value of the parameter at plan-generation time. The parameter values are specified in the call to sp_executesql. The same concept applies if you write a stored procedure. Say you have a query that retrieves the authors from California if the value passed in is NULL, otherwise the user must specify the state he wants, as follows:
CREATE PROCEDURE get_authors_by_state (@state CHAR(2)) AS BEGIN IF @state IS NULL THEN @state = 'CA'; SELECT * FROM authors WHERE state = @state; END
Now, in the most common case (no parameter is specified and the state is NULL), the query is optimized for the value NULL, not the value "CA." If CA is a common value of the column, then you'll be potentially getting the wrong plan. So, when you're using parameterized queries in ADO.NET, remember that this means using the SqlParameterCollection, and not specifying the parameter declaration and assignment in the SQL statement itself. If you're writing stored procedures, make sure you keep in mind that setting parameters in the code itself works against parameter sniffing. Note that you won't see different query plans in the example above that uses the authors table in the pubs database; it's too small. In larger tables, this can affect the physical JOIN type that is used and affect other parts of the query plan indirectly. For examples of how parameter sniffing can affect query plans, refer to the white paper "Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005," by Arun Marathe and Shu Scott.
Parameter sniffing is usually a good idea. The cardinality statistics are calculated to contain an approximately equal number of rows in each histogram step, so that any parameter value is representative of the cardinality as a whole. But, because there is a limited number of cardinality buckets (200 bucket maximum) and because some columns consist of mostly repeating values, this isn't always the case. Imagine that you have a table of customers. Because your business is based in California, most of your customers come from California. Let's imagine 200,000 customers from California and 10 customers from Oregon. A query plan joining your customers table with five other tables might be very different when you're looking for customers in California as opposed to customers in Oregon. If the first query is for customers in Oregon, your cached and reused plan for a California customer would also assume 10 California customers as opposed to the large number of California customers. In this case, using the cardinality statistics isn't a help, but a hindrance. The easiest (but most fragile) way out of this dilemma is to use conditional code—either in the application or in a separate stored procedure, to call two different stored procedures, one for states with many customers and one for states with few customers. SQL Server will not share query plans, even for exactly the same query, if the query occurs in two different stored procedures. The fragile part is determining what constitutes "a state with many customers," and realizing that your distribution of customers can change over time. SQL Server also provides some query hints that can help out. If you decide that having everyone use the plan for California customers is OK (because you have only a small number of rows to process in other states anyway), then you can use the query hint OPTION (OPTIMIZE FOR parameter_name=value). That ensures the plan in cache will always be the plan for states with many customers. As an alternative, you can use SQL Server 2008's new OPTION (OPTIMIZE FOR UNKNOWN) hint, which makes SQL Server ignore cardinality statistics and come up with an intermediate plan, perhaps not optimized for either a big or small state. In addition, if you have a query that uses many parameters, but only uses a value for them one at a time (imagine a system where someone can search from one to ten different conditions defined by parameters in the same query,) then your best bet might be to produce a different query plan each time on purpose. This is specified with an OPTION RECOMPILE query hint.
Right Plan for the Job
To summarize, using parameters guards against plan cache pollution and SQL injection. Always use parameterized queries or parameterized stored procedures. Always specifying the right data type, length, precision, and scale will ensure that you're not doing data type coercion at execution time. Making the values available at query plan creation time ensures that the optimizer can have access to all the statistics that it needs. And if parameter sniffing is a problem (too much caching,) don't go back to a plan for every query that pollutes the cache. Instead, use query hints or stored procedures to ensure that you get the right plan for the job. Remember that the data access and stored procedure code that you, the application programmer, write can make a big difference in performance.
Send your questions and comments for Bob to email@example.com.
Bob Beauchemin is a database-centric application practitioner and architect, course author and instructor, writer, and developer skills partner at SQLskills. He's written books and articles on SQL Server, data access and integration technologies, and database security. You can reach him at firstname.lastname@example.org.