Printer Friendly Version      Send     
Click to Rate and Give Feedback
MSDN
MSDN Library
SQL Server
SQL Server 2005
Technical Articles
SQL Server 2005
 SQL Server 2005 Beta 2 Transact-SQL...
Microsoft SQL Server 9.0 Technical Articles
SQL Server 2005 Beta 2 Transact-SQL Enhancements
 

Itzik Ben-Gan
Solid Quality Learning

December, 2004

Applies to:
   Transact-SQL
   Microsoft SQL Server 2005 Beta 2

Summary: This white paper introduces several of the new enhancements to Transact-SQL in Microsoft SQL Server 2005 Beta 2. These new features can increase your expressive power, the performance of your queries, and your error management capabilities. This paper focuses mainly on relational enhancements that are conceptually new and demonstrates these through practical examples. This paper does not cover every new Transact-SQL feature. (91 printed pages)

Contents

Introduction and Scope
Increasing the Expressive Power of Queries and DRI Support
Partitioning
Single-Parent Environment: Employees Organizational Chart
Multiparent Environment: Bill of Materials
Table-Valued Functions in Correlated Subqueries
Enhancements for Performance and Error Handling
Other SQL Server 2005 Beta 2 Capabilities That Affect Transact-SQL
Conclusion

Introduction and Scope

This white paper introduces several of the new enhancements to Transact-SQL in Microsoft SQL Server 2005 Beta 2. These new features can increase your expressive power, the performance of your queries, and your error management capabilities. This paper focuses mainly on relational enhancements that are conceptually new and demonstrates these through practical examples. This paper does not cover every new Transact-SQL feature.

Assumed knowledge: The target audience is expected to be skilled at using Transact-SQL for ad hoc queries and as components of applications in Microsoft SQL Server 2000.

Increasing the Expressive Power of Queries and DRI Support

This section introduces the following new relational features and enhancements:

  • New ranking functions
  • New recursive queries based on common table expressions (CTE)
  • New PIVOT and APPLY relational operators
  • Declarative referential integrity (DRI) enhancements

Ranking Functions

SQL Server 2005 introduces four new ranking functions: ROW_NUMBER, RANK, DENSE_RANK and NTILE. The new functions allow you to efficiently analyze data and provide ranking values to result rows of a query. Typical scenarios where you may find the new functions helpful include: assigning sequential integers to result rows for presentation purposes, paging, scoring, and histograms.

Speaker Statistics Scenario

The following Speaker Statistics scenario will be used to discuss and demonstrate the different functions and their clauses. A large computing conference included three tracks: database, development, and system administration. Eleven speakers spoke in the conference and got scores in the range 1 through 9 for their sessions. The results were summarized and stored in the following SpeakerStats table:

USE tempdb -- or your own test database
CREATE TABLE SpeakerStats
(
  speaker        VARCHAR(10) NOT NULL PRIMARY KEY,
  track          VARCHAR(10) NOT NULL,
  score          INT         NOT NULL,
  pctfilledevals INT         NOT NULL,
  numsessions    INT         NOT NULL
)

SET NOCOUNT ON
INSERT INTO SpeakerStats VALUES('Dan',     'Sys', 3, 22, 4)
INSERT INTO SpeakerStats VALUES('Ron',     'Dev', 9, 30, 3)
INSERT INTO SpeakerStats VALUES('Kathy',   'Sys', 8, 27, 2)
INSERT INTO SpeakerStats VALUES('Suzanne', 'DB',  9, 30, 3)
INSERT INTO SpeakerStats VALUES('Joe',     'Dev', 6, 20, 2)
INSERT INTO SpeakerStats VALUES('Robert',  'Dev', 6, 28, 2)
INSERT INTO SpeakerStats VALUES('Mike',    'DB',  8, 20, 3)
INSERT INTO SpeakerStats VALUES('Michele', 'Sys', 8, 31, 4)
INSERT INTO SpeakerStats VALUES('Jessica', 'Dev', 9, 19, 1)
INSERT INTO SpeakerStats VALUES('Brian',   'Sys', 7, 22, 3)
INSERT INTO SpeakerStats VALUES('Kevin',   'DB',  7, 25, 4)

Each speaker has one row in the table with the speaker's name, track, average score, percent of attendees that filled evaluations in respect to the number of attendees that attended the sessions, and the number of sessions delivered by the speaker. This section demonstrates how to analyze the speaker statistics data to generate useful information using the new ranking functions.

Semantics

All four ranking functions follow a similar syntax pattern:

Ranking Function

<function_name>() OVER(
  [PARTITION BY <partition_by_list>]
  ORDER BY <order_by_list>)

The function can be specified only in two clauses of a query—in the SELECT clause or in the ORDER BY clause. The following sections discuss the different functions in detail.

ROW_NUMBER

The ROW_NUMBER function allows you to provide sequential integer values to result rows of a query. For example, suppose you wanted to return the speaker, track, and score of all speakers, assigning sequential values from 1 and on to the result rows according to descending score order. The following query generates the desired results by using the ROW_NUMBER function, specifying OVER (ORDER BY score DESC):

SELECT ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum, 
  speaker, track, score
FROM SpeakerStats
ORDER BY score DESC

Here is the result set:

rownum speaker    track      score
------ ---------- ---------- -----------
1      Jessica    Dev        9
2      Ron        Dev        9
3      Suzanne    DB         9
4      Kathy      Sys        8
5      Michele    Sys        8
6      Mike       DB         8
7      Kevin      DB         7
8      Brian      Sys        7
9      Joe        Dev        6
10     Robert     Dev        6
11     Dan        Sys        3

The speaker with the highest score got row number 1, and the speaker with the lowest score got row number 11. ROW_NUMBER always generates distinct row numbers to different rows according to the requested sort. Note that if the ORDER BY list specified within the OVER() option is not unique, the result is non-deterministic. This means that there's more than one correct result to the query; in different invocations of the query you might get different results. For example, in our case three different speakers got the same highest score (9): Jessica, Ron, and Suzanne. Since SQL Server has to assign different row numbers to the different speakers, you should assume that the values 1, 2, and 3 assigned Jessica, Ron, and Suzanne respectively were assigned in arbitrary order among those speakers. The result would have been just as correct if the values 1, 2, and 3 were assigned to Ron, Suzanne, and Jessica respectively.

If you specify a unique ORDER BY list, the result is always deterministic. For example, suppose that in the case of a tie between speakers based on score you want to use the highest pctfilledevals value as the tiebreaker. If there's still a tie, use the highest numsessions value as the tiebreaker. Finally, if there's still a tie, use the lowest dictionary-order speaker name as a tiebreaker. Since the ORDER BY list—score, pctfilledevals, numsessions, and speaker—is unique, the result is deterministic:

SELECT ROW_NUMBER() OVER(ORDER BY score DESC, pctfilledevals DESC,
                           numsessions DESC, speaker) AS rownum, 
  speaker, track, score, pctfilledevals, numsessions
FROM SpeakerStats
ORDER BY score DESC, pctfilledevals DESC, numsessions DESC, speaker

Here is the result set:

rownum speaker    track      score       pctfilledevals numsessions
------ ---------- ---------- ----------- -------------- -----------
1      Ron        Dev        9           30             3
2      Suzanne    DB         9           30             3
3      Jessica    Dev        9           19             1
4      Michele    Sys        8           31             4
5      Kathy      Sys        8           27             2
6      Mike       DB         8           20             3
7      Kevin      DB         7           25             4
8      Brian      Sys        7           22             3
9      Robert     Dev        6           28             2
10     Joe        Dev        6           20             2
11     Dan        Sys        3           22             4

One of the important benefits of the new ranking functions is their efficiency. SQL Server's optimizer needs to scan the data only once in order to calculate the values. It does this either by using an ordered scan of an index placed on the sort columns or by scanning the data once and sorting it if an appropriate index was not created.

Another benefit is the simplicity of the syntax. To give you a sense of how difficult and inefficient it is to calculate ranking values by using the set-based approach used in earlier releases of SQL Server, consider the following SQL Server 2000 query, which returns the same results as the previous query:

SELECT
  (SELECT COUNT(*)
   FROM SpeakerStats AS S2
   WHERE S2.score > S1.score
     OR (S2.score = S1.score
         AND S2.pctfilledevals > S1.pctfilledevals)
     OR (S2.score = S1.score
         AND S2.pctfilledevals = S1.pctfilledevals
         AND S2.numsessions > S1.numsessions)
     OR (S2.score = S1.score
         AND S2.pctfilledevals = S1.pctfilledevals
         AND S2.numsessions = S1.numsessions
         AND S2.speaker < S1.speaker)) + 1 AS rownum,
  speaker, track, score, pctfilledevals, numsessions
FROM SpeakerStats AS S1
ORDER BY score DESC, pctfilledevals DESC, numsessions DESC, speaker

This query is obviously much more complex than the SQL Server 2005 query. Furthermore, for each base row in the SpeakerStats table, SQL Server has to scan all matching rows in another instance of the table. On average, about half (at minimum) of the table's rows need to be scanned per each row in the base table. Performance degradation of the SQL Server 2005 query is linear, while performance degradation of the SQL Server 2000 query is exponential. Even in fairly small tables the performance difference is significant. For example, test the performance of the following queries which query the SalesOrderHeader table in the AdventureWorks database to calculate row numbers for sales orders according to SalesOrderID order. The SalesOrderHeader table has 31,465 rows. The first query uses the SQL Server 2005 ROW_NUMBER function, while the second query uses the SQL Server 2000 subquery technique:

-- SQL Server 2005 query
SELECT SalesOrderID,
  ROW_NUMBER() OVER(ORDER BY SalesOrderID) AS rownum
FROM Sales.SalesOrderHeader
-- SQL Server 2000 query
SELECT SalesOrderID,
  (SELECT COUNT(*)
   FROM Sales.SalesOrderHeader AS S2
   WHERE S2.SalesOrderID <= S1.SalesOrderID) AS rownum
FROM Sales.SalesOrderHeader AS S1

I ran this test on my laptop (Compaq Presario X1020U, CPU: Centrino 1.4 GH, RAM: 1GB, local HD). The SQL Server 2005 query finished in only 1 second while the SQL Server 2000 query finished in about 12 minutes.

A typical application for row numbers is to page through the results of a query. Given a page size in terms of number of rows, and a page number, you need to return the rows that belong to the given page. For example, suppose you want to return the second page of rows from the SpeakerStats table, assuming a page size of three rows, according to score DESC, speaker order. The following query first calculates row numbers according to the specified sort in the derived table D, then it filters only rows with the row numbers 4 through 6, which belong to the second page:

SELECT *
FROM (SELECT ROW_NUMBER() OVER(ORDER BY score DESC, speaker) AS rownum, 
        speaker, track, score
      FROM SpeakerStats) AS D
WHERE rownum BETWEEN 4 AND 6
ORDER BY score DESC, speaker

Here is the result set:

rownum speaker    track      score
------ ---------- ---------- -----------
4      Kathy      Sys        8
5      Michele    Sys        8
6      Mike       DB         8

In more generic terms, given a page number in the @pagenum variable, and a page size in the @pagesize variable, the following query returns the rows that belong to the desired page:

DECLARE @pagenum AS INT, @pagesize AS INT
SET @pagenum = 2
SET @pagesize = 3
SELECT *
FROM (SELECT ROW_NUMBER() OVER(ORDER BY score DESC, speaker) AS rownum, 
        speaker, track, score
      FROM SpeakerStats) AS D
WHERE rownum BETWEEN (@pagenum-1)*@pagesize+1 AND @pagenum*@pagesize
ORDER BY score DESC, speaker

The above approach is adequate for ad hoc requests when you're only interested in one specific page of the rows. However, this approach is not adequate when the user issues multiple requests because each invocation of the query costs you a complete scan of the table in order to calculate the row numbers. For more efficient paging when the user might repeatedly request different pages, first populate a temporary table with all of the base table rows including calculated row numbers and index the column containing the row numbers:

SELECT ROW_NUMBER() OVER(ORDER BY score DESC, speaker) AS rownum, *
INTO #SpeakerStatsRN
FROM SpeakerStats
CREATE UNIQUE CLUSTERED INDEX idx_uc_rownum ON #SpeakerStatsRN(rownum)

And then for each requested page issue the following query:

SELECT rownum, speaker, track, score
FROM #SpeakerStatsRN
WHERE rownum BETWEEN (@pagenum-1)*@pagesize+1 AND @pagenum*@pagesize
ORDER BY score DESC, speaker

Only the rows that belong to the desired page will be scanned.

Partitioning

Ranking values can be calculated within groups of rows independently as opposed to being calculated for all table rows as one group. To do this, use the PARTITION BY clause and specify a list of expressions that identify the groups of rows for which ranking values should be calculated independently. For example, the following query assigns row numbers within each track separately according to score DESC, speaker order:

SELECT track, 
  ROW_NUMBER() OVER(
    PARTITION BY track 
    ORDER BY score DESC, speaker) AS pos, 
  speaker, score
FROM SpeakerStats
ORDER BY track, score DESC, speaker

Here is the result set:

track      pos speaker    score
---------- --- ---------- -----------
DB         1   Suzanne    9
DB         2   Mike       8
DB         3   Kevin      7
Dev        1   Jessica    9
Dev        2   Ron        9
Dev        3   Joe        6
Dev        4   Robert     6
Sys        1   Kathy      8
Sys        2   Michele    8
Sys        3   Brian      7
Sys        4   Dan        3

Specifying the track column in the PARTITION BY clause causes row numbers to be calculated independently for each group of rows with the same track.

RANK, DENSE_RANK

The RANK and DENSE_RANK functions are very similar to the ROW_NUMBER function in the sense that they also provide ranking values according to a specified sort, optionally within groups (partitions) of rows. However, unlike ROW_NUMBER, RANK and DENSE_RANK assign the same ranks to rows with the same values in the sort columns. RANK and DENSE_RANK are useful when you don't want to assign different ranks for rows with the same values in the ORDER BY list when the ORDER BY list is not unique. The purpose of RANK and DENSE_RANK and the difference between the two is best explained with an example. The following query calculates row number, rank, and dense rank values to the different speakers according to score DESC order:

SELECT speaker, track, score,
  ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum,
  RANK() OVER(ORDER BY score DESC) AS rnk,
  DENSE_RANK() OVER(ORDER BY score DESC) AS drnk
FROM SpeakerStats
ORDER BY score DESC

Here is the result set:

speaker    track      score       rownum rnk drnk
---------- ---------- ----------- ------ --- ----
Jessica    Dev        9           1      1   1
Ron        Dev        9           2      1   1
Suzanne    DB         9           3      1   1
Kathy      Sys        8           4      4   2
Michele    Sys        8           5      4   2
Mike       DB         8           6      4   2
Kevin      DB         7           7      7   3
Brian      Sys        7           8      7   3
Joe        Dev        6           9      9   4
Robert     Dev        6           10     9   4
Dan        Sys        3           11     11  5

As discussed earlier, the score column is not unique, so different speakers might have the same score. Row numbers do represent a descending score order, but speakers with the same score still get different row numbers. However, notice in the result that all speakers with the same scores get the same rank and dense rank values. In other words, ROW_NUMBER is not deterministic when the ORDER BY list is not unique, while RANK and DENSE_RANK are always deterministic. The difference between the rank and dense rank values is that rank stands for: the number of rows with a higher score plus one, while dense rank stands for: the number of distinct higher scores plus one. From what you've learned so far you can deduce that ROW_NUMBER, RANK, and DENSE_RANK produce the exact same values when the ORDER BY list is unique.

NTILE

NTILE allows you to separate the result rows of a query into a specified number of groups (tiles) according to a specified order. Each group of rows gets a different number starting with 1 for the first group, 2 for the second, and so on. You specify the requested number of groups in the parentheses following the function's name, and the requested sort in the ORDER BY clause of the OVER option. The number of rows in a group is calculated as total_num_rows / num_groups. If there's a remainder n, the first n groups get an additional row. So all groups might not get an equal number of rows, but the group sizes might differ at most by one row. For example, the following query assigns three group numbers to the different speaker rows according to descending score order:

SELECT speaker, track, score,
  ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum,
  NTILE(3) OVER(ORDER BY score DESC) AS tile
FROM SpeakerStats
ORDER BY score DESC

Here is the result set:

speaker    track      score       rownum tile
---------- ---------- ----------- ------ ----
Jessica    Dev        9           1      1
Ron        Dev        9           2      1
Suzanne    DB         9           3      1
Kathy      Sys        8           4      1
Michele    Sys        8           5      2
Mike       DB         8           6      2
Kevin      DB         7           7      2
Brian      Sys        7           8      2
Joe        Dev        6           9      3
Robert     Dev        6           10     3
Dan        Sys        3           11     3

There are 11 speakers in the SpeakerStats table. Dividing 11 by 3 gives you a group size of 3 with a remainder of 2, meaning that the first 2 groups will get an additional row (4 rows in each group), and the third group won't (3 rows in the group). The group number (tile number) 1 is assigned to rows 1 through 4, group number 2 is assigned to rows 5 through 8, and group number 3 is assigned to rows 9 through 11. This information allows you to generate a histogram with an even distribution of items for each step. In our case, the first step represents the speakers with the highest scores, the second represents the speakers with the medium scores, and the third represents the speakers with the lowest scores. You can use a CASE expression to provide descriptive meaningful alternatives to the group numbers:

SELECT speaker, track, score,
  CASE NTILE(3) OVER(ORDER BY score DESC)
    WHEN 1 THEN 'High'
    WHEN 2 THEN 'Medium'
    WHEN 3 THEN 'Low'
  END AS scorecategory
FROM SpeakerStats
ORDER BY track, speaker

Here is the result set:

speaker    track      score       scorecategory
---------- ---------- ----------- -------------
Kevin      DB         7           Medium
Mike       DB         8           Medium
Suzanne    DB         9           High
Jessica    Dev        9           High
Joe        Dev        6           Low
Robert     Dev        6           Low
Ron        Dev        9           High
Brian      Sys        7           Medium
Dan        Sys        3           Low
Kathy      Sys        8           High
Michele    Sys        8           Medium

Recursive Queries and Common Table Expressions

This section explores the subtleties of recursive CTE expressions and applies them as solutions to common problems in a way that greatly simplifies traditional approaches.

Common Table Expressions

A common table expression (CTE) is a temporary named result set that can be referred to by a defining statement. In their simple form, you can think of CTEs as an improved version of derived tables that more closely resemble a nonpersistent type of view. You refer to a CTE in the FROM clause of a query similar to the way you refer to derived tables and views. You define the CTE only once and can refer to it several times in your query. In the definition of a CTE, you can refer to variables that are defined in the same batch. You can even use CTEs in INSERT, UPDATE, DELETE, and CREATE VIEW statements, similar to the way you use views. The real power of CTEs, however, is in their recursive capabilities, in which CTEs contain references to themselves. In this paper, CTEs are described first in their simple form and later in their recursive form. This paper covers SELECT queries with CTEs.

You use derived tables when you want to refer to a query result as if it were a table, but you do not want to create a persistent view in the database. Derived tables, however, have a limitation not found in CTEs: You cannot define a derived table once in your query and use it several times. Instead, you must define several derived tables with the same query. You can, however, define a CTE once and use it several times in a query without persisting it in the database.

Before providing a practical example of CTEs, the basic syntax of CTEs is compared to derived tables and views. The following is a general form of a query within a view, derived table, and CTE:

View

CREATE VIEW <view_name>(<column_aliases>)
AS
<view_query>
GO
SELECT *
FROM <view_name>

Derived Table

SELECT *
FROM (<derived_table_query>) AS <derived_table_alias>(<column_aliases>)

CTE

WITH <cte_alias>(<column_aliases>)
AS
(

  <cte_query>
)
SELECT *
FROM <cte_alias>

You provide the CTE with an alias and an optional list of aliases for its result columns following the keyword WITH; write the body of the CTE; and refer to it from the outer query.

Note that if the WITH clause for a CTE is not the first statement in the batch, you should delimit it from the preceding statement by placing a semicolon (;) in front of it. The semicolon is used to avoid ambiguity with other uses of the WITH clause (for example, for table hints). Although you may find that including a semicolon is not necessary in all cases, it is recommended that you use it consistently.

As a practical example, consider the HumanResources.Employee and Purchasing.PurchaseOrderHeader tables in the AdventureWorks database. Each employee reports to a manager specified in the ManagerID column. Each employee in the Employee table might have related orders in the PurchaseOrderHeader table. Suppose you want to return, for each employee, their count of orders and last order date and, in the same row, similar details for the manager. The following example shows how you can implement a solution using views, derived tables, and CTEs:

View

CREATE VIEW VEmpOrders(EmployeeID, NumOrders, MaxDate)
AS
SELECT EmployeeID, COUNT(*), MAX(OrderDate)
FROM Purchasing.PurchaseOrderHeader
GROUP BY EmployeeID
GO
SELECT E.EmployeeID, OE.NumOrders, OE.MaxDate,
  E.ManagerID, OM.NumOrders, OM.MaxDate
FROM HumanResources.Employee AS E
  JOIN VEmpOrders AS OE
    ON E.EmployeeID = OE.EmployeeID
  LEFT OUTER JOIN VEmpOrders AS OM
    ON E.ManagerID = OM.EmployeeID

Derived Tables

SELECT E.EmployeeID, OE.NumOrders, OE.MaxDate,
  E.ManagerID, OM.NumOrders, OM.MaxDate
FROM HumanResources.Employee AS E

  JOIN (SELECT EmployeeID, COUNT(*), MAX(OrderDate)
        FROM Purchasing.PurchaseOrderHeader
        GROUP BY EmployeeID) AS OE(EmployeeID, NumOrders, MaxDate)
    ON E.EmployeeID = OE.EmployeeID
  LEFT OUTER JOIN
       (SELECT EmployeeID, COUNT(*), MAX(OrderDate)
        FROM Purchasing.PurchaseOrderHeader
        GROUP BY EmployeeID) AS OM(EmployeeID, NumOrders, MaxDate)
    ON E.ManagerID = OM.EmployeeID

CTE

WITH EmpOrdersCTE(EmployeeID, NumOrders, MaxDate)
AS
(
  SELECT EmployeeID, COUNT(*), MAX(OrderDate)
  FROM Purchasing.PurchaseOrderHeader
  GROUP BY EmployeeID
)
SELECT E.EmployeeID, OE.NumOrders, OE.MaxDate,
  E.ManagerID, OM.NumOrders, OM.MaxDate
FROM HumanResources.Employee AS E
  JOIN EmpOrdersCTE AS OE
    ON E.EmployeeID = OE.EmployeeID
  LEFT OUTER JOIN EmpOrdersCTE AS OM
    ON E.ManagerID = OM.EmployeeID
The CTE's definition must be followed by an outer query, which may or may not refer to it. You cannot refer to the CTE later in the batch after other intervening statements. 

You can define several CTEs in the same WITH clause, each referring to previously defined CTEs. Commas are used to delimit the CTEs. For example, suppose you wanted to calculate the minimum, maximum, and difference of counts of employee orders:

WITH
EmpOrdersCTE(EmployeeID, Cnt)
AS
(
  SELECT EmployeeID, COUNT(*)
  FROM Purchasing.PurchaseOrderHeader

  GROUP BY EmployeeID
),
MinMaxCTE(MN, MX, Diff)
AS
(
  SELECT MIN(Cnt), MAX(Cnt), MAX(Cnt)-MIN(Cnt)
  FROM EmpOrdersCTE
)
SELECT * FROM MinMaxCTE

Here is the result set:

MN          MX          Diff       
----------- ----------- -----------
160         400         240        

In EmpOrdersCTE, you calculate the number of orders from each employee. In MinMaxCTE, you refer to EmpOrdersCTE to calculate the minimum, maximum and difference of counts.

Note   Within a CTE, you are not limited to referring only to the CTE defined directly before it; rather you can refer to all previously defined CTEs. Note that forward references are not allowed: A CTE can refer to CTEs defined before it and to itself (see Recursive Queries later in this paper) but not to CTEs defined after it. For example, if you define CTEs C1, C2, C3 in the same WITH statement, C2 can refer to C1 and C2 but not to C3.

In another example, the following code generates a histogram that calculates the number of employees that fall within each of four ranges of order counts between the minimum and maximum. If the calculations appear complicated to you, do not spend time trying to figure them out. The purpose of this example is to use a practical scenario to demonstrate the declaration of multiple CTEs in the same WITH statement, where each might refer to previous CTEs.

WITH
EmpOrdersCTE(EmployeeID, Cnt)
AS
(
  SELECT EmployeeID, COUNT(*)
  FROM Purchasing.PurchaseOrderHeader
  GROUP BY EmployeeID
),

MinMaxCTE(MN, MX, Diff)
AS
(
  SELECT MIN(Cnt), MAX(Cnt), MAX(Cnt)-MIN(Cnt)
  FROM EmpOrdersCTE
),
NumsCTE(Num)
AS
(
  SELECT 1 AS Num
  UNION ALL SELECT 2
  UNION ALL SELECT 3
  UNION ALL SELECT 4
),
StepsCTE(Step, Fromval, Toval)
AS
(
  SELECT
    Num,
    CAST(MN + ROUND((Num-1)*((Diff+1)/4.), 0) AS INT),
    CAST(MN + ROUND((Num)*((Diff+1)/4.), 0) - 1 AS INT)
  FROM MinMaxCTE CROSS JOIN NumsCTE
),
HistogramCTE(Step, Fromval, Toval, Samples)
AS
(
  SELECT S.Step, S.Fromval, S.Toval, COUNT(EmployeeID)
  FROM StepsCTE AS S
    LEFT OUTER JOIN EmpOrdersCTE AS OE
      ON OE.Cnt BETWEEN S.Fromval AND S.Toval
  GROUP BY S.Step, S.Fromval, S.Toval
)
SELECT * FROM HistogramCTE

Here is the result set:

Step        Fromval     Toval       Samples
----------- ----------- ----------- -----------
1           160         219         2
2           220         280         0
3           281         340         0
4           341         400         10

Notice that the second CTE (MinMaxCTE) refers to the first (EmpOrdersCTE); the third (NumsCTE) does not refer to any CTE. The fourth (StepsCTE) refers to the second and third CTEs, and the fifth (HistogramCTE) refers to the first and fourth CTEs.

Recursive Queries

Nonrecursive CTEs increase your expressive power. But for each piece of code that uses nonrecursive CTEs, you can usually write shorter code that achieves the same results by using other Transact-SQL constructs, such as derived tables. The case is different with recursive CTEs. This section describes the semantics of recursive queries and provides practical implementations for a hierarchy of employees in an organizational chart, and for a bill of materials (BOM) scenario.

Semantics

When a CTE refers to itself, it is considered to be recursive. Recursive CTEs are constructed from at least two queries (or members in recursive query parlance). One is a nonrecursive query, also referred to as the anchor member (AM). The other is the recursive query, also referred to as the recursive member (RM). The queries are separated by a UNION ALL operator. The following example shows a simplified generic form of a recursive CTE:

WITH RecursiveCTE(<column_list>)
AS
(
  -- Anchor Member:
  -- SELECT query that does not refer to RecursiveCTE
  SELECT ... 
  FROM <some_table(s)>
  ...
  UNION ALL
  -- Recursive Member
  -- SELECT query that refers to RecursiveCTE
  SELECT ...
  FROM <some_table(s)>
    JOIN RecursiveCTE

  ...
)
-- Outer Query
SELECT ...
FROM RecursiveCTE
...

Logically you can think of the algorithm implementing the recursive CTE as:

  1. The anchor member is activated. Set R0 (R for Results) is generated.
  2. The recursive member is activated, getting set Ri (i = step number) as input when referring to RecursiveCTE. Set Ri + 1 is generated.
  3. The logic of Step 2 is run repeatedly (incrementing the step number in each iteration) until an empty set is returned.
  4. The outer query is executed, getting the cumulative (UNION ALL) result of all of the previous steps when referring to RecursiveCTE.

You can have more than two members in the CTE, but only a UNION ALL operator is allowed between a recursive member and another member (recursive or nonrecursive). Other operators, such as UNION, are allowed only between nonrecursive members. Unlike regular UNION and UNION ALL operators that support implicit conversion, recursive CTEs require an exact match of the columns in all members, including the same data type, length, and precision.

There are similarities between recursive CTEs and classic recursive routines (not necessarily specific to SQL Server). Recursive routines usually consist of three important elements—the first invocation of the routine, a recursive termination check, and a recursive call to the same routine. The anchor member in a recursive CTE corresponds to the first invocation of the routine in a classic recursive routine. The recursive member corresponds to the recursive invocation of the routine. The termination check, which is usually explicit in recursive routines (for example, by means of an IF statement), is implicit in a recursive CTE—recursion stops when no rows are returned from the previous invocation.

The following sections present practical examples and uses of recursive CTEs in single-parent and multiparent environments.

Single-Parent Environment: Employees Organizational Chart

For a single-parent hierarchy scenario, an employees organizational chart is used.

Note   The examples in this section use a table called Employees that has a structure that is different from the HumanResources.Employee table in AdventureWorks. You should run the code in your own test database or in tempdb, not in AdventureWorks.

The following code generates the Employees table and populates it with sample data:

USE tempdb -- or your own test database
CREATE TABLE Employees
(
  empid   int         NOT NULL,
  mgrid   int         NULL,
  empname varchar(25) NOT NULL,
  salary  money       NOT NULL,
  CONSTRAINT PK_Employees PRIMARY KEY(empid),
  CONSTRAINT FK_Employees_mgrid_empid
    FOREIGN KEY(mgrid)
    REFERENCES Employees(empid)
)
CREATE INDEX idx_nci_mgrid ON Employees(mgrid)
SET NOCOUNT ON
INSERT INTO Employees VALUES(1 , NULL, 'Nancy'   , $10000.00)
INSERT INTO Employees VALUES(2 , 1   , 'Andrew'  , $5000.00)
INSERT INTO Employees VALUES(3 , 1   , 'Janet'   , $5000.00)
INSERT INTO Employees VALUES(4 , 1   , 'Margaret', $5000.00) 
INSERT INTO Employees VALUES(5 , 2   , 'Steven'  , $2500.00)
INSERT INTO Employees VALUES(6 , 2   , 'Michael' , $2500.00)
INSERT INTO Employees VALUES(7 , 3   , 'Robert'  , $2500.00)
INSERT INTO Employees VALUES(8 , 3   , 'Laura'   , $2500.00)
INSERT INTO Employees VALUES(9 , 3   , 'Ann'     , $2500.00)
INSERT INTO Employees VALUES(10, 4   , 'Ina'     , $2500.00)
INSERT INTO Employees VALUES(11, 7   , 'David'   , $2000.00)
INSERT INTO Employees VALUES(12, 7   , 'Ron'     , $2000.00)
INSERT INTO Employees VALUES(13, 7   , 'Dan'     , $2000.00)
INSERT INTO Employees VALUES(14, 11  , 'James'   , $1500.00)

Each employee reports to a manager whose ID is stored in the mgrid column. A foreign key is defined on the mgrid column referencing the empid column, meaning that a manager ID must either correspond to a valid employee ID within the table or be NULL. Nancy, the boss, has NULL in the mgrid column. Manager-employee relationships are shown in Figure 1.

ms345144.sql2k5b2_tsqlenhance_01(en-US,SQL.90).gif

Figure 1. Employees organizational chart

The following are some common requests that might be run on the Employees table:

  • Show me details about Robert (empid=7) and all of his subordinates in all levels.
  • Show me details about all employees that are two levels under Janet (empid=3).
  • Show me the chain of management leading to James (empid=14).
  • Show me how many employees report to each manager directly or indirectly.
  • Show me all of the employees in such a way that it will be easy to see their hierarchical dependencies.

Recursive CTEs provide the means to deal with these requests, which are recursive in nature, without the need to maintain additional information in the database about the hierarchy.

The first request is probably the most common one: returning an employee (for example, Robert whose empid=7) and his/her subordinates in all levels. The following CTE provides a solution to this request:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 

  -- Anchor Member (AM)
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 7
  UNION ALL
  
  -- Recursive Member (RM)
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees AS E
    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
)
SELECT * FROM EmpCTE

Here is the result set:

empid       empname                   mgrid       lvl        
----------- ------------------------- ----------- -----------
7           Robert                    3           0          
11          David                     7           1          
12          Ron                       7           1          
13          Dan                       7           1          
14          James                     11          2          

Following the recursive CTE logic described previously, this CTE is processed as follows:

The anchor member is activated, returning Robert's row from the Employees table. Notice the constant 0 that is returned in the lvl result column.
  1. The recursive member is activated repeatedly, returning direct subordinates of the previous result by means of a join operation between Employees and EmpCTE. Employees represents the subordinates, and EmpCTE (which contains the result from the previous invocation) represents managers:
    • First, Robert's subordinates are returned: David, Ron, and James.
    • And then David's, Ron's, and Dan's subordinates are returned: only James.
    • Finally, James' subordinates are returned: none, in which case, recursion is terminated.
  2. The outer query returns all rows from EmpCTE.

Notice that the lvl value is repeatedly incremented with each recursive invocation.

Using this level counter you can limit the number of iterations in the recursion. For example, the following CTE is used to return all employees who are two levels below Janet:

WITH EmpCTEJanet(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 3
  UNION ALL
  
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees as E
    JOIN EmpCTEJanet as M
      ON E.mgrid = M.empid
  WHERE lvl < 2
)
SELECT empid, empname
FROM EmpCTEJanet
WHERE lvl = 2

Here is the result set:

empid       empname                  
----------- -------------------------
11          David                    
12          Ron                      
13          Dan                      

The additions in this code example compared to the previous one are shown in bold. The filter WHERE lvl < 2 in the recursive member is used as a recursion termination check—no rows are returned when lvl = 2, thus recursion stops. The filter WHERE lvl = 2 in the outer query is used to remove all levels up to level 2. Note that logically the filter in the outer query (lvl = 2) is sufficient by itself to return only the desired rows. The filter in the recursive member (lvl < 2) is added for performance reasons—to stop the recursion early, as soon as two levels below Janet are returned.

As mentioned earlier, CTEs can refer to local variables that are defined within the same batch. For example, to make the query more generic, you can use variables instead of constants for employee ID and level:

DECLARE @empid AS INT, @lvl AS INT
SET @empid = 3 -- Janet
SET @lvl   = 2 -- two levels
WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = @empid
  UNION ALL
  
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees as E
    JOIN EmpCTE as M
      ON E.mgrid = M.empid
  WHERE lvl < @lvl
)
SELECT empid, empname
FROM EmpCTE
WHERE lvl = @lvl

You can use a hint to force termination of the query after a certain number of recursive iterations have been invoked. You do that by adding OPTION(MAXRECURSION value) at the end of the outer query, as shown in the following example:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 1
  UNION ALL

  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees as E
    JOIN EmpCTE as M
      ON E.mgrid = M.empid
)
SELECT * FROM EmpCTE
OPTION (MAXRECURSION 2)

Here is the result set:

empid       empname                   mgrid       lvl        
----------- ------------------------- ----------- -----------
1           Nancy                     NULL        0          
2           Andrew                    1           1          
3           Janet                     1           1          
4           Margaret                  1           1          
10          Ina                       4           2          
7           Robert                    3           2          
8           Laura                     3           2          
9           Ann                       3           2          
.Net SqlClient Data Provider: Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 2 has been exhausted before statement completion

Results generated thus far might be returned (but are not guaranteed to be), and error 530 is generated. You might think of using the MAXRECURSION option to implement the request to return employees who are two levels below Janet using the MAXRECURSION hint instead of the filter in the recursive member:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 3
  UNION ALL
  
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees as E
    JOIN EmpCTE as M
      ON E.mgrid = M.empid

)
SELECT empid, empname
FROM EmpCTE
WHERE lvl = 2
OPTION (MAXRECURSION 2)

Here is the result set:

empid       empname                  
----------- -------------------------
11          David                    
12          Ron                      
13          Dan                      
.Net SqlClient Data Provider: Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 2 has been exhausted before statement completion

But keep in mind that, besides the fact that there is no guarantee that results will be returned, your client will get an error. It is not good programming practice to use code that returns errors in valid situations. It is recommended that you use the filter presented earlier and, if you want, the MAXRECURSION hint as a safeguard against infinite loops.

When this hint is not specified, SQL Server defaults to a value of 100. This value can be used as a safeguard when you suspect cyclic recursive calls. If you do not want to limit the number of recursive calls, set MAXRECURSION to 0 in the hint.

As an example of a cyclic relationship, suppose you had a bug in your data and Nancy's manager was accidentally changed to James (instead of no manager):

UPDATE Employees SET mgrid = 14 WHERE empid = 1

The following cycle is introduced: 1->3->7->11->14->1. If you try running code that returns Nancy and her direct and indirect subordinates at all levels, you get an error indicating that the default maximum recursion of 100 was exhausted before the statement completed:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 1
  UNION ALL
  
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees AS E

    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
)
SELECT * FROM EmpCTE
Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 100 has been exhausted before statement completion

Of course it is good to have a safety measure that prevents infinite recursive calls, but MAXRECURSION doesn't help you much in isolating the cycle and resolving the bug in the data. In order to isolate the cycle, you can use a CTE that will construct, for each employee, an enumerated path of all of the employee IDs leading to the employee. Call this result column path. In the recursive member, use a CASE expression to check whether the current employee ID already appears in the manager's path using a LIKE predicate. If it does, this means you found a cycle. If a cycle is found, return 1 in a result column called cycle, otherwise return 0. Also, add a filter to the recursive member that ensures that only subordinates of managers for which a cycle was not detected will be returned. Finally, add a filter to the outer query that returns only employees for which a cycle was found (cycle = 1):

WITH EmpCTE(empid, path, cycle)
AS
( 
  SELECT empid,
    CAST('.' + CAST(empid AS VARCHAR(10)) + '.' AS VARCHAR(900)),
    0
  FROM Employees
  WHERE empid = 1
  UNION ALL
  
  SELECT E.empid,
    CAST(M.path + CAST(E.empid AS VARCHAR(10)) + '.' AS VARCHAR(900)),
    CASE
      WHEN M.path LIKE '%.' + CAST(E.empid AS VARCHAR(10)) + '.%' THEN 1
      ELSE 0
    END
  FROM Employees AS E
    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
  WHERE M.cycle = 0

)
SELECT path FROM EmpCTE
WHERE cycle = 1
path
---------------
.1.3.7.11.14.1.

Note that corresponding columns in both the anchor member and the recursive member must have the same data type, length, and precision. That's why the expression generating the path value is converted to varbinary(900) in both members. Once the cycle is detected, you can fix the bug in your data by changing Nancy's manager back to no manager:

UPDATE Employees SET mgrid = NULL WHERE empid = 1

The recursive examples provided up to this point have an anchor member that is a manager and a recursive member that retrieves subordinates. Some requests require the opposite; for example, a request to return James' management path (James and all of his managers at all levels). The following code provides an answer for this request:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 14
  UNION ALL
  
  SELECT M.empid, M.empname, M.mgrid, E.lvl+1
  FROM Employees as M
    JOIN EmpCTE as E
      ON M.empid = E.mgrid
)
SELECT * FROM EmpCTE

Here is the result set:

empid       empname                   mgrid       lvl        
----------- ------------------------- ----------- -----------
14          James                     11          0          
11          David                     7           1          
7           Robert                    3           2          
3           Janet                     1           3          
1           Nancy                     NULL        4          

The anchor member returns James' row. The recursive member returns the managers of the previously returned employees or manager in singular, because a single-parent hierarchy is used here and the request starts with a single employee.

You can also use recursive queries to calculate aggregations, such as the number of subordinates that report to each manager directly or indirectly:

WITH MgrCTE(mgrid, lvl)
AS
(
  SELECT mgrid, 0
  FROM Employees
  WHERE mgrid IS NOT NULL
  UNION ALL
  SELECT M.mgrid, lvl + 1
  FROM Employees AS M
    JOIN MgrCTE AS E
      ON E.mgrid = M.empid
  WHERE M.mgrid IS NOT NULL
)
SELECT mgrid, COUNT(*) AS cnt
FROM MgrCTE
GROUP BY mgrid

Here is the result set:

mgrid       cnt        
----------- -----------
1           13         
2           2          
3           7          
4           1          
7           4          
11          1          

The anchor member returns a row with the manager ID for each employee. NULL in the manager ID column is excluded because it represents no specific manager. The recursive member returns the manager IDs of the managers of the previously returned managers, again NULLs are excluded. Eventually, the CTE contains, for each manager, as many occurrences as their direct or indirect number of subordinates. The outer query is left with the tasks of grouping the result by manager ID and returning the count of occurrences.

As another example of a request against a single-parent hierarchy, suppose you want to return Nancy's subordinates sorted and indented according to hierarchical dependencies. The following code does just that, sorting siblings according to their employee IDs:

WITH EmpCTE(empid, empname, mgrid, lvl, sortcol)
AS
( 
  SELECT empid, empname, mgrid, 0,
    CAST(empid AS VARBINARY(900))
  FROM Employees
  WHERE empid = 1
  UNION ALL
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1,
    CAST(sortcol + CAST(E.empid AS BINARY(4)) AS VARBINARY(900))
  FROM Employees AS E
    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
)
SELECT
  REPLICATE(' | ', lvl)
    + '(' + (CAST(empid AS VARCHAR(10))) + ') '
    + empname AS empname
FROM EmpCTE
ORDER BY sortcol
(1) Nancy
 | (2) Andrew
 |  | (5) Steven
 |  | (6) Michael
 | (3) Janet
 |  | (7) Robert
 |  |  | (11) David
 |  |  |  | (14) James
 |  |  | (12) Ron
 |  |  | (13) Dan
 |  | (8) Laura
 |  | (9) Ann

 | (4) Margaret
 |  | (10) Ina

To sort siblings according to the empid value, form a binary string called sortcol for each employee. The string is made of concatenated employee IDs in the chain of management leading to each employee, converted to binary values. The anchor member is the starting point. It generates a binary value with the empid of the root employee. In each iteration, the recursive member appends the current employee ID converted to a binary value to the manager's sortcol. The outer query then sorts the result by sortcol. Remember that corresponding columns in both the anchor member and the recursive member must have the same data type, length, and precision. That's why the expression generating the sortcol value is converted to varbinary(900), even though an integer requires 4 bytes in its binary representation: 900 bytes cover 225 levels, which seems more than a reasonable limitation. If you want support for more levels, you can increase this length but make sure you do so in both members; otherwise, you will get an error.

Hierarchical indentation is achieved by replicating a character string (' | ' in this case) as many times as the number levels of the employee. To that, the employee ID itself is appended within parentheses, and finally the employee name is also appended.

A similar technique can be used to sort siblings by other attributes that can be converted to small fixed-length binary values; for example, the employee's hire date stored in a smalldatetime column. If you want to sort siblings by attributes that are not convertible to small fixed-sized binary values, such as by employee name, you can first produce integer row numbers (for details on row numbers, see the section "Ranking Functions" earlier in this paper) partitioned by manager ID representing the desired sort like so:

SELECT empid, empname, mgrid,
  ROW_NUMBER() OVER(PARTITION BY mgrid ORDER BY empname) AS pos
FROM Employees

And instead of concatenating employee ID's converted to binary values, concatenate the employee positions converted to binary values:

WITH EmpPos(empid, empname, mgrid, pos)
AS
(
  SELECT empid, empname, mgrid,
    ROW_NUMBER() OVER(PARTITION BY mgrid ORDER BY empname) AS pos
  FROM Employees
),
EmpCTE(empid, empname, mgrid, lvl, sortcol)
AS
( 
  SELECT empid, empname, mgrid, 0,

    CAST(pos AS VARBINARY(900))
  FROM EmpPos
  WHERE empid = 1
  UNION ALL
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1,
    CAST(sortcol + CAST(E.pos AS BINARY(4)) AS VARBINARY(900))
  FROM EmpPos AS E
    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
)
SELECT
  REPLICATE(' | ', lvl)
    + '(' + (CAST(empid AS VARCHAR(10))) + ') '
    + empname AS empname
FROM EmpCTE
ORDER BY sortcol
(1) Nancy
 | (2) Andrew
 |  | (6) Michael
 |  | (5) Steven
 | (3) Janet
 |  | (9) Ann
 |  | (8) Laura
 |  | (7) Robert
 |  |  | (13) Dan
 |  |  | (11) David
 |  |  |  | (14) James
 |  |  | (12) Ron
 | (4) Margaret
 |  | (10) Ina

To sort siblings by any other attribute or combination of attributes, simply specify the desired attributes in the ORDER BY list of the ROW_NUMBER function's OVER option instead of empname.

Multiparent Environment: Bill of Materials

In the previous section, CTEs are used to handle hierarchies in which each node in the tree has only a single parent. A more complex scenario of relationships is a graph in which each node might have more than one parent. This section describes the use of CTEs in a bill of materials (BOM) scenario. The BOM is an Acyclic Directed Graph, meaning that each node can have more than one parent; a node cannot be a parent of itself, directly or indirectly; the relationship between two nodes is not dual (for example, A contains C, but C does not contain A). Figure 2 shows the relationships between items in a BOM scenario.

ms345144.sql2k5b2_tsqlenhance_02(en-US,SQL.90).gif

Figure 2. Multiparent environment

Item A contains items D, B and C; item C contains B and E; item B is contained in items A and C, and so on. The following code creates the Items and BOM tables and populates them with sample data:

CREATE TABLE Items
(
  itemid   VARCHAR(5)  NOT NULL PRIMARY KEY,
  itemname VARCHAR(25) NOT NULL,
  /* other columns, e.g., unit_price, measurement_unit */
)
CREATE TABLE BOM
(
  itemid     VARCHAR(5) NOT NULL REFERENCES Items,
  containsid VARCHAR(5) NOT NULL REFERENCES Items,
  qty        INT        NOT NULL
  /* other columns, e.g., quantity */
  PRIMARY KEY(itemid, containsid),

  CHECK (itemid <> containsid)
)
SET NOCOUNT ON
INSERT INTO Items(itemid, itemname) VALUES('A', 'Item A')
INSERT INTO Items(itemid, itemname) VALUES('B', 'Item B')
INSERT INTO Items(itemid, itemname) VALUES('C', 'Item C')
INSERT INTO Items(itemid, itemname) VALUES('D', 'Item D')
INSERT INTO Items(itemid, itemname) VALUES('E', 'Item E')
INSERT INTO Items(itemid, itemname) VALUES('F', 'Item F')
INSERT INTO Items(itemid, itemname) VALUES('G', 'Item G')
INSERT INTO Items(itemid, itemname) VALUES('H', 'Item H')
INSERT INTO Items(itemid, itemname) VALUES('I', 'Item I')
INSERT INTO Items(itemid, itemname) VALUES('J', 'Item J')
INSERT INTO Items(itemid, itemname) VALUES('K', 'Item K')
INSERT INTO BOM(itemid, containsid, qty) VALUES('E', 'J', 1)
INSERT INTO BOM(itemid, containsid, qty) VALUES('C', 'E', 3)
INSERT INTO BOM(itemid, containsid, qty) VALUES('A', 'C', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('H', 'C', 4)
INSERT INTO BOM(itemid, containsid, qty) VALUES('C', 'B', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('B', 'F', 1)
INSERT INTO BOM(itemid, containsid, qty) VALUES('B', 'G', 3)
INSERT INTO BOM(itemid, containsid, qty) VALUES('A', 'B', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('A', 'D', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('H', 'I', 1)

The Items table contains a row for each item. The BOM table contains the relationships between the nodes in the graph. Each relationship is made up of the parent item ID (itemid), the child item ID (containsid), and the quantity of containsid within itemid (qty).

A common request in a BOM scenario is to "explode" an item: that is, to traverse the graph, starting with the given item and return all the items that it contains, directly or indirectly. This might sound familiar because it's similar to returning a subtree out of a tree as in the Employees Organizational Chart. In a directed graph, however, the request is a bit more complex conceptually, because one contained item can be reached from several different containing items through different paths. For example, suppose you want to explode item A. Notice that two different paths lead from it to item B: A->B and A->C->B. This means that item B would be reached twice, which means that all of the items that B contains (F and G) would be reached twice. Fortunately, with CTEs, the implementation of such a request is as simple as implementing a request to get a subtree out of a tree:

WITH BOMCTE
AS
(
  SELECT *
  FROM BOM
  WHERE itemid = 'A'
  UNION ALL
  SELECT BOM.*
  FROM BOM
    JOIN BOMCTE
      ON BOM.itemid = BOMCTE.containsid
)
SELECT * FROM BOMCTE

Here is the result set:

itemid containsid qty        
------ ---------- -----------
A      B          2          
A      C          2          
A      D          2          
C      B          2          
C      E          3          
E      J          1          
B      F          1          
B      G          3          
B      F          1          
B      G          3          

The anchor member returns all the direct items that A contains from BOM. For each contained item returned by the previous iteration of the CTE, the recursive member returns the items that it contains by joining BOM to BOMCTE. Logically, (not necessarily the order in the output) (A, B), (A, C), (A, D) are returned first; then (B, F), (B, G), (C, B), (C, E); and lastly (B, F), (B, G), (E, J). Note that most requests from a BOM do not require you to show an item more then once in the final results. You can use the DISTINCT clause to eliminate duplicates if you want to show only "which" items were involved in the explosion:

WITH BOMCTE
AS
(
  SELECT *
  FROM BOM
  WHERE itemid = 'A'
  UNION ALL
  SELECT BOM.*
  FROM BOM
    JOIN BOMCTE
      ON BOM.itemid = BOMCTE.containsid
)
SELECT DISTINCT containsid FROM BOMCTE

Here is the result set:

containsid
----------
B         
C         
D         
E         
F         
G         
J         

To help you understand the process of parts explosion, visualize its intermediate result as a tree in which all items are expanded to their contained items. Figure 3 shows the trees formed by exploding parts A and H along with the item quantities.

ms345144.sql2k5b2_tsqlenhance_03(en-US,SQL.90).gif

Figure 3. Parts explosion

Taking the original request a step further, you might be more interested in getting the cumulative quantities of each item rather than getting the items themselves. For example, A contains 2 units of C. C contains 3 units of E. E contains one unit of J. The total number of units of J required for A is a product of the quantities along the path leading from A to J: 2*3*1 = 6. Figure 4 shows the cumulative quantities of each item that makes A before aggregating the items.

ms345144.sql2k5b2_tsqlenhance_04(en-US,SQL.90).gif

Figure 4. Parts explosioncalculated quantities

The following CTE calculates the cumulative product of quantities:

WITH BOMCTE(itemid, containsid, qty, cumulativeqty)
AS
(
  SELECT *, qty
  FROM BOM
  WHERE itemid = 'A'
  UNION ALL
  SELECT BOM.*, BOM.qty * BOMCTE.cumulativeqty
  FROM BOM
    JOIN BOMCTE
      ON BOM.itemid = BOMCTE.containsid
)
SELECT * FROM BOMCTE

Here is the result set:

itemid containsid qty         cumulativeqty
------ ---------- ----------- -------------
A      B          2           2            
A      C          2           2            
A      D          2           2            
C      B          2           4            
C      E          3           6            
E      J          1           6            
B      F          1           4            
B      G          3           12           
B      F          1           2            
B      G          3           6            

This CTE adds the cumulativeqty column to the previous CTE. The anchor member returns the quantities of the contained items as cumulativeqty. For each of the next level's contained item, the recursive member multiplies its quantity by its containing item's cumulative quantity. Note that items that were reached from multiple paths appear multiple times in the results, each with the cumulative quantities for each path. Such an output is not very meaningful by itself, but it is helpful to understand the intermediate step to the final results in which each item appears only once. To get the total quantities of each item in A, have the outer query group the result by containsid:

WITH BOMCTE(itemid, containsid, qty, cumulativeqty)
AS
(

  SELECT *, qty
  FROM BOM
  WHERE itemid = 'A'
  UNION ALL
  SELECT BOM.*, BOM.qty * BOMCTE.cumulativeqty
  FROM BOM
    JOIN BOMCTE
      ON BOM.itemid = BOMCTE.containsid
)
SELECT containsid AS itemid, SUM(cumulativeqty) AS totalqty
FROM BOMCTE
GROUP BY containsid

Here is the result set:

itemid totalqty   
------ -----------
B      6          
C      2          
D      2          
E      6          
F      6          
G      18         
J      6          

PIVOT and UNPIVOT

PIVOT and UNPIVOT are new relational operators that you specify in the FROM clause of a query. They perform some manipulation on an input table-valued expression and produce an output table as a result. The PIVOT operator rotates rows into columns, possibly performing aggregations along the way. It widens the input table expression based on a given pivot column, generating an output table with a column for each unique value in the pivot column. The UNPIVOT operator performs the opposite operation of that performed by the PIVOT operator; it rotates columns into rows. It narrows the input table expression based on a pivot column.

PIVOT

The PIVOT operator is useful for handling open-schema scenarios and for generating crosstab reports.

In an open-schema scenario, you maintain entities with sets of attributes that are either not known ahead or different for each entity type. The users of your application define the attributes dynamically. Instead of predefining many columns and storing many null values in your tables, you split the attributes into different rows and store only the relevant attributes for each entity instance.

PIVOT allows you to generate crosstab reports for open-schema and other scenarios in which you rotate rows into columns, possibly calculating aggregations along the way and presenting the data in a useful form.

An example of an open-schema scenario is a database that keeps track of items put up for auction. Some attributes are relevant for all auction items, such as the item type, when it was made, and its initial price. Only the attributes that are relevant for all items are stored in the AuctionItems table:

CREATE TABLE AuctionItems
(
  itemid       INT          NOT NULL PRIMARY KEY NONCLUSTERED,
  itemtype     NVARCHAR(30) NOT NULL,
  whenmade     INT          NOT NULL,
  initialprice MONEY        NOT NULL,
  /* other columns */
)
CREATE UNIQUE CLUSTERED INDEX idx_uc_itemtype_itemid
  ON AuctionItems(itemtype, itemid)
INSERT INTO AuctionItems VALUES(1, N'Wine',     1822,      3000)
INSERT INTO AuctionItems VALUES(2, N'Wine',     1807,       500)
INSERT INTO AuctionItems VALUES(3, N'Chair',    1753,    800000)
INSERT INTO AuctionItems VALUES(4, N'Ring',     -501,   1000000)
INSERT INTO AuctionItems VALUES(5, N'Painting', 1873,   8000000)
INSERT INTO AuctionItems VALUES(6, N'Painting', 1889,   8000000)

Other attributes are specific to the item type, and new items of different types are continuously being added. Such attributes can be stored in a different ItemAttributes table in which each item attribute is stored in a different row. Each row contains the item ID, attribute name, and attribute value:

CREATE TABLE ItemAttributes
(
  itemid    INT          NOT NULL REFERENCES AuctionItems,

  attribute NVARCHAR(30) NOT NULL,
  value     SQL_VARIANT  NOT NULL, 
  PRIMARY KEY (itemid, attribute)
)
INSERT INTO ItemAttributes
  VALUES(1, N'manufacturer', CAST(N'ABC'              AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(1, N'type',         CAST(N'Pinot Noir'       AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(1, N'color',        CAST(N'Red'              AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(2, N'manufacturer', CAST(N'XYZ'              AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(2, N'type',         CAST(N'Porto'            AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(2, N'color',        CAST(N'Red'              AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(3, N'material',     CAST(N'Wood'             AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(3, N'padding',      CAST(N'Silk'             AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(4, N'material',     CAST(N'Gold'             AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(4, N'inscription',  CAST(N'One ring ...'     AS NVARCHAR(50)))
INSERT INTO ItemAttributes
  VALUES(4, N'size',         CAST(10                  AS INT))
INSERT INTO ItemAttributes
  VALUES(5, N'artist',       CAST(N'Claude Monet'     AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(5, N'name',         CAST(N'Field of Poppies' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(5, N'type',         CAST(N'Oil'              AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(5, N'height',       CAST(19.625              AS NUMERIC(9,3)))
INSERT INTO ItemAttributes
  VALUES(5, N'width',        CAST(25.625              AS NUMERIC(9,3)))
INSERT INTO ItemAttributes
  VALUES(6, N'artist',       CAST(N'Vincent Van Gogh' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(6, N'name',         CAST(N'The Starry Night' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(6, N'type',         CAST(N'Oil'              AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(6, N'height',       CAST(28.75               AS NUMERIC(9,3)))
INSERT INTO ItemAttributes
  VALUES(6, N'width',        CAST(36.25               AS NUMERIC(9,3)))

Note that the sql_variant data type is used for the value column because different attribute values may be of different data types. For example, the size attribute stores an integer attribute value, and a name attribute stores a character string attribute value.

Suppose you want to present the data from the ItemAttributes table with a row for each item that is a painting (items 5,6) and a column for each attribute. Without the PIVOT operator, you must write a query such as:

SELECT
  itemid,
  MAX(CASE WHEN attribute = 'artist'  THEN value END) AS [artist],
  MAX(CASE WHEN attribute = 'name'    THEN value END) AS [name],
  MAX(CASE WHEN attribute = 'type'    THEN value END) AS [type],
  MAX(CASE WHEN attribute = 'height'  THEN value END) AS [height],
  MAX(CASE WHEN attribute = 'width'   THEN value END) AS [width]
FROM ItemAttributes AS ATR
WHERE itemid IN(5,6)
GROUP BY itemid

Here is the result set:

itemid artist           name             type       height width
------ ---------------- ---------------- ---------- ------ ------
5      Claude Monet     Field of Poppies Oil        19.625 25.625
6      Vincent Van Gogh The Starry Night Oil        28.750 36.250

The PIVOT operator allows you to maintain shorter and more readable code to achieve the same results:

SELECT *
FROM ItemAttributes AS ATR
  PIVOT

  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT
WHERE itemid IN(5,6)

As with most new features, understanding the PIVOT operator comes with experimentation and use. Some of the elements in the PIVOT syntax are apparent and only require that you figure out the relationship of these elements to the query that does not use the new operator. Others are hidden.

You may find the following terms helpful in understanding the semantics of the PIVOT operator:

table_expression

The virtual table on which the PIVOT operator works (the part in the query between the FROM clause and the PIVOT operator): ItemAttributes AS ATR in this case.

pivot_column

The column from table_expression whose values you want to rotate into result columns: attribute in this case.

column_list

The list of values from pivot_column that you want to present as result columns (in the parentheses followed by the IN clause). These must be expressed as legal identifiers: [artist], [name], [type], [height], [width] in this case.

aggregate_function

The aggregate function that you use to generate the data or column values in the result: MAX() in this case.

value_column

The column from table_expression that you use as the argument for aggregate_function: value in this case.

group_by_list

The hidden part—ALL columns from table_expression excluding pivot_column and value_column which are used to group the result: itemid in this case.

select_list

The list of columns following the SELECT clause that might include any column(s) from group_by_list and column_list. Aliases can be used to change the name of the result columns: * in this case returns all columns from group_by_list and column_list.

The PIVOT operator returns one row for each unique value in group_by_list as if you had a query with a GROUP BY clause and specified those columns. Notice that group_by_list is implied; it is not specified explicitly anywhere in the query. It contains all columns from table_expression excluding pivot_column and value_column. Understanding this is probably the key to understanding why queries you write with the PIVOT operator work as they do, and why you might get errors in some cases.

Possible result columns include values from group_by_list and <column_list>. If you specify an asterisk (*), the query returns both lists. The data part of the result columns or the result column values are calculated by aggregate_function with value_column as the argument.

The following color-highlighted code illustrates the different elements in the query using the PIVOT operator:

SELECT * -- itemid, [artist], [name], [type], [height], [width]
FROM ItemAttributes AS ATR
  PIVOT
  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT
WHERE itemid IN(5,6)

And the following relates the different elements to the query that does not use the PIVOT operator:

SELECT
  itemid,
  MAX(CASE WHEN attribute = 'artist'  THEN value END) AS [artist],
  MAX(CASE WHEN attribute = 'name'    THEN value END) AS [name],
  MAX(CASE WHEN attribute = 'type'    THEN value END) AS [type],
  MAX(CASE WHEN attribute = 'height'  THEN value END) AS [height],
  MAX(CASE WHEN attribute = 'width'   THEN value END) AS [width]
FROM ItemAttributes AS ATR
WHERE itemid IN(5,6)
GROUP BY itemid

Note that you must explicitly specify the values in <column_list>. The PIVOT operator does not provide an option to derive those dynamically from pivot_column in a static query. You can use dynamic SQL to construct the query string yourself to achieve this.

Taking the previous PIVOT query a step further, suppose you want to return, for each auction item, all attributes relevant to paintings. You want to include those attributes that appear in AuctionItems and those that appear in ItemAttributes. You might try the following query, which returns an error:

SELECT *
FROM AuctionItems AS ITM
  JOIN ItemAttributes AS ATR
    ON ITM.itemid = ATR.itemid
  PIVOT

  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT
WHERE itemtype = 'Painting'

Here is the error message:

.Net SqlClient Data Provider: Msg 8156, Level 16, State 1, Line 1

The column 'itemid' was specified multiple times for 'PVT'.

Remember that PIVOT works on table_expression, which is the virtual table returned by the section in the query between the FROM clause and the PIVOT clause. In this query, the virtual table contains two instances of the itemid column—one originating from AuctionItems and the other from ItemAttributes. You might be tempted to revise the query as follows, but you will also get an error:

SELECT ITM.itemid, itemtype, whenmade, initialprice, 
  [artist], [name], [type], [height], [width]
FROM AuctionItems AS ITM
  JOIN ItemAttributes AS ATR
    ON ITM.itemid = ATR.itemid
  PIVOT
  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT
WHERE itemtype = 'Painting'

Here is the error message:

.Net SqlClient Data Provider: Msg 8156, Level 16, State 1, Line 1
The column 'itemid' was specified multiple times for 'PVT'.
.Net SqlClient Data Provider: Msg 107, Level 15, State 1, Line 1
The column prefix 'ITM' does not match with a table name or alias name used in the query.

As mentioned earlier, the PIVOT operator works on the virtual table returned by table_expression and not on the columns in the select_list. The select_list is evaluated after the PIVOT operator performs its manipulations and can refer only to group_by_list and column_list. That is why the ITM alias is no longer recognized in the select_list. If you understand this, you realize that you should provide PIVOT with a table_expression that contains only the columns you want to work on. This includes the grouping columns (only one occurrence of itemid plus itemtype, whenmade and initialprice), the pivot column (attribute), and the value column (value). You can achieve this by using CTEs or derived tables. Here is an example using a CTE:

WITH PNT
AS
(
  SELECT ITM.*, ATR.attribute, ATR.value
  FROM AuctionItems AS ITM
    JOIN ItemAttributes AS ATR
      ON ITM.itemid = ATR.itemid
  WHERE ITM.itemtype = 'Painting'
)
SELECT * FROM PNT
  PIVOT
  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT

Here is the result set:

itemid itemtype whenmade ini