This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.
Trees in SQL Server
R.L. Parker
Many applications—both commercial and homegrown—have to deal with
hierarchical, tree-structured data. However, because of their recursive nature,
trees can be intimidating to model, program, and display. While SQL Server
doesn't provide native support for hierarchical data (like Oracle does via its
START WITH…CONNECT BY clause, for example), it's not that hard to add. R.L.
Parker explains.
If you're not exactly sure what I mean by hierarchical, tree-structured data,
consider the Northwind database's Employees table. In it, a Northwind employee's
boss is identified by a value in the ReportsTo column. That makes it easy to
find out who reports to a given manager. If Mary Manager's EmployeeID is
"1," then her direct reports are identified by:
SELECT FirstName, LastName, EmployeeID
FROM Employee,
WHERE ReportsTo = 1
But Mary might be a mid-level manager. Suppose you also want to know who
reports to the people who report to her? And the people who report to them, and
so on, recursively, until you get to the bottom of the tree where Dilbert lives?
Hold that thought—we'll come back to it.
Figure 1 shows part of the logical model for the
Northwind database. The Employees table has a recursive relationship to itself
because the ReportsTo column has a foreign-key relationship to the EmployeeID
column. Some people call this kind of relationship in a model a
"fishhook."
Another way of describing this situation is that each row in Employees might
have many related child rows (the row's direct reports) in the Employees table.
And, of course, one of the child rows might have, in turn, many related children
of its own. This kind of recursive, hierarchical data is often depicted as a
tree structure, as shown in the org chart in
Figure 2.
In this article, we're going to develop a simple application that shows how
to model, program, and display this kind of data. But first, here's a quick
review of the terminology and some rules about data trees:
- An element of data in the tree is called a node.
- A node has either one parent or none.
- If a node doesn't have a parent, it's called a root node.
- If a node has a parent, the node is referred to as the child of the
parent node.
- A node can have an arbitrary number of children.
- A node can't be the parent of itself.
One of the things we'll need to do in our database design is make sure that
these rules are observed.
The application
The application we're going to implement is designed to track and
display a list of tasks. Each task has a name, a description, a type, and a
status. A given task can have zero or more subtasks.
The display requirement is that, for a given task, the application must be
able to show the task and its related task tree.
Figure 3
shows a possible implementation of the solution, implemented as an ASP.NET
application, and
Figure 4 shows a logical model for a
table that will support our Task application. It should look familiar, as it
follows the same pattern that the Employees table does.
Listing 1 includes the script for
the Task table, including its constraints.
Listing 1. The Task table.
CREATE TABLE [dbo].[Task] (
[ID] [int] IDENTITY (1000, 1) NOT NULL,
[Name] [varchar] (16)
COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[Description] [varchar] (128)
COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
[Type] [tinyint] NOT NULL,
[Status] [tinyint] NOT NULL,
[ParentID] [int] NULL --line 9
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Task] WITH NOCHECK ADD
CONSTRAINT [DF_Task_Description] DEFAULT ('None')
FOR [Description],
CONSTRAINT [DF_Task_Type] DEFAULT (0) FOR [Type],
CONSTRAINT [DF_Task_Status] DEFAULT (0) FOR [Status],
CONSTRAINT [PK_Task] PRIMARY KEY CLUSTERED
([ID]) ON [PRIMARY] ,
CONSTRAINT [CK_Task] CHECK ([ID] <> [ParentID]),
CONSTRAINT [CK_Task_1] CHECK ([Type] >= 0
and [Type] <= 5),
CONSTRAINT [CK_Task_2] CHECK ([Status] >= 0
and [Status] <= 2)
GO
ALTER TABLE [dbo].[Task] ADD
CONSTRAINT [FK_Task_Task] FOREIGN KEY
([ParentID]) REFERENCES [dbo].[Task] ([ID])
GO
Remember the rules about trees? Rule #2 is enforced by the combination of
line 9, which allows a node's ParentID to be NULL, and the ALTER TABLE statement
at the end that defines a foreign key on the ParentID column. If there's a value
in the ParentID column, it must be the primary key of a row in the table.
But not just any row—rule #6 says that a node can't be a parent of itself. If
the data violated rule #6, it would lead to infinite recursion! The ParentID
constraint defined ensures that this doesn't happen.
An iterative solution
Figure 5 shows the raw data for our Task
application. All the information that an application needs is there—but it's
not in a very useful format!
The main problem with the raw data is that there's no guarantee about the
ordering of the nodes. The requirement that the application imposes on the data
is simple—the application must see each parent node before seeing any
of its children. And, although there's no way to construct a SELECT statement's
ORDER BY clause that will guarantee that, we can write a stored procedure
and a helper function that will do the job. Listing
2 and Listing 3 define the
FindRoot function and BuildTree stored proc, respectively.
Listing 2. The FindRoot function.
CREATE FUNCTION dbo.FindRoot(@id int)
RETURNS int
AS
BEGIN
DECLARE @parentID int
SELECT @parentID = parentID
FROM Task
WHERE id = @id
WHILE @parentID <> NULL
BEGIN
SELECT @id = @parentID
SELECT @parentID = parentID
FROM Task
WHERE id = @id
END
RETURN @id
END
Listing 3. The BuildTree stored procedure.
CREATE PROCEDURE BuildTree(@id int)
AS
SET NOCOUNT ON
CREATE TABLE #results(level tinyint, id int,
name varchar(16), description varchar(128),
type tinyint, status tinyint, parentID int)
DECLARE @name varchar(16)
DECLARE @description varchar(128)
DECLARE @type tinyint
DECLARE @status tinyint
DECLARE @parentID int
DECLARE @level tinyint
SELECT @level = 1
DECLARE @root int
SELECT @root = dbo.FindRoot(@id)
CREATE TABLE #stack (id int, level smallint)
INSERT INTO #stack VALUES (@root, @level)
WHILE @level > 0
BEGIN
IF EXISTS (SELECT * FROM #stack WHERE level = @level)
BEGIN
SELECT @id = s.id, @name = t.name, @description =
t.description, @type = t.type, @status = t.status,
@parentID = IsNull(t.parentID, 0)
FROM #stack s INNER JOIN Task t
ON p.id = s.id
WHERE level = @level
INSERT INTO #results VALUES (@level, @id, @name,
@description, @type, @status, @parentID)
DELETE FROM #stack
WHERE level = @level
AND id = @id
INSERT #stack
SELECT id, @level + 1
FROM Task
WHERE parentID = @id
IF @@ROWCOUNT > 0
BEGIN
SELECT @level = @level + 1
END
END--IF EXISTS
ELSE
BEGIN
SELECT @level = @level - 1
END
END -- WHILE
SELECT * FROM #results
ORDER BY level, name
FindRoot
FindRoot is pretty straightforward—given a primary key for row
x in the Task table, it will return the primary key for the root node of the
tree that x is in. If the node's ParentID is NULL, then the node
is the
root, so FindRoot returns x's ID. Otherwise, FindRoot iteratively travels up x's
ParentID links until it gets to the root of the tree.
BuildTree
Given the primary key for a row in the Task table, BuildTree will
return a resultset that lists the node's entire task family, beginning with the
root node of the tree. Each parent node is listed in the resultset before any of
its child nodes. Just what the application needs!
BuildTree uses two temporary tables and the FindRoot function. Temp table
#stack is used as a LIFO stack that holds the intermediate results as each node
is processed. I start by calling the FindRoot helper function and push the root
node onto the stack. Then, in the WHILE loop, I pick a row from #stack. After
adding the selected row to the resultset, I delete it from the stack, and then
push all of its children onto the stack. When the WHILE loop terminates, I've
processed all the parent nodes. I return the results to the client by selecting
all the rows in #results. I reuse level in the ORDER BY clause to ensure that
all parent rows are returned before any of their children.
A set-oriented solution
BuildTree is based on code that you'll find if you type
"hierarchical information" in the index of SQL Server's Books Online,
but it isn't very efficient. It processes one row in the #stack table for each
trip through the WHILE loop.
I wrote another stored procedure called BuildTreeNew that takes a more
set-oriented approach to solving the problem. It goes through the WHILE loop
once for each level in the tree, processing all nodes at that level
simultaneously. It's simpler than BuildTree too, disposing of the #stack table
altogether. Instead, it reuses the #results table for intermediate results as it
walks down the tree. You'll find it in this month's
>Download
file.
An XML-based solution?
What about XML? XML documents can represent hierarchical data by
nesting the elements in the document. Since SQL Server 2000 provides the ability
to return data as an XML document instead of a recordset by using the FOR XML
clause on the SELECT statement, maybe XML is the way to go?
Using the FOR XML AUTO mode, SQL Server will generate a hierarchical tree.
Sounds promising! But, unfortunately for the Task application, the structure of
the hierarchy is based on the table names in the SELECT statement. To show you
what I mean, I'm going to create a couple of tables that have a simple
parent-child relationship and load the tables with a few rows of data (see Listing 4).
Listing 4. A simple parent-child relationship.
CREATE TABLE [dbo].[Parent] ([ID] [int]
IDENTITY (1, 1) NOT NULL, [Name] [varchar] (8) COLLATE
SQL_Latin1_General_CP1_CI_AS NOT NULL) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Child] ([ID] [int]
IDENTITY (1, 1) NOT NULL, {Name] [varchar] (8) COLLATE
SQL_Latin1_General_CP1_CI_AS NOT NULL,
[ParentID] [int] NULL) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Child] WITH NOCHECK ADD
CONSTRAINT [PK_Child] PRIMARY KEY CLUSTERED
([ID]) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Parent] WITH NOCHECK ADD
CONSTRAINT [PK_Parent] PRIMARY KEY CLUSTERED
([ID]) ON [PRIMARY]
GO
insert into Parent values ("P1")
insert into Parent values ("P2")
insert into Parent values ("P3")
insert into Child values ("P1.1",1)
insert into Child values ("P2.1",2)
insert into Child values ("P1.2",1)
insert into Child values ("P3.1",3)
insert into Child values ("P3.2",3)
insert into Child values ("P3.3",3)
Given such a parent-child relationship, here's the SQL that will produce an
XML document instead of a resultset.
SELECT p.name, c.name
FROM Parent p INNER JOIN Child c
ON p.id = c.parentID
ORDER BY p.id
FOR XML AUTO
Here's the resulting XML. I formatted it for the sake of easier readability.
<p name="P1">
<c name="P1.1"/>
<c name="P1.2"/>
</p>
<p name="P2">
<c name="P2.1"/>
</p>
<p name="P3">
<c name="P3.1"/>
<c name="P3.2"/>
<c name="P3.3"/>
</p>
Pretty cool, but it doesn't really solve my problem for the Task application
because I only have one table to select from!
There's another FOR XML alternative: FOR XML EXPLICIT. To use XML EXPLICIT,
two special columns must be included in the resultset, and the data columns must
be aliased in a special way so that FOR XML EXPLICIT will know how to structure
the output. A full explanation of FOR XML EXPLICIT is beyond the scope of this
article. But, like FOR XML AUTO, FOR XML EXPLICIT is intended for conventional
parent-child relationships rather than for arbitrarily deep tree structures, so
it probably isn't the solution I need for the Task application.
Conclusion
Data is sometimes hierarchical in nature. It pays to be able to
model, program, and display hierarchical data structures. Although SQL Server
2000 doesn't provide native support for hierarchical data, it's relatively easy
to support it with the help of a user-defined function and a set-oriented stored
procedure.
Download >TREESSQL.TXT
Sidebar: More on Trees and Nodes
SQL Server Professional
subscribers may also want to (re)read these
articles:
- Mark Scanlon's August 2000 article, "Hierarchical Data and Scope
Checking."
- Nolan Zak's September 2001 piece, "Hierarchical Data and Scope
Checking in Detail."
- Tom Moreau's March 2001 column, "Feeding XML to a Stored
Procedure."
- Anton Britten's February 2002 article, "Getting the XML Back
In."
- Alexander Yankelevich's November 1999 article, "Climbing the
Tree."
I'd also encourage Hardcore Visual Basic subscribers to (re)read Rod
Stephens' classic columns:
- "Tree Trimming Revisited" (June 1998).
- "The Tree Control," Parts 1 and 2 (December 2000 and January
2001).
- "XmlTree" (February 2002).
—kw
To find out more about SQL Server Professional and Pinnacle Publishing, visit their website at
http://www.pinpub.com/html/main.isx?sub=57
Note: This is not a Microsoft Corporation website. Microsoft is not responsible for its content.
This article is reproduced from the February 2003 issue of SQL Server Professional. Copyright 2003, by Pinnacle Publishing, Inc., unless otherwise noted. All rights are reserved. SQL Server Professional is an independently produced publication of Pinnacle Publishing, Inc. No part of this article may be used or reproduced in any fashion (except in brief quotations used in critical articles and reviews) without prior consent of Pinnacle Publishing, Inc. To contact Pinnacle Publishing, Inc., please call 1-800-493-4867 x4209.