Test Run

Graph-Based Shortest-Path Analysis with SQL

James McCaffrey

Download the Code Sample

David McCaffreyIn this column I explain how to implement a graph-based shortest-­path analysis in situations where you have graph data stored in a SQL database and want to use native T-SQL language processing. Traditional shortest-path approaches assume that the graph representation is stored in a data structure in machine memory. But large graphs, such as those that represent social networks, often won’t fit into memory, so storing and processing them using SQL is one option. The best way to understand where I’m headed is to look at the example graph in Figure 1, and the screenshot of a demo run in Figure 2.

Demo Graph for Shortest-Path Analysis
Figure 1 Demo Graph for Shortest-Path Analysis

Shortest-Path Analysis with T-SQL
Figure 2 Shortest-Path Analysis with T-SQL

The graph shown in Figure 1 is artificially small and has eight nodes (sometimes called vertices or vertexes) with IDs 111 through 888. You can imagine that the nodes represent people or computers. The directional arrows (usually called edges) between nodes are relationships. You can imagine that an arrow between two nodes represents an exchange of e-mails. In this example the graph edges have weights indicated by the values 1.0 or 2.0. Edge weights can have different meanings depending on your particular problem scenario. For example, the weights can represent some measure of social affinity or an indicator of when a message was sent.

The term “shortest path” can have different meanings. Suppose you’re interested in the shortest path between user 222 and user 444. This could mean the smallest number of hops to get from 222 to 444, which would be 4 (222 to 333 to 666 to 777 to 444). Or, shortest path could mean the smallest sum of weights between 222 and 444, which would be 5.0 (1.0 + 1.0 + 1.0 + 2.0).

In the left window in Figure 2, you can see I’m working with a database called dbShortPathDemo in SQL Server Management Studio. In the upper-right window I have a T-SQL script named ShortestPath.sql that defines and populates the demo database with information that corresponds to the graph in Figure 1, and code that defines a stored procedure named usp_ShortestPath. The stored procedure has just been called to analyze the shortest path between user 222 and user 444. In the bottom-right window you can see the results. The shortest path is given by the string “444;777;666;333;222.” The shortest path in terms of weight is 5.0 (displayed without the decimal). The bottom-right part of the image shows the final state of a table named tblAlgorithmData, which is the key to implementing the shortest-path code.

In the sections that follow, I’ll walk you through the T-SQL code that generated the screenshot in Figure 2 so you’ll be able to adapt the code to meet your own problem scenarios. I’m assuming you’re primarily a non-SQL developer (meaning you most often use C# or some other procedural programming language) but have some basic SQL knowledge. I present all the T-SQL code explained in this article; the code is also available at msdn.microsoft.com/magazine/msdnmag1212.

Setting up the Demo Database

To create a demo database I launched SQL Server Management Studio and connected to my server machine. I used SQL Server 2008, but the code presented here, with minor modifications, should work with most earlier and all newer versions of SQL Server. After connecting, I clicked File | New Query to get an editing window, then typed T-SQL code to create my dummy database:

use master
go
if exists(select * from sys.sysdatabases 
  where name='dbShortPathDemo')
drop database dbShortPathDemo
go
create database dbShortPathDemo
go
use dbShortPathDemo
go

Here I used all the default parameter values for the create database statement. In many scenarios you’ll be working with an existing database with real data, but I recommend experimenting with a small demo database. I prefer to use lowercase T-SQL code for the most part, which might offend SQL purists. I used the mouse to select these nine lines of code and then hit the F5 key to execute them. After verifying the demo database was created, I saved the script as ShortestPath.sql.

Next, I created a table within the demo database to hold the data that defines the graph to be analyzed:

create table tblGraph
(
fromNode bigint not null,
toNode bigint not null,
edgeWeight real not null
)
go

I used the mouse to highlight just these seven lines of code (so I wouldn’t re-execute the previous lines) and hit F5 to create the table. I use type bigint for node IDs. Other common node ID types you might run into include int for a relatively simple employee ID and varchar(11) for a Social Security number in xxx-xx-xxxx format. I use type real for column edgeWeight. T-SQL type real is just a convenient alias for float(24), which is similar to C# type double. In many scenarios you might not have an explicit edge weight between nodes, and you can omit the edgeWeight column.

Next, I populated table tblGraph by entering, highlighting and executing these 15 lines of code:

insert into tblGraph values(111,222,1.0)
insert into tblGraph values(111,555,1.0)
insert into tblGraph values(222,111,2.0)
insert into tblGraph values(222,333,1.0)
insert into tblGraph values(222,555,1.0)
insert into tblGraph values(333,666,1.0)
insert into tblGraph values(333,888,1.0)
insert into tblGraph values(444,888,1.0)
insert into tblGraph values(555,111,2.0)
insert into tblGraph values(555,666,1.0)
insert into tblGraph values(666,333,2.0)
insert into tblGraph values(666,777,1.0)
insert into tblGraph values(777,444,2.0)
insert into tblGraph values(777,888,1.0)
go

If you refer to Figure 1, you’ll see that each insert statement corresponds to one of the arrows in the graph.

Next, I created indexes on the fromNode and toNode columns in tblGraph to improve performance when the shortest-path stored procedure accesses the graph data:

create nonclustered index idxTblGraphFromNode on tblGraph(fromNode)
go
create nonclustered index idxTblGraphToNode on tblGraph(toNode)
go
I then created and populated a table to hold a list of unique node IDs:
create table tblNodes
(
nodeID bigint primary key
)
go
insert into tblNodes
select distinct fromNode from tblGraph
union
select distinct toNode from tblGraph
go

A table of just node IDs isn’t required for shortest-path analysis, but it’s useful if you want to perform error checking on the start node and end node input parameters to the shortest-path stored procedure. By applying the primary key clause to the nodeID column, I implicitly create a clustered index on that column. The SQL union statement, used with two or more select-distinct statements, is a convenient way to fetch a list of unique values from a table.

The Algorithm Data Table

The key to understanding and modifying the shortest-path stored procedure presented in this article is understanding a table of information named tblAlgorithmData. You can create the table by executing these T-SQL statements:

create table tblAlgorithmData
(
nodeID bigint not null,
dist real not null,
pred bigint not null,
inQueue bit not null
)
go
create index idxTblAlgorithmDataNodeID on tblAlgorithmData(nodeID)
create index idxTblAlgorithmDataDist on tblAlgorithmData(dist)
create index idxTblAlgorithmDataPred on tblAlgorithmData(pred)
create index idxTblAlgorithmDataInQueue on tblAlgorithmData(inQueue)
go

The table will have (at most) one row for each unique node in the graph, so for the example in this article, the table will have (at most) eight rows. Because nodeID is a unique identifier, I could’ve defined it as a primary key column. The dist column is the current distance from the start node to the node that has nodeID. This dist value changes as the stored procedure executes but holds the final distance when the stored procedure finishes. If you look at Figure 2, you can see that the dist value for node 444, the end node, is 5.0 units when the stored procedure finished.

The pred column holds the immediate predecessor to the node with nodeID of the shortest path from parameter start node to end node. For example, in Figure 2, the end node is 444. Its predecessor is 777. The predecessor of 777 is 666. The predecessor of 666 is 333. The predecessor of 333 is 222, the start node. Putting these pred values together gives the shortest path from end node to start node: 444 to 777 to 666 to 333 to 222. Notice that the algorithm I use gives just one shortest path, even in situations where there are two or more paths with the same shortest total distance; in this example, the path 444 to 777 to 666 to 555 to 222 also has a total distance of 5.0 units.

The inQueue column holds a bit value (functionally similar to C# type Boolean) that indicates whether the associated node should be reexamined as part of the shortest path. Put another way, if the value of inQueue is 1, the associated node needs to be examined by the algorithm as a neighbor to the current node in the algorithm. If the value of inQueue is 0, the associated node doesn’t need to be examined as a neighbor. The column name inQueue is somewhat misleading because the algorithm doesn’t really use an explicit queue, so you might want to consider using a name such as isActive. I use the name inQueue because in procedural programming language implementations of shortest-path algorithms, an explicit queue is often used and so the name is somewhat traditional.

The Algorithm

The algorithm I present in this article is a variation of Dijkstra’s Shortest Path (DSP) algorithm. In non-SQL pseudo-code, the variation of the DSP algorithm I use is presented in Figure 3.

Figure 3 Shortest-Path Algorithm

set dist of startNode to 0.0
set pred of startNode to undefined
set inQueue of startNode to true
while there are any nodes with inQueue = true
  set u = node with smallest dist and inQueue = true
  if dist of u is infinity break
  if u = endNode break
  fetch all neighbors of u
  foreach neighbor v that has inQueue = true
    if v has not been seen before then
      set dist of v to infinity
      set pred of v to undefined
      set inQueue of v to true
    end if
  end foreach
  set inQueue of u = false
  foreach neighbor v of u
    if ((dist from startNode to u) + (dist from u to v) <
      curr dist to v) then
        set dist of v to (dist from startNode to u) + (dist from u to v)
        set pred of v to u
    end if
  end foreach
end while
if (u != endNode) then
  path from startNode to endNode is unreachable
else
  retrieve path using pred values
  shortest path distance = dist of endNode
end if

The DSP algorithm is one of the most famous algorithms in computer science and you can find many painfully detailed explanations on the Internet, but very few complete implementations that use SQL. Although short, the DSP algo­rithm is very tricky, and the only way I was able to fully understand it was to trace through several examples using paper and pencil. The essence of the algorithm is that at a given node u, you’ll know the current shortest distance from the start node to u. Then you find all neighbors of u. For each neighbor node, v, you know the current distance from the start node to v. You can look up the distance from u to v. If the distance from start to u plus the distance from u to v is shorter than the current distance from start to v, you know you’ve found a shorter path from start to v. The inQueue variable prevents the algorithm from revisiting a node once it’s known that revisiting that node will not find a shorter path.

The Stored Procedure

I implemented the shortest path as a T-SQL stored procedure. The stored procedure definition begins:

create procedure usp_ShortestPath
  @startNode bigint,
  @endNode bigint,
  @path varchar(5000) output,
  @totDist real output
as
begin
  ...

I prepend the stored procedure usp_ShortestPath name with “usp” (user-defined stored procedure) to distinguish it from system stored procedures (“sp”), extended stored procedures (“xp”) and CLR stored procedures (“csp”). Parameters @startNode and @endNode are input parameters. The stored procedure has two result output parameters, @path and @totDist. The @path is arbitrarily set to type varchar(5000)—you might need to increase the 5000 maximum size depending on the type of node ID you’re using and the maximum path you expect.

Next, I initialize table tblAlgorithmData with information for the start node:

truncate table tblAlgorithmData
insert into tblAlgorithmData
values (@startNode, 0.0, -1, 1)
...

The truncate statement erases the contents of tblAlgorithmData from any previous call to usp_ShortestPath. Recall the columns for tblAlgorithmData are nodeID, dist, pred and inQueue. The distance from the start node to itself is 0.0, the predecessor of the start node is set to -1 to indicate undefined, and the inQueue bit is set to 1 to indicate true.

This code assumes that tblAlgorithmData has already been created in the current database as a permanent table. In some situations you might not have the security permissions needed to create the table; in these situations you can either ask the appropriate database administrator to do so for you, or you can create tblAgorithmData inside the stored procedure as a temp table or possibly a table variable.

This code also assumes that the values of both @startNode and @endNode are valid. If you have a table of all node IDs, you could check for this along the lines of:

if not exists(select nodeID from tblNodes where @startNode = nodeID)
  // Error out in some way //

Next, I prepare the main processing loop:

declare @u bigint
declare @d real = 0.0
declare @done bit = 0
while @done = 0
begin
...

Here @u is the node ID of the current node in the algorithm. Variable @d is a convenience and holds the distance from @startNode to node @u (as stored in tblAlgorithmData). The @done variable is a dummy loop control variable. You might want to put other explicit stopping criteria in the loop such as a maximum number of iterations, maximum hop count or maximum total path distance.

Inside the main processing loop, the first step is to find the value of @u—the node ID of the current node. This is the node that has the smallest dist column value in tblAlgorithmData and also has inQueue = 1:

select top 1 @u = nodeID, @d = dist from tblAlgorithmData
where inQueue = 1
order by dist asc
...

At this point the stored procedure checks for two loop exit conditions:

if @d = 2147483647.0 break
if @u = @endNode break
...

If the distance from the start node to @u (stored in @d) is equal to the somewhat mysterious-looking value of 2147483647.0, this means that the end node is not reachable from the start node. The value of 2147483647.0 is a stand-in for infinity. You can use any large value that can’t actually appear as a distance. I picked 2147483647.0 because 2147483647 (without the decimal) is the largest value for SQL type int. The other break condition simply checks to see if the end node has been reached. Because of the way the algorithm searches the graph using a breadth-first approach, if the end node is hit, the shortest path has been found.

Next, the stored procedure checks each neighbor of the current node @u and if a neighbor hasn’t been seen before, the neighbor is initialized into tblAlgorithmData:

insert into tblAlgorithmData
select toNode, 2147483647.0, -1, 1 from tblGraph where fromNode = @u
and not exists (select * from tblAlgorithmData where nodeID = toNode)
...

This SQL code is a bit subtle if you’re a coder who rarely uses SQL. The insert statement is conditional upon the not exists statement, which can be interpreted as, “if the neighbor node ID (value toNode) is not already in tblAlgorithmData (as nodeID).” For each neighbor node where the conditional is true, the insert statement initializes tblAlgorithmData with the neighbor’s nodeID, a dist value of infinity (as 2147483647.0), a predecessor of undefined (as -1), and an inQueue of true (as 1). Working with SQL statements that operate on sets of data, rather than iterating through a collection using a loop as you would with a procedural programming language, can be a difficult paradigm to master.

In this algorithm, nodes are initialized when they first appear as neighbor nodes. A significant alternative approach is to initialize all graph nodes at the very beginning of the algorithm. The problem with this approach is that for large graphs, populating tblAlgorithmData with possibly millions of nodes can take quite a bit of time.

At this point, the stored procedure marks the current node as processed:

update tblAlgorithmData set inQueue = 0
where nodeID = @u
...

Next, the stored procedure checks each neighbor to the current node to see if a new, shorter path to the neighbor has been found and, if so, updates the entries in tblAlgorithmData for that neighbor node:

update tblAlgorithmData
  set dist = @d + tblGraph.edgeWeight, pred = @u
  from tblAlgorithmData inner join tblGraph on 
    tblAlgorithmData.nodeID = tblGraph.toNode
  where tblGraph.fromNode = @u and tblAlgorithmData.inQueue = 1
    and (@d + tblGraph.edgeWeight) < tblAlgorithmData.dist
end -- loop
...

This is by far the trickiest part of the shortest-path implementation. Table tblGraph is joined to tblAlgorithmData so that all data can be used in the update statement. The current node’s ID is stored in @u and this value is matched to the fromNode in tblGraph. The neighbors to @u are stored in toNode in tblGraph, which is linked to the nodeID of tblAlgorithmData. Recall that @d holds the distance from the start node to the current node @u. The value in the edgeWeight column will hold the distance from @u to the neighbor. The sum of these two distances is a possible new shorter path from the current distance from the start node to the neighbor, which is stored in column dist. If a new shorter distance has been found, the dist column is updated with that new shorter distance and the predecessor to the neighbor node is recorded as @u, the current node. In situations where you define shortest path to mean the fewest number of hops between start node and end node, you would replace dist = @d + tblGraph.edgeWeight with dist = @d + 1.

At this point the main processing loop has exited and the cause of the exit can be checked:

if (@u != @endNode)
  begin
    set @path = 'NOT REACHABLE'
    set @totDist = -1.0
  end
  else
  begin
    set @path = ''
    declare @currNode bigint
    set @currNode = @endNode
  ...

If the value in @u is the ID of the end node, the end node has been found; if @u is anything other than the ID of end node, the loop exited before finding the end node. In this case I set the path string to ‘NOT REACHABLE’ and assign an arbitrary shortest-path total distance of -1.0 to indicate not reachable. If the end node was in fact found, I initialize the shortest-path string to the empty string and set up local variable @currNode to iterate through tblAlgorithmData to construct the shortest-path string.

The stored procedure definition code concludes by constructing the shortest-path string and assigning the shortest-path total distance:

...
    while @currNode != @startNode
    begin
      set @path = @path + cast(@currNode as varchar(19)) + ';'
      set @currNode = (select pred from tblAlgorithmData 
        where nodeID = @currNode)
    end
    set @path = @path + cast(@startNode as varchar(19))
    select @totDist = dist from tblAlgorithmData 
      where nodeID = @endNode
  end -- else
end -- usp_ShortestPath definition
go

Here I use the T-SQL string concatenation “+” operator. I use varchar(19) because the largest possible value for my nodeID type of bigint is 9,223,372,036,854,775,807, which has 19 digits.

With the stored procedure created, I can find the shortest path between any two nodes; for example:

declare @startNode bigint
declare @endNode bigint
declare @path varchar(5000)
declare @totDist real
set @startNode = 222
set @endNode = 444
exec usp_ShortestPath @startNode, @endNode, 
  @path output, @totDist output
select @path
select @totDist

Wrapping Up

Shortest-path graph analysis is likely to increase in importance as enterprises gather more data and store that data in a cloud environment. The code and explanation presented in this article should give you a solid basis for getting started with performing shortest-path analyses on your data. An important alternative to the native T-SQL language approach described in this article is to use a CLR stored procedure. Based on my experiences, a shortest-path analysis using a CLR stored procedure can give you improved performance at the expense of increased complexity. I’ll present a CLR stored procedure approach to graph-based shortest-path analysis in a future article.


Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. McCaffrey is the author of “.NET Test Automation Recipes” (Apress, 2006). He can be reached at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Shane Williams

 

Rate: