Test Run - Graph-Based Shortest-Path Analysis with SQL
By James McCaffrey | December 2012
In this column I explain how to implement a graph-based shortest-path analysis in situations where you have graph data stored in a SQL database and want to use native T-SQL language processing. Traditional shortest-path approaches assume that the graph representation is stored in a data structure in machine memory. But large graphs, such as those that represent social networks, often won’t fit into memory, so storing and processing them using SQL is one option. The best way to understand where I’m headed is to look at the example graph in Figure 1, and the screenshot of a demo run in Figure 2.
Figure 1 Demo Graph for Shortest-Path Analysis
Figure 2 Shortest-Path Analysis with T-SQL
The graph shown in Figure 1 is artificially small and has eight nodes (sometimes called vertices or vertexes) with IDs 111 through 888. You can imagine that the nodes represent people or computers. The directional arrows (usually called edges) between nodes are relationships. You can imagine that an arrow between two nodes represents an exchange of e-mails. In this example the graph edges have weights indicated by the values 1.0 or 2.0. Edge weights can have different meanings depending on your particular problem scenario. For example, the weights can represent some measure of social affinity or an indicator of when a message was sent.
The term “shortest path” can have different meanings. Suppose you’re interested in the shortest path between user 222 and user 444. This could mean the smallest number of hops to get from 222 to 444, which would be 4 (222 to 333 to 666 to 777 to 444). Or, shortest path could mean the smallest sum of weights between 222 and 444, which would be 5.0 (1.0 + 1.0 + 1.0 + 2.0).
In the left window in Figure 2, you can see I’m working with a database called dbShortPathDemo in SQL Server Management Studio. In the upper-right window I have a T-SQL script named ShortestPath.sql that defines and populates the demo database with information that corresponds to the graph in Figure 1, and code that defines a stored procedure named usp_ShortestPath. The stored procedure has just been called to analyze the shortest path between user 222 and user 444. In the bottom-right window you can see the results. The shortest path is given by the string “444;777;666;333;222.” The shortest path in terms of weight is 5.0 (displayed without the decimal). The bottom-right part of the image shows the final state of a table named tblAlgorithmData, which is the key to implementing the shortest-path code.
In the sections that follow, I’ll walk you through the T-SQL code that generated the screenshot in Figure 2 so you’ll be able to adapt the code to meet your own problem scenarios. I’m assuming you’re primarily a non-SQL developer (meaning you most often use C# or some other procedural programming language) but have some basic SQL knowledge. I present all the T-SQL code explained in this article; the code is also available at msdn.microsoft.com/magazine/msdnmag1212.
Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. McCaffrey is the author of “.NET Test Automation Recipes” (Apress, 2006). He can be reached at firstname.lastname@example.org.
Thanks to the following technical expert for reviewing this article: Shane Williams