Islands and Gaps in Sequential Numbers
As you know, relational databases are set-based and aren't particularly suited for row-at-a-time processing. However, in situations with heavily fragmented data (lots of gaps), you may find that row-based processing does better than set-based processing. Alexander Kozak demonstrates.
SQL Server developers often use either the IDENTITY column or the MAX() function (for example, MAX(id) + 1) to create unique identifiers for a table. I personally like the identities. In my opinion, they're convenient and flexible. One of the things that I don't like about them, though, is the way they inevitably create gaps in the sequence of IDs. An INSERT may fail and have to be rolled back, for example.
Another type of fragmentation in sequential numbers can happen in data warehouses. Usually you load data into a warehouse dailywell, nightly, to be more precise. You know the drill: The client sends one or more data files with transactions for the previous business day. If everything is running smoothly, each file will contain a range of continuous sequential numbers (IDs) for each business day. Let's assume that OrderIDs 715236 to 832455 occurred on January 5, 2004. But, what if on January 8, 2004, you received another portion of a January 5 transaction? In this case you'll have two alternatives:
- Delete previously loaded data and try to reload data for the same business day (that arrived on January 5 and 8) one by one. If your January 8 package includes transactions for different business dayslet's say for January 4, 5, and 7 as well as January 8you'll also need to separate data before or during data loading.
- Leave everything as is and simply load the data file as usual, including the new portion of data.
The first solution will result in physical fragmentation (gaps) in sequential numbers because of deletions. The second will create the separated islands of sequential continuous numbers for the same business day (logical fragmentation). There are a variety of situations that can lead to fragmentations in sequential numbers, and, sooner or later, most DBAs will have to answer the question "What are the gaps and (or) islands in my data?"
There are lots of ways to find the gaps (islands) in sequential continuous numbersall you need is a loop operatorbut that doesn't mean they're all equally efficient. Let's start by taking a look at a row-based solution, implemented in T-SQL. But first, let's create a sample table with gaps:
CREATE TABLE gaps(gapID int NOT NULL PRIMARY KEY) INSERT INTO gaps values(1) INSERT INTO gaps values(2) INSERT INTO gaps values(3) INSERT INTO gaps values(4) INSERT INTO gaps values(6) INSERT INTO gaps values(7) INSERT INTO gaps values(8) INSERT INTO gaps values(10) INSERT INTO gaps values(14) INSERT INTO gaps values(15) INSERT INTO gaps values(16) INSERT INTO gaps values(17) INSERT INTO gaps values(38)
I want to show the result as start and end of each data island (group). For our sample table gaps, we'd expect:
StartGroup EndGroup 1 4 6 8 10 10 14 17 38 38
You can find a row-based solution in Listing 1.
Listing 1. Finding gaps with a row-based solution.
IF EXISTS (SELECT name FROM sysobjects WHERE name = N'spu_findIslands' AND type = 'P') DROP PROCEDURE spu_findIslands GO CREATE PROCEDURE spu_findIslands AS SET NOCOUNT ON DECLARE @i int, @min int, @max int, @startGroup int, @EndGroup int SELECT @min = MIN(gapID), @max = MAX(gapID) FROM gaps IF @min IS NULL and @max IS NULL RETURN(-1) CREATE TABLE #result(startOfGroup int not null, endOfGroup int not null) SET @i = @min WHILE(@i <= @max) BEGIN WHILE(@i <= @max) BEGIN IF EXISTS(SELECT gapID FROM gaps WHERE gapID = @i) BEGIN SET @startGroup = @i BREAK END SELECT @i = @i + 1 END WHILE(@i <= @max) BEGIN SELECT @i = @i + 1 IF NOT EXISTS(SELECT gapID FROM gaps WHERE gapID = @i) BEGIN SET @EndGroup = @i -1 INSERT INTO #result VALUES(@startGroup, @EndGroup) BREAK END END END SELECT startOfGroup, endOfGroup FROM #RESULT GO
The algorithm, implemented in stored procedure spu_findIslands, is a simple state machine, flip-flopping between two stable states: gaps and islands. This solution works fine on our simple table, but it has some obvious drawbacks:
- It will be slow for the large tables.
- It's T-SQL specific, not standard ANSI SQL.
A set-based solution
If you carefully think about islands and gaps, you'll realize that the beginning of an "island" is characterized by a missing predecessor. In terms of SQL, this statement can be expressed by query with correlated sub-query, where tables in outer and inner queries are the same:
SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl1.gapID - tbl2.gapID = 1)
Similarly, the last numbers of the island can be found as:
SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl2.gapID - tbl1.gapID = 1)
We can use a JOIN to combine the two sets:
SELECT * FROM (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl1.gapID - tbl2.gapID = 1)) t1 INNER JOIN (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl2.gapID - tbl1.gapID = 1)) t2 ON t1.gapID <= t2.gapID
The final step will be grouping by start position of the island. The end position of the island will correspond to MIN(endPosition) in the group. Listing 2 shows the complete query.
Listing 2. A set-based solution for identifying islands.
SELECT t1.gapID as startOfGroup, MIN(t2.gapID) as endOfGroup FROM (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl1.gapID - tbl2.gapID = 1)) t1 INNER JOIN (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl2.gapID - tbl1.gapID = 1)) t2 ON t1.gapID <= t2.gapID GROUP BY t1.gapID
Listing 3 shows a similar solution for the gaps.
Listing 3. A set-based solution for finding gaps.
SELECT t1.col1 as startOfGroup, MIN(t2.col1) AS endOfGroup FROM (SELECT col1 = gapID+1 FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl2.gapID - tbl1.gapID = 1) AND gapID <> (SELECT MAX(gapID) FROM gaps)) t1 INNER JOIN (SELECT col1 = gapID-1 FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl1.gapID - tbl2.gapID = 1) AND gapID <> (SELECT MIN(gapID) FROM gaps)) t2 ON t1.col1 <= t2.col1 GROUP BY t1.col1
All of the solutions will also work for duplicate rows in heap. You can test that situation, if you'll drop the primary key in table gaps and then insert some duplicate rows. As a sort of ad hoc test for ANSI SQL compliance, I also tested the set-based solution for islands (Listing2) in Oracle9i, and everything worked fine.
After writing the code, I was pretty proud of my solutions, but I wanted to see how they'd fare with some "real" data. (To be honest, I was sure the set-based solutions would be blazingly fast; I just needed "proof.") So, I inserted about 80,000 more rows into the gaps table and ran my query for islands (Listing 2). Let's just say the CPU usage immediately jumped to 100 percent and stayed there until I was finally able to halt the query. I even had to restart SQL Server from another computer to get all the resources back. I was dumbfounded, and (dumbly?) re-ran the querynot just once, but two or three times to be sure. Sadly, I was able to duplicate the results.
Here I would ask you to stop your reading. Take a pen and paper, computerwhatever you needand try to figure out what's wrong with Listing 2. Can you see the problem in the query? Well, I couldn't, and I didn't have any clue! It turns out that it had nothing to do with the logic of my set-based solution. Read on.
I decided to take a look at the estimated execution plan one more time. (Of course, I'd done this initially with the 13-row gaps table. As I recalled, I didn't find anything interesting or outstanding then.) I started clicking and highlighting different operators of the estimated execution plan almost without any goal, and suddenly one of the nested loops showed 2.8536023E+9 for an estimated row count. Wow! For my table with 80,317 rows, a self-join would yield 6.450820489E+9. Why the discrepancy? It turns out the query optimizer was trying to create the intermediate result set for my inner join with 40 percent of Cartesian product size. Well, nice try, but I figured I'd better roll up my sleeves and try again.
I came up with three more set-based ANSI-compliant solutions: two for islands and one for gaps (see Listing 4). They all worked fine for my 13-row version of gaps, but didn't work at all for the 80,317-row version. It took me a while to understand why.
Listing 4. More set-based solutions.
-- SOLUTION #2 FOR ISLANDS SELECT t1.gapID as startGroup, (SELECT MIN(t2.gapID) FROM (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS (SELECT * FROM gaps tbl2 WHERE tbl2.gapID - tbl1.gapID = 1)) t2 WHERE t1.gapID <= t2.gapID GROUP BY t1.gapID) AS endGroup FROM (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl1.gapID - tbl2.gapID = 1)) t1 -- SOLUTION #3 FOR ISLANDS SELECT MIN(start) AS startGroup, endGroup FROM (SELECT g1.gapID AS start, (SELECT min(g2.gapID) FROM gaps g2 WHERE g2.gapID >= g1.gapID and NOT EXISTS (SELECT * FROM gaps g3 WHERE g3.gapID - g2.gapID = 1)) as endGroup FROM gaps g1) T1 GROUP BY endGroup -- SOLUTION #2 FOR GAPS SELECT startGroup = c1+1, endGroup = c2-1 FROM (SELECT t1.gapID AS c1, MIN(t2.gapID) AS c2 FROM gaps t1 INNER JOIN gaps t2 ON t1.gapID < t2.gapID GROUP BY t1.gapID) t3 WHERE c2-c1 > 1
Arithmetic of joins
Recall the basic subtraction rule, if A = B or A = B + 0, then A - B = 0. Right, obviously. But, what's obvious for humans isn't necessarily so for a computer. More precisely, the query optimizer of SQL Server treats cases A = B and A - B = 0 differently for joins. If you check the estimated execution plan for the inner join, joining table gaps to itself:
SELECT * FROM gaps t1 INNER JOIN gaps t2 ON t1.gapID = t2.gapID
you'll see that the estimated row count for the top input in the graphical execution plan (outer input table) equals 13, for the inner (bottom) input table equals 1, and for the nested loops/inner join element equals 13.
For the cross join:
SELECT * FROM gaps t1 cross JOIN gaps t2
the estimated row count for the outer input table equals 13, for the inner input table equals 13, and for the nested loops/inner join element equals 169.
Now, let's try to use an expression in the JOIN condition where the columns from both tables (t1.gapID and t2.gapID) are located on the same left side of the equals sign:
SELECT * FROM gaps t1 inner JOIN gaps t2 ON t1.gapID-t2.gapID = 0
For the outer input table, the estimated row count equals 13, for the inner input table 13, and for the inner loops 46! Bingohere's the problem! For table gaps with 80,000 rows, the query optimizer will estimate almost 23,000,000 rows for processing in a nested loop. (Warning! Be careful experimenting since you risk ending up as I did, having to restart SQL Server and/or reboot the server.)
So, the problem with my solutions (Listings 2, 3, and 4) was the wrong usage of JOIN conditions in the derived tables. When I changed those, everything started to work (see Listing 5).
Listing 5. The final set-based solution.
-- ISLANDS SELECT t1.gapID as startOfGroup, MIN(t2.gapID) as endOfGroup FROM (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl1.gapID = tbl2.gapID + 1)) t1 INNER JOIN (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl2.gapID = tbl1.gapID + 1)) t2 ON t1.gapID <= t2.gapID GROUP BY t1.gapID -- ISLANDS SELECT t1.gapID as startGroup, (SELECT MIN(t2.gapID) FROM (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS (SELECT * FROM gaps tbl2 WHERE tbl2.gapID = tbl1.gapID + 1)) t2 WHERE t1.gapID <= t2.gapID GROUP BY t1.gapID) AS endGroup FROM (SELECT gapID FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl1.gapID = tbl2.gapID + 1)) t1 -- GAPS SELECT t1.col1 as startOfGroup, MIN(t2.col1) AS endOfGroup FROM (SELECT col1 = gapID+1 FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl2.gapID = tbl1.gapID + 1) AND gapID <> (SELECT MAX(gapID) FROM gaps)) t1 INNER JOIN (SELECT col1 = gapID-1 FROM gaps tbl1 WHERE NOT EXISTS(SELECT * FROM gaps tbl2 WHERE tbl1.gapID = tbl2.gapID + 1) AND gapID <> (SELECT MIN(gapID) FROM gaps)) t2 ON t1.col1 <= t2.col1 GROUP BY t1.col1
Results and conclusions
Table 1 shows the results of my tests, performed with a Dell PowerEdge 2600 server with dual 2.4 GHz CPUs and 2GB of RAM.
Table 1. Test results for the solutions in Listing 1 and Listing 5.
Numbers of rows (islands)
Row-based solution for islands (Listing 1)
3 min, 41 sec
6 min, 41 sec
Set-based solution #1 for islands (Listing 5)
2 min, 41 sec
Set-based solution #2 for islands (Listing 5)
17 min, 14 sec
Set-based solution #3 for gaps (Listing 5)
2 min, 48 sec
When you analyze the results in Table 1, you can draw some conclusions. First of all, when dealing with databases, try to find a set-based, ANSI-compliant solution for your task. Usually it works faster than a row-based solution, but, under certain conditions, your set-based solution can change the behavior. Look at what happened to my beautiful queryset-based solution #1 for islands (Listing 5). The query gets bogged down quickly when it encounters a large amount of fragmentation (gaps). As you can see, there's a reverse proportionality between the number of islands and execution time in solution #1 (52 times more gaps runs 52 times slower). For confirmation, I ran a quick test. I deleted each third row in table gaps when it had 5,199,633 rows:
DELETE gaps WHERE gapID % 3 = 0
That operation decreased the number of rows by 30 percent, but increased the number of islands 5502-fold (1733317 / 315). According to my calculations, I would have to wait almost 16 hours for the result. (I canceled the query after 30 minutes of running both CPUs at 100 percent.) So, what happens to my query (solution #1 for islands in Listing 5) when I increase the number of islands? The problem is in the derived tables t1 and t2. When the number of islands grows, they produce more rows, and this, in turn, results in a rapid increase of total rows produced by the JOIN. And the JOIN condition t1.gapID <= t2.gapID only makes things worse.
On the other hand, the row-based solution (Listing 1) behaves pretty well and is able to process the case with 1,733,317 islands. Of course, the growing number of rows and/or growing number of islands affects the performance of such a solution, but its performance degrades much less rapidly than the set-based ones.
So, what's the conclusion? Does it mean that a row-based solution is better than set-based ones?
Well, it depends. If you have an application and want to "plug" the gaps by inserting new rows into them, the fragmentation will probably be relatively low. If you have a very large data warehouse and sometimes you need to purge and then reload the data, the fragmentation will also be low. For cases like thiswhere you have control over the number of gaps (islands)the best solution will be a set-based one. But if you check the estimated execution plan and realize that the sequence of the numbers has high fragmentation, it will probably behoove you to run a row-based solution.
Sidebar: SQL Server Professional Articles Related to IDENTITY
Readers may want to refer to several related articles on maintaining (or plugging) sequence numbers:
- Dr. Tom's Workshop: Identity Crisis (April 2001)
- Dr. Tom's Workshop: Gap Analysis (June 2001)
- Dr. Tom's Workshop: Generating Sequence Numbers (November 2003)
- SQL Essentials: Incrementing String IDs (June 2000)
- Tip: Identity Crisis, by Adina Reeve (February 2000)
- Tip: Generating Incrementing Keys without Using Identity, by John Thorpe (October 1999)
To find out more about thisjournalname and Pinnacle Publishing, visit their Web site at http://www.pinpub.com/
Note: This is not a Microsoft Corporation Web site. Microsoft is not responsible for its content.
This article is reproduced from the April 2004 issue of thisjournalname. Copyright 2004, by Pinnacle Publishing, Inc., unless otherwise noted. All rights are reserved. thisjournalname is an independently produced publication of Pinnacle Publishing, Inc. No part of this article may be used or reproduced in any fashion (except in brief quotations used in critical articles and reviews) without prior consent of Pinnacle Publishing, Inc. To contact Pinnacle Publishing, Inc., please call 1-800-788-1900.