Random Sampling in T-SQL
Whether you've got a gargantuan data warehouse, a huge transaction database, or a smaller workgroup database or data mart, it's not uncommon to want to "sample" your data. Although selecting a random sample of rows isn't a natural SQL operation, over the years, SQL Server professionals have developed a variety of techniques to sample a table. In this feature article, Brian Connolly presents these techniques, discusses their advantages and disadvantages, and performs a statistical evaluation of the population samples that each technique produces.
Imagine a company with a data warehouse that contains tens of millions of records on individuals. The analysts want to use it to do statistical analysis or data mining. Because it can be time-consuming to work with such a large number of records, analysts often select a small, random subset of records, work with the sample to build a model or draw some conclusions, and then test the model against the full population. Alternatively, an analyst could use OLAP, data mining, or some other type of BI tool to study the full customer population and identify some pattern or patterns that seem to represent a business opportunity. The smart analyst would then test these conclusions against a small random subset of "real" customers to ensure that his or her conclusions were valid.
So random sampling is important. But there's a conceptual hurdle to random sampling within SQL: Since SQL is a set-oriented language, the only subset operations are those based on column criteria or join operations. There's no notion of a "random sample" of rows. There are three techniques that SQL Server developers use to surmount this problem:
Add a column to the table, fill it with values from the Rand() function, and select an ordered subset based on that new column.
The same as the first technique, except that the Rand() column is seeded with a row primary key.
Use the T-SQL NewID() function as the random operator.
The RandomPopulation table
We'll use the following table to implement and compare each of these techniques:
create table RandomPopulation ( rowid int not null, pure_random float, seeded_random float NULL, newid_random uniqueidentifier NULL, pure_random_order int not null, seeded_random_order int not null, newid_random_order int not null)
This table represents the population to be sampled. Rowid is the primary key. If the table has 100,000 rows, a 1 percent random sample should return a set of 1000 random Rowids. Pure_random, seeded_random, and newid_random are the columns that I use to employ the three techniques listed earlier. For example, pure_random will have values obtained from successive calls to the Rand() function.
Pure_random_order will contain a rowid value. If there are 100,000 rows in the table, rowids will be between 0 and 99,999. The table row with the smallest pure_random number will have a pure_random_order value of 0. The row with the largest pure_random_number will have a pure_random_order value of 99,999. The following query returns a sample of RandomPopulation with @samplecount rows:
select rowid into #tpure from RandomPopulation where pure_random_order < @rowsample
Seeded_random_order and newid_random_order are used in the same way to order the seeded_random and newid_random columns, and select samples into temporary tables #tseeded and #tnewid. The following sections show how the three techniques listed earlier are used to populate the random values in the table and determine their relative order.
Technique 1: Use a Rand() column
The fundamental obstacle to using the Rand() function is that Rand() is only evaluated once per query. The only way to assign a different random number to every row in the table is to make a separate assignment for each row, either when creating the row or updating the table one row at a time:
while (@row < @rowcount) begin insert into RandomPopulation (rowid,pure_random, pure_random_order, seeded_random_order, newid_random_order) values(@row,rand(),-1,-1,-1) select @row = @row+1 end
Once pure_random is assigned, pure_random_order can be assigned to reflect the row order according to the Rand() function:
select IDENTITY(int, 0,1) as row_order,* into #tpureorder from /* need the derived table to force ordering need top 100 percent to allow an order by */ (select top 100 percent rowid, pure_random as random from RandomPopulation order by pure_random asc) x update RandomPopulation set pure_random_order = row_order from #tpureorder where #tpureorder.rowid = RandomPopulation.rowid
Techniques 2 and 3: Seeded random and NewID()
The following code shows how the other two techniques can be used. If Rand is seeded with a row-dependent value, it's evaluated not once per query, like Rand(), but rather once for each row. NewID() is also evaluated once per row, and the unique identifiers that it returns can be ordered. Both of these techniques allow assignment of "random" columns in a single query:
update RandomPopulation set seeded_random = rand( (( (rowid+1)) % 1000000) * (datepart(ms, GETDATE()) +1) ), newid_random = newid()
The seed chosen for Rand is a function of the row primary key and also the current millisecond. Rowid is included in the seed function because it forces Rand to be executed once per row.
Getdate() is evaluated once per query. The reason it's included in the seed function is so that when multiple tests of these techniques are executed, a variety of seeds are used. Note that if I run this test two times, both runs will have the same rowids, so in the absence of Getdate(), or another "random" function in the seed (such as @@idle), both Rand calls would return the same values. The additions and the modulo operation in the seed function guarantee non-zero numbers within the int datatype range.
The use of seeded random is problematic. Rand has been designed (and tested) to give a stream of statistically independent, uniformly distributed numbers on successive calls after it's seeded. However, the "seeded random" technique used earlier essentially takes the first random number from a different seed for each row. So while Rand(X), Rand(), Rand() ... Rand() would be expected to yield a valid stream of random numbers, there's no reason to expect Rand(N1), Rand(N2) ... Rand(NNN) to be random. For this reason, and also because the tests that follow demonstrate how bad it can be, I don't recommend you use this technique.
NewID() has the same advantage as seeded random: It's evaluated once per row, so "random" values can be assigned to each row in a table in a single query. However, it's evaluated by the operating system, not T-SQL. In addition, while unique identifiers can be ordered, the comparison functions aren't documented. Unlike Rand(), NewID() has not been designed for use as a statistically valid random number generator.
Once seeded_random and newid_random are assigned, seeded_random_order and newid_random_order can be assigned using the same algorithm that was used to assign pure_random_order earlier.
Random sampling experiment design
You can use any of the three techniques to select a random sample of rows from a table. Rand() is expensive because it requires the use of INSERTs, cursors, or single-row UPDATEs. Seeded Rand() and NewID() are easier to use, but it's important to test the statistical validity of the samples they produce.
There are many tests of random number generators. The Chi-squared test divides the output into "buckets" and ensures that on average the number of items in each bucket falls within the expected count for each bucket. For example, a random 1 percent sample of a RandomPopulation table with 100,000 rows will return 1000 rowids that are scattered between 0 and 99,999. Divide the space into 40 "buckets" from 0-2499, 2500-4999 ... 97500-99999. The random 1 percent sample should fill each bucket with about 25 rowids from RandomPopulation. Naturally, some buckets will have more and some will have less. The Chi-square test evaluates this deviation to see whether there's evidence of non-randomness. [See www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html for one of many tutorials on Chi-square. The site also has an associated Web Chi calculator you can use.Ed.]
Given k buckets, the Chi-square statistic that measures independence is as shown in Figure 1.
I use the following table to compute the Chi-square statistic for each of the three sampling techniques:
create table chi_squared_hist ( range_low int, range_high int, expected_rows int, pure_rows int null, seeded_rows int null, newid_rows int null )
The ranges and expected counts are dependent on the RandomPopulation row count and the number of buckets:
select @buckettop = @rowcount -1 while (@buckettop > 0) begin insert into chi_squared_hist (range_low, range_high, expected_rows) values (@buckettop - @rowcount/@bucketcount+1, @buckettop, (@rowcount/@bucketcount)*@sampleprop) select @buckettop = @buckettop - @rowcount/@bucketcount end
I compute the counts within each rowid range for each sampling technique as follows:
update chi_squared_hist set pure_rows = row_count_hist.pure, seeded_rows = row_count_hist.seeded, newid_rows = row_count_hist.newid from ( select range_low, range_high,expected_rows, count(all #tpure.rowid) as pure, count(all #tseeded.rowid) as seeded, count(all #tnewid.rowid) as newid from chi_squared_hist h inner join RandomPopulation rp on rp.rowid between h.range_low and h.range_high left outer join #tpure on rp.rowid = #tpure.rowid left outer join #tseeded on rp.rowid = #tseeded.rowid left outer join #tnewid on rp.rowid = #tnewid.rowid group by range_low, range_high, expected_rows ) row_count_hist where row_count_hist.range_low = chi_squared_hist.range_low
Using the counts in each bucket, the Chi-square statistic is computed like so:
update up_test_random_selection_results set pure_chi_sq = sum_diff.pure_chi_sq, seeded_chi_sq = sum_diff.seeded_chi_sq, newid_chi_sq = sum_diff.newid_chi_sq from ( select pure_chi_sq = sum( power (expected_rows - pure_rows,2) /1.0/expected_rows), seeded_chi_sq = sum( power (expected_rows - seeded_rows,2) /1.0/expected_rows), newid_chi_sq = sum( power (expected_rows - newid_rows,2) /1.0/expected_rows) from chi_squared_hist ) sum_diff where run_number = @run_number
There are other tests for randomness. Most of them test for serial independence, which means that there's no correlation between successive numbers, or that there's no correlation between arbitrary groups of numbers that have a fixed lag between them. However, in the sampling case, the order of the rowids that are selected from RandomPopulation isn't important. Only the coverage is important, and this is why the Chi-square test is used to measure the effectiveness of the T-SQL sampling techniques.
The following stored procedure (see the Download) performs all of the preceding steps N times, where N is less than the runs parameter. Rowcount is the RandomPopulation size, sampleprop is the proportion of RandomPopulation to be sampled, and bucket count is the number of buckets to be used for the Chi-square analysis.
create procedure up_test_random_selection @runs int, @rowcount int, @sampleprop float, @bucketcount int
I called the up_test_random_selection procedure, specifying 30 runs, 100,000 RandomPopulation rows, .01 for a 1 percent sample, and 40 buckets. Consequently, each 2,500 interval range should have about 25 rowids. Since 40 buckets were used, there are 39 degrees of freedom. We can say that the row sample for each technique is independent and uniformly distributed at the 90 percent confidence level if the Chi-square statistic for each technique is Chi-Sq[.90, 39]. The threshold level of this statistic from a Chi-square function table is 50.660. As long as the Chi-square total for each technique is less than this level, there's no reason to suspect that the sample isn't random.
The pure random technique is clearly the best method for obtaining a random sample, consistently yielding a Chi-square statistic less than the threshold value (see Figure 2). The average Chi-square statistic for the NewID() method is slightly higher than the average for the pure random technique, and it has considerably more variance (94.2 vs. 57.9). It exceeded the threshold value on three of the 30 runs. However, we're working with a 90 percent confidence threshold, so this doesn't necessarily prove that its samples are non-random.
Seeded random is either very good (almost too good) or extremely bad. In run 12, its Chi-square statistic is 538.7, more than 10 times the threshold. It's clearly not random, as the histogram in Figure 3 shows.
This shows the observed counts for the first quarter of the buckets in run 12. Three of the 10 buckets are completely empty for the seeded random technique, and a similar pattern holds true for the other 30 buckets in run 12. Buckets that aren't empty have far more observed rows than the expected value of 25.
Conclusions and recommendations
The seeded random technique should never be used. It gives results that are dangerous and misleading, because it can give seemingly random results in one run, and completely non-random results in the next.
Pure random is the best technique, provided that you can afford the overhead of row-at-a-time operations to set up the table for sampling, but if this overhead is too great, consider using NewID(). Bear in mind, though, that because NewID() is assigned by the operating system, you should always test its behavior on the target systemperhaps using the code included in the accompanying Download as a starting point.
To find out more about SQL Server Professional and Pinnacle Publishing, visit their Web site at http://www.pinpub.com/
Note: This is not a Microsoft Corporation Web site. Microsoft is not responsible for its content.
This article is reproduced from the March 2004 issue of SQL Server Professional. Copyright 2004, by Pinnacle Publishing, Inc., unless otherwise noted. All rights are reserved. SQL Server Professional is an independently produced publication of Pinnacle Publishing, Inc. No part of this article may be used or reproduced in any fashion (except in brief quotations used in critical articles and reviews) without prior consent of Pinnacle Publishing, Inc. To contact Pinnacle Publishing, Inc., please call 1-800-788-1900.