Counting Your Rows Before They Hatch

 

Microsoft Corporation

September, 1998

Attempting to Pass a Count Value to GetRows

I just got back from Vbits, where I stood in the back of one of the more popular sessions that discussed Client/Server application development. The speaker (who will remain nameless) was energetically extolling the virtues of his sample application code when I noticed that he was showing a segment of code something like the following:

Function MyGetData(IDWanted as Integer) as Variant
Dim rs as rdoResultset
Dim rowsfound as Integer
Dim SQL as string
SQL = "Select Count(*) From MyTable Where ID = " & IDWanted
Set rs = db.OpenResultset(SQL, rdOpenForwardOnly, rdConcurReadOnly)
rowsfound = rs(0) 'The count of rows where ID = 
SQL = "Select * From MyTable Where ID = " & IDWanted
Set rs = db.OpenResultset(SQL, rdOpenForwardOnly, rdConcurReadOnly)
MyGetData = GetRows(rowsfound)
Exit Function

I gulped and left the room knowing that we were going to have to do some damage control when I got back. When I talked with the speaker, a well-known consultant in his own right, he said that this approach was a well-accepted way to deal with the problem of not knowing how many rows were going to result from a query. I disagreed and we agreed to disagree and leave it at that. However, so that you don't make the same mistake, let's take a closer look at the problem and how various approaches deal or don't deal with the issues.

Counting Rows: One of the Issues

First, let's assume you are building a query that returns an unknown, but finite number of rows. You know by experience that the number of hits would be in the range of say, 0 to 50 rows. This upper limit might be much higher, but you know what I think about large resultsets, so your WHERE clause limits this number to a fairly small number of rows. You also want to get these rows out of the result set quickly, and you have learned that GetRows does a neat job of this with a single command—not to mention that it's dramatically faster than looping through the rows.

The problem is that you don't know exactly how many rows will result from your query. You want to know because it makes coding GetRows easier. If the reason is because you simply want to code a For/Next loop with a finite counter, consider using a Do/Until loop instead. However, once you know the number of rows, you think that it's easier to simply "get" just that many rows and you don't have to add additional code to check for more (or fewer) rows. But consider that GetRows can be coded to fetch "all" of the rows simply by passing a very large number—say 500,000—well beyond the scope of possibility. RDO simply stops fetching when all the rows have been read, so this is not a problem. As a matter of fact, in ADO, you can use the adGetRowsRest option that simply gets all of the (remaining) rows. Since the Variant array is dynamically allocated, you don't have to worry about size and you can tell how many rows were returned by using the Ubound function as illustrated in the GetRows examples in online help.

Understanding the Query Processor and Chickens

However, you also need to consider the problem that the query processor has to deal with. For example, imagine the farmer's young daughter going into the hen house to collect eggs for the first time. She would like to know how many eggs are there so she can bring the right number of egg-cartons from the house. However, because she has no experience with egg gathering, she goes into the hen house and simply counts the eggs first. With this knowledge, she returns from the house with the exact number of cartons, only to discover she has more eggs than the cartons will hold. She returns with more cartons, and starts to fetch again, only to discover that now she has far more cartons than needed. What she did not know is that the family dachshund has an affinity for eggs (and chickens) and visited the hen house between her visits. Your experience with fetching rows can have the same effect—and leaving out data rows that are there, but not fetched, can leave egg on your face.

How Count(*) Leaves the Problem Unsolved

Looking closer at our code we discover that SELECT COUNT(*) does not return the number of rows you might get in subsequent queries. Sure, COUNT(*) indicates how many rows would have been returned if the operation was performed at the instant the first query was executed. However, multiuser DBMS systems do not work that way any more than chickens lay eggs on demand. What if a number of rows are added to the database between the time the first and second queries were executed? This is not that unlikely, because the data we are working on is often the data that other users are also working on.

The other problem with this example is performance—especially since it was offered as a way to boost performance. In actuality, this approach cuts performance nearly in half. That is, both queries must

  • Be compiled (they were submitted as SQL statements and not stored procedures or even rdoQuery objects).
  • Be executed (this means that the query engine has to fetch all of the rows that qualify into RAM and count them up the first time, and return all of the rows the next time).
  • Return a result set (this means that the ODBC and RDO layers have to deal with building the rdoResultset object).
  • Make at least one (and more likely several) round trip(s) each to the server to set up and deal with the queries.

Yes, once the first query is executed, the data pages that hold the rows would probably reside in the cache. But no, that does not mean that the query processor would be able to skip this step—it would still have to redetermine which rows qualified and at least make sure these pages were loaded. If the system is busy or the data is scattered over a dozen pages, the chance of them still being in RAM gets even slimmer. Sure, no data I/O would be needed if the page was in memory and this can save time. However, if this is a complex query, the chances of the results being the same are slimmer yet—because the number of potentially inclusive operations would be greater.

If you insist on using the Count(*) approach, make sure you include a HOLD LOCK in the query to make sure that the count you get the first time matches the operation run the second time. The problem with this approach is that it hurts scalability and concurrency.

Another Idea...But Still a Problem

I picked up a thread on VBDATA-L@peach.ease.lsoft.com lately that also attempted to address this problem. They suggest that instead of doing a "SELECT count(*)", you can also get the number of records for a table by querying SQL Server's sysindexes table. Try this procedure that takes the table name as a parameter and returns the number of rows.

Create Procedure GetRows @TableName VarChar(255)
As
SELECT idx.rows
FROM sysobjects obj, sysindexes idx
WHERE obj.type = 'u'
AND obj.name = @TableName
AND obj.id = idx.id
AND idx.indid < 2

The problem with this approach is that the sysindexes rowcount is only updated when you use UPDATE STATISTICS, so its value is likely very old. Of course, if the table is static, this might not be a problem.

Doing It Right the First Time

A more reasonable approach would be to code GetRows so it fetches either:

  • A number of rows greater than the expected rows. This way, you are reasonably certain you will get all rows. It's like taking a few more egg cartons to the hen house based on yesterday's egg production. No one ever broke eggs when they dropped empty cartons.
  • A manageable number of rows to be processed by the user with the option of returning for additional rows as needed.

In either case, you will want to check the resultset EOF property to ensure that all rows are fetched—even if you choose to ignore my advice and use COUNT(*).