MSDN Magazine > Issues and Downloads > 2000 > July >  Cutting Edge: Writing Custom OLE DB Providers U...
This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.
MSDN Magazine

Writing Custom OLE DB Providers Using ATL
Dino Esposito

Code for this article: Cutting0700.exe (108KB)

O
LE DB is the technology that converts the Universal Data Access (UDA) philosophy into concrete programming calls. The idea behind it is surprisingly simple. The data source is wrapped by a module that exposes a standard set of COM interfaces. The interaction between a client module (called an OLE DB consumer) and the data source holder (called an OLE DB provider) always takes place through COM. Therefore, as you might guess, OLE DB consumers and providers are really COM objects. To work with them, all you need are some robust COM skills. Refer to Don Box's House of COM column in the July 1999 issue of MSJ for insightful observations about the way OLE DB was first conceived and implemented.
      In this column, I'll briefly discuss some important concepts of OLE DB and emphasize its superior architecture. Then I'll move on to writing custom OLE DB providers. In particular, I'll show you how to improve the code generated by the ATL wizard to determine which columns of data were requested by a given call.

Advantages of OLE DB

Figure 1 OLE DB Architecture
Figure 1 OLE DB Architecture
      Microsoft is touting OLE DB as the preferred technology for data-driven applications deployed over the Web, as well as in other scenarios. OLE DB is supposed to replace ODBC over time in most typical enterprise scenarios. Architecturally speaking, OLE DB is superior to ODBC because of the way it was designed. The use of COM in OLE DB eliminates the need for a runtime layer to identify the data source holder (the ODBC drivers) and to translate from the ODBC SDK to the actual entry points that drivers expose.
      Furthermore, the use of COM, and in particular the use of OLE DB services, makes the schema more flexible since OLE DB services are intermediate components that work like interceptors and can enhance both the overall functionality and the performance of an OLE DB provider (see Figure 1). Such services can be enabled and disabled individually to guarantee the most flexibility. The architectural advantages of OLE DB services are even more evident if you consider how painful and bothersome it would be to extend an ODBC driver in the same manner.
      One OLE DB service is the Client Cursor Engine that, when invoked, obtains the rowset from the data provider, caches it locally, and then makes it available to the caller through the same IRowset interface. The service can also implement additional interfaces and provide extended functionalities like scrolling and sorting. The Client Cursor Engine is the key element that enables disconnected recordsets in ActiveX® Data Objects (ADO). When using ADO (whether with Visual Basic® or Visual C++®), you automatically invoke the Client Cursor Engine by setting the Connection or Recordset object's CursorLocation property to adUseClient.


Dim rs As New ADODB.Recordset
rs.CursorLocation = adUseClient
Another useful service is the Synchronization service, which allows you to reconnect and resynchronize a previously disconnected ADO recordset.
      As you may have guessed, many of the ADO recordset facilities have been implemented through OLE DB services. Resource pooling and automatic transaction enlistment within Microsoft® Transaction Services (MTS) or Windows® 2000 Component Services are features brought to you by OLE DB services.
      But services are not the most important reason to discard all your legacy ODBC and Remote Data Objects (RDO) code and upgrade to OLE DB and ADO. Direct ODBC calls issued through the ODBC API, or its COM counterpart RDO, can still keep up with OLE DB in terms of raw performance. So you should look carefully before you decide to jump on the OLE DB/ADO bandwagon. This is especially true if you're not using Microsoft SQL Server™ as your database provider.
      When everything else is equal, the choice between ODBC and OLE DB is about nonrelational and hierarchical data. OLE DB gives you the unprecedented power to manipulate both relational and nonrelational data through a common API. Plus, as of version 2.5 of the OLE DB specification, you can also manipulate semistructured and hierarchical data (such as XML data). To learn more about OLE DB 2.5, check out Cutting Edge in the October 1999 issue of Microsoft Internet Developer and Bob Beauchemin's article, "Using OLE DB and ADO 2.5 to Represent Nonlinear, Nonrelational Types of Data," in the November 1999 issue of MSJ.
      OLE DB gives you great flexibility. Through services and data providers, you can expose any data in terms of ADO recordsets, regardless of the storage media or its physical location. In addition, you could even define a custom query language to replace SQL. Such a language could be lightweight, more complex than SQL, or simply closer to the structure of the data you're actually representing. Of course, flexibility always comes at a price. The more flexible you are, the more layers of code you must traverse to provide the expected result.

Is OLE DB Faster than ODBC?

      ODBC is generally faster than OLE DB when you want to fetch sequential rows, and in particular when you do bulk fetching. This is because ODBC takes only one call to retrieve all the needed rows.
      OLE DB is generally faster when you want to fetch individual, nonsequential rowsâ€"for example, every third row of a table. OLE DB takes two calls to actually retrieve data. First you call IRowset::GetNextRows to get an HROW handle to the row. Then you invoke IRowset::GetData to obtain the data stored in a given row. GetData takes the previously retrieved HROW to identify the specific record.
      OLE DB doesn't support bulk fetching in quite the same way as ODBC. You can fetch multiple row handles in OLE DB, but you can only get data for one row at a time. This is clearly demonstrated by the prototypes of both GetNextRows and GetData:


HRESULT GetNextRows(
  HCHAPTER hChapter,
  LONG lRowsOffset,
  LONG cRows,
  ULONG *pcRowsObtained,
  HROW **prghRows);

HRESULT GetData(
  HROW hRow,
  HACCESSOR hAccessor,
  void *pData);
      The exception to this is the OLE DB Provider for SQL Server 7.0 (SQLOLEDB). This provider implements the IRowsetFastLoad interface to enable a consumer application to obtain a true bulk fetch capability from SQL Server 7.0. Once you've established a connection to the data source, set the SQLOLEDB provider-specific property called SSPROP_ENABLEFASTLOAD to a variant value of True, like so:


IDBProperties *pIProps;
DBPROP rgps[1];
DBPROPSET PropSet;
rgps[0].dwPropertyID = SSPROP_ENABLEFASTLOAD;
rgps[0].vValue.boolVal = VARIANT_TRUE;
PropSet.rgProperties = rgps;
pIDBInit->QueryInterface(IID_IDBProperties, &pIProps);
pIProps->SetProperties(1, &PropSet);
Next, create a session. The newly created session exposes an IRowsetFastLoad interface that consumers can access.
      In summary, for general applications working against a traditional database, both ODBC and OLE DB are fine, and in terms of performance, you won't detect a great difference. The advantages of OLE DB are that its design is more uniform than ODBC and it offers the possibility of handling nonrelational data through a common API.

Writing Custom OLE DB Providers

      The two features I just mentionedâ€"OLE DB's uniform design and ability to handle many data typesâ€"are the key factors that make writing OLE DB providers for proprietary data an intriguing new possibility. The ability to make your data accessible through ADO recordsets no matter where it resides and whatever its organization means better interoperability for all sorts of data-driven applications. And to obtain this, you don't need to hire ODBC driver experts. You just need to rely on your existing COM skills. As a COM developer, it's much more likely that you will be asked to write a custom OLE DB provider than extend or enhance an ODBC driver.
      You'll probably use ATL to write your own custom OLE DB providers. The latest Platform SDK ships with a few basic examples. One example of an OLE DB provider that uses the ATL templates can be found on MSDN™ CD 1 and is called ATLMTO. In the MSDN CDs you'll also find some technical articles on ATL template classes for OLE DB.
      In the June 1999 issue of MSJ I wrote an article, "Exposing Your Custom Data In a Standardized Way Through ADO and OLE DB", featuring a provider that returned e-mail messages as a recordset. That example had a significant limitation as do all the providers you write on top of code generated by the ATL Wizard. No matter which command string you use, the OLE DB provider always returns a rowset with the same layout. In other words, there's no way for you to specify which columns of data you want to obtain. You cannot do what a simple SQL query allows:


select firstname,lastname 
from contacts
where lastname like 'A%'
      This SQL query string tells the provider to return a rowset with only two columns of data (firstname and lastname) populated with all the records whose lastname field begins with A. This is SQL, though, and the provider you're working with understands the language. With the e-mail provider described in my June 1999 article, you could narrow the set of returned messages, but not the columns of data. For example, command text like this


select=Joe; about=MSDN; from=2000/04/01
would return all the messages sent by someone whose name contains Joe, with the word MSDN in the subject, and that was received on April 1, 2000, or later. However, the rowset will always contain a fixed set of columns. Is there a way to dynamically set the columns to be returned based on the information embedded in the command text? Yes there is, and I'm going to show you how.

A New OLE DB Provider

Figure 2 New Provider Object
Figure 2 New Provider Object
      To create a new OLE DB provider with ATL, you start a new ATL COM project, choose the in-process option, and add a new Provider object from the Data Access category (see Figure 2). At this point, you can save and build the project. What you end up with is a perfectly functional OLE DB provider that accepts a file specification as the command text and returns records with five columns. The columns map to the fields of the WIN32_FIND_ DATA structure, which is used by functions like FindFirstFile and FindNextFile to walk through the contents of a file system directory.
      Let's suppose you name the ATL project FileSys and call the provider class Files. The progID of the new provider will be FileSys.Files.1. The wizard creates a file called filesRS.h that contains code like this:


BEGIN_PROVIDER_COLUMN_MAP(CFilesWindowsFile)
  PROVIDER_COLUMN_ENTRY("FileAttributes", 1, dwFileAttributes)
  PROVIDER_COLUMN_ENTRY("FileSizeHigh", 2, nFileSizeHigh)
  PROVIDER_COLUMN_ENTRY("FileSizeLow", 3, nFileSizeLow)
  PROVIDER_COLUMN_ENTRY_STR("FileName", 4, cFileName)
  PROVIDER_COLUMN_ENTRY_STR("AltFileName", 5, cAlternateFileName)
END_PROVIDER_COLUMN_MAP()
The macros in the code set the structure of the rowset to be returned. Each record maps to an accessor structure represented by the CFilesWindowsFile class. This class inherits from the WIN32_FIND_DATA structure:


class CFilesWindowsFile: public WIN32_FIND_DATA
(Any C++ struct can be seen as a class with all public members.)
      I won't go into the details of the wizard-generated code's structure since that was covered in my June 1999 article, as well as in Lon Fisher's MSDN Online article, "Using the Visual C++ 6.0 OLE DB Provider Templates". I intend this column to be more of a follow-up to the MSJ article than a detailed account of the wizard code.
      The Windows 2000 user interface allows you to change the columns displayed in the Details view in Explorer. Two columns, Author and Version, are especially relevant here. Author returns the author of the document, if any. Version tells you about the version number of the specified file, provided that the file contains that information. The author's name is stored (for compound files only) in the SummaryInformation field. The module version is stored through a VERSIONINFO resource structure for all those files that can have resources, such as DLLs and EXEs. Figure 3 presents a view of the System32 folder in Windows 2000, showing all the files that can display their version number.
Figure 3 System32 Folder
Figure 3 System32 Folder

      Is it possible to obtain the same information through a recordset? In other words, can you extend the standard ATL OLE DB provider to return additional columns with author and version information for files? The following code snippet adds three more columns to the fixed schema seen previously. These columns will retrieve the author of a document, the version number of an executable, and the short text describing the type of the file:



PROVIDER_COLUMN_ENTRY_STR("Author", 6, m_szAuthor)
PROVIDER_COLUMN_ENTRY_STR("FileVersion", 7, m_szVersion)
PROVIDER_COLUMN_ENTRY_STR("FileType", 8, m_szType)
      In Figure 4 you can see the source code that retrieves this information and stores it in members of the CFilesWindowsFile class. Notice that the class defines the fields of the corresponding ADO recordset. Source code for reading the author and version information can be found in the files suminfo.cpp and version.cpp respectively, which are available with this month's source code from the link at the top of this article. The function shown in Figure 4 is the Execute method of the CFilesRowset class, the wizard-generated class that renders the rowset.
      The provider uses a custom FILTERSTRUCT structure to keep track of the directory (pszDir), the author name (pszAuthor), and the descriptive text (pszType) for the current document:


typedef struct tagFILTERSTRUCT  { /* lpfs */
    TCHAR pszDir[MAX_PATH];
    TCHAR pszAuthor[MAX_PATH];
    TCHAR pszType[100];

    // filter info
    TCHAR pszMatchAuthor[MAX_PATH];
    TCHAR pszMatchType[100];
} FILTERSTRUCT, FAR* LPFILTERSTRUCT;
      The remaining two fields are used to contain the information to be matched. For example, if you want to collect all the files in a certain folder whose author is Joe User and whose type is Microsoft Word Document (and no other files), fields like pszMatchAuthor and pszMatchType would contain just these strings to be matched.
      To query for files using author and type information, I need to create some syntax rules and define a minimal query language. For the sake of simplicity, I've chosen a token-based syntax like this:


DIR=c:\*.*;AUTHOR=Joe Users;TYPE="Word"
Figure 5 describes in detail the query language that the FileSys.Files provider understands. The parser called to ensure the syntax of the language is implemented in the file cmdparser.cpp.
       Figure 6 shows a simple Visual Basic-based application that retrieves files using this provider. (Incidentally, the program utilizes an ActiveX control called DBReportView for display. Its source code is available from http://www.microsoft.com/msj/0699/code/oledb.exe.) The provider correctly retrieves all documents of a certain type, written by a certain author, and located in a certain path. There is just one little inconvenience: at the moment there's no way to specify which columns to retrieve. The provider uses a static array of column information that always lists all the possible columns. The next step will be to narrow this argument.

Choosing Columns Dynamically

      The wizard-generated code uses the BEGIN_COLUMN_PROVIDER_MAP macro to fill the data structure that will make the columns information available to the various classes of the provider, particularly the command and rowset classes. The macro is used within the body of the CFilesWindowsFile class. Figure 7 shows a snapshot of the declaration. Basically, the CFilesWindowsFile class ends up containing a static member function with the following prototype:


template <class T>
static ATLCOLUMNINFO* GetColumnInfo(T* pT, ULONG* pcCols)
BEGIN_COLUMN_PROVIDER_MAP declares the function and starts filling its body. GetColumnInfo simply defines a static array of ATLCOLUMNINFO structures that is then filled and returned.


static ATLCOLUMNINFO _rgColumns [] = 
{
The code that returns the array is added, expanding the closing macro called END_COLUMN_PROVIDER_MAP:


}
*pcCols = sizeof(_rgColumns)/sizeof(ATLCOLUMNINFO); 
return _rgColumns;
      The various PROVIDER_COLUMN_ENTRY macros simply add items to the array. Figure 8 shows how the column macros would expand, and the final code that gets added to the CFilesWindowsFile class.
      ATL makes several flavors of the PROVIDER_COLUMN_ENTRY available (see atldb.h for details), but they differ in the way they pack the data that describes the column. I recommend that you use PROVIDER_COLUMN_ENTRY_STR instead of PROVIDER_COLUMN_ENTRY for all columns that will contain variable-length strings.
      Since the various column macros expand as shown previously, it comes as no surprise that the rowset's structure is fixed. GetColumnInfo is invoked to return information about the columns that will be added to the final rowset. If you leave the original wizard code unchanged, the GetColumnInfo function would always return the same set of columns. Thus, to determine at runtime which columns to return based on the command text being processed, you need to get rid of all the macros and instead write your own GetColumnInfo function.

When Consumers Request Columns

      In general, the provider can decide, according to its internal status, which columns to return. In most cases, though, you want the consumer to request a specific set of columns like you would in a SELECT SQL statement. To accomplish this, a certain logical link between the command text and the GetColumnInfo function is needed. In Figure 5 you saw the role and the syntax of the SELECT keyword in this provider's query language. A consumer application can issue a command like this


DIR=c:\*.*;SELECT=filename,filetype
and obtain a recordset with only two fields. Let's see how to reach this result.
      First, delete all of the calls to the column macros in the CFilesWindowsFile class. Then insert the following code:


template <class T>
static ATLCOLUMNINFO* GetColumnInfo(T* pT, ULONG* pcCols)
{
    return MyGetColumnInfo(pT, pcCols);
}
static ATLCOLUMNINFO* MyGetColumnInfo(CFilesRowset*, ULONG*);
      Now, make sure the CFilesRowset identifier is known at this point of the code. CFilesRowset is a class you don't define until later, so to avoid problems with the compiler add the following line before the CFilesWindowsFile class begins:


class CFilesRowset;
      Modify the FILTERSTRUCT structure to hold an additional member (pszCols) to keep track of the columns requested.


typedef struct tagFILTERSTRUCT  { /* lpfs */
    TCHAR pszDir[MAX_PATH];
    TCHAR pszAuthor[MAX_PATH];
    TCHAR pszType[100];
    TCHAR pszCols[100];

    // filter info
    TCHAR pszMatchAuthor[MAX_PATH];
    TCHAR pszMatchType[100];
} FILTERSTRUCT, FAR* LPFILTERSTRUCT;
There is a FILTERSTRUCT data member in my CFilesRowset class. This member is filled by reading the command text just before creating the rowset (see Figure 4). When it comes to creating the rowset, the ATL infrastructure calls the function GetColumnInfo on CFilesWindowsFile. The function is expected to return an array of ATLCOLUMNINFO structures that will determine the layout of the final rowset.


struct ATLCOLUMNINFO
{
    LPOLESTR pwszName;
    ITypeInfo *pTypeInfo;
    ULONG iOrdinal;
    DBCOLUMNFLAGS dwFlags;
    ULONG ulColumnSize;
    DBTYPE wType;
    BYTE bPrecision;
    BYTE bScale;
    DBID columnid;
    UINT cbOffset;
};
As the name suggests, this is not a data structure defined in the OLE DB specification; it's an ATL-specific shortcut.
      To satisfy the consumer's request for a certain group of columns, you have to implement the GetColumnInfo method to add as many columns as are specified in the SELECT keyword of the custom language. As mentioned earlier, the string is now held by the pszCols member of the FILTERSTRUCT structure. A specific m_fs member has been added to the CFilesRowset definition.
      In Figure 9 you can see the code that compares the names of all possible columns with the comma-separated string the consumer application provided. Whenever a match is found, a new item is added to the _rgColumns array that is returned to the caller at the end of the function. Notice the use of yet another macro, ADD_COLUMN_ENTRY. This time, however, it has nothing to do with ATL. The macro is defined in the filesRS.h file as follows:


#define ADD_COLUMN_ENTRY(ulCols, name, ordinal, colSize, type, 
                  precision, scale, guid, dataClass, member) \
 _rgColumns[ulCols].pwszName = (LPOLESTR)name; \
 _rgColumns[ulCols].pTypeInfo = (ITypeInfo*)NULL; \
 _rgColumns[ulCols].iOrdinal = (ULONG)ordinal; \
 _rgColumns[ulCols].dwFlags = 0; \
 _rgColumns[ulCols].ulColumnSize = (ULONG)colSize; \
 _rgColumns[ulCols].wType = (DBTYPE)type; \
 _rgColumns[ulCols].bPrecision = (BYTE)precision; \
 _rgColumns[ulCols].bScale = (BYTE)scale; \
 _rgColumns[ulCols].cbOffset = offsetof(dataClass, member);
I have Lon Fisher's article on MSDN Online to thank for this chunk of code.
      The _rgColumns array is not allocated dynamically, but it is large enough to hold an entry for each possible column. There's nothing that prevents you from allocating fresh memory to hold the column information. However, if you do this, don't forget to free the memory when you finish.
      To select all the available columns, you can assign the empty string or * to the SELECT keyword in the command text.


DIR=c:\*.*;SELECT=*
      Figure 10 shows the same sample app that I showed previously, but it's now working with the new version of the FileSys.Files provider.


Dim rs As New ADODB.Recordset
rs.CursorLocation = adUseClient
rs.Open txtDir.Text, "Provider=FileSys.Files.1"
Set rs.ActiveConnection = Nothing
If rs.RecordCount > 0 Then
        Set RV.Recordset = rs
End If
Figure 10 Old Version and New Version
Figure 10 Old Version and New Version

      Notice the command in the textbox at the bottom of the form; specifying three fields leads to three columns in the display grid.

Final Considerations

      Writing an OLE DB provider, especially with the help of ATL template classes, is not very difficult. The points at which you need to intervene are limited as long as you need a read-only provider and don't want to tackle service providers. Implementing dynamically sized rowsets is perhaps the only circumstance in which you need to change the structure of the ATL Wizard-generated code instead of simply adding new capabilities to work with your provider. The ATL support for OLE DB ends here.
      If you want to support bookmarks or add write capability, you must implement new interfaces beyond those provided by ATL. For updateable providers, implement IRowsetChange. (Notice that there's no system-provided IRowsetChangeImpl class.) The methods in IRowsetChange are used to update the values of columns in existing rows, delete existing rows, and insert new rows. To define bookmarks you need an implementation of IRowsetLocate. Look at the sample provider available with the latest Platform SDK to see how both features should be coded.
      Finally, what I've shown you here is just one approach for having your OLE DB provider choose columns dynamically. It's not the only one. You could, for example, override the GetColumnsInfo method of the Command and the Rowset class. In addition, the approach shown here can be improved in some ways. For example, the sample always retrieves the data for each column and then discards those that aren't needed. There's always room for improvement.

Dino Esposito is a senior trainer and consultant based in Rome. He has recently written Windows Script Host Programmer's Reference (WROX, 1999). You can reach Dino at desposito@vb2themax.com.

From the July 2000 issue of MSDN Magazine.

Page view tracker