Word Breaker and Stemmer Sample
The lrsample sample includes an example IWordBreaker implementation, an example IStemmer implementation, and the DLL entry points and exports for the component. The word breaker implementation parses Unicode text based on the occurrence of white space and punctuation. The stemmer implementation uses a small, static custom dictionary to look up and generate inflected forms for words. The sample words are English words and their inflected forms.
The language resource code sample is located in mssdk\samples\winbase\indexing\lrsample (where mssdk is the directory where the Platform SDK is installed).
To build the sample
- Set the Lib environment variable to drive:\mssdk\Lib;%Lib% and the Include environment variable to drive:\mssdk\Include;%Include% (where drive is the drive on which you installed the Platform SDK).
- Open a command prompt window, and then change the directory to the source path of the sample.
- At the command prompt, type nmake to build the language resource DLL.
To register the sample
- Copy the language resource DLL, the lrsample.dll file, to your installation directory (for example, drive:\MyDLLs).
- At the command prompt, type regsvr32.exe drive:\MyDLLs\lrsample.dll to self-register the filter.
The word breaker and stemmer are available when Indexing Service starts.
- The word breaker is registered under class ID d225281a-7ca9-4a46-ae7d-c63a9d4815d4.
- The stemmer is registered under class ID 0a275611-aa4d-4b39-8290-4baf77703f55.
The class IDs for these and other word breakers and stemmers are located in the following registry location:
The samples are registered under the English language and the "English_Sample" sublanguage.
The following table shows source files for the lrsample sample and provides a short description for each file.
|Lrsample.cxx||C++ source file that contains the IWordBreaker interface implementation (CSampleWordBreaker), the IStemmer interface implementation (CSampleStemmer), and the DLL export functions|
|Lrsample.hxx||C++ header file that contains the class definitions for CSampleWordBreaker, CSampleStemmer, and CLanguageResrouceSampleCF, the class factory for the sample|
|Langreg.hxx||C++ header file that contains utilities for registering language resources|
|Lrsample.def||Definition file that contains definitions for DLL export functions|
|Minici.hxx||C++ header file that contains utility functions for the sample|
CSampleWordBreaker is the class implementation of the IWordBreaker interface. It implements a private method, CSampleWordBreaker::Tokenize, in addition to implementations for the public interface methods. CSampleWordBreaker::BreakText reads text from the text source, TEXT_SOURCE, and calls the Tokenize method on the contents of the text buffer. CSampleWordBreaker::Tokenize breaks text based on the accuracy of white space and word separators for this locale, as determined by a call to GetStringTypeW in the Microsoft Win32 APIs. CSampleWordBreaker::Tokenize also removes punctuation and makes the appropriate calls methods of the WordSink object.
CSampleStemmer is the class implementation of the IStemmer interface. CSampleStemmer::GenerateWordForms takes input words and generates alternative, inflected, word forms for that word based on the contents of a small, custom dictionary contained in the aStemForms array. This stemmer implementation does not perform any morphological analysis in determining inflected forms for a word. The aStemForms array contains English words and their inflected forms. You can modify the array for words and alternative forms for any other inflected language. The custom dictionary is implemented as a static array in this sample to illustrate basic stemmer functionality. You should not expand or use this dictionary implementation in a production environment.
Error handling in the language resource sample is rudimentary. The sample handles only the most obvious error conditions. If you use the language resource as a basis for your own word breaker or stemmer, you must determine the additional error conditions that you must detect and handle.