Speech Recognition API and DDI (Windows Embedded CE 6.0)

1/6/2010

This topic describes basic APIs and DDIs required for speech recognition.

API

Just as ISpVoice is the main interface for speech synthesis, ISpRecoContext is the main interface for speech recognition. Like the ISpVoice, it is an ISpEventSource, which means that it is the speech application's vehicle for receiving notifications for the requested speech recognition events.

An application has the choice of two different types of speech recognition engines (ISpRecognizer). A shared recognizer that could possibly be shared with other speech recognition applications is recommended for most speech applications. To create an ISpRecoContext for a shared ISpRecognizer, an application need only call COM's CoCreateInstance on the component CLSID_SpSharedRecoContext. In this case, SAPI will set up the audio input stream, setting it to SAPI's default audio input stream. For large server applications that would run alone on a system, and for which performance is key, an InProc speech recognition engine is more appropriate. In order to create an ISpRecoContext for an InProc ISpRecognizer, the application must first call CoCreateInstance on the component CLSID_SpInprocRecoInstance to create its own InProc ISpRecognizer. Then the application must make a call to ISpRecognizer::SetInput (see also ISpObjectToken) in order to set up the audio input. Finally, the application can call ISpRecognizer::CreateRecoContext to obtain an ISpRecoContext.

The next step is to set up notifications for events the application is interested in. As the ISpRecognizer is also an ISpEventSource, which in turn is an ISpNotifySource, the application can call one of the ISpNotifySource methods from its ISpRecoContext to indicate where the events for that ISpRecoContext should be reported. Then it should call ISpEventSource::SetInterest to indicate which events it needs to be notified of. The most important event is the SPEI_RECOGNITION, which indicates that the ISpRecognizer has recognized some speech for this ISpRecoContext. See SPEVENTENUM for details on the other available speech recognition events.

Finally, a speech application must create, load, and start an ISpRecoGrammar, which essentially indicates what type of utterances to recognize, such as Dictation or a Command and Control grammar. First, the application creates an ISpRecoGrammar using ISpRecoContext::CreateGrammar. Then, the application loads the appropriate grammar, either by calling ISpRecoGrammar::LoadDictation for Dictation or one of the ISpRecoGrammar::LoadCmdxxx methods for Command and Control. Finally, to start these grammars so that recognition can start, the application calls ISpRecoGrammar::SetDictationState for Dictation or ISpRecoGrammar::SetRuleState or ISpRecoGrammar::SetRuleIdState for Command and Control.

When recognitions come back to the application by means of the requested notification mechanism, the lParam member of the SPEVENT structure will be an ISpRecoResult by which the application can determine what was recognized and for which ISpRecoGrammar of the ISpRecoContext.

An ISpRecognizer, whether shared or InProc, can have multiple ISpRecoContexts associated with it, and each one can be notified in its own way of events pertaining to it. An ISpRecoContext can have multiple ISpRecoGrammars created from it, each one for recognizing different types of utterances.

DDI

At the most basic level, the SAPI 5.0 Recognition DDI provides the functionality for an engine to receive audio data from SAPI and return phrase recognitions. The two interfaces used for this are ISpSREngine, which is implemented by the engine, and ISpSREngineSite, which is implemented by SAPI. The SAPI interface ISpSREngineSite also provides speech engine methods for communicating more detailed information on what the engine recognizes. Grammars and speakers provide information to engines that better help them recognize speech and are a critical part of the communication between SAPI and speech engines. There are two final aspects of the communication between engine and SAPI: The order in which calls can be made, and threading issues. One of the key benefits from SAPI 5.0 is the simplification of threading issues.

An engine provides its services to SAPI through the ISpSREngine interface. The function through which all recognitions are made is the ISpSREngine::RecognizeStream. When SAPI calls ISpSREngine::SetSite, it passes in a pointer to the interface ISpSREngineSite through which the engine will communicate to SAPI during the execution of ISpSREngine::RecognizeStream. SAPI dedicates a thread to an ISpSREngine object and the engine should not return from ISpSREngine::RecognizeStream until there is a failure or SAPI has indicated using ISpSREngineSite::Read, that there will be no more data to process and the engine has done whatever cleanup is appropriate.

SAPI isolates engine developers from the details of managing an audio device. SAPI maintains a logical stream of raw audio data that it indexes with a stream position index. With a stream position index, an engine can call ISpSREngineSite::Read to receive a buffer of raw audio data during the execution of ISpSREngine::RecognizeStream. This call will block until all requested data is available. If ISpSREngineSite::Read returns with less data than requested, this means that there is no more data and the engine should return from ISpSREngine::RecognizeStream.

Once an engine has recognized a phrase with enough certainty that it warrants a notification to the application, it should call ISpSREngineSite::AddEvent. The stream position passed into AddEvent indicates the point in the audio stream after which the engine is seeking recognitions. To pass phrase hypotheses to the application, the engine calls ISpSREngineSite::Recognition and passes an SPRECORESULTINFO structure with the hypothesis flag set. Once an engine has a final recognition candidate, it must call ISpSREngineSite::Recognition with an SPRECORESULTINFO structure with the hypothesis flag not set. If an engine rejects all possible recognitions, then it calls ISpSREngineSite::Recognition and passes in a NULL even though it called ISpSREngineSite::AddEvent. Although every call to ISpSREngineSite::AddEvent should be followed by one or more calls to ISpSREngineSite::Recognition, this restriction is not enforced by SAPI and must not be depended on.

The SAPI DDI layers make it possible for an engine to only have one thread which executes between SAPI and an engine. The only method in ISpSREngine which does not enter and exit relatively quickly is ISpSREngine::RecognizeStream. The engine can remain executing and blocking inside of RecognizeStream until it fails or receives no more data to process from ISpSREngineSite::Read. When the engine has the opportunity to give SAPI a chance to call back into ISpSREngine, it should call ISpSREngine::Synchronize and pass the stream position index of where the engine has finished recognizing audio data. From whatever thread Synchronize is called on, SAPI can call back into any of the other ISpSREngine methods except Recognize. For example, the speaker can change, grammars can unload, start or stop dynamically. At any point in time, ISpSREngine will be called only on the original thread which created the ISpSREngine or the thread on which the engine called ISpSREngineSite::Sychronize. If the engine only calls ISpSREngineSite::Synchronize on the same thread which entered ISpSREngine::RecognizeStream, then there is only one thread with which SAPI ever calls into an instance of ISpSREngine.

See Also

Concepts

SAPI Overview