How to Write an RTF Reader

This content is no longer actively maintained. It is provided as is, for anyone who may still be using these technologies, with no warranties or claims of accuracy with regard to the most recent product version or service release.

 

An RTF reader must do three basic things:

  1. Separate text from RTF controls.
  2. Parse an RTF control.
  3. Dispatch an RTF control.

Separating text from RTF controls is relatively simple, because all RTF controls begin with a backslash. Therefore, any incoming character that is not a backslash is text and will be handled as text. (Of course, what one does with that text may be relatively complicated.)

Parsing an RTF control is also relatively simple. An RTF control is either (a) a sequence of alphabetic characters followed by an optional numeric parameter, or (b) a single non-alphanumeric character.

Dispatching an RTF control, on the other hand, is relatively complicated. A recursive-descent parser tends to be overly strict because RTF is intentionally vague about the order of various properties relative to one another. However, whatever method you use to dispatch an RTF control, your reader should do the following:

  • Ignore control words you don't understand.

    Many readers crash when they come across an unknown RTF control. Because Microsoft is continually adding new RTF controls, this limits an RTF reader to working with the RTF from one particular product (usually some version of Word for Windows).

  • Always understand \*.

    One of the most important things an RTF reader can do is to understand the \* control. This control introduces a destination that is not part of the document. It tells the RTF reader that if the reader does not understand the next control word, then it should skip the entire enclosing group. If your reader follows this rule and the one above, your reader will be able to cope with any future change to RTF short of a complete rewrite.

  • Remember that binary data can occur when you're skipping RTF.

    A simple way to skip a group in RTF is to keep a running count of the opening braces that the reader has encountered in the RTF stream. When the reader sees an opening brace, it increments the count; when the reader sees a closing brace, it decrements the count. When the count becomes negative, the end of the group has been found. Unfortunately, this doesn't work when the RTF file contains a \bin control; the reader must explicitly check each control word found to see if it's a \bin control, and, if a \bin control is found, skip that many bytes before resuming its scanning for braces.