2.2.3.3 Extracting Original Plain Text from RTF

The de-encapsulating RTF reader MUST parse the RTF document as specified in [MSFT-RTF]. Before trying de-encapsulation, it MUST first recognize the encapsulated content, as specified in section 2.2.3.1.

To be able to correctly convert text inside RTF, the de-encapsulating RTF reader SHOULD process control words and other information in RTF that affect the interpretation of text runs in RTF and a code page of such text runs. For more details about code page support, see [MSFT-RTF]. In particular, the de-encapsulating RTF reader SHOULD use the default code page, as specified in the RTF header, and it SHOULD use the code page information, as specified for each font in a font table. It SHOULD also track changes of a current font by following RTF text, and use the appropriate code page for the currently selected font. The de-encapsulating RTF reader MUST skip other parts of the RTF header, as specified in [MSFT-RTF].

The de-encapsulating RTF reader MUST examine each control token, translate it to its textual equivalent, and emit it to the output stream. Any RTF formatting control words that do not have a textual representation MUST be ignored.

Individual textual characters can be escaped by RTF and these SHOULD be converted to their character equivalents and emitted to the output stream (for example: "\{"," \}", "\\", and "\'HH"). After unescaping, the resulting bytes SHOULD be interpreted in a code page that corresponds to the currently selected font. Unicode characters produced from Unicode escapes (\uN control word) and other control words SHOULD be interpreted as Unicode characters.

The \par and \line RTF control words SHOULD be translated to CRLF and emitted to the output stream.

The \tab control word SHOULD be translated to the horizontal tab (%x09) character, and such character SHOULD be emitted to the output stream.

Any remaining text MUST be copied to the target plain text document. Text SHOULD be interpreted in a code page that corresponds to the currently selected font.