Best Practices for Writing an MFT

Article
04/30/2009

A Media Foundation transform (MFT) must implement the IMFTransform interface. For detailed descriptions of how these APIs should perform, refer to the reference pages for the interface.

This section gives some specific recommendations for writing various types of MFTs.

Time Stamps and Durations

An MFT must set as accurate a time stamp and duration as possible on all output samples. For a simple MFT that takes one input buffer and completely processes it into an output buffer, the MFT should just copy the time stamp and duration directly from the input sample to the output sample. However, many transforms are more complex than this and may require more complex calculations of output time. All MFTs should observe the following basic rules:

An MFT should try to put a time stamp and duration on all uncompressed video or audio output samples if an accurate time stamp or duration is given on the input samples or can be calculated.
An MFT should never guess the time stamp or duration. A wrong time stamp or duration is worse than none at all.
The time stamps and durations of the input samples should be preserved on the output samples as much as possible.
The output time stamps or durations might not match the input because the MFT is holding back data or breaking the output into different-sized pieces than the input. In that case, the MFT should calculate the output time stamp from the earliest input sample that contains data used to create the output sample. To calculate the output time stamp, add the input time stamp of the appropriate input sample to the duration of data that has already been transformed from that sample. The second example at the end of this section illustrates this idea.
If the input samples have duration, that duration should be preserved. If an input sample does not have duration, the MFT should calculate a duration if possible from the size of the output buffer or the data rate given by the media type.
Calculated durations should be truncated (rounded down), not rounded to the nearest increment. The pipeline has enough slack to handle durations that are slightly inaccurate, but it is easier for the pipeline to handle a duration that is 1% too short than a duration that is 1% too long. That said, there is no reason to deliberately shorten the durations, other than by rounding.

Decoders

A decoder converts compressed packets into uncompressed buffers of audio or video. Because the output is uncompressed, decoders have a special obligation to get the time stamps and durations correct.

Some compressed formats, most notably MPEG-2, do not have time stamps on all input packets and often have no duration on any packet. For these formats, the decoder is responsible for putting a valid time stamp and duration on every output sample by summing the implied durations of all the output since the last time stamped input sample.

For video, if the duration is not available in the compressed format, the decoder should calculate the duration as the inverse of the frame rate, converted to 100-nanosecond units and rounded down.

For audio, if the duration is not available in the compressed format, the decoder should calculate the duration as the inverse of the audio sample rate multiplied by the number of samples in the output buffer, converted to 100-nanosecond units and rounded down.

The only time a transform should output a sample without a time stamp is if the MFT has never received a time stamp on an input sample, or if there is no way to calculate an accurate output time stamp from the previous input time stamp.

Mixers

Note Currently the Media Foundation pipeline does not support mixers, or any MFTs with more than one input.

A mixer takes multiple inputs and mixes them into one output. If the input streams are not completely rate-locked, or are slightly offset in time from each other, there can be ambiguity about which time to set on the output. Here are some guidelines, depending on the media type:

Audio. At startup or immediately after a drain or flush, an audio mixer should wait to produce output samples until it has received an input sample on all required input streams. At that point, it should choose the earliest time stamp of the initial samples to use as a baseline for the output time stamps. The other streams should be padded with silence to make up any time discrepancy. If a sample is received on an optional input stream, it should also be factored into the calculation. From that point on, the MFT should strive to produce a continuous and unbroken chain of output time stamps. In general, the MFT should not try to account for one stream drifting relative to another. Instead, it should calculate the output time stamps from the baseline time stamp, the output rate, and the buffer sizes. When another drain or flush occurs, the MFT should reset its baseline time stamps.
Video. At startup or immediately after a drain or flush, a video mixer should wait to produce output samples until it has received an input sample on all required input streams. At that point, it should choose the earliest time stamp of the initial samples to use as a baseline for the output time stamps. In general, it should strive to keep continuous and regular output time stamps and fixed durations, even if the input is not as regular, if necessary by repeating input frames.

Encoders

An encoder converts uncompressed audio or video into compressed packets. An encoder should follow these guidelines:

The encoder should follow the conventions of the output format. If the format does not typically time stamp every sample, as in MPEG-2, not every output sample needs to have a time stamp and a duration.
The input time stamps should be preserved in the output format, if the format has fields for time stamps, unless better time information is a available from another source, such as the application itself.

Multiplexers

Note Currently the Media Foundation pipeline does not support multiplexers, or any MFTs with more than one input.

A multiplexer combines two different audio or video streams into one interleaved format, such as AVI or MPEG-2 Transport Stream. A multiplexer should follow these guidelines:

The multiplexer should follow the conventions of the output format. If the format does not typically time stamp every sample, as in MPEG-2, not every output sample needs to have a time stamp and a duration.
The time stamp should reflect the earliest time that would be placed on any frame that begins in that packet, or the time of the first audio sample that would be decoded from that packet. Ignore this guideline if it conflicts with the conventions of the output format.

Demultiplexers

A demultiplexer splits an interleaved format, such as AVI or MPEG-2 Transport Stream, into the underlying audio and video streams.

If the format contains specific time stamp information that can be used to calculate accurate output time stamps based on the input time stamps, that information should be used. However, if the format contains times in a completely different base that bear no relation to the input time stamps, and an accurate offset to the input time stamp cannot be calculated, the format's own times should be ignored.

If the format does not have usable time stamp information, the demultiplexer should follow these rules:

Uncompressed output streams should have valid time stamps and durations if possible, calculated from the closest previous input time stamp.
Compressed output streams should have time stamps on only the first output sample derived from an input sample with a time stamp. If the input sample does not have a time stamp, no output samples derived from that input sample should have a time stamp. If the input sample is broken into multiple output samples, only the first output sample should have a time stamp, and the rest should have no time stamps.

Examples

Example 1. Suppose that a video effect always takes an uncompressed input frame, applies the effect, and copies it to the output. It never holds back any frames or buffers any input. This MFT simply copies the time stamp and duration from the input sample to the output sample, if they are available, and does no time calculations at all.

Example 2. Suppose that an audio effect transforms all but 10 milliseconds (ms) of each input buffer, saving the extra 10 ms to combine with the next buffer. It gets a stream of samples that all have a duration of 50 ms. The input times are shown in the following table.

Sample	Input time	Input duration	Output time	Output duration
1	20	50	20	40
2	70	50	60	50
3	121	50	110	50
4	171	50	161	50

Note the 1-ms discrepancy between the actual duration of sample 2 and the implied duration based on the next time stamp (121 − 70 = 51).

Because the MFT holds back 10 ms, it outputs the first 40 ms of input sample 1 as output sample 1, with a time stamp of 20 ms and a duration of 40 ms.

Output sample 2 combines the 10 ms previously held back with 40 ms of input sample 2. This sample is given a time stamp of 60 ms (the time stamp of the previous input sample, 20ms, plus the duration of the data already processed from that sample, 40ms). It is given a duration of 50ms.

Similarly, the next sample has a time stamp of 110ms (70ms + 40ms) with a duration of 50 ms.

The next calculation is more interesting. The implied time stamp from the previous output time and duration would be 160 ms (time stamp 110 ms + duration 50 ms). However, the output time stamp is supposed to be calculated from the input time stamp of the earliest input sample that overlaps the output sample in time, plus the length of any data already processed from that sample. The closest overlapping input sample is the sample 4 (time stamp = 171), but this is not the earliest one. The earliest overlapping sample is sample 3 (time stamp = 121). Adding the 40ms that has already been processed from that sample, the result is 161.

Discontinuities

A discontinuity is a break in an audio or video stream. Discontinuities can be caused by dropped packets on a network connection, corrupt file data, a switch from one source stream to another, or a wide range of other causes. Discontinuities are signaled by setting the MFSampleExtension_Discontinuity attribute on the first sample after the discontinuity. It is not possible to signal a discontinuity in the middle of a sample. Therefore, any discontinuous data should be sent in separate samples.

Some transforms, especially those that handle uncompressed data, such as audio and video effects, should ignore discontinuities when they process input data. These MFTs are generally designed to handle continuous data, and should treat any data they receive as continuous, even after a discontinuity.

If an MFT ignores a discontinuity on input data, it should still set the discontinuity flag on the output sample, if the output sample has the same time stamp as the input sample. If the output sample has a different time stamp, however, the MFT should not propagate the discontinuity. (This would be the case in some audio resamplers, for example.) A discontinuity at the wrong place in the stream is worse than no discontinuity.

Most decoders cannot ignore discontinuities, because a discontinuity affects the interpretation of the next sample. Any encoding technology that uses inter-frame compression, such as MPEG-2, falls into this category. Some encoding schemes use only intra-frame compression, such as DV and MJPEG. These decoders can safely ignore discontinuities.

Transforms that respond to discontinuities should generally output as much of the data before the discontinuity as they can, and discard the rest. The input sample with the discontinuity flag should be processed as though it were the first sample in the stream. (This behavior matches what is specified for the MFT_MESSAGE_COMMAND_DRAIN message. For more information, see IMFTransform::ProcessMessage.) However, the exact details will depend on the media format.

If a decoder does nothing to mitigate a discontinuity, it should copy the discontinuity flag to the output data. Demultiplexers and other MFTs that work entirely with compressed data must copy any discontinuities to their output streams. Otherwise, the downstream components may not be able to decode the compressed data correctly. In general, it is almost always correct to pass discontinuities downstream, unless the MFT contains explicit code to smooth out discontinuities.

Sample Attributes

The input samples might have attributes that must be copied to the corresponding output samples.

If the MFT returns VARIANT_TRUE for the MFPKEY_EXATTRIBUTE_SUPPORTED property, the MFT must copy the attributes.
If the MFPKEY_EXATTRIBUTE_SUPPORTED property is either VARIANT_FALSE or is not set, the client must copy the attributes.

For an MFT with one input and one output, you can use the following general rule:

If each input sample produces exactly one output sample, you can let the client copy the attributes. Leave the MFPKEY_EXATTRIBUTE_SUPPORTED property unset.
If there is not a one-to-one correspondence between input samples and output samples, the MFT must determine the correct attributes for output samples. Set the MFPKEY_EXATTRIBUTE_SUPPORTED property to VARIANT_TRUE.

IMF2DBuffer Support

If an MFT processes uncompressed video data, it should use the IMF2DBuffer interface to manipulate the sample buffers. To get this interface, query the IMFMediaBuffer interface on any input or output buffer. Not using this interface when it is available may result in additional buffer copies. To make proper use of this interface, the transform should not lock the buffer using the IMFMediaBuffer interface when IMF2DBuffer is available.

For more information about processing video data, see Uncompressed Video Buffers.

Creating Hybrid DMO/MFT Objects

The IMFTransform interface is loosely based on IMediaObject, which is the primary interface for DirectX Media Objects (DMOs). It is possible to create objects that expose both interfaces. However, this can lead to naming collisions, because the interfaces have some methods that share the same name. You can solve this problem in one of two ways:

Solution 1: Include the following line at the top of any .cpp file that contains MFT functions:

#define MFT_UNIQUE_METHOD_NAMES

This changes the declaration of the IMFTransform interface so that all of the methods are prefixed with MFT. Thus, IMFTransform::ProcessInput becomes IMFTransform::MFTProcessInput, while IMediaObject::ProcessInput keeps its original name. This technique is most useful if you are converting an existing DMO to a hybrid DMO/MFT. You can add the new MFT methods without changing the DMO methods.

Solution 2: Use C++ syntax to disambiguate names that are inherited from more than one interface. For example, declare the MFT version of ProcessInput as follows:

CMyHybridObject::IMFTransform::ProcessInput(...)

And declare the DMO version of ProcessInput like this:

CMyHybridObject::IMediaObject::ProcessInput(...)

If you make an internal call to a method within the object, you can use this syntax, but doing so will override the virtual status of the method. A better way to make calls from inside the object is the following:

hr = ((IMediaObject*)this)->ProcessInput(...)

That way, if you derive another class from CMyHybridObject and override the CMyHybridObject::IMediaObject::ProcessInput method, the correct virtual method is called. The DMO interfaces are documented in the DirectShow SDK documentation.

Share via