Skip to main content

GPU Encoding in Expression Encoder 4 Pro SP1

Author: Eric Juteau

About the Author: Eric Juteau is a Principal Software Design Engineer in Test on the Microsoft Expression Encoder team and an avid enthusiast of everything related to video.


NOTE: Updated information about this feature is available in the  GPU Encoding in Expression Encoder 4 Pro SP2 whitepaper.

Introduction

In this white paper, we will explore why GPU encoding is important, step-by-step instructions for using Nvidia CUDA-accelerated GPUs for encoding in Expression Encoder Pro, limitations and best practices, recommended PC configurations, and troubleshooting. 

What is GPU Encoding?

Traditional video transcoding runs similarly to a water pipeline: uncompressed video frames first flow out of the decoder component, are then processed (resized, de-interlaced, etc.) and finally are sent to the encoder component of the pipeline. If one of these operations takes significantly longer than the others, the system gets clogged and the flow of video frames slows down across the whole pipeline.

image: Transcoding pipeline

With the current PCs on the market, the usual bottleneck of the transcoding process resides in that last encoder component, even on PCs with many processors. This is where the GPU comes into play. The GPU (graphics processing unit), typically found on graphics cards and used for drawing pixels on your computer screen, is a hardware resource that is also extremely efficient at doing many tasks in parallel and is very advantageous for video encoding.

While Expression Encoder can take advantage of the GPU's power in the editing phase for decoding, pre-processing and rendering the preview window, it has been idle at encoding time in our earlier versions. Wouldn't it be great to harness its power to encode too? That is exactly what we introduced in Expression Encoder 4 Pro SP1 by integrating the new Main Concept H.264 CUDA encoder into our encoding pipeline. The result: improving encoding performance by a factor of 2-3x over CPU alone for offline cases, and enabling some Live Smooth Streaming scenarios that weren't possible with software-based encoding, like encoding 3 or 4 HD streams in real-time using a high-end laptop.

Today the GPU is used only for the video encoding process, leaving the CPU to focus on the decoding and pre-processing. Therefore we recommend matching the CPUs and GPUs accordingly. Please note that our team is investigating ways to make even better use of the GPU in the future.

GPU Encoding Usage

It is worth noting that GPU encoding is enabled by default in the Expression Encoder application, but it is disabled by default in the Encoder SDK. This discrepancy was introduced to ensure visibility of the feature while keeping API parity with the Encoder SDK. The GPU settings in the Encoder application can be found in the Tools->Options dialog in the Other tab (see below) and will be applied to all future encodes:

image: GPU settings in Expression Encoder 4

Expression Encoder includes support for multiple CUDA-enabled GPUs – these are all listed and can be enabled or disabled individually.If more than one GPU is available, video streams will be distributed equally to each GPU in a snake pattern (1-2-2-1, 1-2-3-3-2-1) from the highest resolution stream down, up to the number set in the “GPU Streams” setting. For example, if an 8-stream Smooth Streaming encode is started on a PC with two available CUDA GPUs, the first GPU would get stream 1, 4 and 5, while the second GPU would get streams 2 and 3, and the last and smallest three streams would be handled by the CPU.

While it is advisable to leave the “GPU Streams” setting to its default since it is the best GPU/CPU mix for performance, you can experiment with it to tailor the setting to your hardware configurations.

You can view how many streams are actually being encoded by the GPU in the Transcoding progress dialog as well as in the Statistics tab in Live mode, taking into account the current status of the GPU, the “GPU Stream” setting as well as other limitations described further below:

Transcoding progress dialog:

image: Transcoding progress dialog box

Live Statistics tab:

image: Live Statistics tab

As for the SDK, one line of code enables GPU encoding:

       H264EncodeDevices.EnableGpuEncoding = true;

Similarly, you can enable or disable a CUDA device and control the “GPU Stream” number with static APIs in H264EncodeDevices:

       H264EncodeDevices.Devices[0].Enabled = true;
       H264EncodeDevices.MaxNumberGPUStream = 5;

These APIs affect the current instance of the application and, unlike their UI counterparts, will not persist.

Finally, information about whether each stream is encoded via the GPU or CPU can be accessed via the Job.EncodeProgress event.

Limitations

  1. GPU encoding is currently only supported on Nvidia GeForce, Quadro and Tesla-class boards with CUDA v1.1 or higher. More information about Nvidia CUDA technology is available here. Note that direct access to CUDA services is required, which makes running under remote desktop or as a service only possible using the Tesla-class boards (see below for more information). Note: Nvidia Kepler class GPU hardware (GeForce 6XX, Quadro K-series and Tesla K-series) is not supported in Expression Encoder 4.
  2. At the moment, only H.264 encodes are supported. This includes MP4 and H.264-based IIS Smooth Streaming in both Live and Transcoding modes.
  3. Multiple GPUs will only be useful on multiple stream encode operations. In other words, only one GPU will be effectively used on a single stream encode.
  4. GPU encoding does have an overhead performance cost which doesn't make sense for encodes of 200 scan lines or smaller and thus are better served by the CPU. For this reason, any video stream containing less than 204 scan lines is automatically encoded by the CPU.
  5. Adaptive B-Frames and Reference B-Frames are not currently supported by the GPU-based H.264 encoder, so those settings are ignored.
  6. Output quality differences between GPU-assisted encoding and pure CPU-based encoding should be expected since they are implemented very differently. For example, the GPU output doesn’t blur as much, and thus is harder to encode, which in turns can introduce more compression artifacts if the bitrate isn’t adequate.
  7. Because of memory constraints, we also recommend running memory-bound scenarios like HD IIS Smooth Streaming encodes only in 64-bit environments with 4GB or more RAM. For applications using the SDK, we also strongly recommend using the LARGEADDRESSAWARE setting so that the application can use as much memory as possible. See this post for more information.

How to Make the Most of GPU Encoding

In most cases, the encoding bottleneck in the transcoding pipeline will be removed by using the GPU. Unfortunately, the main bottleneck will very likely be moved to another part of the pipeline by using the GPU to encode. Of course, we are continuing to investigate new ways to eliminate processing bottlenecks throughout the video pipeline.

Here are some guidelines to see the best performance results with GPU encoding:

  • Ensure the right GPU is used for the right job:  Tesla for high-performance server / data center, Quadro for professional workstation, and GeForce for consumer and enthusiast PC.
  • Make sure to get the latest drivers for your video card directly from the Nvidia website.
  • GPU encoding requires that your GPU and CPU operate at the peak of their performance capabilities, so you should make sure that you have proper air flow in and around your PC as well as an adequate power supply for the system.
  • Pair the appropriate GPU for the PC: high-end PC with high-end GPUs, low-end PC with low-end GPUs. For example, you may not experience significant performance gains adding a $500 GPU in a low end dual-core than adding a $100 one, mainly because the CPU will be the bottleneck in both cases. On the other hand, enabling GPU encoding on a top-end dual Quad Xeon with a low-end GPU might actually slow down the encoding.
  • Ensure that the GPU has enough memory (discreet performs the best), especially to encode to HD. Video cards with 1GB or more are recommended.
  • Tweak the pre-processing settings: Expression Encoder defaults to the highest quality pre-processing algorithms (aka “Auto adaptive” de-interlacing, “Super Sampling” resizing, etc.). More than likely, the new encoding pipeline bottleneck will be in the pre-processing, so switching the resizing algorithm to bi-cubic, for example, could greatly speed up the encoding process. Note that in Encoder 4 SP1, we’ve also included “Selective Blend”, a new de-interlacer, which we strongly recommend using over Pixel Adaptive for better performance.
  • Our performance tests show that multiple GPUs actually perform better without being connected to each other via a SLI connector, and thus it is most likely preferable to remove it. Please refer to the video card manufacturer’s instructions for more information.
  • It is suggested to leave the “Lower process priority during encode” checkbox in the Options dialog unchecked to ensure proper GPU and CPU load balancing.

RDP and “Running As a Service” Issues

Because any application using CUDA, including Encoder and applications using its SDK, needs direct access to the CUDA device, besides a few exceptions discussed below, most types of video driver virtualization will disable the GPU encoding feature. That includes scenarios like VPC, HyperV, remote desktop (RDP) and running as a service.

It is worth noting that once the CUDA device has been instantiated by a process, it will be kept available to it. This means for example that you could start an encode operation on a local workstation, lock it and RDP to it later and while you don't currently have access to the CUDA device within your RDP session, your encoding process still would.

But if you actually need to start the encoding process either remotely or running headless, there are still a few options available, granted that you have the right hardware and software combination.

For remote access/RDP:

  • Using a remote access system which gives access to the GPU like VNC or Live Mesh will work, although it is less optimized and convenient than RDP.
  • [Tesla only] The Tesla Computer Cluster (TCC) driver enables access to the CUDA device through RDP. More information about the TCC driver can be found here.

For running as a service:

  • [Tesla only] The Tesla Computer Cluster (TCC) driver enables access to the CUDA device when running as a service. More information about the TCC driver can be found here.

 How to enable the Tesla Computer Cluster (TCC) driver

  1. Make sure you have the latest Tesla driver available (263.06 or better). This is crucial, as older drivers will silently fail the next steps.
  2. In an administrator command prompt, navigate to %ProgramFiles%\NVIDIA Corporation\NVSMI and run this operation:

    Nvidia-smi –g 0 –dm 1        (where 0 is the device id of the Tesla card)

  3. After TCC mode is successfully turned on, a full reboot is required.
  4. Repeat all operations upon any hardware modification that could change the device order.

More information about the TCC driver is available here.

This is obviously a hot topic on the bleeding edge and it is safe to say that many options will become available in the near future regarding these limitations.

Suggested PC Configurations

Since the CPU is still very much in use to decode the sources and execute the pre-processing operations, having a CPU with multiple cores would definitely help the GPU encoding process. Of course, performance greatly varies depending on the sources and settings selected, so the suggestions below should only be taken as rough guidelines.

Memory and Storage

Unless multiple encoding processes are done in parallel, 6-8 GB of RAM is sufficient since Encoder is limited to 32-bit memory addressing.

In some cases, especially on sources with large bitrate (25 Mbps+), hard drive throughput can be a bottleneck, thus the use of SSDs and/or Raid arrays in stripe mode would definitely help. Either way, we highly recommend using local storage for both source and target files for best performance.

CPU and GPU

First, ensure the right GPU is being used for the right job: Tesla for high-performance server / data center, Quadro for professional workstation, and GeForce for consumer and enthusiast PC. As a general rule, the GPU encoding performance is mostly driven by the number of CUDA cores.. For example, the 48 cores the GeForce GT220 has are dwarfed by the Tesla C2050’s 448 cores, which will perform considerably better.

Scenario #1: One-stream MP4 encoding on home PC/laptop

Almost every PC will benefit from having an appropriately paired GPU, improving performance by a factor of 1.5-3x over CPU encoding alone. A GeForce-class board with 1GB of memory and a minimum of 128 or more CUDA cores is recommended. Of course, the more CUDA cores, the better.

Scenario #2: Live encoding / Smooth Streaming workstation

One or two mid to high-end Quadro boards or a Quadro/Telsa combination would nicely match a high-end Core i7-based workstation, typically encoding 2-3x faster.

Scenario #3: Enterprise-level live/offline encoding server

Depending on the performance needs, one to two high-end Tesla boards are appropriate to accelerate a high-end dual Xeon server blade, tripling its HD encoding speed, as well as enabling you to encode as a service or control the encoding remotely via RDP.

To give you a better idea of appropriate hardware and associated performance in these kinds of scenarios, we’ve tested several machines to compare GPU-assisted encoding to CPU-only encoding. Please see this performance report for more information. The Live Performance Tool has also been updated to support GPU encoding. 

Troubleshooting

Here are a few solutions to some common problems with GPU encoding:

Problem: When encoding, GPU stream count is always 0.

Solutions:

  1. Verify that Expression Encoder 4 Pro is installed and that the project is set to encode H.264.
  2. Make sure that the last version of the video driver from the Nvidia website is installed.
  3. Verify that the Nvidia GPU that supports CUDA v1.1 by going to the Tools->Options->Other dialog and try to check the GPU Encoding checkbox. More information about your GPU can be discovered by using tools like GPU-Z or CUDA-Z.
  4. Verify that there is at the least one stream that has more than 200 scan lines.

Problem: GPU encoding fails with “Not enough memory” error.

Solutions:

  1. Try again after closing all other applications other than Expression Encoder Pro.
  2. Make sure that the last version of the video driver from the Nvidia website is installed.
  3. If encoding multiple streams, remove a few streams in the video profile and/or reduce the “GPU Stream” setting in the Tools->Options->Other dialog.
  4. If all else fails, encode this project with GPU encoding turned off.

Problem: PC becomes unresponsive, video driver crashes, OS bluescreens or reboots.

Solutions:

  1. Make sure that the last version of the video driver from the Nvidia website is installed.
  2. Verify that there is proper air flow in and around your PC, as well as an adequate power supply for the system. External tools like PC Wizard and GPU-Z can be used to monitor the temperature of each component and help identify an air flow problem.

Problem: GPU encoding takes longer to encode than expected.

Solutions:

  1. Try again after closing all other applications other than Expression Encoder Pro.
  2. Make sure that the last version of the video driver from the Nvidia website is installed.
  3. Monitor both CPU and GPU activity using tools like Task Manager and GPU-Z:
    1. If the GPU load is very high and CPU usage is low: reduce the number of streams encoded by the GPU by tweaking the “GPU Stream” setting in the Tools->Options->Other dialog to reduce the GPU bottleneck on multi-stream encodes. Upgrading the GPU would be another option to maximize performance.
    2. If the GPU isn’t running at its maximum clock speed during encoding: make sure the OS performance setting is set to high. In some instances, the GPU may be “throttled” down by the system to limit power consumption, which may explain low performance. In this case, please contact your PC manufacturer to inquire about possible solutions.
    3. If the CPU load is very high and GPU usage is low: consider changing the pre-process options (de-interlacing, resizing, etc) to reduce the CPU load. You could also try to increase the number of streams encoded by the GPU to increase performance.
    4. If both CPU and GPU loads are low: this normally indicates a bottleneck outside of the encoder component, possibly related to the source access or decoding. If possible, use local sources and targets. Some decoders are more efficient than others, so it might be worth trying different codecs by disabling others in the Tools->Options->Compatibility dialog.