Click to Rate and Give Feedback
MSDN
MSDN Library
Technical Articles
Technical Articles
 Windows Advanced Rasterization Plat...

  Switch on low bandwidth view
Windows Advanced Rasterization Platform (WARP) In-Depth Guide

Andy Glaister, Principal Development Lead
Desktop and Graphics Technologies
November 2008

What is WARP?

WARP10 is a new component that will be part of DirectX graphics technology in Windows 7. WARP10 is a high speed, fully conformant software rasterizer. It is shipping in beta form in the November 2008 DirectX SDK. WARP stands for Windows Advanced Rasterization Platform. This paper describes the following aspects of WARP10.

Goals
Capabilities
How to Use WARP
Priorities
Target Customers
Architecture
Conformance
Performance

Goals

WARP10 has six key priorities:

  • Completing the Platform
  • Replaces the Need for Custom Software Rasterizers
  • Enabling Maximum Performance from Hardware
  • Enabling Rendering When Direct3D 10 Hardware Is Not Available
  • Leveraging Existing Resources for Software Rendering
  • Enabling Scenarios that don’t require Graphics Hardware

Completing the Platform

One of the major advances in Direct3D 10 was the removal of caps.  This removal allows developers to take advantage of all the features of a wide range of video cards knowing that their application will behave and look the same everywhere. The performance of these applications can be scaled by simply disabling expensive graphics features on low end cards or rendering to smaller targets. WARP10 contributes to this ‘No Caps’ goal by allowing developers to have access to all Direct3D 10 graphics features, even on machines without Direct3D 10 graphics hardware.

Replaces the Need for Custom Software Rasterizers

By providing WARP, development is simplified by removing the need to spend time building a custom software rasterizer and tuning your application for it instead of hardware. By providing a single, conformant general purpose software rasterizer, there is no longer a need to write image rendering algorithms in multiple ways simply to run on hardware or software with different features and capabilities. Developers can still choose to implement algorithms in multiple ways if it would help achieve better performance or scaling, however they do not need to change the API or rendering architecture used to implement those algorithms. Instead, a developer can focus on creating a great Direct3D 10 application and know that it will look the same and perform well on hardware or in software.

Enabling Maximum Performance from Hardware

When an application is tuned to run efficiently on hardware, it will run efficiently on WARP as well. The converse is also true, any application that is tuned to run well on WARP10 will perform extremely well when running on hardware. Using Direct3D 10 inefficiently can limit the ability for your application to scale efficiently on different hardware. WARP10 has very similar performance profiles to hardware so tuning an application for large batches, minimizing state changes, removing synchronizing points or locks will benefit both hardware and WARP10.

Enabling Rendering When Direct3D 10 Hardware is Not Available

WARP allows fast rendering in a variety of situations where hardware implementations are unavailable, including:

  • When the user does not have any Direct3D capable hardware
  • When running as a service or in a server environment
  • When no video card installed
  • When a video driver is not available, or is not working correctly
  • When a video card is out of memory, hangs or would take too many system resources to initialize.

Leveraging Existing Resources for Software Rendering

There is a huge community, many books, web sites, SDKs, samples, white papers, mailing lists and other resources that can help you take advantage of Direct3D 10 shader based image rendering. With WARP10 as a software fallback you can leverage existing, hardware-oriented knowledge to make your application better when running in hardware or software. In addition there are many excellent tools from the graphics card vendors and in the DirectX SDK that can help you design, build, develop, debug and analyze performance issues with graphics applications. These tools and knowledge can now benefit application development targeting both hardware and software when using WARP10.

Enabling Scenarios that don’t require Graphics Hardware

Image processing algorithms, printing, remoting, Virtual PC’s and other emulators, high quality font rendering, charts, graphs etc… these algorithms and applications have typically been optimized for the CPU because they don’t want to or can’t take a dependency on hardware. With WARP10, it’s possible to have a single architecture that accomplishes these tasks that can run fully in software, yet also have the ability to take advantage of hardware acceleration if it’s available.

Capabilities

  • Fully supports all Direct3D 10 and 10.1 feature
    • Fully supports all the precision requirements of the Direct3D 10 and 10.1 specification
    • Supports Direct3D 11 when used with FeatureLevel 9_1, 9_2, 9_3, 10_0 and 10_1
    • Supports all optional texture formats, such as multi-sample render targets and sampling from float surfaces.
    • Supports anti-aliased, high quality rendering up to 8x MSAA.
    • Supports anisotropic filtering
    • Supports 32 and 64 bit applications as well as large address aware 32 bit applications.
  • The minimum specification for WARP10 is the same as Windows Vista, specifically:
    • Minimum 800MHz CPU.
    • MMX, SSE or SSE2 is *not* required
    • Minimum 512MB of RAM.

How to Use WARP

The beta Direct3D 10, 10.1 and 11 components in the November SDK have an additional driver type that can be specified during CreateDevice – this is D3D10_DRIVER_TYPE_WARP. When this is specified, a WARP10 device will be created and no hardware device will be initialized.

The November DirectX SDK includes a beta version of the WARP10 rasterizer. Because WARP10 uses the same software interface to Direct3D that is used by the references rasterizer, any Direct3D 10 or 10.1 application that can support running with the reference rasterizer can be tested with WARP10. Rename D3D10WARP.DLL to D3D10REF.DLL and place it in the same folder as the sample or application and when you switch to ref you will see WARP10 rendering.

If WARP10 is renamed D3D10Ref.DLL and placed in C:\Program Files (x86)\Microsoft DirectX SDK (November 2008)\Samples\C++\Direct3D\Bin\x86, the DirectX samples can all run against WARP10, either by clicking the ‘Toggle Ref’ button in the sample, or running the sample with /ref specified on the command line.

Priorities

The priorities we chose during development were:

  • Conformance, Stability, Reliability
  • Performance
  • Memory Consumption

Our primary goal during WARP10 development was to produce a rasterizer that met or exceeded all the precision and conformance requirements of the Direct3D 10 and 10.1 specifications. We wanted to do this while achieving a high level or reliability and stability. If this rasterizer was going to be used as a fallback for when hardware was not functioning, it’s important that it worked in all scenarios, configurations and different types of machines. Once those conformance targets were met, performance was an important goal. We wanted to make sure we could achieve impressive performance, utilizing all the available computing power of modern CPU’s without compromising conformance for speed. We also decided that we would not expose any new API’s, configuration files or registry keys to enable, disable or optimize various features of WARP10. We felt that the Direct3D 10 API provided enough configuration options to allow developers to tune their applications to run more efficiently on a software rasterizer without having to expose any new features in the API. Because we use a deferred rendering architecture we often batch up significant amounts of work to allow multiple CPU cores to execute in parallel with little synchronization. There is a tradeoff between the amount of efficiency we can achieve in multi-threaded scaling and memory consumption. We have attempted to achieve a balance of reasonable memory consumption which meeting our scaling efficiency goals. There is also some memory overhead because of internal data structures or storing some of the rasterization resources in more CPU friendly formats; this data has also been tuned to be as efficient as possible.

Target Customers

The target customer base for WARP10, put most concisely, is all applications that can use Direct3D 10 or 10.1. This includes:

Casual Games

Games have simple rendering requirements but also want the ability to use impressive visual effects that can be hardware accelerated. The majority of the best selling game titles for Windows are either simulations or casual games, neither of which requires high performance graphics, but both styles of games greatly benefit from modern shader based graphics and the ability to scale on hardware if present.

Existing Non-Gaming Applications

There is a large gamut of graphical applications that want to minimize the number of code paths in their rendering layer. WARP10 enables these applications to implement a single Direct3D 10, 10.1, or 11 code-path that can target a very large number of machine configurations.

Advanced Rendering Games

Game developers that want to isolate graphics card or driver specific rendering errors. We believe that all games, even extremely graphically demanding games would benefit from being able to render their content using WARP to validate that any visual artifacts they might experience are due to rendering errors or problems with hardware or drivers.

As mentioned above, the target customers for WARP also include those that may not use Direct3D 10 or 10.1 currently. This includes applications that need to ‘always work’ on all machines, image processing applications that don’t want to write CPU and GPU versions of image processing algorithms, image processing algorithms where speed or use the GPU is not critical such as printing, and emulators and virtual environments that are attempting to display advanced 3D graphics.

The target audience for the November SDK BETA release of WARP10 is to get the WARP10 rasterizer into the hands of developers. We feel confident at this stage of our development that we have achieved a high level of conformance and we are currently working on improving our performance and multi-core scalability. We are working hard to test WARP10 in more unusual situations and with more varied and diverse applications. Between this beta release and the final release of WARP10 in Windows 7, there should be noticeable increases in performance as well as scaling efficiently and any conformance feedback we receive from this beta and our continued testing.

Architecture

WARP10 is based upon the reference rasterizer codebase so shares the same software interface to both Direct3D 10 and DXGI. WARP10 is shipped in Windows 7 as a DLL and can be found in the system folder named D3D10WARP.DLL. There are two versions of WARP10 installed on 64 bit machines, an x86 and x64 version. The x64 version may run faster in certain circumstances as the code generator contained in WARP can take advantage of the additional registers that are available when running 64 bit applications.

WARP10 contains two high-speed, real-time compilers: The first is a high level intermediate language that converts HLSL bytecode and the current render state into an optimized stream of vector commands for the GS, VS and PS stages of the pipeline. The second is a high performance Just-In-Time code generator that can take these commands and generate optimized SSE2, SSE4.1, x86 or x64 assembly code.

WARP10 uses the Vista OS thread pool and complex task management and dependency tracking to allow all parts of the rendering pipeline to be distributed efficiently across available CPU cores.

WARP10 uses a deferred rendering style architecture where rendering commands are batched and rasterization only occurs when sufficient data is available to use all the CPU resources efficiently. Work on the main application thread is minimized to allow the application to submit commands as quickly as possible. If an application is also multi-threaded, and it also chooses to use the Vista OS thread pool, work will be evenly distributed between WARP10 and the application.

The WARP10 code generator has been tuned to make best use of the CPU architecture of modern machines. WARP10 is able to run on all machines that can run Windows Vista, even if the machine does not support SSE. WARP10 has been optimized to run best on machines that support SSE2. It also contains optimizations for specific architecture and performance differences between AMD and Intel processors as well as extensive support for the SSE 4.1 extensions. Modern AMD Phenom and Intel Penryn quad core CPU’s are extremely efficient at running WARP10.  Future platforms and multi-processor machines are expected to be even more efficient.

WARP10 does not require graphics hardware to execute. This has a number of advantages as this allows WARP10 to execute even in situations that hardware is not available or cannot be initialized.

Conformance

WARP10 passes all the standard WHQL conformance tests that are used to validate Direct3D hardware devices.

WARP10 has been tested against an extensive suite of Direct3D 10 and 10.1 applications and benchmarks as well as SDK samples from DirectX, NVIDIA and AMD.

WARP10 used PIX for Windows extensively in its testing; we have a large library of single frame captures of applications that we compare between hardware and WARP10. The majority of the images appear almost identical between hardware and WARP10, where small differences sometimes occur we find they are within the tolerances defined by the Direct3D 10 specification.

Performance

Applications and samples that were designed and built to run on Direct3D 10 hardware without any knowledge of WARP10 should run well when using WARP10, however it’s often best to lower the quality settings and resolution as much as possible to achieve usable frame rates. If developers are aware of the capabilities and performance of WARP10, it’s possible to develop and tune applications that should run extremely well on both hardware and software.

During the development of WARP10 we have tried to avoid ‘fast path optimizations’ that might help speed up only some particular benchmarks or applications, but could confuse a developer if they see large differences in performance with only small changes in their application.

As WARP10 makes extensive use of multiple CPU cores, the best performance of the rasterizer will be found on modern quad core CPU’s. WARP10 also runs significantly faster on machines with SSE4.1 extensions and we have done significant testing and performance tuning on machines with eight or more cores and SSE4.1 as we believe these high end machines will be more and more common during the lifetime of Windows 7.

When WARP10 is running on the CPU we are limited compared to a graphics card in a number of ways. The front side bus speed of a CPU is typically around or under 10GB/s where as a graphics card often has dedicated memory that is able to take advantage of 20-100GB/s or more of graphics bandwidth. Graphics hardware also has fixed function units that can perform complex and expensive tasks like texture filtering, format decompression or conversions asynchronously with very little overhead or power cost. Performing these operations on a typical CPU is expensive in terms of both power consumption and performance cost in cycles.

A number of Direct3D 10 applications have built in benchmarking modes which has been useful during WARP10 development. We have used a number of these applications as great workloads to help tune the performance of the rasterizer with more ‘real world examples’.

The typical performance numbers we are seeing on an Intel Penryn based 3.0GHz Quad Core machine show that WARP10 can in some cases even outperform low end integrated Direct3D 10 graphics GPU’s on a number of benchmarks!

Low end discrete graphics hardware is typically 4-5x faster than WARP10 at running these benchmarks and obviously, these GPU’s have minimal use of CPU resources as well. Mid-range or high-end graphics cards are significantly faster than WARP10 for many applications particularly when an application can take advantage of the massive parallelism and memory bandwidth these graphics cards provide.

We don’t see WARP10 as a replacement for graphics hardware, particularly as reasonably performing low end Direct3D 10 discrete hardware is now available for under $25. The goal of WARP10 was to allow applications to target Direct3D 10 level hardware without having significantly different code paths or testing requirements when running on hardware or when running in software.

Example data:

Running Direct3D 10 Crysis at 800x600 with all the quality settings on their lowest settings:

CPU

Time

Ave FPS

Min FPS

Min Frame

Max FPS

Max Frame

Core i7 8 Core @ 3.0GHz

271.57

7.36

3.46

1966

15.01

995

Penryn 4 Core @ 3.0GHz

351.35

5.69

2.49

1967

10.95

980

Penryn 2 Core @ 3.0GHz

573.98

3.48

1.35

1964

6.61

988

Core 2 Duo @ 2.6GHz

707.19

2.83

0.81

1959

5.18

982

Core 2 Duo @ 2.4GHz

763.25

2.62

0.76

1964

4.70

984

Core 2 Duo @ 2.1GHz

908.87

2.20

0.64

1965

3.72

986

Xeon 8 Core @ 2.0GHz

424.04

4.72

1.84

1967

9.56

988

AMD FX74 4 Core @ 3.0GHz

583.12

3.43

1.41

1967

5.78

986

Phenom 9550 4 Core @ 2.2GHz

664.69

3.01

0.53

1959

5.46

987

For example, this is the same test running across a variety of hardware:

Graphics Card

Time

Ave FPS

Min FPS

Min Frame

Max FPS

Max Frame

NVIDIA 8800 GTS

23.58

84.80

60.78

1957

130.83

1022

NVIDIA 8500 GT

47.63

41.99

25.67

1986

72.57

991

NVIDIA Quadro 290

67.16

29.78

18.19

1969

49.87

1017

NVIDIA 8400 GS

59.01

33.89

21.22

1962

51.82

1021

ATI 3400

53.79

37.18

22.97

618

59.77

1021

ATI 3200

67.19

29.77

18.91

1963

45.74

980

ATI 2400 PRO

67.04

29.83

17.97

606

45.91

987

Intel DX10 Integrated

386.94

5.17

1.74

1974

16.22

995

© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Page view tracker