HLSL Shader Model 6.0

Describes the new wave operation intrinsics added to HLSL Shader Model 6.0.

Note  : The Shader Model 6.0 specifics described is a preview and subject to change, pending the release of supporting tools and drivers.
 

Shader model 6.0

For earlier shader models, HLSL programming exposes only a single thread of execution. New wave-level operations are provided, starting with model 6.0, to explicitly take advantage of the parallelism of current GPUs - many threads can be executing in lockstep on the same core simultaneously. For example, the model 6.0 intrinsics enable the elimination of barrier constructs when the scope of synchronization is within the width of the SIMD processor, or some other set of threads that are known to be atomic relative to each other.

Potential use cases include: stream compaction, reductions, block transpose, bitonic sort or Fast Fourier Transforms (FFT), binning, stream de-duplication, and similar scenarios.

Most of the intrinsics appear in pixel shaders and compute shaders, though there are some exceptions (noted for each function). The functions have been added to the requirements for DirectX Feature Level 12.0, under API level 12.

The <type> parameter and return value for these functions implies the type of the expression, the supported types are those from the following list that are also present in the target shader model for your app:

  • half, half2, half3, half4
  • float, float2, float3, float4
  • double, double2, double3, double4
  • int, int2, int3, int4
  • uint, uint2, uint3, uint4
  • short, short2, short3, short4
  • ushort, ushort2, ushort3, ushort4
  • uint64_t, uint64_t2, uint64_t3, uint64_t4

Some operations (such as the bitwise operators) only support the integer types.

Terminology

TermDefinition
LaneA single thread of execution. The shader models before version 6.0 expose only one of these at the language level, leaving expansion to parallel SIMD processing entirely up to the implementation.
WaveA set of lanes (threads) executed simultaneously in the processor. No explicit barriers are required to guarantee that they execute in parallel. Similar concepts include "warp" and "wavefront."
Current waveThe wave of lanes that the program is currently executing.
Inactive LaneA lane which is not being executed, for example due to the flow of control, or insufficient work to fill the minimum size of the wave.
Active LaneA lane for which execution is being performed. In pixel shaders, it may include any helper pixel lanes.
QuadA set of 4 adjacent lanes corresponding to pixels arranged in a 2x2 square. They are used to estimate gradients by differencing in either x or y. A wave may be comprised of multiple quads. All pixels in an active quad are executed (and may be "Active Lanes"), but those that do not produce visible results are termed "Helper Lanes".
Helper LaneA lane which is executed solely for the purpose of gradients in pixel shader quads. The output of such a lane will be discarded, and so not render to the destination surface.
Render LaneA lane in a pixel shader quad that will emit a write to the destination buffer. A non-helper lane.

 

Shading language intrinsics

All the operations of this shader model have been added in a range of intrinsic functions.

Wave Query

The intrinsics for querying a single wave.

IntrinsicDescriptionPixel shaderCompute shader
WaveOnce Identify operations that must happen only once per wave.**
WaveGetLaneCount Returns the number of lanes in the current wave. **
WaveGetLaneIndex Returns the index of the current lane within the current wave.**
WaveIsHelperLane Returns true if the current lane is a helper pixel of the specified lane.*

 

Wave Vote

This set of intrinsics compare values across threads currently active from the current wave.

IntrinsicDescriptionPixel shaderCompute shader
WaveAnyTrue Returns true if the expression is true in any active lane in the current wave.**
WaveAllTrue Returns true if the expression is true in all active lanes in the current wave.**
WaveAllEqual Returns true if the expression is the same for every active lane in the current wave (and so is uniform across the wave).**
WaveBallot Returns a 64-bit unsigned integer bitmask of the evaluation of the Boolean expression for all active lanes in the specified wave. **

 

Wave Broadcast

These intrinsics enable all active lanes in the current wave to receive the value from the specified lane, effectively broadcasting it. The return value from an invalid lane is undefined.

IntrinsicDescriptionPixel shaderCompute shader
WaveReadLaneAt Returns the value of the expression for the given lane index within the specified wave.**
WaveReadFirstLane Returns the value of the expression for the active lane of the current wave with the smallest index. **

 

Wave Reduction

These intrinsics compute the specified operation across all active lanes in the wave and broadcast the final result to all active lanes. Therefore, the final output is guaranteed uniform across the wave.

IntrinsicDescriptionPixel shaderCompute shader
WaveAllSum Sums up the value of the expression across all active lanes in the current wave and replicates it to all lanes in the current wave.**
WaveAllProduct Multiplies the values of the expression together across all active lanes in the current wave.**
WaveAllBitAnd Returns the bitwise AND of all the values of the expression across all active lanes in the current wave.**
WaveAllBitOr Returns the bitwise OR of all the values of the expression across all active lanes in the current wave.**
WaveAllBitXor Returns the bitwise Exclusive OR of all the values of the expression across all active lanes in the current wave.**
WaveAllMin Computes the minimum value of the expression across all active lanes in the current wave.**
WaveAllMax Computes the maximum value of the expression across all active lanes in the current wave.**

 

Wave Scan and Prefix

These intrinsics apply the operation to each lane and leave each partial result of the computation in the corresponding lane.

IntrinsicDescriptionPixel shaderCompute shader
WavePrefixSum Returns the sum of all of the values in the active lanes with smaller indices than this one.**
WavePrefixProduct Returns the product of all of the values in the lanes before this one of the specified wave.**

 

Global Ordered Append

Global ordering is already supported, prior to shader model 6.0, for some UAV write operations via Append/Consume. Exposing it as a separate intrinsic allows its use in more general scenarios.

IntrinsicDescriptionPixel shaderCompute shader
WaveGetOrderedIndex Returns the index of this wave since the start of the draw/dispatch call. **
GlobalOrderedCountIncrement Returns a per-lane globally ordered count.**

 

Quad-wide Shuffle operations

These intrinsics perform swap operations on the values across a wave known to contain pixel shader quads as defined here. The indices of the pixels in the quad are defined in scan-line or reading order - where the coordinates within a quad are:

  • [0] is at [x,y]
  • [1] is at [x+1,y]
  • [2] is at [x, y+1]
  • [3] is at [x+1, y+1]

These routines work in either compute shaders or pixel shaders. In compute shaders they operate in quads defined as evenly divided groups of 4 within an SIMD wave. In pixel shaders they should be used on waves captured by WaveQuadLanes, otherwise results are undefined.

IntrinsicDescriptionPixel shaderCompute shader
QuadReadLaneAt Returns the specified source value read from the lane of the current quad identified by quadLaneID [0..3] which must be uniform across the quad. **
QuadSwapX Returns the specified source value read from the other lane in this quad in the X direction. **
QuadSwapY Returns the specified source value read from the other lane in this quad in the Y direction.**

 

Hardware capability

In order to check that the wave operation features are available on any specific hardware, call ID3D12Device::CheckFeatureSupport, noting the description and use of the D3D12_FEATURE_DATA_D3D12_OPTIONS1 structure.

Related topics

Programming Guide for HLSL
Shader Model 6 intrinsics

 

 

Show: