Kinect for Windows 1.5, 1.6, 1.7, 1.8
A Kinect streams out color, depth, and skeleton data one frame at a time. This section briefly describes the coordinate spaces for each data type and the API support for transforming data from one space to another.
Each frame, the color sensor captures a color image of everything visible in the field of view of the color sensor. A frame is made up of pixels. The number of pixels depends on the frame size, which is specified by NUI_IMAGE_RESOLUTION Enumeration. Each pixel contains the red, green, and blue value of a single pixel at a particular (x, y) coordinate in the color image.
Each frame, the depth sensor captures a grayscale image of everything visible in the field of view of the depth sensor. A frame is made up of pixels, whose size is once again specified by NUI_IMAGE_RESOLUTION Enumeration. Each pixel contains the Cartesian distance, in millimeters, from the camera plane to the nearest object at that particular (x, y) coordinate, as shown in Figure 1. The (x, y) coordinates of a depth frame do not represent physical units in the room; instead, they represent the location of a pixel in the depth frame.
Figure 1. Depth stream values
When the depth stream has been opened with the NUI_IMAGE_STREAM_FLAG_DISTINCT_OVERFLOW_VALUES flag, there are three values that indicate the depth could not be reliably measured at a location. The "too near" value means an object was detected, but it is too near to the sensor to provide a reliable distance measurement. The "too far" value means an object was detected, but too far to reliably measure. The "unknown" value means no object was detected. In C++, when the NUI_IMAGE_STREAM_FLAG_DISTINCT_OVERFLOW_DEPTH_VALUES flag is not specified, all of the overflow values are reported as a depth value of "0".
Depth Space Range
The depth sensor has two depth ranges: the default range and the near range (shown in the DepthRange Enumeration). This image illustrates the sensor depth ranges in meters. The default range is available in both the Kinect for Windows sensor and the Kinect for Xbox 360 sensor; the near range is available only in the Kinect for Windows sensor.
This diagram applies to the values returned by the managed API, or by the native API with the NUI_IMAGE_STREAM_FLAG_DISTINCT_OVERFLOW_DEPTH_VALUES turned on. With the flag turned off, out of range data is returned as zero.
This table lists the depth values for out of range readings.
|Out of range depth data||Has this value|
The same values apply in managed code, but with a small twist: Because the managed values are signed, the 13-bit value (0xFFF8 >> 3) sign-extends to -1, rather than 8191.
Each frame, the depth image captured is processed by the Kinect runtime into skeleton data. Skeleton data contains 3D position data for human skeletons for up to two people who are visible in the depth sensor. The position of a skeleton and each of the skeleton joints (if active tracking is enabled) are stored as (x, y, z) coordinates. Unlike depth space, skeleton space coordinates are expressed in meters. The x, y, and z-axes are the body axes of the depth sensor as shown below.
Figure 2. Skeleton space
This is a right-handed coordinate system that places a Kinect at the origin with the positive z-axis extending in the direction in which the Kinect is pointed. The positive y-axis extends upward, and the positive x-axis extends to the left. Placing a Kinect on a surface that is not level (or tilting the sensor) to optimize the sensor's field of view can generate skeletons that appear to lean instead of be standing upright.
Each skeleton frame also contains a floor-clipping-plane vector, which contains the coefficients of an estimated floor-plane equation. The skeleton tracking system updates this estimate for each frame and uses it as a clipping plane for removing the background and segmenting players. The general plane equation is:
Ax + By + Cz + D = 0
A = vFloorClipPlane.x B = vFloorClipPlane.y C = vFloorClipPlane.z D = vFloorClipPlane.w
The equation is normalized so that the physical interpretation of D is the height of the camera from the floor, in meters. Note that the floor might not always be visible or detectable. In this case, the floor clipping plane is a zero vector.
By default, the skeleton system mirrors the user who is being tracked. That is, a person facing the sensor is considered to be looking in the -z direction in skeleton space. This accommodates an application that uses an avatar to represent the user since the avatar will be shown facing into the screen. However, if the avatar faces the user, mirroring would present the avatar as backwards. If needed, use a transformation matrix to flip the z-coordinates of the skeleton positions to orient the skeleton as necessary for your application.
Converting Coordinates between Spaces
The following APIs are designed to convert data from one coordinate space to the other:
|NuiTransformSkeletonToDepthImage(Vector4, LONG*, LONG*, USHORT*)||Skeleton to depth|
|NuiTransformSkeletonToDepthImage(Vector4, FLOAT*, FLOAT*, NUI_IMAGE_RESOLUTION)||Skeleton to depth|
|NuiTransformSkeletonToDepthImage(Vector4, FLOAT*, FLOAT*)||Skeleton to depth|
|NuiTransformDepthImageToSkeleton(LONG, LONG, USHORT)||Depth to skeleton|
|NuiTransformDepthImageToSkeleton(LONG, LONG, USHORT, NUI_IMAGE_RESOLUTION)||Depth to skeleton|
|NuiImageGetColorPixelCoordinatesFromDepthPixelAtResolution||Depth to color|
|NuiImageGetColorPixelCoordinatesFromDepthPixel||Depth to color|
|NuiDepthPixelToPlayerIndex||Depth to player index|
|NuiDepthPixelToDepth||Packed format depth pixel to unpacked format depth pixel|