November 2012

Volume 27 Number 11

Kinect - 3D Sight with Kinect

By Leland Holmquest | November 2012

In previous articles I’ve written about Kinect, I demonstrated how to create an application that uses the skeleton data from a Kinect for Windows sensor and how to test the application with Kinect Studio. At first glance, skeleton tracking seems to be the main value the developer gets from Kinect. After you’ve worked with Kinect for a while, however, you realize that the most significant capability that Kinect offers is the depth sensor. Through the depth sensor, an application can see the surroundings through three-dimensional “eyes.” In this article, I explain the basics of setting up and using the depth sensor. I also show you a more robust way of starting and using the Kinect in an application, a way that lets you tailor the Kinect to fit your specific needs. This approach is more advanced than that in my previous articles.

3D Sight

In the first article in this series, I demonstrated the skeleton-tracking capability of Kinect. The skeleton tracker is a great way to capture a user’s movement and actions. At times, however, the application needs information about the surroundings that is independent of a user. For example, what if the sensor is in a room and the application wants to track a ball or some other inanimate object? The skeleton tracker can’t provide a solution for inanimate objects. What is needed is the ability to see the room and understand it from a three-dimensional perspective. The depth sensor satisfies this need.

So, what is the depth sensor? Kinect has an infrared sensor and an infrared light source. Depth sensors in other devices typically use a technique called “time of flight” to get depth data. The basic idea of the time-of-flight technique is that a beam of light is released and the time between when the light leaves the sensor, bounces off an object and returns to the sensor is measured. The time is then interpolated into the distance that the light traveled. Many devices use this technique, and they are expensive.

Kinect uses a different technique that provides the same kind of data but at a greatly reduced price. Kinect projects a pattern of dots from the IR Emitter (see Figure 1). The sensor then uses the size and spacing between the dots to create a 320 × 240 pixel map that captures the depth data. An application can parse this data and re-create the scene the Kinect is viewing. Alternatively, an application can use the depth data to look for patterns (ranging from simple to complex algorithms depending on what is required) in each image. For example, you can use a simple algorithm to create a histogram of a scene. A more elaborate scenario would involve sampling multiple frames to identify known configurations of gestures.

Kinect sensor components

Figure 1 Kinect sensor components (from the MSDN library articleKinect for Windows Sensor Components and Specifications”)

Starting the Depth Sensor

In my previous Kinect articles, I took a simple approach to handling the Kinect: Start it (and only one Kinect is available), enable the desired features and assume it will keep working. In the real world, however, an application can start without a Kinect connected or the Kinect can get disconnected in the middle of running. You could also be faced with numerous other scenarios that our friend Mr. Murphy could throw at you. As developers, we need to consider these scenarios.

Let’s look at a better, more realistic approach to handling the Kinect. The first thing you need to do is create a member variable to reference aKinect (note “a” Kinect–the application could have access to multiple Kinect units).

private KinectSensor _kinectSensor;

Next, you need a method that will discover any Kinect that is available to the application. You could just use the first Kinect, like this:

private KinectSensor _kinectSensor = KinectSensor.KinectSensors[0];

However, you could easily encounter a scenario in which multiple Kinects are connected to the machine but the first one has an error or isn’t yet available. A better approach is the following:

_kinectSensor = KinectSensor.KinectSensors.FirstOrDefault(x => x.Status == KinectStatus.Connected);

Using this technique, the program is looking for the first or default Kinect that has a status of Connected. The enumeration KinectStatus has the following values: Connected, DeviceNotGenuine, DeviceNotSupported, Disconnected, Error, Initializing, InsufficientBandwidth, NotPowered, NotReady and Undefined. Given these statuses, an application can identify what is happening with the (potentially) multitude of Kinect units attached to the machine.

Knowing the status of a unit enables the application to deal with the device appropriately. For example, in my previous Kinect articles, the applications expected a single Kinect to be available in the Connected status. But what happens when the USB connection gets cut? The API changes the status to Disconnected. The application, having subscribed to the StatusChanged event, can respond appropriately by switching over to another Connected Kinect or by providing feedback to the user that the Kinect is no longer functioning.

Here’s an important piece of information about using an Xbox 360 Kinect unit: On the developer’s machine, an Xbox 360 Kinect reports a status of Connected when the unit is available. When running on a non-developer machine, however, the status comes back as DeviceNotSupported. While running the application in debug, the debugger output window provides the following warning message: “The Kinect plugged into your computer is for use on the Xbox 360. You may continue using your Kinect for Xbox 360 on your computer for development purposes. Microsoft does not guarantee full compatibility for Kinect for Windows applications and the Kinect for Xbox 360.”

The application for this article is a simple Windows Presentation Foundation application. In it, the Kinect’s depth sensor is providing a view of the scene in front of it. The information is converted to a graphical representation of the three-dimensional space and projected to the user through a simple System.Windows.Controls.Image object.

If you fail to obtain a reference to a Connected Kinect unit, you provide feedback to the user:

if (_kinectSensor == null)
     {
         StatusText.Text = Properties.Resources.KinectNotAvailable;
     }
     else
     {
         _kinectSensor.DepthFrameReady += KinectSensorDepthFrameReady;
         StatusText.Text = Properties.Resources.KinectIsReady;
         _kinectSensor.DepthStream.Enable();
         _kinectSensor.Start();
         _kinectSensor.ElevationAngle = 0;
     }

And if there’s a handle to a valid Kinect, you register it to the DepthFrameReady event (I’ll cover this in more depth in the next section), let the user know that the Kinect is ready, enable the DepthStream and start the sensor and set it to an elevation angle of 0 degrees (neutral). The application is now ready to begin interacting with the surroundings.

Seeing Is Believing

In the DepthFrameReady event, you can gain access to the DepthImageFrame. Iterating through each pixel in this “image” enables you to identify the distance between the camera and whatever is in front of it. The data is in the pixel. Each pixel is represented by 2 bytes. The bits 0–3 hold the player index value. This data is valuable: If the pixel is associated with a player, the index to that player is in these bits. The remainder of the bits (4–15) provide the distance data. To get to the desired data, you need to be aware of two formulas:

Distance Formula

int depth = depthPoint >> DepthImageFrame.PlayerIndexBitmaskWidth;

Player Formula

int player = depthPoint & DepthImageFrame.PlayerIndexBitmask;

Using these formulas, the application can determine the distance (in millimeters) from the camera to the object in front of the Kinect and the index of the player if the depth data is associated with a player.

In this application, however, a simple visualization of the data depth is desired. With each pixel of the frame representing the distance to the object, the visualization wouldn’t be of value to us humans. We need a way to convert the data to an image. There are multiple ways to achieve this effect. One is shown in Figure 2.

private void KinectSensorDepthFrameReady(object sender, 
  DepthImageFrameReadyEventArgs e)
        {
            using (DepthImageFrame frame = e.OpenDepthImageFrame())
            {
                if (frame != null)
                {
                    var pixelData = new short[frame.PixelDataLength];
                    frame.CopyPixelDataTo(pixelData);
                    int stride = frame.Width*frame.BytesPerPixel;
                    DepthImageElement.Source = 
                      BitmapSource.Create(frame.Width,
                         frame.Height,
                         96,
                         96,
                         PixelFormats.Gray16,
                         null,
                         pixelData,
                         stride);
                }
            }
        }

Figure 2 Convert to Image

In this method, you begin by using a DepthImageFrame. The using statement is important. Once this method goes out of scope, the dispose method is called on the frame, ensuring that garbage collection occurs in a timely manner. To witness the consequences of not employing using, set up a test case by recording a loop of test data via Kinect Studio. (See my article “Working with Kinect Studio.”). Then set up a test case that runs the application in an infinite loop. Using Task Manager, watch the memory allocation go up as the application continues to run until it utilizes too many resources and comes to a grinding halt.

The important point of this method is the BitmapSource.Create method. Using this method, you pass in the data from the DepthImageFrame and, through the parameters passed in, tell the method to convert the data to an image based on a 16-grayscale color. The resulting image is passed to the Image element of the application. This results in an image in which the darker the color, the closer the object. In this image, you can see the various things in the room, including the user, as shown in Figure 3.

Depth image in gray scale
Figure 3 Depth image in gray scale

What to Do with Depth

You can find numerous examples online to demonstrate the use of depth data. Some ideas include motion detectors, user gestures and a whole host of applications available once you start employing image-processing concepts such as edge, contour and blob detection. In essence, Kinect enables the application to see the three-dimensional world in front of it. What we do with that data is limited only by our imagination.

The DepthStream coupled with the skeleton tracker previously covered provide the vision components of the Kinect, or its “eyes.” With this data, an application can see the world around it. Applications can understand the three-dimensional world as well as the motion of the people that reside in that world. In the next article, I’ll explore the “ears” of the Kinect and explain what hearing can bring to natural user interfaces (NUIs). The goal of NUIs is to bring the user closer to ubiquitous computing environments: a place where the interface with the application is so natural and intuitive that the application blends into the background.


Leland Holmquest is a senior consultant at Microsoft, providing solutions for the U.S. Army. Prior to Microsoft, he gained 15 years of federal service, including the Naval Surface Warfare Center, Dahlgren, and diplomatic security. He is earning a Ph.D. in IT at George Mason University. He is the father of two daughters and is a devoted husband. You can reach him atlelandholmquest.wordpress.com.