Skeletal Joint Smoothing White Paper
Kinect for Windows 1.5, 1.6, 1.7, 1.8
by Mehran Azimi, Software Development Engineer
Advanced Technology Group (ATG)- Introduction
- Why We Need Joint Filtering
- Joint Filtering Basics
- Smoothing Filters
- Practical Tips and Takeaways
- References
1. Introduction
The skeletal tracking (ST) system of the Natural User Interface (NUI) provides joint positions of tracked persons’ skeletons. These joint positions are the data consumed as position and pose, and they are used for many purposes, such as gesture detection, navigating user interfaces, and so on.
In practice, there is some noise present in the joint positions returned by the ST system. An important step before consuming ST data is to use a noise reduction filter to remove as much noise as possible from the joint data. Such filters are called smoothing filters because they result in smoother positions over time.
This white paper describes the filtering techniques and best practices for using skeleton joint data for a Kinect-enabled application, and its goal is to help developers choose an appropriate filtering technique and fine-tune the filter parameters to match their application needs. The paper covers different areas related to joint filtering, such as the type of noise one should expect in ST data; how filtering affects the latency and how forecasting can be used to reduce latency; the characteristics of an ideal joint filter in terms of responsiveness, latency, and smoothing effect; and how tracking state data returned by ST can be used to improve filtering. Then it describes the specific characteristics of a few useful filtering techniques in detail. The paper concludes with a summary of best practices and practical tips for filtering.
2. Why We Need Joint Filtering
Measurement errors and noise are by-products of almost any system that measures a physical quantity via a sensor. The characteristics of this error are usually described by accuracy and precision of the system, where accuracy is defined as the degree of closeness of measured quantity to its actual value, and precision is defined as the degree to which repeated measurements are close to each other. An accurate system does not have any systematic error in measurements and therefore does not add a systematic bias. A precise system results in measurements close to each other when the measurement is repeated [1,4]. The accuracy and precision concepts are illustrated in Table 1 for a system that is measuring a hand position in the real world.
Table 1. Accuracy vs. Precision: The black X represents the hand location in the real world, and red dots represent a few measurements of hand position by a measurement system.
(a) An inaccurate and imprecise system generates random-like measurements that are essentially useless in practice. | (b) An inaccurate but precise measurement system generates measurements that are close to each other, but have a systematic error or bias. |
(c) An accurate and precise system generates identical measurements that are close to data in the real world. Unfortunately, 100% accurate and precise systems do not exist in the real world, because there will always be some error in practice. | (d) An accurate and modestly precise system generates measurements that are close to each other and are not systematically biased with respect to the data in the real world. This is what one should expect in a well-designed system in practice. |
Just like any measurement system, the joint positions data returned by the NUI ST system has some noise. There are many parameters that affect the characteristics and level of noise, which include room lighting; a person’s body size; the person’s distance from the sensor array; the person’s pose (for example, for hand data, if the person’s hand is open or fisted); location of the sensor array; quantization noise; rounding effects introduced by computations; and so on. Note that the joint positions returned by ST are accurate, meaning that there is no bias in the joint position data to the actual positions in the real world. This means that if a person is standing still, then the average of the joint positions data, over time, is close to the positions in the real world. However, the joint positions data are not necessarily perfectly precise, meaning that they are scattered around the correct positions in each frame. In practice, the joint positions are accurate within a centimeter range, not millimeters.
There are cases that the ST system does not have enough information in a captured frame to determine a specific joint position. Examples of these cases include occlusion by furniture or other persons, self-occlusion of a joint by other body parts of the person, and moving a joint out of the sensor’s field of view. In most of these cases, the ST system is still able to infer the joint position, and the NUI_SKELETON_POSITION_TRACKING_STATE parameter, returned as part of a NUI_SKELETON_DATA structure, is set to NUI_SKELETON_POSITION_INFERRED for that joint. This parameter can be treated as the confidence level of the ST system regarding the joint position. Though the inferred joint positions are a very refined estimate of joint position, they may become inaccurate in some cases, depending on a person’s pose. Therefore, one should expect that inferred joints have higher noise values, along with a possibility of a bias. This bias is usually observed as temporary spikes in the joint position data, which goes away as the joint tracking state level goes back to NUI_SKELETON_POSITION_TRACKED.
Therefore, two types of noise are present in joint positions. One is the relatively small white noise that is always present for all joints and caused by imprecision; the other is temporary spikes caused by inaccuracy, which happen when the joint has an inferred tracking state. Since these noises have different characteristics, different filtering techniques should be used for each. That is, developers need to use a combination of two or three filtering techniques to achieve good results in a Kinect-enabled application.
3. Joint Filtering Basics
Before describing any specific filtering technique, there are a few important concepts to cover that are related to filtering; they include latency and how it relates to filtering delays, how forecasting can improve latency, the tradeoff between latency and smoothing effects in filter design, and what makes an ideal filter.
- Latency and How it Relates to Filtering
- Filter Smoothing Effect vs. Filtering Delay
- Error Propagation to Variables Calculated from Skeletal Tracking Data
- Filter Notations
- Filter Response to Step Function and Sinus Waveform Inputs
- Using Joint Tracking State in Filtering
- Using Body Kinematics to Correct the Filter Output Data
3.1. Latency and How it Relates to Filtering
Latency can be defined as the time it takes from when a person makes a move, till the time at which the person sees the response to his or her body movement on the screen. Latency degrades the experience as soon as people start to notice there is a delay in response to their movements. User research shows that 72% of people start noticing this delay when latency is more than 100 msec, and therefore, it is suggested that developers aim for an overall latency of 100 msec [15]. For a detailed discussion on latency, refer to [14] in References.
The joint filtering latency is how much time it takes for filter output to catch up to the actual joint position when there is a movement in a joint. This is shown in Figure 1, which shows that filter output is lagging behind input when there are changes in input. It is important to note that the latency introduced by joint filtering is not the CPU time it takes for the filtering routine to execute.
Figure 1. Output of a typical joint filter in response to a NUI joint movement. Note that latency added by joint filtering is the lag between output and input when there is movement in input data, and the amount depends on how quickly the joint is moving.
In general, the filtering delay depends on how quickly the input is changing, and hence, one cannot attribute a specific delay value to a given filter for all cases. This is referred to as phase distortion in signal processing [6,7]. A special class of filters called linear phase filters have the same delay for all input frequencies, and such a specific delay time can be attributed to the filter for all inputs. Reducing phase distortion is important in some signal processing applications, specifically in audio processing; however, it is not necessarily as important in NUI joint filtering, so having a linear phase filter is not a design criterion in NUI joint filtering.
A useful technique to reduce latency is to tweak the joint filter to predict the future joint positions. That is, the filter output would be a smoothed estimate of joint position in subsequent frames. If forecasting is used, then joint filtering would reduce the overall latency. However, since the forecasted outputs are estimated from previous data, the forecasted data may not always be accurate, especially when a movement is suddenly started or stopped. Forecasting may propagate and magnify the noise in previous data to future data, and hence, may increase the noise. Almost all joint filtering techniques can forecast or can be tweaked to forecast future outputs. The accuracy of predicted outputs depends on the underlying data model that the filter is using and how the filter parameters are selected. Note that it is usually practical to forecast a joint position for about two frames, which could reduce the filtering latency by about 66 msec in ideal cases. However, in practice, the smoothing effect of the filter, along with the prediction errors in forecasted data, would result in smaller latency reductions.
3.2. Filter Smoothing Effect vs. Filtering Delay
An ideal joint filter would remove all unwanted noise and jitters from the joint data resulting in smooth joint position data over time. It would also follow the movements of the joint without any lag or delay. Unfortunately, there is a tradeoff between these two objectives in practice, and choosing a filtering technique that aggressively smoothes out the data would result in higher filtering delay, which would increase the perceived latency. As an intuitive explanation for this concept, consider a case where a person is standing still and therefore the input to the joint filter is mostly a constant position along with some noise. In order to produce a smooth output, the filter should not be sensitive to the changes in input due to noise. Now suppose the person starts moving his/her hand. In order to be responsive to these movements, the filter should be designed to be sensitive to changes due to movement, which is an opposite of the requirement for noise removal. In practice, most filters take some time to see enough movement before they start following these changes in output, and therefore their output lags behind the changes in input.
Accordingly, one should understand how latency and smoothness affect the user experience, and identify which one is more important to create a good experience. Then, carefully choose a filtering method and fine-tune its parameters to match the specific needs of the application. In most Kinect applications, data output from the ST system is used for a variety of purposes—such as gesture detection, avatar retargeting, interacting with UI items and virtual objects, and so on—where all are different in terms of how smoothness and latency affect them. Similarly, joints have different characteristics from one another in terms of how fast they can move, or how they are used to create the overall experience. For example, in some applications, a person’s hands can move much faster than the spine joint, and therefore, one needs to use different filtering techniques for hands than the spine and other joints. Alternately, consider an application that uses hand movements to score how well a person is doing a particular exercise (that is, gesture scoring), and to animate the person’s avatar on screen at the same time. Latency is usually less important in gesture scoring, since the whole gesture should be done before computing the score; consequently, the application should use a high-latency filtering technique with more aggressive smoothing for gesture scoring, while also using a different low-latency filtering technique for animating a person’s avatar.
So, there is no filtering solution that fits the needs of all use cases for all joints. Depending on which joint data is to be used and how the filter output is consumed, one should apply different filtering techniques or fine-tune the parameters per joint and per application.
3.3. Error Propagation to Variables Calculated from Skeletal Tracking Data
One can apply the filtering techniques discussed in this paper to the variables calculated from joint positions. For example, an application may need to use the elbow angle formed between the person’s hand, elbow, and shoulder. Then the same filtering techniques discussed here can be applied to the calculated angle to smooth it out over time. Another example is using spherical coordinate systems local to a joint’s parents, where each point in space is represented by a (ρ,φ,θ) triplet, instead of a Cartesian system of (x,y,z) coordinates. For example, one can use the right shoulder as the origin and represent the elbow position by using this local spherical coordinate system. In this case, the Cartesian coordinates returned by ST are transformed to local spherical coordinates (ρ,φ,θ), and filtering is done on each component separately. The radius ρ, which represents the bone length, can be filtered more aggressively than the angles (φ,θ), which represent the joint movements.
It is important to note that applying mathematical operations to noisy joint data would propagate noise and may amplify the noise level. The underlying concept of noise propagation is similar in essence to propagation of rounding or precisions errors of floating-point variables in math operations, though rounding errors on floating-point values are small and are ignored in most practical cases, while measurement noise is relatively much larger. Operations to calculate body part sizes, such as addition, subtraction, and multiplication, amplify the noise [1-4]. For example, calculating bone length or calculating relative joint coordinates, such as elbow position relative to shoulder or hand position relative to head, all require subtracting two joint positions. In all these cases, the noise in resulting data is amplified. Trigonometric functions (such as sine, cosine, or inverse tangent), which are typically used for calculating or manipulating joint angles, affect the noise in different ways. The effect of any function on noise depends on the local slope of that function, that is, the local derivative of the function around the data point used. For example, suppose θ is a joint angle calculated from noisy ST data, and hence θ is noisy as well. Now consider function tanθ, which is sensitive around θ=90o, because (d/dθ)tanθ converges to ∞ around θ=90o. Hence, calculating the tanθ function would amplify the noise drastically if θ is close to 90o. However, since (d/dθ)tanθ=1 around θ=0o, then noise is neither amplified or decreased if θ is close to zero.
A more thorough discussion of this topic can be found in References under “Error analysis and error propagation in noisy data”; see [1-4] for more details.
3.4. Filter Notations
The joint filter implementation in a typical application receives joint positions from ST as input in each frame, and returns the filtered joint positions as output. The filter treats each joint’s x, y, and z coordinates independently from other joints or other dimensions. That is, a filter is independently applied to the x, y, and z coordinate of each joint separately—and potentially each with different filtering parameters. Note that, though it is typical to directly filter the Cartesian position data returned by ST, one can apply the same filtering techniques to any data calculated from joint positions.
This means the input of each joint filter is a one-dimensional time series that represents a single joint position in a given dimension at frame n (for example, y coordinates of the right hand position). The filter input at frame n is denoted by Xn and the filter output generated from applying the filter by n.
The k next forecasted output calculated at time n is denoted by (n+k|n), where the “|n” notation denotes that this forecast is done at time n, and hence, by definition, only previous samples up to Xn are available for calculating (n+k|n).
Figure 2. Filter notations used in this white paper
3.5. Filter Response to Step Function and Sinus Waveform Inputs
In order to understand a filtering technique and how filtering parameters affect the filter characteristics, it is useful to study the filter output in response to some predefined inputs, specifically step function and sine wave inputs.
Step function input models a sudden jump in input data, and is defined as:
That is, input Xn is zero for up to a given time and then jumps and stays at 1 at time N. Though one does not expect to see such input in practice, a step function is helpful because it shows how quickly and how accurately the filter tracks sudden changes in input. Figure 4 shows the typical output of a filter to a step function input, as well as definitions for some of the filter characteristics of our interest.
Figure 3. Typical response of a filter to step function input
- Rise time is the time required for the filter output to reach 90% of it asymptotic final value, which is 0.90 in case of a unit step function input. In some fields, a similar parameter called time constant is used, which is the time required for the filter output to reach (1−e−1)%=63% of its final asymptotic value. A small rise time is an indication of a filter with low latency.
- Overshoot is when the output of the filter reaches a higher maximum value than input, and this is usually represented as a percentage of the final asymptotic value. Note that not all filters have overshoot in their output. Overshoot is undesirable, and it is usually present in low-latency filters that are sensitive to changes in input.
- Ringing is the effect where filter output oscillates before it settles down to its final value.
- Settling time is applicable to filters that have overshoot or ringing, and it is the time it takes for the filter output to reach and remain within a given error margin of its final asymptotic value. The range of the error margin is usually 10 percent.
Rise time shows how quickly the filter catches up with sudden changes in input, while overshoot, ringing, and settling time are indications of how well a filter can settle down after it has responded to a sudden change in input.
A filter’s response to a step function does not reveal all of the useful characteristics of a filtering technique, because it only shows the filter’s response to sudden changes. It is also useful to study a filter’s response to sine waveform input, both in time and frequency domains. As shown in Figure 4, the response in the time domain can show the lag in filter output, which depends on input frequency in most cases (that is, in nonlinear phase filters). Note that the output may not reach the maximum or minimum level of the input due to the filter attenuating that frequency. This is sometimes referred to as dampening the peaks and valleys in input by aggressive smoothing. As an example, Figure 5 shows the hand position x of a person who has opened and closed his arm quickly two times, resulting in a sinusoidal peak in input. As noted, aggressive smoothing of data has resulted in filter output not reaching the same maximum and minimum level of the input. Also, it is interesting to note in Figures 4 and 5 that, in some frames, the input is increasing while the output is decreasing; this can be attributed to the filtering latency. This filter would not be a good choice for drawing a cursor on the screen based on the person’s current hand position, because it would produce an undesirable effect when the person’s hand changes direction; the cursor would catch up and change direction after a while, which would create an awkward experience for the person.
Figure 4. Typical response of a filter to sine waveform input
Figure 5. Aggressive smoothing would reduce the minimum and maximum in sinusoidal input data. The data is from actual hand movement where a person has opened and closed his arms rapidly two times.
The filter’s response in the frequency domain shows how a filter responds to all ranges of frequency inputs. All smoothing filters used for NUI joint filtering are low pass filters, where an ideal low pass filter would let through the input frequency components that are lower than a cut-off frequency, but remove the frequency components that are higher than the cut-off frequency. The low-pass characteristics of NUI joint filters is based on the assumption that joint movements have relatively lower frequency than the noise, though this is not necessarily a correct assumption for all cases—specifically, when a person makes rapid movements, such as quick hand movements or sudden jumps. Note there are powerful methods of filter design available in signal processing textbooks that realize a desired frequency response, such as a low pass filter of given order N and with a given cutoff frequency [6]. Since the frequency range of NUI joint data overlaps with the noise, low-pass filters with a design based only on the frequency response criteria do not necessarily provide good results in Kinect applications. Design methods of frequency domain filters are not within the scope of this white paper. However, you can still become familiar with the underlying concepts used in filter design in the frequency domain to better understand filtering characteristics in general, and items [6,7] in References are good starting points.
3.6. Using Joint Tracking State in Filtering
Joints that are inferred are more likely to have temporary spike noises due to being less accurate. Also, the random noise levels are usually higher for inferred joints. In practice, for Kinect applications, developers should consider a joint’s tracking state to be valuable information about the quality of the joint data, and apply a more aggressive smoothing filter when a joint’s state is inferred. This can easily be done by checking the joint’s tracking state in the filter’s implementation and adaptively update the filter parameters based on the joint’s tracking state. Also, filters that are specifically more powerful in removing spike noise should be applied when a joint’s state is inferred.
3.7. Using Body Kinematics to Correct the Filter Output Data
The anatomy and kinematics of a person provide valuable information that can be used to enhance the ST joint data. For example, some joints move or bend only in a certain direction, or some joints, such as hands, can move faster than other joints. How this data can be applied depends on the joint and the contexts the joint data are used. As a practical example, suppose that a person’s right hand joint is filtered with a low latency filter that forecast one frame in future. This type of filter usually results in overshoots in response to sudden movements. Therefore, due to overshoot, the filter may calculate the hand position to be too far from the person’s body. This overshoot can be corrected by using an estimate of hand bone length, and correcting the filtered hand position such that the hand is positioned an acceptable distance from the person’s body.
Another useful kinematic property is the limitations of hinge joint. A hinge joint is a joint that can bend along only one axis (that is, it has only one degree of freedom). For example, elbows and knees are hinge joints, because they can bend in only one direction and within a limited range. These kinematic limitations of elbow or knee can be used to correct the filtered joint positions.
4. Smoothing Filters
This section discusses details of a few smoothing filtering techniques that are useful for NUI joint filters. First are the Auto Regressive Moving Average (ARMA) filters, which are a general class of filters. Specific smoothing filters for noise removal are also covered, including Moving Average, Double Moving Average, Exponential Filter, Double Exponential Filter, and Savitzky-Golay filters. All these smoothing filters are special cases of ARMA filters. Finally, the section discusses a filter based on the Taylor series that is useful for forecasting.
Also covered are Median and Jitter Removal filters, which although they have some smoothing effect, are specifically useful for removing spike noise from data.
- Auto Regressive Moving Average (ARMA) Filters
- Simple Averaging Filter
- Double Moving Averaging Filter
- Savitzky–Golay Smoothing Filter
- Exponential Smoothing Filter
- Double Exponential Smoothing Filter
- Adaptive Double Exponential Smoothing Filter
- Taylor Series Filter
- Median Filter
- Jitter Removal Filter
4.1. Auto Regressive Moving Average (ARMA) Filters
Auto regressive moving average (ARMA) filters are a general class of linear filters. All the smoothing filters we discuss in this paper are a special case of ARMA filters. The output of an ARMA filter is a weighted average of current and N previous inputs, and M previous filter outputs:
Where the ai and bi coefficients are the filtering parameters. The first term is known as the moving average (MA) term, and the second term is known as the auto-regressive (AR) term.
Moving average (MA) filters are a special case of ARMA filters where all bi parameters are zero:
The coefficients ai are the weight factors and are selected such that ai=1 in all NUI joint filter applications. This property is a result of using low-pass filtering and allowing DC components through without attenuation. As an intuitive explanation, suppose the input to be filtered is the constant 1 for all frames n. Then, one intuitively expects the filter output to be 1 as well for all n, which would result in ai=1. This can be used as a quick sanity check that the derived ai coefficients are correct.
The moving average filter can be extended to a central moving average filter, where filter output is the weighted average of N past and M future inputs:
Since the output of this filter depends on M future inputs after Xn, then any implementation of this filter will add a latency of at least M frames. Therefore, this filter is only practical in offline cases in which all data is available in advance, or in cases in which increasing the latency by an order of a few frames is tolerable. For example, in some Kinect-enabled applications, it may make sense to score how well the person has done an exercise after the whole performance is finished. The central MA filters usually perform better in terms of noise removal than simple MA filters.
ARMA filters are usually designed by assuming an underlying data model for data and using this model to calculate the filter orders N and M, and the ai and bi coefficients. The data model should be chosen based on the actual data characteristics. In some approaches the underlying data models are chosen based on a statistical approach that would preserve the higher order moments of data. The details of these methods are outside the scope of this paper; see [5] for a detailed discussion. This paper mentions the order of statistical moments up to which input data are preserved by the filter. In general, filters that preserve higher-order moments perform better.
4.2. Simple Averaging Filter
The simple averaging filter is the simplest joint filter, where the filter output is the average of N recent inputs, which is an MA filter of order N with ai=1/(N+1) for all i:
From a statistical point of view, the averaging filter is a naive filter that fits a horizontal line (that is, a constant) to N recent inputs and uses it as the filter output. Therefore, an averaging filter is not taking advantage of joint data characteristics or noise statistical distribution, and it preserves only the first-order moment of data, which is the average. A simple averaging filter doesn’t provide satisfactory results in most cases of filtering NUI joints.
An averaging filter using a large N would result in more smoothing than a smaller N, but it would introduce more filtering delay. The filtering delay can be noted in the output from an averaging filter in response to inputs from step functions and sinusoidal waveforms , where filtering delay is directly proportional to N. -->For example, the step function rise time for N=5 and N=10 are about 4.5 frames (148 msec) and 9 frames (297 msec), respectively. The simple averaging is a linear phase filter, which means that all frequency components in input are delayed by the same amount [6]. To experience this, try different frequencies for sinusoidal input in the spreadsheet, and notice that the output delay is the same for all filters.
The smoothing effect of the filter can be more easily noticed when noise is added to step function or sinusoidal inputs. Also note that output from the averaging filter cannot reach the peaks and valleys of most sinusoidal waveform inputs.
Since an averaging filter fits a horizontal line of data, it forecasts the future outputs as a constant as well, which is the average of last N inputs up to time n:
The simple averaging filter performs well in terms of forecasting only if the input data is stationary and there is no trend in data. In NUI joint filtering, stationary input data means that a joint has little movement. If there is movement in the joint, then the filter’s input data will have a slope, and hence, averaging filters performs poorly, especially in terms of forecasting. For NUI joint data with movement at a constant speed, it can be shown that the averaging filter has a constant bias in its output [8,9]. A method to eliminate this error is to use a filtering method called double moving average, which is discussed in the next section.
4.3. Double Moving Averaging Filter
Double moving averages are used in many applications, such as stock market forecasting, and they are useful when data has a linear trend. The trend in NUI joint data is equivalent to a joint movement at a constant velocity. The underlying data model used by double moving averaging filter is to fit a linear line to local input data, and hence it is more adaptable to track changes in input data than simple averaging filter [8,9]. Let and be the first and second order moving averages of input data at time n:
Therefore is a moving average of a moving average of input data (hence the name double moving average). If we assume that the underlying data follow a linear model, it can be shown that systematically lags behind the actual data, and it can also be shown that a second moving average lags average by approximately the same amount. To account for the systematic lag, the difference of second and first order averages (−) is added to the filter output. Then filter output is given as the first order of moving average plus the trend adjustment term:
A similar approach is used for adjusting trends in forecasting, and future data are estimated as:
The double moving averaging filter has the advantage of being more responsive to changes in input data as compared to a moving averaging filter. Note that for a given window N, the filtering equations can be combined and the filter output equation can be rewritten in terms of a weighted moving average of past inputs, which would result in easier implementation. For example, for N=2, the second moving average and filter output are given by:
which would result in:
It is interesting to note that this filter uses larger weights for recent inputs, which is generally true for any filtering window N.
4.4. Savitzky–Golay Smoothing Filter
A Savitzky-Golay smoothing filter (also known as a smoothing polynomial filter or a least squares smoothing filter) fits a polynomial to neighbor input data for each input Xn in a least-square sense and uses the value of the polynomial at time n as the filter output. A polynomial of order K is defined as:
If we use N previous and M future samples as the neighbor samples, then a Savitzky-Golay filter finds ci coefficients such that minimize the term:
and uses the polynomial value at time n as the filter output—that is, n=fK(n). Though it may first seem that this filter results in a complicated implementation, it turns out to be easy. It has been shown that the output of a Savitzky-Golay filter can be expressed as a weighted moving average filter; that is:
where the filtering coefficients ai are all constant for all Xn values and do not change for different n, or even different inputs. Thus, to implement a Savitzky-Golay filter requires only choosing the appropriate filter order K, and determining how many samples before and after should be used (that is, choose N and M). Then, the coefficients ai can be calculated offline by using available off-the-shelf algorithms, and filter output is easily calculated using these coefficients [11,12].
Savitzky-Golay filters are optimal in two different senses; firstly, they minimize the least-squares error in fitting a polynomial to each windowed frame of input data, and secondly, in preserving the K first statistical moments of the input signal. In general, Savitzky-Golay filters are useful for cases in which the frequency span of input data without noise is large, and therefore, Savitzky-Golay filters are good candidates for NUI joint filtering.
A Savitzky-Golay filter of order K preserves the first K+1 moments of the data [11]. For K=0, the Savitzky-Golay filter fits fK(x)=c0, or a constant value, to neighbors of each input, which would turn to a simple averaging filter with equal weight coefficients ai. For K=1, a straight line is fitted to local input data, which is usually referred to as linear regression in statistics text books. For K=2 and K=3, a parabola and a cubic curve are fitted, respectively. The choice of n affects the filter smoothing effect, where a cubic curve using K=3 seems to be a good choice for NUI joint filtering, since it is the lowest-degree polynomial that supports inflection and is still smooth. Using a higher-order polynomial yields jumpy curves with too many local minima and maxima, which would result in a lessened smoothing effect.
The Savitzky-Golay filter can also be used for estimating the derivatives of input [11], which is easily calculated from the coefficients ci. Thus, a Savitzky-Golay filter can produce the joint speed and acceleration along the smoothed position for NUI joint data.
4.5. Exponential Smoothing Filter
An exponential smoothing filter, also known as an exponentially weighted moving average (EWMA), is a popular filter in many different fields. The exponential filter output is given by:
Where α is called the dampening factor and 0≤α≤1. By substituting (n−1) this can be expanded to obtain:
Therefore, filter output at time n is a weighted average of all past inputs, where weights ai=α(1−α)i decrease exponentially with time (more precisely, with geometric progression, which is the discrete version of an exponential function). Also, all previous inputs contribute to the smoothed filter output, but their contribution is dampened by increasing power of parameter 1−α. Since n depends on all past inputs, an exponential filter is said to have an infinite memory of all past inputs.
Similar to a simple averaging filter, the exponential filter fits a straight horizontal line to the input data. The difference is that an exponential filter places relatively more weight on recent input data, and is correspondingly more responsive to recent changes in input than a simple averaging filter.
The dampening factor α affects the filtering delay and smoothing effect. A larger α corresponds to larger weight on recent inputs, and results in faster dampening of older inputs. This results in less latency and less smoothing of the data. A small α gives larger weight to older input samples, and hence, the effect of older inputs is larger. This results in a more aggressive smoothing effect and more latency, as shown in Figure 6.
Figure 6. Effect of α on filtering delay of an exponential smoothing filter
Since an exponential filter fits a horizontal line to data, it forecasts the future data as a constant value similar to an averaging filter:
A variant of exponential filter, called a double exponential filter, addresses this limitation of exponential filters and is described in the next section.
One can incorporate state data for a joint into the implementation of an exponential filter by adaptively using a smaller α for joints for which tracking state indicates inferred positions. This results in more aggressive filtering of inferred joints.
4.6. Double Exponential Smoothing Filter
The double exponential smoothing filter is a popular smoothing filter used in many applications. Similar to a double moving averaging filter, the double exponential smoothing filter smoothes the smoothed output by applying a second exponential filter (hence the name double exponential), and it uses this to account for trends in input data. There are various formulations of double exponential smoothing filters, with minor differences between them, but the most popular formulation is defined by the following set of equations:
As can be noted, the trend bn is calculated as an exponential filter of the difference between a filter’s last two outputs. Then the sum of the current trend and the previous filter output—that is, (n−1)+b(n−1)—are used in calculating the filter output. Including the trend helps to reduce the delay as the filter fits a line to local input data, where the trend bn is the slope of this fitted line. The parameter controls the weights on the input data that was used for calculating the trend, and hence, controls how sensitive the trends is to recent changes in input. A large results in less latency in trend; that is, the trend element follows the recent changes in input faster, while a small gives larger weight to older input samples and, hence, results in longer delay in trend elements catching up with changes in input.
Note that a trend is the smoothed difference between the last two estimated joint positions (that is, a trend is the smoothed value of n−(n−1)); so a trend can be thought of as the estimated velocity of the joint in the case of NUI joint filtering. Therefore, we can think of as the dampening factor used in exponential filtering of joint velocity; and that the smoothed joint velocity is accounted for when joint position is calculated as the filter output.
The trend factor bn can easily result in overshooting in filter outputs when there are sudden joint movements or stops. For example, suppose a person suddenly moves his or her hand and then stops, which results in a filter input similar to a step function (that is, a sudden jump in filter input). This is shown in Figure 7. In this case, the trend term bn helps the filter output to catch up more quickly with this change in input; however, since bn itself is smoothed out, bn will need some time to settle back to zero, which will result in overshoot and ringing in output. Note that there is a delay between bn maximum and overshoot.
Figure 7. Output and trend of a double exponential smoothing filter in response to step function input; α=0.35 and =0.70
The double exponential smoothing filter fits a line to data and, therefore, forecasts the future data as a straight line with a slope equal to trend term bn:
In general, this filter performs better than a single smoothing filter in terms of forecasting. However, as shown in Figure 8, the filter overshoot is larger in forecasted outputs.
Figure 8. Output and trend of a double exponential smoothing filter in response to step function input; α=0.50 and =0.40
There are numerical techniques that adaptively update the α and parameters such that the error in filter predictions for a given k are minimized in a least-square sense. For example, when predicting one sample ahead (that is, k=1), the prediction error at time n considering the past N forecasts is given by:
Then α and are updated at time n in a direction that this prediction error is minimized (for details, see the Levenberg-Marquardt algorithm in [10,12]). This criterion is useful if precise forecasting is the only concern; however, it does not take into account the smoothing effect of the filter, and so the α and parameters calculated by this approach do not necessarily result in smooth output.
4.7. Adaptive Double Exponential Smoothing Filter
A simple but useful improvement to a double exponential smoothing filter for NUI joints is to adjust the α and parameters adaptively based on the joint velocity, such that when the joint is not moving quickly more aggressive filtering is applied by using smaller α and parameters. This adaptation results in smoothed output when a joint is not moving quickly. Alternately, larger α and parameters are used when the joint is moving quickly, which results in better responsiveness to input changes and, hence, a lower latency.
This idea can be implemented in different ways. For example, an adaptive double exponential smoothing filter could be implemented by using two preset α and parameters: one for low-velocity cases, say αlow and low, and one for high-velocity cases, say αhigh and high. There could also be two velocity thresholds used, say vlow and vhigh. Then for each input Xn, the velocity is estimated as vn=|Xn−X(n−1)|, and filtering parameters α and are set as a linear interpolation between their low and high values based on the current velocity. For example, the α parameter used at time n, denoted by αn, is set to be:
4.8. Taylor Series Filter
The Taylor series expansion is a well-known representation of a function in mathematics, where a continuous function f(x) is expressed as an infinite sum of terms, calculated from its derivatives at a given point a [13]:
f(i)(a) is the ith derivative of function f(x) at point a. This series is used in many applications for approximating a function as a polynomial of order N at points close to expansion point a, where N is the number of terms in the expansion that are included in this approximation.
The Taylor series can be used for forecasting NUI joint data. The underlying assumption is that joint movement is approximated with a polynomial of order N over time. In other terms, we fit a polynomial of size N to past N inputs, and then use this polynomial approximation to forecast next joint data. Note that the NUI joint data are in discrete time; therefore, the derivatives used in Taylor series coefficients are approximated numerically as higher order backward differences of input data. For example, the first, second and third degree derivatives are approximated as:
As said, the next input is estimated by using the fitted polynomial—that is, X(n+1|n)=f(n+1). Substituting the preceding terms for f(i)(n) in a Taylor series polynomial expansion of size N and choosing a=n results in an estimate for next inputs in term of past inputs:
For example, for N=3, by using the preceding equations, one obtains:
which is, again, a weighted moving average of past N inputs.
The Taylor series expansion is helpful, because it can forecast the future positions of joints. Note that the Taylor series does not smooth the data or attempt to remove any noise by itself, though this can be compensated for by applying a smoothing filter, such as exponential smoothing filter, to the output of a Taylor series filter. One approach for tweaking the Taylor series filter is to smooth the data that the Taylor series uses to forecast the current input from previous inputs (that is, X(n|n−1)) and then calculate the smoothed output as a linear interpolation between Xn and X(n|n−1). That is, the filter output is given by:
Also note that the approximation of derivatives using only the backward differences is a naive approximation of the derivatives [13]. This is imposed on us, because we do not have any future input data to use in difference equations to estimate the derivatives. The Savitzky-Golay filter is known to create good estimates for the derivatives. So, instead of the difference equations presented in the preceding text, one can use the Savitzky-Golay filter to calculate the derivatives and use them in a Taylor series filter to forecast future outputs. Also remember that the approximation by a Taylor series is accurate when the expansion is around a near point (that is, when x−a is small), which means it is not practical to forecast more than one input into the future by using the Taylor series.
Note that the Taylor series filter may, at first, seem identical to the Savitzky-Golay filter, because both are fitting a local polynomial to input data. However, they are different, because a Savitzky-Golay filter is over-fitting a local polynomial to input samples, which means that the number of input data used to calculate the model parameters is greater than the model parameters. Since there is more than one potential solution, the least-square error approach is used to find the model parameters that minimize the error in a least-square sense. This allows the Savitzky-Golay filter to handle noise in the input data better, and the results are much smoother. However, in a Taylor series filter, there is no over-fitting of data, and the underlying polynomial is simply approximated by approximating the derivatives of input. This can be noted as exactly N+1 samples (N previous samples along current sample) are used in calculating the N+1 polynomial coefficients in a Taylor series approach.
4.9. Median Filter
In a median filter (also known as moving median filter), the filter’s output is the median of the last N inputs. Median filters are useful in removing impulsive spike noises, as shown in Figure 9. Ideally, the filter size N should be selected to be larger than the duration of the spike noise peaks. However, the filter’s latency directly depends on N, and hence, a larger N adds more latency.
Figure 9. Median filter applied to actual NUI data. Ideally, the filter order N should be larger than the duration of the spike noises.
Median filters do not take advantage of the statistical distribution of data or noise, and though they have some smoothing effect, they are not suitable for removing random noise from joint data.
4.10. Jitter Removal Filter
A jitter removal filter attempts to dampen the spikes in input by limiting the changes allowed in output in each frame. That is, filter output is the same as input if the difference between the current input data and the previous filter output is less than a threshold. Otherwise, the filter limits the changes in output, which can be done by using different methods. For example, the following variant of a jitter removal filter uses an exponential filter to dampen large changes seen in input:
Alternately, one can use a simple averaging filter instead of the exponential filter. Since median filters usually perform better in terms of removing spikes, changes in input can be limited by the median:
Where Xmed denotes the median of the last N inputs.
Jitter removal filters basically bypass the filtering for cases in which a jump has not been detected that is as large as the threshold in input data. Ideally, the threshold should be selected to be less than the impulsive jumps in the input that are due to spike noise, but larger than normal changes that are due to actual joint movements; in practice, these two criteria may overlap with each other. In a Kinect-enabled application, the joint tracking state can be used to adaptively select this threshold—that is, to use a smaller threshold when the joint data is inferred.
5. Practical Tips and Takeaways
Following is a summary of the tips described in this white paper:
- No filtering solution fits all cases: There is no filtering technique that can be universally used in all Kinect-enabled applications. You must choose and fine-tune the filtering techniques that are right for your application.
- Latency vs. Smoothing Tradeoff: Be aware of the tradeoff between latency and smoothing in joint filtering. Understand how latency affects your application, and become familiar with different filtering techniques so you can choose and fine-tune the right filter for your application.
- Filter per-joint and per-application: Joints have a variety of characteristics, and depending on how the filter output is to be used, different filtering techniques should be used per joint.
- Remember you can filter any data: We usually apply filtering to the Cartesian coordinates (x,y,z) of joints; however, filtering can be applied to any data calculated from joint positions. For example, one can directly filter the bone length, relative coordinates of a joint with reference to another joint, spherical coordinates of a joint, and so on. Applying mathematical calculations to noisy data or calculating relative joints usually amplifies the noise, so be careful.
- It may take more than one filter to get good results: A good filtering solution is usually a combination of various filtering techniques, which may include applying a jitter removal filter to remove spike noise, a smoothing filter, and a forecasting filter to reduce latency, and then adjusting the outputs based on person kinematics and anatomy to avoid awkward cases caused by overshoot.
- Use the joint tracking state: Take advantage of the joint tracking state data, and use it in your filter implementations to apply more aggressive smoothing and jitter removal when a joint position is inferred.
- Include future frame data in filtering, if possible: In offline cases, or in cases when it’s acceptable to increase latency by a few frames, include the future joint data in filtering (for example, use ARMA filters, such as central moving average or Savitzky-Golay with M ≥ 0), which will result in better noise removal.
- Account for the actual time elapsed between two NUI frames: In some cases, the call to get the joint positions from the ST system may fail, which would result in a dropped frame. Therefore, the input data to the filter would be missing for that dropped frame, which means that the joint positions are no longer 33 msec apart from each other. Make sure that your filtering implementation uses the time stamp of the NUI frames and that it estimates the missing data points with an interpolation of data before and after the missing data.
- Reset the filter when a skeleton is lost: When a person moves outside of the camera’s field of view, its skeleton is lost and is no longer tracked. The ST system may use a different tracking ID for the same person later, or it may reuse that tracking ID for other persons. So make sure you reset the filter after a person skeleton is lost.
- Use debug visualization to see how filter parameters affect your application: Visualizing the filtered joint positions on top of the depth map is useful to see how your filtered data is different from the actual depth data. It is useful to visualize only one joint, like the right hand, and render a history of previous positions on screen. This joint path will show you how smooth the filter output was over the past frames.
- Use Kinect Studio to export joint data and analyze them offline: You can use Kinect Studio to export joint data and analyze them offline. Use
Note that the SDK includes the NuiTransformSmooth method, which is an implementation of a joint filter that is based on double exponential smoothing combined with jitter removal and some overshoot control.
6. References
Error Analysis and Error Propagation in Noisy Data
1. John Robert Taylor, An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements. University Science Books, 1999.
2. Manfred Drosg, Dealing with Uncertainties: A Guide to Error Analysis. Springer, 2009.
3. Bevington, Philip, and D. Keith Robinson, Data Reduction and Error Analysis for the Physical Sciences. McGraw-Hill Science, 3rd edition, 2003.
4. Hughes, Ifan, and Thomas Hase, Measurements and their Uncertainties: A practical guide to modern error analysis. Oxford University Press, 2010.
ARMA Filters
5. Brockwell, Peter J., and Richard A. Davis, Time Series: Theory and Methods, 2nd edition. Springer, 2009.
Digital Filter Design in Frequency Domain
6. Oppenheim, Alan V., Ronald W. Schafer, and John R. Buck, Discrete-Time Signal Processing. Prentice Hall, 1999.
7. A. Antoniou, Digital Filters: Analysis, Design, and Applications. McGraw-Hill, 2000.
Moving Average and Exponential Filters
8. Robert Goodell Brown, Smoothing, Forecasting and Prediction of Discrete Time Series. Dover Publications, 2004.
9. Hoang Pham, Springer handbook of engineering statistics. Springer, 2006.
10. Nocedal, Jorger, and Stephen J. Wright, Numerical Optimization, 2nd edition. Springer, 2006.
Savitzky-Golay Filter
11. Vijay Madisetti, The digital signal processing handbook. CRC Press, 2009.
12. William H. Press, Numerical Recipes: the art of scientific computing. Cambridge University Press, 2007.
Taylor Series and Derivative Estimation Using Difference Equations
13. Joe D. Hoffman, Numerical Methods for Engineers and Scientists, 2nd edition. CRC Press, 2001.