Article
08/10/2015

October 2012

Volume 27 Number 10

Numerics - Testing Math Functions in Microsoft Cloud Numerics

By Stuart Brorson | October 2012

Suppose you need to perform a mathematical calculation. For example, suppose you need to know the sine of 34°. What do you do? You probably turn to a calculator, computer or some other smart device. On your device, you type in “sin(34)” and you get an answer, often with 16 decimal digits of precision. But how do you know your answer is correct?

We’re so accustomed to getting mathematical answers from our electronic gizmos that nobody thinks about whether the answers are correct! Just about everybody takes it for granted that our machines give us the right answers. However, for a small set of software quality engineers, correctness can’t be taken for granted; the job is all about getting the right answer. This article explains how the math functions in the new Microsoft Cloud Numerics math library are tested.

Most scientific and engineering computations use floating-point arithmetic. The IEEE standardized the basic workings of floating-point math in 1985. A key feature of the standard is that it acknowledges not all numbers can be represented on a computer. While the set of real numbers is infinite and uncountable, the set of IEEE floating-point numbers is necessarily finite and countable because the numeric representation of floating-point numbers uses a fixed number of bits. The construction of IEEE floating-point numbers implies that they’re also rational, so commonly used numbers such as π aren’t expressed exactly, and operations using them, such as sin(x), aren’t exact, either.

Moreover, unlike with integers, the spacing between IEEE floating-point numbers is locally uniform, but it increases logarithmically with the magnitude of the number itself. This logarithmic aspect is depicted in Figure 1, which shows schematically the locations of uniform chunks of IEEE floating-point numbers on the real number line. In the figure, valid chunks of floating-point numbers (embedded in the real number line) are indicated by vertical lines. Note that the distance between valid floating-point numbers increases logarithmically as the magnitude of x increases. The distance between two adjacent floating-point numbers is often called the Unit of Least Precision or Unit in Last Place (ULP), and is provided as a built-in function called eps(x) in many common math programs. A consequence of the variation in spacing is that a small number close to 0 has no effect when added to a relatively larger number such as 1.

Figure 1 The Floating-Point Number Grid Depicted Schematically

Besides codifying formats for floating-point numbers, the IEEE standard provides strict guidance about how accurate the returns from the most fundamental math operations must be. For example, the IEEE standard specifies that the returns from the four arithmetic operations (+, -, * and /) must be “best rounded,” meaning that the answer returned must be closest to the “mathematically correct” result. Although this requirement was originally met with some resistance, returning the best-rounded result now takes place in hardware, showing how commonplace it has become. More interestingly, the IEEE floating-point spec also requires that the square root function return the best-rounded result. This is interesting because it opens the door to two questions: “How accurately can an irrational function be computed using the IEEE floating-point representation?” and “How should you test the accuracy of a function implementation?”

Computing Accurate Functions

What factors go into determining the accuracy of a computation? The reality is that we should expect to encounter inaccuracies, as round-off can occur in any step in an algorithm. These errors might compound, although sometimes they might also cancel. In practice, an algorithm designer must learn to live with and contain the inherent inaccuracy of numeric computations.

You can separate the factors contributing to computational error into two categories:

Factors related to the algorithm used to perform the computation. In particular, this means the stability of the algorithm itself and its sensitivity to round-off errors. A poor, unstable algorithm will tend to return less-accurate results, whereas a good, stable algorithm will return more-accurate results.
The intrinsic nature of the function itself and the domain of its inputs. There’s a theoretical limit to the achievable accuracy of any function implementation when computation uses a finite (as opposed to an infinite) number of bits. The reason is that round-off acts like a source of error in the computation, and the behavior of the function itself determines whether this error is amplified or attenuated as the calculation proceeds from the inputs to the outputs. For example, computing values of a function near a singularity, a zero or a fast oscillation might be less accurate at those points than computing a function that varies slowly over the same input domain. When talking about this intrinsic attainable accuracy, we speak of how “well conditioned” the function is.

Accordingly, testing the accuracy of a function implementation involves verifying that the algorithm returns results as accurate as theoretically possible. The limits of theoretical possibility are established by the conditioning of the function at each of its inputs. To put it another way, using floating-point numbers to compute a result introduces two possible sources of error: Errors that we can avoid by using good, stable algorithms (factor 1), and errors that are harder to avoid because they’re related to the function’s behavior at its inputs (factor 2).

In numerical analysis, the concept of conditioning is quantified by the so-called “condition number,” which measures the sensitivity of a function to variations in its inputs. Depending on the exact nature of the function under consideration (for example, the number of inputs, scalar versus matrix and so on), there are a number of complicated expressions for the condition number. The simplest case is for a differentiable function of one variable, such as 1/x, x², sin(x) or any of the other functions you likely encountered in high school algebra. In this simple case the condition number of the function f is given by:

Equation 1

We’ll refer to this later as Equation 1. As always, the notation f’(x) means the derivative of the function f(x). A large condition number (K_f* >> 1*) indicates high sensitivity to relative perturbations, whereas a small condition number (K_f* <= 1*) indicates the function is relatively insensitive to perturbations. Intuitively, you can see that if a function has a singularity—that is, the function blows up, such as 1/x near x = 0—the derivative will capture this effect and return a large condition number.

For example, consider the function f(x) = 1/(1-x). This function obviously blows up (that is, becomes infinite) at x = 1. The derivative is f’(x) = 1/(1-x)2. Plugging this into Equation 1 and eliminating terms gives the condition number for this function:

The condition number K_f(x) goes to infinity around x = 1, meaning that computations involving this function will be sensitive to perturbations such as round-off error when xis close to 1. Because f(x) blows up for x → 1, this behavior is expected. Also, the condition number K_f* (x)* goes to 0 when x is close to 0. This indicates that computations using this function will be insensitive to perturbations when x → 0. This makes intuitive sense because f(x) tends to a constant value of 1 when x is much smaller than 1.

Testing Function Accuracy

So what does a practical test of a mathematical function look like? Suppose we want to test our implementation of the scalar function y = f(x), where x is the input and y is the returned output value. Given a floating-point input value x, testing requires three items:

The value computed by the function under test, y_computed* = f*_computed(x).
The “mathematically true” result. Assume we had an oracle that could tell us the exact, mathematically correct answer. Then round that value to the nearest floating-point value (so it can be represented on a computer). Call this value y_true* = f*_true(x).
A testing tolerance. For some functions, this tolerance can be 0, which means that ycomputed is exactly the same as the mathematically true value, ytrue. For example, the floating-point spec mandates that the function sqrt() returns the floating-point value that lies closest to the exact answer (that is, best-rounded). In the general case, however, we can’t expect the returns from all functions to be the best-rounded approximations to their exact values. Therefore, the testing tolerance must incorporate information about how much error is allowed in the return from f(x). Note that the tolerance might depend on the details of the function f(x), as well as the exact value of the input, x.

With these ingredients, the accuracy of our function implementation is deemed acceptable (that is, it passes our test) if

| y_computed* - y*_true* | < tol(f; x),*

where the tolerance depends both upon the input value x and the behavior of the function f itself.

In words, this equation says that the function passes the test if the returned value differs from the “true” value by an amount less than the testing tolerance (the allowed error). Note that | y_computed* - y*_true | is the absolute error of the computed function.

In this formalism, a function test is performed by creating a large number of input values lying in the input domain of the function, running these values through the function under test and comparing the function outputs y_computed against the best-rounded (mathematically true) values, y_true. The values of the inputs should be chosen to cover all relevant portions of the function’s valid input domain.

At this point, the tester has two questions to answer: What is the “true” result of a function? and What is a reasonable tolerance?

To answer the first question, the easiest thing to do is to use an “infinite precision” math package to create the input/output pairs used for testing. This package doesn’t use 32- or 64-bit floating-point numbers to compute a value. Rather, it uses a numeric representation—or better yet, a symbolic representation—of a number that may carry the ability to compute an arbitrarily large number of digits through the computation at the expense of computational speed. Several commercially available mathematics packages implement infinite precision math. Also, many infinite precision math libraries can be plugged into common languages. The ability to use infinite precision math at modern speeds is a recent innovation, making this type of testing convenient. Any of these resources is sufficient to create so-called “golden value” pairs useful for testing a function’s floating-point implementation.

Putting It Together—Getting the Testing Tolerance

Once we have golden values, we need to answer the other question: What is a reasonable testing tolerance? Getting the correct testing tolerance is the critical component of testing. If the tolerance is unrealistically small, the function will never pass its test, even if the best algorithm is used to compute it. On the other hand, if the testing tolerance is too large, it means that the allowed error is larger than it needs to be, and the function will pass the test, even if the algorithm is faulty.

The condition number defined earlier is the key to determining the acceptable error allowed in function testing. The expression for the condition number can be rearranged to read:

Equation 2

We’ll refer to this later as Equation 2. If we identify ∆x as the distance between one floating-point number and the next, this expression tells us how much the output of f(x) will jump as we move from one floating-point number x to its neighbor, x + ∆x. Traditionally, the field of computer numerics takes for ∆x the function eps(x), the spacing between any two adjacent floating-point numbers. Note that because the grid spacing is non-constant (see Figure 1), eps(x) is a function of x. (As noted earlier, the distance between one floating-point number and its nearest neighbor is also related to a ULP—the magnitude represented by the least-significant bit.)

Next, we demand that the output error in our test, y_computed* - y*_true, be less than some multiple of this jump. That is, we ask that, where C is a constant:

Intuitively, this expression captures the following idea: If x makes one step on the floating-point grid, the change in the output should be no greater than Equation 2. An accurate function f(x) will change its output by only the amount derived from the condition number, and no more. Therefore, output perturbations occurring during the computation of f(x) should be less than the change caused by taking one step on the floating-point grid. The constant C is a “fudge factor.” Obviously, as C increases, the allowed tolerance increases with it. We can therefore interpret this constant as the allowed functional error over the theoretical minimum. In real testing, Ctakes integer values between 1 and 10, which we interpret as the allowed error, expressed in ULPs.

Finally, using the definition of condition number, we can rearrange the equation in a way that exposes the testing tolerance:

| y_computed* - y*_true* | ≤ C | f'(x) | eps(x) = tolerance*

This is our desired test—it’s the expression that must be satisfied by an accurate implementation of the function f(x) for any input x. In other words, this is the expression used to validate scalar math functions. Note that this expression should hold true for all valid input values x.

The final question to be settled is, “What is the correct value of C to use in testing?” Because we interpret C as the number of ULP errors allowed, C is an integer of order 1. We set C by starting with an initial C value (usually 10) and running the test. If the test passes, decrease Cby 1 and run the test again. Repeat this procedure for decreasing C values until the test fails. Then choose the value of C that last allowed the test to pass (but always keep a human in the loop to assure reasonableness). The goal of this process is to close the door to passing as much as possible (while still allowing the function to pass the test) so you can be assured the function is tested as rigorously as possible. Well-behaved functions typically require C values between 1 and 3, but more complicated functions might require larger C values for passing. We find that very few function implementations require C > 10 to pass, and when we find a function that requires C > 10, it often signals a suboptimal implementation.

From the perspective of software testing, once C is determined for a function’s test (and the test passes), we know that any subsequent test failure means that something in the function implementation has changed—probably for the worse (for example, a regression has occurred). Of course, trade-offs in algorithm design may dictate that it’s worth it to give some slack on C if the regression is small (for example, a speed versus accuracy trade-off). Likewise, after improving an algorithm, it’s worth it to see if C can be tightened further. In any event, one job of software testing becomes managing the testing tolerances as the rest of the software is developed.

Golden Value Testing and Beyond

In the testing methodology we’ve discussed so far, we used precomputed input/output pairs scattered throughout the entire input domain of the function under test. We call this type of testing golden value testing, meaning that the infinite precision test oracle provides golden input/output pairs that are known to be highly accurate. The pairs can be stored in a file and read into the test program when it’s executed. As a black-box method to test functional accuracy, this approach works quite well. However, other types of function tests also provide important ways to validate function behavior. Other tests include:

Special value testing: Many functions return known, exact theoretical values for certain inputs, such as cos(0) => 1, sin(π) => 0 and gamma(1/2) = √π, for examples. In a special value test, known fixed inputs are fed to the function under test, and the function return is compared to the known result. Of course, because irrational numbers such as π don’t lie exactly on the floating-point grid, the closest floating-point approximation to those numbers must be used in testing, and a non-zero value for sine and testing tolerance (as computed earlier) is used to validate the computational results.
Identity testing: Many functions obey identities that are true for all inputs over a known domain. For example, sin(x)^2 + cos(x)^2 == 1 for all inputs x. Another one is arcsin(x) == -i log(i*x + sqrt(1 – x2)). The difference between an identity test and a special value test is that the identity test is true for arbitrary input, whereas the special value test only holds true for one particular input value. As a consequence, identity tests validate the relationships between functions.

You must be careful with identity tests. For one thing, some identities are badly conditioned. For example, intermediate results might grow to be quite large even though the final result is moderately sized. In this case, small, allowable errors incurred during intermediate steps in the calculation can cause spurious test failures. Also, many functions (for example, the inverse trig functions) are multivalued in the complex plane. Blindly using identity tests might cause spurious test failures when the left- and right-hand sides of the identity return correct values lying on different Riemann sheets in the complex plane. The tester must carefully craft identity tests to avoid both of these problems.

Inverse testing: This involves computing a function composed with its inverse and verifying the result is the same as the input to the test. For example, for positive x we can test log(exp(x)) = x. You might think of this as a special case of identity testing, and indeed it is. However, because so many functions have inverses—and mathematical consistency demands that a function composed with its inverse return the original input—we use a distinct test category to validate that functions and their inverses behave in consistent ways. The same caveats that apply to identity tests (such as conditioning or returns lying on different Riemann sheets) also apply to inverse testing.
Series summation testing: Almost all transcendental functions admit a series expansion over some input domain. In series summation testing, we compare the value returned by the function under test to the value returned by an explicitly summed series in the test. You might regard series summation as a type of identity test. However, the considerations going into creating a good series summation test are similar across functions (for example, stability and convergence criteria). Therefore, we consider this type of testing conceptually different from identity testing.
Continued fraction expansions: Some functions may also be evaluated over a known domain using a continued fraction expansion. When such an expansion exists, it might converge much more quickly than a series sum. Therefore, for certain functions, the series summation test might be substituted with a continued fraction expansion.
Constructed value testing (also known as model-based testing): In constructed value testing, you create a simple, verifiably correct implementation of the function within the test itself. This type of test doesn’t necessarily apply to scalar math functions. However, a good math package isn’t limited to the set of analytic functions. For example, consider functions such as floor, ceiling, mod, concatenation operators for vectors and so on. Each of these functions may be tested by writing code that implements the function using basic computer language constructs. For example, floor, ceiling and the various numeric chop functions may be modeled using cast operators that convert the floating-point value to an integer and then convert the result back to a float (with some simple math performed to mimic the exact behavior of the function under consideration). A benefit of this type of test is that the model implementation is usually trivial and is therefore verifiable by inspection.
Error testing: These tests ensure that invalid input is handled properly by returning a clean error that enables a programmer to clearly and quickly understand the problem. For example, when working in real numbers, sqrt(x) should return a useful error (or NaN) if x < 0. Some input errors are known to crash the computer if they aren’t handled properly, and the tester’s job is to make sure this doesn’t occur.

The benefit of these types of tests is that they provide a cross-check of the function’s correctness using only mathematics itself. Because these methods use the self-consistency of mathematics as the foundation of testing, they’re “known good” and don’t rely on numeric computations using third-party software.

Validating the Approach

One way to appreciate the advantage of condition number tolerances is to examine the tolerances used to validate scalar math functions that have singularities in their input domains. For example, the common trig functions tan and cot are singular at an infinite number of points along the real number line. If fixed, constant tolerances were used for accuracy validation, either these functions would fail their tests for input values close to the singularities, or the testing tolerances necessarily would be so large that the tests wouldn’t be effective. With condition number tolerances, we can validate the scalar functions to within a small handful of ULPs over the entire input domain of the function. Shown in Figure 2are examples of passing ULP values we’ve found during qualification of functions distributed in the Microsoft Cloud Numerics product. That is, these are the ULP values at which the corresponding function passes its test.

Figure 2 Passing ULP Values

Function	Input Domain	Passing ULP	Comment
Atan2	Real numbers	1	Two input, four quadrant arctan
Cot	Real numbers	3
Coth	Complex numbers	3
Cuberoot	Real numbers	2
Ellipe	Real numbers x such that 0 ≤ x ≤ 1	4	Elliptic integral E
Ellipk	Real numbers x such that 0 ≤ x ≤ 1	6	Elliptic integral K
Expm1	Real numbers	3	Exp(x) - 1
Gamma	Real numbers	3
Log	Real numbers	3
Log	Complex numbers	3
Log1p	Real numbers	1	Log(1+x)
Log1p	Complex numbers	1	Log(1+x)
Psi	Real numbers	6	Psi function—derivative of gamma function
Sqrt	Real numbers	0	IEEE spec requires this to be “best rounded”
Tan	Real numbers	1
Tanh	Real numbers	4
Tanh	Complex numbers	4

The conclusion to be drawn from the examples in Figure 2 is that the special functions are implemented using algorithms that are accurate to within a small number of ULPs. The worst cases presented in Figure 2 have an error less than 6 ULPs, which corresponds to log10(2^6) = 1.8 decimal digits of error in the lowest-order digits of the number. For a floating-point double, this corresponds to a relative error of 1.3323e-015.

State-of-the-Art Testing

To review, a rigorous testing methodology has been successfully applied to validation of the algorithms used in the Microsoft Cloud Numerics math packages. Using testing tolerances derived from the function’s condition number has enabled us to identify and correct incorrect function implementations, even when the errors occurred only in the last few digits of the function returns. Fixed testing tolerances can’t provide a stringent test that doesn’t also cause spurious failures (such as near function singularities). The majority of our testing involved using golden value input/output values that are precomputed using infinite precision math packages and then stored in files that are read at test time. In addition to using golden values, we also performed cross-checks on our function implementations using a variety of different math identities to validate the behavior of our functions.

We believe users of Microsoft Cloud Numerics can use its array of math and statistics library functions and be confident that the functions have been thoroughly vetted using state-of-the-art software testing practices.

For more information, we encourage you to read David Goldberg’s classic description of floating-point math, “What Every Computer Scientist Should Know About Floating-Point Arithmetic,” which is available at bit.ly/vBhP9m and other sites.

Stuart Brorson *is an SDET at the Microsoft NERD Center in Cambridge, Mass. He joined Microsoft when the company acquired Interactive Supercomputing, which developed software allowing users to run mathematical and matrix computations on parallel supercomputers. For fun, Brorson enjoys hacking electronics and playing Irish fiddle around Boston. *

Ben Moskowitz is an SDET specializing in testing mathematics-based projects at Microsoft. He has contributed to the releases of Cloud Numerics and Solver Foundation, an optimization package and platform that won the Visual Studio Lighthouse Award in 2011. Off-hours, he spends time with his wife, caring for their city-dwelling goats.

Alan Edelman is a professor of mathematics and a member of the Computer Science and AI Laboratories at MIT in Cambridge, Mass. Professor Edelman has won numerous prizes for his work on numerical analysis, parallel computing and random matrix theory. He was the founder of Interactive Supercomputing, is one of the founders of the Julia project, has learned much from Velvel Kahan and has participated heavily in the Numerical Mathematics Consortium on issues of floating-point mathematics*.***

Thanks to the following technical expert for reviewing this article: Paul Ringseth