实用浮点数知识 | TommyWu's Lab

文章發布時間 2011年1月4日

作者 TommyWu

標籤

译文 · 原文： Friday Q&A 2011-01-04: Practical Floating Point · 作者 Mike Ash

原文：https://www.mikeash.com/pyblog/friday-qa-2011-01-04-practical-floating-point.html 发布：2011-01-04　作者：Mike Ash 译者：MiMo（mimo-v2.5-pro）；代码块保留英文原样

欢迎阅读新年的首期 Friday Q & A。作为 0x7DB 年的第一篇文章，我决定写点关于实际浮点数内容，这个主题由 Phil Holland 建议。

首先，我想说明本文的讨论范围。我不打算深入探讨浮点计算的理论细节，甚至不会涉及进行密集数值或科学计算时需要掌握的那些知识。经典的《What Every Computer Scientist Should Know About Floating-Point Arithmetic》已经很好地涵盖了那些领域。

我打算从实用且务实的角度，探讨在编写日常 Mac 或 iOS 应用程序时该如何处理浮点算术运算（floating-point arithmetic）。包括何处应使用浮点数、何处不宜使用、它的适用场景以及实用技巧。

误区在深入核心内容前，我想先讨论两个在编程社区中相当普遍的误解：

浮点计算很慢
浮点计算不精确

浮点精度的特性远比简单地说” 不精确” 更复杂。许多浮点计算能产生精确结果。大多数其他计算则能产生尽可能接近精确值的、当前可表示的最优结果。要正确使用浮点数，必须正确理解其精度特性，但它并非总是糟糕或总成问题。

浮点数表示法

虽然我不打算深入讲解浮点数的二进制表示细节，但了解其基本表示原理仍然有用。对最底层细节感兴趣者可参阅 IEEE-754 规范。

需注意，C 语言并未强制要求浮点类型必须采用 IEEE-754 语义。但所有 Apple 平台都采用此标准，其他平台很可能也是如此，因此我在此讨论的所有内容均基于 IEEE-754 假设。

你可能熟悉科学记数法。将数字转换为科学记数法时，需通过乘以或除以 10 来标准化数值，使其位于 [1, 10) 区间内，随后乘以相应的 10 的幂次使其恢复原有量级：

42 = 4.2 × 10¹
998.75 = 9.9875 × 10²
0.125 = 1.25 × 10⁻¹
-42 = 101010₂ = 1.01010₂ × 2⁵
998.75 = 1111100110.11₂ = 1.11110011011₂ × 2⁹
0.125 = 0.001₂ = 1.0 × 2⁻³

可表示为：

(1.01010, 5)
(1.11110011011, 9)
(1.0, -3)

你会注意到所有三组的首位数字都是 1。事实上，首位数字将始终为 1，表示零的情况除外 —— 那属于特例。既然首位数字始终为 1，就无需存储它。这样数对可以简化为：

(01010, 5)
(11110011011, 9)
(0, -3)

存在一些特殊情况。零就是其中之一，无穷大（infinity）以及其他几种情况。但基本形式就是这些简单的数对。

观察结论
了解了这些数字的表示方法后，可以对其性质做出一些有用的观察。

任何二进制表示能在尾数（mantissa）内完全容纳的整数都可以精确表示而无误差。对于双精度浮点数（double），这意味着最高可达 2⁵³（约 18 万亿亿）的任何整数都能被精确表示。对于单精度浮点数（float），最高可达 2²⁴（略低于 1680 万）的整数可以被精确表示。

比这更大的数字也能表示，但精度会降低。超过上述界限后，只能表示偶数。随着数字继续增长，只能表示 4 的倍数，然后是 8 的倍数，接着是 16 的倍数，依此类推。

当且仅当分数能表示为 2 的幂次之和时，才能精确表示。例如，3/4 = 1/2 + 1/4 = 1.1 × 2⁻¹ = (1, -1)。然而，像 1/10 这样看似简单的数字却无法在浮点数中精确表示。你能得到的最好结果是一个近似值：(10011001100110011…, -4)。

换种说法：每个浮点数都能精确地写成有限位小数。但许多有限小数却无法精确表示为浮点数。这就是为什么你永远不应该用浮点数来表示货币。

字面量 在代码中编写浮点常量时，必须注意整数常量和浮点常量之间的语义差异。例如，以下陷阱很常见：

1
    double halfpi = 1/2 * M_PI;

要修复这个问题，只需在数值字面量上加上小数点使其变为浮点类型即可。在这种情况下，只需给其中一个数字加上小数点，因为另一个数字会自动转换为浮点类型，不过对两个数字都加上小数点会更清晰：

1
    double halfpi = 1.0/2.0 * M_PI;

精度对浮点数（floating-point numbers）的算术运算有多种精度要求。特别是加、减、乘、除这四种基本运算，如果正确的结果是可表示的，则必须产生完全正确的结果。如果正确的结果不可表示，则必须生成最接近正确结果的浮点数。

结合一个事实，即很大范围的整数都是精确可表示的。这意味着，只要操作数和结果处于该范围内，浮点数中的整数加、减、乘运算将是精确的。结果为整数的除法同样也是精确的。一般来说，你可以将整数放入浮点数中，只要你知道数值的范围在所需范围内，你就可以依赖其完全精度且不存在不可预测性。

这就是 Cocoa 能够使用 CGFloat 作为图形坐标的原因。乍一看这似乎不是个好主意。像素是离散单位，浮点数是连续且不精确的。然而，任何基于完整像素的操作都会产生精确结果而不会出现不准确的情况。使用浮点数让 Cocoa 在非完整像素操作时具备额外的灵活性，能够产生良好的近似结果。

比较
常言道，C 语言程序员永远不应使用 == 比较浮点数。甚至有一个专门的 gcc 警告 -Wfloat-equal 来捕获此类用法。给出的理由是浮点精度不准确意味着两个本应完全相等的数字，可能因舍入误差或其他计算不精确性而产生细微差异。

虽然这通常是正确的且值得遵循的良好经验法则，但如你所见，情况并非总是如此。只要确保数值完全精确，使用 == 比较浮点数值是完全合理的。例如，如果你只处理表现良好的整数，使用 == 就不会有问题：

1
    double x = 1.0 + 2.0 * 3.0;
2
    double y = (29.0 - 1.0) / 7.0 + 3.0;
3
    if(x == y)
4
        // guaranteed to be true

1
    double one1 = 0.1 * 10.0;
2
    double one2 = 1.0 / 3.0 * 3.0;
3
    double one3 = 4.0 * atan(1.0) / M_PI;
4
    if(one1 == 1.0 || one2 == 1.0 || one3 == 1.0)
5
        // no guarantee any of these will be true

1
    BOOL FloatAlmostEqual(double x, double y, double delta)
2
    {
3
        return fabs(x - y) <= delta;
4
    }

1
    double one = 0.1 * 10.0;
2
    if(FloatAlmostEqual(one, 1.0, 0.0000001))
3
        // this will be true

特殊数值
有几种特殊的浮点数值得了解。

首先是零。IEEE-754（国际浮点数算术标准）实际上有两种零：正零和负零。尽管它们本质上相同，甚至用 == 比较时相等，但行为略有差异。例如，1.0 * 0.0 会产生正零，而 1.0 * -0.0 会产生负零。当将浮点数视为对某个理论精确值的近似时，负零的概念就合理了：正零不仅代表精确的零值，还代表一个极小的、趋近于零的正数范围；类似地，负零代表零和一个趋近于零的负数范围。

大多数情况下，负零几乎没有实际影响，可以忽略。若需要检测负零，可通过 signbit 函数进行判断：

1
    BOOL IsNegativeZero(double x)
2
    {
3
        return x == 0.0 && signbit(x);
4
    }

无穷大可以在代码中通过 INFINITY 宏来表示，也可以使用 isinf(x) 来检测。在大多数情况下，浮点数的无穷大表现会符合你的预期：加上或减去一个有限数会得到无穷大；乘以或除以一个正数得到无穷大，乘以或除以负数则改变符号；将一个有限数除以无穷大得到零。

最后是 “非数字”（Not a Number，简称 NaN）。它表示数学中的 “未定义” 概念，或者至少是无法用实数表示的结果。NaN 产生于以下运算：对 -1 开平方、计算 0.0/0.0，或 INFINITY - INFINITY（无穷大减无穷大）。

NaN 有若干异常行为。最令人惊讶的可能是 NaN 不等于任何值，甚至不等于它自身。如果 x 是 NaN，那么表达式 x == x 的结果将为假（false）。NaN 还会在计算中传播：任何浮点运算，只要有一个操作数是 NaN，结果就会是 NaN。这意味着代码可以在一长串计算的末尾对 NaN 做一次检查，而不必在每个可能产生 NaN 的操作之后都进行检查。

NaN 可以在代码中通过 NAN 宏写入，并能使用 isnan 进行检测。也可以通过 x != x 来检测，但并不推荐这样做，因为有些编译器在优化时过于 “聪明”，会让这个表达式始终返回 false。

数学函数 math.h 头文件中有大量实用的数学函数。每个函数都有两种变体。例如 sin，它接受一个 double 类型的参数，并返回 double 类型的结果。而以 f 结尾的函数（例如 sinf）则执行相同的操作，只不过它们处理的是 float 类型。当你的数据全是 float 时，使用它们会稍快一些。有几类函数值得提及：

三角函数：提供了 sin、cos、tan 等函数。这些函数当然对各种几何计算都非常有用。
指数函数：exp 计算数学常数 e 的幂，log 计算自然对数。还有其他函数可用于计算以 2 为底的幂以及以其他数为底的对数。
幂运算：pow 函数可以计算任意指数。sqrt 函数是专门针对平方根优化过的。
整数转换：多种用于从浮点数获取相邻整数的函数，例如 ceil（向上取整）、floor（向下取整）、trunc（向零取整）、round（四舍五入）和 rint（四舍五入到最近偶数）。
专用浮点函数：许多通过利用浮点数特性来实现更优性能、精度或额外功能的函数，例如 fma（执行一次乘法和一次加法）、log1p（计算 log (1 + x) 函数）和 hypot（计算斜边长度）。

延伸阅读
Mac OS X 附带了一些关于浮点数的优秀文档。通过 man float 可查看其通用表示和行为。通过 man math 可查看 math.h 中的各类函数。其中大多数函数也有各自的 man 页面，提供更详细的信息。

最后，经典文献《What Every Computer Scientist Should Know About Floating-Point Arithmetic》（每位计算机科学家都应了解的浮点运算知识）对于任何真正想理解这一切如何运作及其影响的人来说，是一篇略显艰深但极其有用的读物。

结论

浮点运算可能有些奇怪，但如果你理解其工作原理的基础，它就没什么好怕的，而且可以非常有用。浮点运算非常适合用于物理、图形处理，甚至只是内部记账。时刻注意精度和其他限制非常重要，但在这些限制之内，它确实能大展身手。

本次内容就到这里。两周后欢迎回来阅读另一期精彩文章。如果等待期间感到无聊，何不给我发送一个主题建议呢？

#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2011-01-04-practical-floating-point.html

Welcome to the first Friday Q&A of the new year. For the first post of 0x7DB, I decided to write about practical floating point, a topic suggested by Phil Holland.

First, I want to discuss what I will cover. I do not intend a deep theoretical discussion of floating point calculations, nor even the sorts of things you’d need to know when doing heavy numeric or scientific calculations with them. The classic What Every Computer Scientist Should Know About Floating-Point Arithmetic covers that ground well.

What I intend to cover is how to approach floating-point arithmetic in a practical, pragmatic sense when writing your everyday Mac or iOS applications. Where to use floating point, where not to use it, what it’s good for, and useful tricks.

Myths Before getting into the meat, I want to discuss two myths which are fairly pervasive in the programming community:

Floating point calculations are slow
Floating point calculations are inaccurate

Floating point accuracy is harder to characterize than simply saying it’s inaccurate. Many floating point calculations produce exact results. Most others produce results which are as close to the exact answer as is possible to represent. Accuracy must be properly understood to use floating point properly, but it’s not always bad and not always a problem.

Floating Point Representation While I don’t want to get into the exact binary representation of floating point numbers, it is useful to understand the basics of how they are represented. Those interested in the lowest-level details can read about the IEEE-754 spec.

Note that there is nothing in C which requires the floating point types to use IEEE-754 semantics. However, that is what is used on all Apple platforms, and what you’re likely to find anywhere else, so everything I discuss here assumes IEEE-754.

You are probably familiar with scientific notation. To put a number in scientific notation, you normalize the number by multiplying or dividing by 10 until the number is in the range [1, 10), and then you multiply it by a power of 10 to get it back where you want it:

42 = 4.2 × 101
998.75 = 9.9875 × 102
0.125 = 1.25 × 10-1
42 = 1010102 = 1.010102 × 25
998.75 = 1111100110.112 = 1.111100110112 × 29
0.125 = 0.0012 = 1.0 × 2-3
(1.01010, 5)
(1.11110011011, 9)
(1.0, -3)

You’ll notice that the leading digit on all three is 1. In fact, the leading digit will always be 1, except for representing zero, which is a special case. Since the leading digit is always 1, it’s not necessary to store it. The pairs can then be reduced to:

(01010, 5)
(11110011011, 9)
(0, -3)

There are some special cases. Zero is one of those, as is infinity, and various others. But the basics are these simple pairs.

Observations Knowing the representation of these numbers, there are some useful observations that can be made about their properties.

Any integer whose binary representation fits within the mantissa can be precisely represented with no error. For a double, this means that any integer up to 253, or about 18 quadrillion, can be represented exactly. In a float, integers up to 224, or a bit under 16.8 million can be represented exactly.

Numbers much larger than this can be represented as well, but with less precision. Only even numbers can be represented when immediately past the above limits. As the numbers grow further, only multiples of 4 can be represented, then multiples of 8, then 16, etc.

Fractions can be represented if and only if they can be expressed as a sum of powers of two. For example, 3/4 = 1/2 + 1/4 = 1.1 × 2-1 = (1, -1). However, a seemingly simple number such as 1/10 cannot be precisely represented in floating point. The best you can do is a close approximation: (10011001100110011…, -4).

To put it differently: every floating point number can be precisely written out as a finite decimal. However, many finite decimals cannot be exactly represented as a floating point number. This is why you should never use floating point to represent currency.

Literals When writing floating point constants in code, it’s important to be mindful of the semantic difference between integer constants and floating point constants. For example, the following trap is common:

1
    double halfpi = 1/2 * M_PI;

To fix this, it is necessary to simply place a decimal point on the literals to make them into floats. In a case like this, only one of the numbers needs it, because the other number will be converted to floating point automatically, but it’s more clear to just do it with both:

1
    double halfpi = 1.0/2.0 * M_PI;

Accuracy There are various accuracy requirements placed on arithmetic operations on floating point numbers. In particular, the four basic operations of addition, subtraction, multiplication, and division, are required to produce exactly the correct result if the correct result is representable. If the correct result is not representable, then they must produce the closest possible floating point number to the correct result.

Combine this with the fact that a large range of integers are exactly representable. This means that, as long as the operands and result are within that range, addition, subtraction, and multiplication of integers in floating-point numbers will be exact. Division with an integral result will also be exact. In general, you can place integers in floating point numbers and, as long as you know the range of the numbers to be within what’s required, you can count on full accuracy and no unpredictability.

This is how Cocoa can use CGFloat for graphics coordinates. At first glance it might seem like a bad idea. Pixels are discrete units, floating point is continuous and inaccurate. However, any operation that works on whole pixels will produce exact results and no inaccuracy. Using floating point gives Cocoa additional flexibility to produce good approximate results when not working on whole pixels.

Comparison It’s commonly said that a C programmer should never use == to compare floating point numbers. There’s even a gcc warning specifically to catch this: -Wfloat-equal. The reason given for this is that floating point inaccuracy means two numbers which should be exactly equal may in fact differ slightly due to rounding errors or other such computational inexactness.

While this is often true and a good rule of thumb to follow, as you can see it is not always the case. It is perfectly reasonable to use == on floating point values as long as you know that the values are completely accurate. For example, if you’re working purely on well-behaved integers, == presents no problem:

1
    double x = 1.0 + 2.0 * 3.0;
2
    double y = (29.0 - 1.0) / 7.0 + 3.0;
3
    if(x == y)
4
        // guaranteed to be true

1
    double one1 = 0.1 * 10.0;
2
    double one2 = 1.0 / 3.0 * 3.0;
3
    double one3 = 4.0 * atan(1.0) / M_PI;
4
    if(one1 == 1.0 || one2 == 1.0 || one3 == 1.0)
5
        // no guarantee any of these will be true

1
    BOOL FloatAlmostEqual(double x, double y, double delta)
2
    {
3
        return fabs(x - y) <= delta;
4
    }

1
    double one = 0.1 * 10.0;
2
    if(FloatAlmostEqual(one, 1.0, 0.0000001))
3
        // this will be true

Special Numbers There are a few kinds of special floating point numbers that are useful to understand.

The first is zero. IEEE-754 actually has two zeroes: positive and negative. While they are largely the same, and even compare as equal using ==, they do behave slightly differently. For example, 1.0 * 0.0 produces positive zero, but 1.0 * -0.0 produces negative zero. The concept of negative zero makes sense when considering floating point values as approximations to some theoretical exact number. Positive zero represents not only the precise quantity of zero, but a small range of extremely small positive numbers that are very close to zero. Likewise, negative zero represents zero and a small range of negative numbers very close to zero.

For the most part, negative zero has few practical consequences and can be ignored. For cases where it needs to be detected, it can be checked using signbit:

1
    BOOL IsNegativeZero(double x)
2
    {
3
        return x == 0.0 && signbit(x);
4
    }

Infinity can be written in code by writing the INFINITY macro. It can be detected with isinf(x). For the most part, floating point infinities behave the way you would expect them to. Adding or subtracting a finite number produces infinity. Multiplying or dividing by a positive number produces infinity, and a negative number switches the sign. Dividing a finite number by infinity produces zero.

Finally, there is Not a Number, or NaN. This represents the mathematical concept of “undefined”, or at least a result which can’t be represented as a real number. NaN is produced by operations such as taking the square root of -1, calculating 0.0/0.0, or INFINITY - INFINITY.

NaNs have several unusual behaviors. Perhaps the most surprising is that NaN is not equal to anything, not even itself. The expression x == x will be false if x is a NaN. NaNs also propagate through calculations. Any floating point operation where one operand is a NaN will produce NaN as the result. This means that code can do a single check for NaN at the end of a long calculation, rather than having to check after each operation that could potentially produce one.

NaN can be written in code with the NAN macro, and can be detected using isnan. They can also be detected using x != x, but this is not recommended as some compilers get a little too clever while optimizing and will make that expression always be false.

Math Functions There are a ton of useful math functions in the math.h header. Each function comes in two variants. The plain function, for example sin, takes a double and returns a double with the result. Functions which end in f, for example sinf, do the same except they operate on float. This makes them a bit faster when your values are all float. There are a few categories of functions worth mentioning:

Trigonometric functions: sin, cos, tan, and others are all provided. These are, of course, useful for all kinds of geometric calculations.
Exponential functions: exp calculates powers of the mathematical constant e, and log calculates natural logarithms. Other functions are available to calculate powers of two and logarithms in other bases.
Powers: the pow function will calculate arbitrary exponents. The sqrt function is specifically optimized to take square roots.
Integer conversion: various functions to get a nearby integer from a floating point number, such as ceil, floor, trunc, round, and rint.
Specialized floating point functions: many functions which provide better performance or accuracy, or additional capabilities by taking advantage of the nature of floating point, such as fma (performs a multiply and an add), log1p (calculates the function log(1 + x)), and hypot.

Further Reading Mac OS X ships with some good documentation on floating point numbers. man float discusses their general representation and behavior. man math discusses the various functions in math.h. Most of those functions also have their own man page which goes into more detail.

Finally, the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic is a somewhat difficult but extremely useful read to anyone who really wants to understand just how all of this stuff works and what consequences it has.

Conclusion Floating point arithmetic can be strange, but if you understand the basics of how it works, it’s nothing to be afraid of and can be extremely useful. Floating point can be great for physics, graphics, and even just internal bookkeeping. It’s important to always be mindful of accuracy and other limits, but within those limits there’s much that it’s good for.

That’s it for this time. Come back again in two weeks for another exciting edition. If you get bored while waiting, why not send me a topic suggestion?