← Back to context

Comment by Sirenos

4 years ago

Why is everyone complaining about people finding floats hard? Sure, scientific notation is easy to grasp, but you can't honestly tell me that it's trivial AFTER you consider rounding modes, subnormals, etc. Maybe if hardware had infinite precision like the real numbers nobody would be complaining ;)

One thing I dislike in discussions about floats is this incessant focus on the binary representation. The representation is NOT the essence of the number. Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery, but you can get a perfectly good understanding of binary floats by playing around with the key parameters of a floating point number system:

- precision = how many bits you have available

- exponent range = lower/upper bounds for exponent

- radix = 2 for binary floats

Consider listing out all possible floats given precision=3, exponents from -1 to 1, radix=2. See what happens when you have a real number that needs more than 3 bits of precision. What is the maximum rounding error using different rounding strategies? Then move on to subnormals and see how that adds a can of worms to underflow scenarios that you don't see in digital integer arithmetic. For anyone interested in a short book covering all this, I would recommend "Numerical Computing with IEEE Floating Point Arithmetic" by Overton [1].

[1]: https://cs.nyu.edu/~overton/book/index.html

I think the binary representation is the essence of floating point numbers, and if you go beyond the "sometimes, the result is slightly wrong" stage, you have to understand it.

And so far the explanation in the article is the best I found, not least because subnormal numbers appear naturally.

There is a mathematical foundation behind it of course, but it is not easy for a programmer like me. I think it is better to think in term of bits and the integers they make, because that's what the computer sees. And going this way, you get NaN-boxing and serialization as a bonus.

Now, I tend to be most comfortable with a "machine first", bottom-up, low level approach to problems. Mathematical and architectural concepts are fine and all, but unless I have some idea about how it looks like in memory and the kind of instructions being run, I tend to feel lost. Some people may be more comfortable with high level reasoning, we don't all have the same approach, that's what I call real diversity and it is a good thing.

  • Sorry, I didn't mean to downplay the value of using concrete examples. I absolutely agree that everyone learns better from concrete settings, which is why my original comment fixed the parameters for people to play with. I was referring more to the discussions of how exponents are stored biased, the leading bit in the mantissa is implied = 1 (except for subnormals), and so on. All these are distracting features that can (and should) be covered once the reader has a strong intuition of the more fundamental aspects.

The binary form is important to understand the implementation details. You even mention underflow. It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.

The binary form handily demonstrates the limitations. Understanding the floating point instructions is kinda optional but still valuable.

Otherwise everyone should just use varint-encoded arbitrary precision numbers.

  • It also explains why 0.1+0.2 is not 0.3. With binary IEEE-754 floats, none of those can be represented exactly[a]. With decimal IEEE-754 floats, it's possible, but the majority of hardware people interact with works on binary floats.

    [a]: Sure, if you `console.log(0.1)`, you'll get 0.1, but it's not possible to express it in binary exactly; only after rounding. 0.5, however, is exactly representable.

    •     Python 3.9.5
          >>> 0.1.hex()
          '0x1.999999999999ap-4'
          >>> 0.2.hex()
          '0x1.999999999999ap-3'
          >>> (0.1 + 0.2).hex()
          '0x1.3333333333334p-2'
          >>> 0.3.hex()
          '0x1.3333333333333p-2'

      2 replies →

  • > It's difficult for most people to initially understand why you can't store a large number that can be represented by an equivalent size integer as a float accurately.

    Because you don't have all the digits available just for the mantissa? That seems quite intuitive to me, even if you don't know about the corner cases of FP. This isn't one of them.

    • I was going to respond with something like this. I think for getting a general flavor*, just talking about scientific notation with rounding to a particular number of digits at every step is fine.

      I guess -- the one thing that explicitly looking at the bits does bring to the table, is the understanding that (of course) the number of bits or digits in the mantissa must be less than the number of bits or digits in an equivalent length integer. Of course, this is pretty obvious if you think about the fact that they are the same size in memory, but if we're talking about sizes in memory, then I guess we're talking about bits implicitly, so may as well make it explicit.

      * actually, much more than just getting a general flavor, most of the time in numerical linear algebra stuff the actual bitwise representation is usually irrelevant, so you can get pretty far without thinking about bits.

>"Sure it matters if you are a hardware guy, or need to work with serialized floats, or some NaN-boxing trickery"

Might you or someone else elaborate on what "NaN-boxing trickery" is and why such "trickery" is needed?

  • There are 13 bits that are unused in the NaN representation, so you can put data in there without impeding your ability to represent every floating point value.

    For example, if you are designing a scripting language interpreter, you can make every value a floating point number and use some of the NaNs to represent pointers, booleans, etc.

    See also http://wingolog.org/archives/2011/05/18/value-representation...