IEEE-754 Floating Point
Every position, velocity, and color in your engine is a 32-bit float, and floats lie in small, specific, reproducible ways. Knowing exactly how they are built turns a class of "impossible" bugs (jitter far from the origin, a replay that desyncs on a teammate's PC, a reverb that tanks the frame rate) into things you can see coming. We work from the bits up, in C++ and Rust.
01Why floats
A 32-bit float buys you a huge dynamic range, from sub-millimeter to kilometers, in a fixed number of bits, with arithmetic the CPU and GPU run in hardware. That range, not raw speed, is the reason games use them: integer add and multiply are usually just as fast, but integers can't cheaply hold fractions or span that range, and fixed-point trades the range away for uniform precision.
The catch is that a float is an approximation with exactly defined rounding, and a handful of conventions decide how it rounds. Get those conventions wrong and you get the classic float bugs. Get them right and the bugs become predictable.
People use one word, , for two different numbers, and the confusion causes wrong tolerance checks. FLT_EPSILON (which is what std::numeric_limits<float>::epsilon() and Rust's f32::EPSILON return) is 2−23 ≈ 1.19e-7, the gap between 1.0 and the next representable float[3][4]. The rounding-error bound (Goldberg's version) is half that, 2−24 ≈ 5.96e-8, because round-to-nearest is never off by more than half a step[1]. They differ by 2×. When you write a tolerance, know which one you mean.
02The bit layout
A binary32 float is three fields in 32 bits: 1 sign bit, 8 exponent bits, and 23 fraction bits[1][2].
Three fields rebuild the number. The sign bit flips it positive or negative. The fraction bits sit after an implicit leading 1. to form a significand in the range [1, 2). The stored exponent shifts that significand by a power of two, after subtracting the bias of 127 so one 8-bit field can encode both tiny and huge magnitudes. Hover any symbol to see what it stands for.
Need a refresher on binary fractions and 2−k?
Digits after a binary point work like decimal, but in halves. The first place is 1⁄2 (2−1), the next 1⁄4 (2−2), then 1⁄8, and so on. So 1.1012 is 1 + 1/2 + 0/4 + 1/8 = 1.625. That is exactly what 1.fraction means above: a 1, then 23 bits each worth half of the bit before it.
A negative exponent is just division: 2−3 is 1/8. The exponent field slides the binary point left (smaller) or right (larger), which is why one format spans sub-millimeter to kilometer. A value is representable exactly only when it is a sum of these powers of two within the 24-bit budget, which is the reason 0.1 never lands exactly (covered in the rounding section).
The exponent is stored with a bias of 127, so a stored 1..254 means an actual −126..+127. The leading 1. is implicit for normal numbers and not stored, which is why 23 stored fraction bits give 24 bits of significand precision. (The standard's word is significand; "mantissa" is the older name for the same thing.) Stored exponent 0 is reserved for zero and subnormals; stored exponent 255 for infinity and NaN. binary64 (a double) uses the same scheme with 11 exponent bits (bias 1023) and 52 fraction bits.
Click the bits. The editor below shows the exact value of any binary32, color-codes the three fields, and flags the subnormal and infinity/NaN regions. Try the presets, then zero out the exponent field; the number goes subnormal and the implicit leading bit flips off:
// C++20: std::bit_cast is the UB-free way to reinterpret. (Casting through
// a uint32_t* is type-punning and technically undefined behavior.)
#include <bit>
uint32_t bits = std::bit_cast<uint32_t>(value);
float back = std::bit_cast<float>(bits);
uint32_t sign = bits >> 31;
uint32_t exponent = (bits >> 23) & 0xFF; // 8 bits
uint32_t fraction = bits & 0x7FFFFF; // 23 bits
// to_bits / from_bits are the safe, stable way (const since 1.83).
let bits: u32 = value.to_bits();
let back: f32 = f32::from_bits(bits);
let sign = bits >> 31;
let exponent = (bits >> 23) & 0xFF; // 8 bits
let fraction = bits & 0x7F_FFFF; // 23 bits
03Normals & subnormals
Normal numbers have that implicit leading 1 and a stored exponent of 1..254. When a result gets too small to normalize (below 2−126 for binary32), the format doesn't jump straight to zero. It switches to (also called denormals): stored exponent 0, implicit leading bit 0, filling the gap to zero with constant absolute spacing. This is gradual underflow[1].
FLT_MIN is not the smallest floatstd::numeric_limits<float>::min() (and Rust's f32::MIN_POSITIVE) is the smallest normal, about 1.18e-38. The smallest positive value is the smallest subnormal, denorm_min, about 1.4e-45, roughly seven orders of magnitude smaller[3]. And note f32::MIN in Rust is the most-negative value, not the smallest magnitude. Three different numbers, routinely confused.
Subnormals keep precision degrading gracefully toward zero instead of falling off a cliff. They have a performance story too, on some hardware a sharp one, which gets its own section at the end.
04Signed zero, infinity, NaN
The reserved exponents encode the special values, and each has a sharp edge.
- Signed zero.
+0.0and−0.0compare equal under==but have different bits and behave differently:1.0 / +0.0is+∞,1.0 / −0.0is−∞. Functions likecopysignandatan2tell them apart. - Infinity. From overflow or division by zero. Arithmetic on it mostly stays infinite;
∞ − ∞is NaN. - NaN. From
0.0/0.0,∞ − ∞,sqrt(−1). NaN is not equal to anything, including itself, which is exactly whyx != xis the portable NaN test (usestd::isnan/f32::is_nan).
First, x == NAN is always false; never test for NaN by comparing to it. Second, NaN is not a single bit pattern: any all-ones exponent with a nonzero fraction is a NaN, so there are millions of them. The top fraction bit distinguishes quiet from signaling NaN (on x86 and ARM, 1 means quiet). The sign of a NaN is not portable, so don't rely on it[4].
05Rounding, and why 0.1 isn't exact
When a result can't be represented exactly, it rounds. IEEE-754's default is round-to-nearest, ties-to-even: on an exact halfway case, pick the result whose last significand bit is 0[2].
Ties go to even, not always up. This removes the upward statistical bias that round-half-up would accumulate over millions of operations. The other four modes (toward zero, toward ±∞, ties-away) exist but are rarely the default, and many library functions assume round-to-nearest-even, so changing the mode at runtime is fragile[2].
This is why 0.1 isn't exact. A fraction is representable in binary only if it's n / 2k. One tenth is 1/(2×5), and that factor of 5 makes it a non-terminating, repeating binary fraction, the same way 1/3 repeats in decimal. The nearest binary32 to 0.1 is slightly larger than one tenth[6], so 0.1 + 0.2 != 0.3: three separate roundings, then a fourth on the addition. Integers, by contrast, are exact up to 224 in binary32 (253 in a double), so not everything is approximate, just any value that isn't a sum of powers of two within the precision budget.
06Machine epsilon and ULP
Floats are not evenly spaced. A (unit in the last place) is the gap between one float and the next, and it doubles at every power of two. Near 1.0 the gap is 2−23; just below 16,777,216 (224) the gap is 1.0, and above it, 2.0[1].
The stepper walks one ULP at a time. The gap grows with magnitude; past 224 it reaches 2.0 and stepping skips the odd integers:
Wrong answers, and why: epsilon() is the gap at 1.0, not the smallest float and not the rounding bound; and far-from-origin jitter is a magnitude problem that doubles fix or rounding don't actually solve.
07Catastrophic cancellation
Catastrophic cancellation happens when you subtract two nearly-equal numbers that already carry rounding error: the equal leading digits cancel, promoting the low-order error to the front. The subtraction itself is exact; it exposes error that was already there[1].
Goldberg's example: computing b² − 4ac with b=3.34, a=1.22, c=2.28 in 3-digit arithmetic. The true answer is 0.0292, but b² rounds to 11.2 and 4ac to 11.1, so the result is 0.1, an error of about 70 ULP[1]. The multiplications introduced the error; the subtraction merely revealed it. The fix is a numerically stable reformulation (for the quadratic formula, compute the larger-magnitude root first, then use the product-of-roots identity), not "throw more precision at it."
08Summing a million numbers
Add up a long list of floats naively and the running total grows large while each new term is small, so each addition rounds away part of the term. The error accumulates. Kahan summation keeps a compensation term that recovers what was lost[8].
float sum = 0.0f, c = 0.0f; // c carries the lost low-order bits
for (float x : values) {
float y = x - c; // add back last round's lost part
float t = sum + y; // big + small: low bits of y fall off here
c = (t - sum) - y; // recover exactly what fell off
sum = t;
}
let (mut sum, mut c) = (0.0f32, 0.0f32); // c carries the lost low-order bits
for &x in &values {
let y = x - c; // add back last round's lost part
let t = sum + y; // big + small: low bits of y fall off here
c = (t - sum) - y; // recover exactly what fell off
sum = t;
}
Naive summation error grows with the count of terms; Kahan's stays roughly constant, effectively independent of n until n approaches 1/epsilon[8]. The race below adds a large value and then a long stream of tiny ones. Watch the naive total stall while Kahan keeps climbing:
09FMA, order, and -ffast-math
Three related facts about how arithmetic is evaluated, each a source of "different answer on a different build."
computes a*b + c with a single rounding of the full-precision product-plus-sum, instead of rounding the product and then the sum[9]. It's more accurate for that expression and, on hardware with an FMA unit, often as fast as a bare multiply. C++ spells it std::fma; Rust spells it f32::mul_add.
float r = std::fma(a, b, c); // a*b + c, rounded once. Fast only if FP_FAST_FMA.
let r = a.mul_add(b, c); // (a*b) + c, rounded once.
The catch: a compiler may contract a*b + c into a single FMA on its own, which changes the result versus the two-rounding form. "More accurate" is not "identical," and whether contraction happens depends on the compiler and flags, so the same source gives different numbers on different toolchains[9][10].
Float addition is not associative. (a+b)+c can differ from a+(b+c) because each step rounds[1]. So summation order, SIMD lane-reduction order, and multi-threaded partial-sum order all change the result. It isn't randomness; a fixed order is perfectly deterministic, the answer just changes when the order does.
-ffast-math trades IEEE behavior for speed, in ways that bite later-ffast-math bundles several relaxations[10]. -fassociative-math lets the compiler reassociate, which is what deletes the Kahan compensation from §8. -ffinite-math-only lets it assume no NaN or infinity, so it can optimize isnan(x) to false and silently break NaN sentinels. -fno-signed-zeros makes x + 0.0 → x (wrong for −0.0). And on GCC and Clang (Linux), merely linking a fast-math object pulls in a constructor (crtfastmath.o) that switches on flush-to-zero for the whole process, changing the behavior of unrelated code, including libraries you didn't compile. Audit it per translation unit; never blanket-enable it in a codebase that uses NaN sentinels, compensated sums, or lockstep determinism.
10Comparing floats
a == b on floats is exact bit equality, and after any rounding or reordering that's usually not what you want. There are three tolerance strategies, and each fails somewhere[7].
- Absolute (
|a−b| ≤ ε): works near zero, fails at large magnitudes, where a fixedεis smaller than one ULP and the test degrades into==. - Relative (
|a−b| ≤ ε·max(|a|,|b|)): works across magnitudes, breaks down near zero, and is meaningless across zero (opposite signs). - ULP-based: reinterpret the bits as integers and subtract; the magnitude is the count of representable floats between them. Intuitive, but undefined across signs and gigantic when comparing a tiny number to
0.0.
Dawson's practical answer: combine an absolute "safety net" for the near-zero case with a relative or ULP comparison otherwise, and pick the absolute tolerance from the scale of your problem. His summary is worth memorizing: "zero is a huge nuisance, avoid it if possible"[7].
Drag the magnitude. A fixed absolute tolerance works near zero but falls apart at large magnitudes, where ε drops below one ULP and the test silently becomes ==; relative and ULP stay scale-invariant:
11Determinism across machines
Identical source can produce different float results on different machines. This is the thread the Game Loop tutorial kept pointing at, and it's why a lockstep RTS or a fighting-game replay can desync on a teammate's PC.
The causes are specific, and listing them is the point, because "floats are random" is wrong[11]:
- Different compilers and optimization levels reassociate and contract differently.
- FMA contraction on one platform but not another.
- x87 80-bit intermediates vs SSE 32/64-bit on x86.
- Transcendental functions. IEEE-754 requires
+ − × ÷ sqrt fmaand remainder to be correctly rounded, but it does not require it ofsin,cos,exp, so vendor math libraries legitimately differ[11]. - FTZ/DAZ mode differences (next section).
More bits doesn't make two different rounding sequences agree. Cross-platform determinism comes from controlling the operations: fixed-point arithmetic, or strict same-compiler, same-flags builds with contraction and x87 disabled and transcendentals replaced by your own implementations. Glenn Fiedler's writeup documents a shipped lockstep game (Battlezone II) that had to force single precision and wrap the transcendentals because AMD and Intel disagreed on sin/cos[11].
12The subnormal cliff
On some hardware, arithmetic that produces or consumes subnormals drops into a microcode-assisted slow path that is dramatically slower than the same operation on normal numbers. Bruce Dawson measured an SSE multiply with a subnormal hitting roughly a 175× slowdown on a Core 2, and about a 140-cycle penalty on a Sandy Bridge mulps[12]. The size is hardware-dependent (often negligible on recent AMD, and on addition), so quote it with its CPU rather than as a universal number.
The fix is FTZ (flush-to-zero, for outputs) and DAZ (denormals-are-zero, for inputs), two bits in the x86 MXCSR control register that make the CPU treat subnormals as zero, trading correctness near zero for speed. It is a per-thread CPU mode, so set it on every audio or physics worker thread, not just the main one[13].
#include <xmmintrin.h> // FTZ (flush-to-zero, outputs)
#include <pmmintrin.h> // DAZ (denormals-are-zero, inputs; SSE3)
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // MXCSR bit 15
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // MXCSR bit 6; call per thread
// No std API; use the x86 intrinsics (unsafe) on each worker thread.
use std::arch::x86_64::{_mm_getcsr, _mm_setcsr};
unsafe { _mm_setcsr(_mm_getcsr() | 0x8040); } // set FTZ (bit 15) and DAZ (bit 6)
Recursive filters and reverb tails decay exponentially toward zero, so when a signal goes quiet (a DAW transport stops and streams zeros into a reverb) they generate a flood of subnormals, and on an affected CPU the audio thread blows its budget and drops out. Enabling FTZ/DAZ on the audio thread is the standard mitigation. And recall the link-time footgun from §9: -ffast-math turns FTZ on process-wide, which is how most people enable this without meaning to[10].
13Pitfalls
FLT_EPSILON (gap at 1.0) as the rounding bound (half that), or as a usable tolerance far from 1.0.x == NAN is always false; use x != x. And -ffinite-math-only deletes the check entirely.-fassociative-math / -ffast-math proves the compensation zero and removes it.FLT_MIN is the smallest normal, not the smallest positive (that's denorm_min).(uint32_t*)&f is UB in C++; use std::bit_cast or memcpy.14What's next
This closes the determinism thread from the Game Loop and pairs with 3D Math as the numerics foundation. The bit tricks here are the same muscle as the Bit Shifting tutorial (the famous fast inverse square root is exactly this bit-level reinterpretation[14]). Next on the build path is Platform & Window, then the Vulkan renderer. The full path is on the series hub.
- David Goldberg. "What Every Computer Scientist Should Know About Floating-Point Arithmetic." ACM Computing Surveys, 1991. docs.oracle.com. Bit layout, ULP and machine epsilon, gradual underflow, catastrophic vs benign cancellation (the 70-ULP quadratic example), non-associativity.
- IEEE. "IEEE Standard for Floating-Point Arithmetic," IEEE 754-2019. ieeexplore.ieee.org. binary32/binary64 definitions, round-to-nearest-ties-to-even as the default, the correctly-rounded operation set.
- cppreference. std::numeric_limits<T>::epsilon / min / denorm_min. en.cppreference.com. epsilon = gap from 1.0 to the next float = FLT_EPSILON; min() = smallest normal; denorm_min() = smallest subnormal.
- Rust. Primitive type f32. doc.rust-lang.org. to_bits/from_bits, mul_add (single rounding), next_up/next_down (stable 1.86), EPSILON = 1.19209290e-07, MIN_POSITIVE, NaN sign non-portability.
- cppreference. std::bit_cast. en.cppreference.com. C++20, the UB-free reinterpretation of a float's bits (vs type-punning through a pointer).
- Exploring Binary. "Why 0.1 Does Not Exist In Floating-Point." exploringbinary.com. The exact nearest value to 0.1 and the repeating-binary reason it isn't representable.
- Bruce Dawson. "Comparing Floating Point Numbers, 2012 Edition." randomascii.wordpress.com. Absolute vs relative vs ULP comparison, where each fails, the combined approach, and "zero is a huge nuisance."
- "Kahan summation algorithm." Wikipedia. en.wikipedia.org. The compensated-summation algorithm and its error bound versus naive summation (and the Neumaier and pairwise variants).
- cppreference. std::fma. en.cppreference.com. Single-rounding semantics of
a*b+c, FP_FAST_FMA, and the relationship to compiler contraction (FP_CONTRACT). - Simon Byrne. "Beware of fast-math." simonbyrne.github.io. The sub-flags
-ffast-mathenables, including reassociation breaking Kahan, finite-math killing isnan, and crtfastmath.o setting FTZ/DAZ process-wide at link time. - Glenn Fiedler. "Floating Point Determinism." gafferongames.com. The causes of cross-platform float divergence and the Battlezone II lockstep case (forced single precision, wrapped transcendentals).
- Bruce Dawson. "That's Not Normal: the Performance of Odd Floats." randomascii.wordpress.com. Measured subnormal slowdowns with the specific CPUs, and the FTZ/DAZ control-register fix.
- Intel. "Set the FTZ and DAZ Flags," oneAPI C++ Compiler Developer Guide. intel.com. The
_MM_SET_FLUSH_ZERO_MODE/_MM_SET_DENORMALS_ZERO_MODEintrinsics and the MXCSR FTZ (bit 15) / DAZ (bit 6) bits. - "Fast inverse square root." Wikipedia. en.wikipedia.org. The Quake III
0x5f3759dfbit-level hack, its ~0.17% error after one Newton step, and the modern hardwarersqrtinstructions that supersede it.