All tutorials Mighty Professional
Tutorial 16 ยท Engine Programming

Bit Shifting from Scratch

Every flag check, packed color, texture coordinate, random number, and spatial key in a game engine bottoms out in shifts and masks. We start from a single bit and build up to the things a senior engineer still gets wrong: why dividing a negative number by a shift is a bug, why 1 << 32 has three different answers, and why hand-writing a shift for speed buys you nothing. Fourteen live widgets, C and Rust side by side, and the primary sources behind each trick.

Time~60 min LevelJunior on-ramp; senior review for the traps PrereqsYou can read C or pseudocode. No prior binary needed; §2 builds it. The later engine sections assume you have seen a game loop. HardwareNone. A feel for how an integer sits in a register helps in §5 and §16.

01Why bits

An integer in a register is not a number. It is 32 or 64 wires, each carrying a 0 or a 1, and the CPU offers a handful of single-cycle operations that treat those wires directly: shift them sideways, AND them, OR them, XOR them. Numeric meaning is a convention layered on top. Bit manipulation is what you do when you drop the convention and work the wires.

In a game engine that happens constantly, usually in the hottest code:

What you'll have by the end

A correct mental model of the two shift operators and when each is arithmetic or logical. The three traps that bite seniors: signed division by a power of two, shift-count undefined behavior, and the multiply-is-not-slower reality. Rotates and the mask toolkit. Then the engine payload: flags and bitsets, bitboards, RGBA and 565 packing, Morton swizzling, fixed-point arithmetic, the Quake III inverse square root, xorshift, and hardware popcount, clz, ctz, and BMI2. Every concept has a widget you can poke.

02Binary & two's complement

Positional notation in base two: bit i carries the value 2i. The byte 0b00101101 is 32 + 8 + 4 + 1 = 45. Because long binary strings are unreadable, we group four bits into one hex digit (a nibble): 0x2D is the same 45. Two nibbles make a byte, four bytes make a 32-bit word.

The right-most bit is the LSB; the left-most is the MSB. For an unsigned 32-bit integer the MSB is just the 231 place. For a signed integer it is the sign.

Two's complement, in one paragraph

Nearly every machine you will ship on stores signed integers in : the top bit carries negative weight. In 8 bits the value is −128·b7 + 64·b6 + … + 1·b0, so 0b11111111 is −1, not 255. Negating is "flip every bit, then add one" (-x == ~x + 1). Three consequences fall out of this and they drive most of the tutorial:

C++20 went further and declared two's complement the only legal signed representation, ending decades of "in principle a C compiler could use sign-magnitude" hedging[12]. Click the bits below and watch the signed and unsigned readings diverge at bit 31.

Live ยท Bit inspector
binary
-
hex
-
unsigned
-
signed
-
reading
-
Click any bit to toggle it. The sign bit (31) is outlined. The unsigned and signed readings agree until bit 31 is set, where the unsigned value jumps past two billion and the signed value goes negative. They are the same 32 wires, read two ways.

03Left & right shifts

A shift slides every bit a fixed number of places. Left shift (<<) moves bits toward the MSB and feeds zeros in at the bottom. Each step doubles the value, because every bit moves to the next-higher power of two. Right shift (>>) moves bits toward the LSB and (for an unsigned value) feeds zeros in at the top, halving the value and discarding the bit that falls off the bottom.

So x << 3 is x × 8 and x >> 3 is roughly x ÷ 8. The word "roughly" is doing real work there, and §6 is about exactly how much. For now, watch the bits move. Load a negative value and the right shift's two modes (next section) start to matter.

Live ยท Shift playground
hex
-
value
-

Left shift past bit 31 and the high bits fall off the top, silently. That overflow is exactly why left-shifting a signed value into the sign bit is undefined behavior (§7). Load −20, set the right shift to logical, and watch a "divide by two" turn a small negative into a huge positive: the sign bit got a zero instead of a copy of itself.

04Signed vs unsigned: SAR and SHR

Left shift has one behavior. Right shift has two, and the operand's type picks between them. A right shift always feeds zeros into the top (the x86 instruction is SHR). An arithmetic right shift copies the sign bit into the top, so a negative number stays negative (the instruction is SAR).

In C and C++ the rule is: unsigned >> is logical, signed >> is arithmetic. The signed case was technically implementation-defined before C++20, which is standardese for "every real compiler sign-extends, but the standard would not promise it." C++20 removed the weasel words and now requires arithmetic right shift for signed types[12]. On x86, ARM, and every other mainstream target it was always SAR for signed and SHR for unsigned[10].

The same bits, shifted two ways
int32_t  signedVal   = -8;          // 0xFFFFFFF8
uint32_t unsignedVal = 0xFFFFFFF8u; // same 32 bits, read as unsigned

// Signed >> is arithmetic (SAR): the sign bit copies down.
signedVal   >> 1;   // -4   (0xFFFFFFFC)  divide-by-two stays correct

// Unsigned >> is logical (SHR): zeros come in at the top.
unsignedVal >> 1;   // 0x7FFFFFFC = 2147483644, not -4

// The sign-broadcast trick: -1 if negative, 0 if not.
int32_t signMask = signedVal >> 31;   // 0xFFFFFFFF (all ones)

// Branchless absolute value, straight out of Hacker's Delight.
int32_t mask = x >> 31;             // 0 or -1
int32_t absX = (x ^ mask) - mask;   // flip-and-add-one only when negative
let signed_val: i32 = -8;            // 0xFFFFFFF8
let unsigned_val: u32 = 0xFFFFFFF8;  // same bits, unsigned type

// In Rust the TYPE decides, just like C, but it is guaranteed:
signed_val   >> 1;   // -4            arithmetic, sign copies down
unsigned_val >> 1;   // 0x7FFFFFFC    logical, zero-filled

// Sign broadcast. Rust has no UB here; the semantics are defined.
let sign_mask: i32 = signed_val >> 31;   // -1 (all ones)

// Branchless abs (std has i32::abs; this shows the bit trick).
let mask = x >> 31;                  // 0 or -1
let abs_x = (x ^ mask) - mask;
The one-line takeaway

If you reinterpret a pointer or a value through an unsigned type and shift right, you get a logical shift, even if the bits "mean" a negative number. Mixing signedness around a shift is one of the quieter ways to ship a bug. Pick the type that matches the meaning, then let the compiler pick SAR or SHR.

05Multiply, divide & the "shift is faster" myth

Here is a belief that made sense on a 1980s microprocessor and has been quietly wrong ever since: that replacing x * 8 with x << 3 makes your code faster. On the CPUs you ship games on, it does not, because the compiler already emits the shift for you. The optimization has a name, , and it has been standard since before most shipping engines were written.

Two facts retire the myth. First, the makes any shift cost one cycle regardless of distance: shifting by 1 and shifting by 31 are the same price. The "big shifts are slow" intuition is a memory of the 8086, which really did loop one bit per cycle. Second, on a modern x86 core an integer multiply is about 3 cycles of latency with one finishing every cycle, and a shift is 1 cycle[11]. That gap is real in a tight dependent chain, but exploiting it is the compiler's job, and it does, as the disassembly below shows:

Live ยท What the compiler actually emits

        

Representative x86-64 output at -O2. The exact instructions vary by compiler and version; the shapes do not. x * 8 and x << 3 compile to the same instruction, so choosing one over the other is a statement about intent, not speed. Division is where shifts genuinely save cycles, and where the signed case hides a trap.

So when should you write a shift? When the value is a bit pattern, not a number. color >> 24 says "pull out the alpha byte." 1u << layer says "the bit for this layer." Writing those as color / 16777216 or pow(2, layer) would be obfuscation. Use the shift to signal bit work; use * and / for arithmetic and trust the compiler to reduce it.

06The signed-division trap

This is the one that gets seniors. For unsigned values, x / 8 and x >> 3 are identical. For signed values they are not, and the difference is a real bug that survives code review because the two expressions look interchangeable.

The reason is rounding direction. An arithmetic right shift rounds toward negative infinity (it floors): -1 >> 1 is -1, because 0xFFFFFFFF shifted right with sign extension is still 0xFFFFFFFF. But C and C++ integer division rounds toward zero (it truncates): -1 / 2 is 0. So -7 >> 1 is -4 while -7 / 2 is -3. Off by one, only for negatives, only when the division is inexact.

The compiler knows this, which is why signed x / 8 in the widget above was not a single sar. It adds a bias of 2n−1 to negative operands before shifting, nudging them back toward zero[5][17]. The bias is computed branchlessly with the sign-broadcast mask from §4. Drag the dividend negative and watch the floor and the true quotient split apart, then watch the bias close the gap.

Live ยท Round toward zero vs round toward −∞ 2^2 = 4
x >> n (floor)
-
x / 2โฟ (C, toward 0)
-
bias added
-
(x + bias) >> n
-

The corrected form (x + ((x >> 31) & (2โฟ - 1))) >> n adds 0 to non-negative operands and 2โฟ−1 to negative ones, reproducing C's toward-zero division with one extra add and one AND. This is precisely the instruction sequence the compiler emits for x / 8 on a signed int.

07Undefined behavior & the count-masking footgun

Shifts have a surprising amount of packed into a simple-looking operator. The C and C++ standards say the result is undefined when the shift count is negative, or greater than or equal to the operand width, and (for signed left shifts) when a bit is pushed into or past the sign bit[12]. "Undefined" means the compiler may assume it never happens and optimize on that assumption.

The count rule is the practical landmine, because the hardware disagrees with the language. An x86 shift instruction masks the count to the low 5 bits for a 32-bit operand, or 6 bits for a 64-bit operand[10]. So shl eax, 32 is a no-op (32 masked to 0), not a zero. C leaves the case undefined precisely so the compiler can emit that bare masked instruction without checking the count[13]. The result: one line of standard-conforming code can produce three different answers.

Live ยท One shift, three answers 1 & 31 = 1

C standard

-
value << count

x86 hardware

-
count masked to 5 bits

compile-time fold

-
constant folded to 0

Count ≥ 32: this expression is undefined behavior in C and C++.

Push the count to 32 or beyond. The C-standard column goes red (undefined), the x86 column wraps the count and keeps shifting the live value, and a compiler that folds the expression at compile time may simply produce 0. All three are "correct" under their own rules. The takeaway: never let a shift count reach the operand width.
Other languages made different choices

Because the widgets here run in your browser, they are a live example of a third rule. JavaScript coerces operands to 32-bit integers for bitwise ops and masks the shift count to 5 bits, so 1 << 32 is 1, and it has a separate >>> operator for unsigned (logical) right shift. Java does the same masking (& 31 for int, & 63 for long) and also spells logical right shift >>>. None of these are "undefined": they are defined to mask. Only C and C++ leave it open so the compiler can emit the raw hardware instruction. Port a shift between these languages without thinking and the count-masking rule will eventually surprise you.

08Rotates, and why C has no rotate operator

A rotate is a shift that wraps: bits that fall off one end reappear at the other, so no information is lost and the population count never changes. CPUs have had rotate instructions for decades (ROL and ROR on x86). Hash functions, cryptographic primitives, and PRNGs lean on them constantly because rotation mixes bits without throwing any away.

C and C++ have no rotate operator, so for years people wrote the idiom (x << n) | (x >> (32 - n)). That idiom has a bug: when n is 0, the second shift becomes x >> 32, which is the undefined behavior from §7. The safe version masks the complement: (x << n) | (x >> ((-n) & 31)). Modern compilers pattern-match both forms back to a single ROL[2]. Since C++20 you can stop playing this game and call std::rotl and std::rotr from <bit>, which are correct for every count including 0[12].

Rotate, three ways
#include <bit>     // C++20
#include <cstdint>

// The classic idiom. Correct ONLY for 1 <= n <= 31.
uint32_t rotl_naive(uint32_t value, int n) {
    return (value << n) | (value >> (32 - n));   // UB when n == 0
}

// The safe hand-written form: mask the complement so n == 0 works.
uint32_t rotl_safe(uint32_t value, int n) {
    return (value << (n & 31)) | (value >> ((-n) & 31));
}

// The right answer in modern C++: defined for every n, compiles to ROL.
uint32_t rotl_std(uint32_t value, int n) {
    return std::rotl(value, n);
}
// Rust has had rotate in the standard library from the start.
// These are defined for every count and lower to ROL / ROR.
let rotated_left  = value.rotate_left(n);    // u32 -> u32
let rotated_right = value.rotate_right(n);

// Writing the shift idiom by hand would panic in debug builds if the
// shift amount reached 32 (Rust checks shift overflow), so use the
// built-ins. They are the canonical form, not a convenience wrapper.
Live ยท Rotate ring (16-bit)
In rotate mode the set-bit count never changes; bits leaving the top reappear at the bottom. Flip to shift mode and rotate repeatedly: the word bleeds to zero as bits fall off the end. That conservation is why rotation is the mixing primitive of choice in hashes and PRNGs, where you want to stir bits without losing entropy.

09The mask toolkit

Shifts position a bit; masks act on it. The whole toolkit is four bitwise operators and a couple of idioms, and almost every higher-level trick in this tutorial is built from them[1]. Knuth catalogs the scholarly version under "Bitwise Tricks and Techniques"[4].

OperationExpressionWhat it does
Test bit i(x >> i) & 1Isolate one bit as 0 or 1.
Set bit ix | (1u << i)Force the bit to 1, leave the rest.
Clear bit ix & ~(1u << i)Force the bit to 0.
Toggle bit ix ^ (1u << i)Flip the bit.
Low n bits set(1u << n) - 1A mask of n ones. UB-adjacent: at n = 32 the shift itself is undefined.
Extract a field(x >> shift) & maskPull a packed sub-value out.
Isolate lowest set bitx & -xKeep only the rightmost 1. The basis of bitset iteration.
Clear lowest set bitx & (x - 1)Drop the rightmost 1. Loop on this to visit set bits.

The last two deserve a second look because they fall straight out of two's complement. -x is ~x + 1, which flips every bit up to and including the lowest set bit, so x & -x keeps exactly that bit. And subtracting 1 borrows through the trailing zeros up to the lowest set bit, so x & (x - 1) erases it. Together they let you walk the set bits of a word in as many steps as there are set bits, not 32 or 64. That is the engine behind iterating a bitset, which is the next section.

10Flags, bitsets & bitboards

The most common bit work in a game is the humblest: one integer, one bit per yes/no fact. Sixty-four booleans in a single 64-bit word, tested and combined with the operators from §9, no cache misses, no branches.

Collision layers and masks

Physics engines decide whether two objects can collide with a pair of bitmasks. Each body has a category (one bit: which layer it is) and a mask (which layers it accepts). Two bodies collide only if each one's category is in the other's mask: (catA & maskB) && (catB & maskA). Box2D spells this exactly this way[14]; Unity's layer collision matrix is the same idea wearing a UI. Two ANDs replace a lookup table.

Live ยท Collision filter (the Box2D rule)

Body A

category (pick one)
mask (accepts)
cat = - ยท mask = -

Body B

category (pick one)
mask (accepts)
cat = - ยท mask = -
PASS THROUGH
Set Body A to category Player accepting everything, and Body B to Pickup accepting only Player. Both directions pass, so they collide. Now make a Bullet whose mask excludes other Bullets: bullets fly through each other while still hitting everything else. The whole filter is two AND operations per pair.

Bitsets as component signatures

An ECS often gives each entity a bitset "signature": bit k set means the entity has component k. A system that wants everything with Position and Velocity builds a query mask and accepts an entity when (signature & queryMask) == queryMask. Iterating the matching components means walking the set bits with the x & (x - 1) idiom from §9. The same pattern drives free-list allocators (a set bit is a free slot) and dirty-flag tracking.

Bitboards

Board games take this to its logical end. A bitboard packs an 8×8 board into a 64-bit word, one bit per square. The payoff is that a shift moves every piece at once: shifting the whole board left by 8 slides every piece one rank north, by 1 one file east. The catch is wraparound. Shifting east moves a piece on the h-file onto the a-file of the next rank, so you AND with a "not the a-file" mask to delete the wrapped bits[14]. Population count gives material; count-trailing-zeros gives the next occupied square.

Live ยท Bitboard mover (64-bit) pop: 5
board
-
last move
click a square, then shift
Place a piece on the right edge (the h-file) and shift east: it vanishes instead of teleporting to the left edge, because notAFile masked away the wrapped bit. Drop the mask and that piece would reappear on the far side, one rank up. The mask is the difference between a move generator and a bug.

11Packing: color & bitfields

Shifts and masks are how you stuff several small values into one word and pull them back out. The canonical example is color. An 8-bit-per-channel RGBA pixel is four bytes packed into a 32-bit word, and the whole codec is shifts on the way in, masks on the way out.

Pack and unpack RGBA8888
uint32_t pack(uint8_t r, uint8_t g, uint8_t b, uint8_t a) {
    return (uint32_t(r) << 24) | (uint32_t(g) << 16)
         | (uint32_t(b) <<  8) |  uint32_t(a);
}

// Cast to uint32_t BEFORE shifting. (uint8_t(r) << 24) would promote to
// int and push a 1 into the sign bit: the signed-left-shift UB from ยง7.

uint8_t red(uint32_t c)   { return (c >> 24) & 0xFF; }
uint8_t green(uint32_t c) { return (c >> 16) & 0xFF; }
uint8_t blue(uint32_t c)  { return (c >>  8) & 0xFF; }
uint8_t alpha(uint32_t c) { return  c        & 0xFF; }
fn pack(r: u8, g: u8, b: u8, a: u8) -> u32 {
    (r as u32) << 24 | (g as u32) << 16 | (b as u32) << 8 | a as u32
}

// In Rust the `as u32` widens before the shift, so there is no sign-bit
// hazard. Shifting a u32 left into bit 31 is just a value, never UB.

fn red(c: u32)   -> u8 { ((c >> 24) & 0xFF) as u8 }
fn green(c: u32) -> u8 { ((c >> 16) & 0xFF) as u8 }
fn blue(c: u32)  -> u8 { ((c >>  8) & 0xFF) as u8 }
fn alpha(c: u32) -> u8 {  (c        & 0xFF) as u8 }

Squeeze the same color into 16 bits and you get RGB565, still used for framebuffers and textures where bandwidth matters more than fidelity[15]. Packing drops the low bits of each channel (r >> 3, g >> 2, b >> 3); unpacking has to invent them back, usually by replicating the high bits down. That reconstruction is lossy, which is what produces banding in 16-bit gradients. The widget shows the error.

Live ยท Color packer
input
stored
packed
-
bits
-
Switch to RGB565 and watch the "stored" swatch drift from the input as the low bits of each channel are thrown away. Slide a channel slowly in 565 mode and the swatch steps instead of glides: only 32 red levels survive where 8-bit had 256. The error readout is the per-channel reconstruction gap.
Endianness is not bit order

A common confusion: people assume that packing colors with shifts is sensitive to endianness. It is not. Shifts operate on the value in a register, where bit i always means 2i, the same on every machine. Endianness only describes the order bytes land in memory when you store that register. (r << 24) | ... produces the identical number on a little-endian and a big-endian CPU; the bytes only reshuffle when you write the word to a buffer and read it back as bytes. When you genuinely need to swap byte order (network packets, file formats), that is a separate operation: bswap, or std::byteswap in C++23.

12Morton order & swizzling

Packing puts independent fields side by side. does something subtler: it interleaves the bits of two coordinates so that points close together in 2D get keys close together in 1D. Guy Morton described it in a 1966 IBM technical report on file sequencing for a geodetic database[6], and it has been quietly load-bearing in graphics ever since.

The encode is pure shift-and-mask: spread x into the even bit positions, spread y into the odd positions, OR them together. Walking the resulting indices 0, 1, 2, 3 traces a recursive Z (hence "Z-order curve"), and that Z keeps spatial neighbors mostly adjacent in memory. GPUs lay textures out in Morton-order tiles, or close relatives of them, so that a 2D sampling neighborhood lands on one cache line. Linear octrees and quadtrees store nodes sorted by Morton code, computing parent and child relationships with shifts instead of pointer chases (the spatial partitioning tutorial builds those trees). Parallel BVH builders sort primitives by Morton code as their first step.

Live ยท Morton (Z-order) tracer
index
-
decoded
-
interleave

blue = x bits (even positions) ยท red = y bits (odd positions)

The index bits alternate y, x, y, x reading from the top. Switch to row-major and the path makes long horizontal sweeps: cell 7 and cell 8 are neighbors in index but opposite ends of the grid in space. Z-order keeps the jump small, which is the whole point for cache locality. The big diagonal leaps in Z-order are where the curve crosses a quadrant boundary, the curve's known weak spot.

For wider coordinates you do not loop bit by bit. The standard trick is a sequence of shift-and-mask steps with "magic" constants that spread the bits in logarithmically few operations, the inverse of the SWAR fold you will see in §15[1]. And on recent x86, a single PDEP instruction deposits bits into a mask in one go, collapsing the whole interleave to two instructions (§16).

13Fixed-point arithmetic

Before hardware floating point was free, and in any system that needs bit-exact reproducibility today, fractions are stored as integers with an implied binary point. The shift you keep in your head is the whole idea. A Q16.16 value is a 32-bit integer; the real number it represents is that integer divided by 216.

Conversion is a shift: an integer n becomes n << 16. Addition and subtraction just work, because the points line up. Multiplication is where the shift earns its keep: multiplying two Q16.16 values gives a result scaled by 232, so you shift the product right by 16 to put the point back, using a 64-bit intermediate so the high bits do not fall off. Division shifts the other way.

Q16.16 fixed-point multiply
using fixed = int32_t;            // Q16.16
constexpr int SHIFT = 16;

fixed from_int(int n)    { return n << SHIFT; }
fixed mul(fixed a, fixed b) {
    // 32x32 -> 64 so the top bits survive, then shift the point back.
    int64_t wide = int64_t(a) * int64_t(b);
    return fixed(wide >> SHIFT);
}
fixed div(fixed a, fixed b) {
    int64_t wide = (int64_t(a) << SHIFT);
    return fixed(wide / b);
}
type Fixed = i32;                 // Q16.16
const SHIFT: u32 = 16;

fn from_int(n: i32) -> Fixed { n << SHIFT }
fn mul(a: Fixed, b: Fixed) -> Fixed {
    // Widen to i64 so the product does not overflow, then re-align.
    let wide = a as i64 * b as i64;
    (wide >> SHIFT) as Fixed
}
fn div(a: Fixed, b: Fixed) -> Fixed {
    (((a as i64) << SHIFT) / b as i64) as Fixed
}

Why bother in 2026? Determinism. Floating-point results can differ across compilers, CPUs, and optimization flags (fused multiply-add here, a different rounding there), which quietly desynchronizes a lockstep simulation where every client must compute bit-identical results[18]. Fixed-point integer math is exactly reproducible by construction, so RTS netcode and replay systems still reach for it. It was also the only option on consoles without an FPU: the SNES and the Game Boy Advance did their world math in fixed-point shifts.

Live ยท Fixed-point number line
raw integer
-
real value
-
smallest step
-
split
-
Click bits to build a value; the blue-underlined bits are the fraction. Load 0.1 and switch between formats: in Q4.4 it cannot be represented (the nearest step is 1/16 = 0.0625), in Q16.16 it is close but still not exact, because 0.1 is not a sum of negative powers of two any more than it is in floating point. The integer/fraction boundary is just where you decide the point sits.

14Two famous hacks

Two bit tricks earned their fame in games. Both reinterpret the same bits as different types, which is the deepest move in the bit-manipulation playbook.

The fast inverse square root

Computing 1/√x for vector normalization once cost a division and a square root, both slow. The trick popularized by Quake III Arena in 1999 replaces them with a subtraction and a shift[19]. Reinterpret the float's 32 bits as an integer; because IEEE-754 lays out the exponent in the high bits, the integer value is roughly a scaled logarithm of the float. A right shift by one halves that integer, which halves the log, which square-roots the value; subtracting from the magic constant 0x5f3759df negates it, turning √x into 1/√x. Reinterpret back to float and you have a seed good to about 3%, then one Newton-Raphson step sharpens it to a fraction of a percent. Chris Lomont reverse-engineered why the constant is near-optimal in 2003[8].

Live ยท Float bits & the 0x5f3759df hack
signexponent (8)mantissa (23)
bits as int
-
exponent field
-
i >> 1
-

raw approx

-
error -

+1 Newton

-
error -

true 1/√x

-
The single right shift i >> 1 is the heart of it: halving the integer reinterpretation of a float roughly halves its exponent, which is how a shift can approximate a square root. The red error curve is the bare bit-hack; the green curve is one Newton step later. The purple line marks your current x.

Xorshift: a PRNG that is only shifts

George Marsaglia's 2003 xorshift generators produce a stream of pseudo-random numbers from nothing but shifts and XORs[9]. Three lines, no multiply, no table: XOR the state with a shifted copy of itself, three times, with carefully chosen shift amounts. A 32-bit xorshift runs through all 232−1 non-zero states before repeating. It is fast enough for particle jitter, loot rolls, and procedural placement, though it fails some stringent statistical tests and is not suitable for cryptography; later variants like xorshift128+ fixed the quality issues[9].

Live ยท Xorshift32 scatter state: -
points plotted
0
uint32_t x = seed;          // never zero
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;                 // x is the next output
Each point takes the low 16 bits of an output for its column and the high 16 for its row. The plane fills evenly: three shifts and three XORs are enough to smear the state across the whole range. Reset and single-step to watch the 32-bit state churn one iteration at a time.

15Counting bits: popcount, clz, ctz

Three queries about a word show up everywhere: how many bits are set (popcount), how many leading zeros (clz), how many trailing zeros (ctz). Popcount gives set cardinality and Hamming distance. Leading zeros give an integer log base two, which sizes hash tables and buddy allocators. Trailing zeros give the index of the lowest set bit, which is how you iterate a bitset.

Modern hardware does all three in one instruction: POPCNT arrived with SSE4.2 in 2008, and LZCNT and TZCNT came with the BMI extensions[10]. There is a sharp edge: the older bit-scan instructions BSR and BSF leave their result undefined when the input is zero, so clz and ctz of zero need care; LZCNT and TZCNT were defined to return the operand width instead. C++20 wraps all of this in <bit>: std::popcount, std::countl_zero, std::countr_zero, and std::bit_width, each lowering to the hardware instruction when present[12].

When the instruction is not available, the classic fallback is a population count: treat the word as many small lanes, add adjacent pairs, then nibbles, then bytes, halving the lane count each step. Thirty-two bits fall in five operations instead of a 32-iteration loop. The technique goes back to HAKMEM, MIT AI Memo 239 from 1972[3], and is tabulated in Hacker's Delight[2]. Trailing-zero counts have an equally clever software form using a de Bruijn sequence and a small lookup table[7].

Live ยท Popcount race -

naive loop

-

SWAR fold

-

hardware

-
The naive loop tests one bit per iteration: 32 steps. The SWAR fold sums in parallel, halving the field width each row, and reaches the same total in 5. The hardware POPCNT does it in a single instruction. Run the fold and watch the partial sums accumulate as the lane width doubles each row.

16Hardware bit-manip: BMI1 and BMI2

Since 2013, x86 has shipped two extension sets, BMI1 and BMI2, that turn several of this tutorial's multi-step idioms into single instructions[10]. Compilers emit them automatically when you target a recent enough CPU; recognizing them in a disassembly tells you what the source was doing.

InstructionReplacesMeaning
BLSRx & (x - 1)Clear the lowest set bit.
BLSIx & -xIsolate the lowest set bit.
ANDN~a & bAND-NOT in one op, no separate NOT.
BZHIx & ((1 << n) - 1)Zero the bits at and above position n.
SHLX / SHRX / SARX<< / >>Shift by a variable count without touching flags.
PEXT / PDEPspread/gather loopsExtract or deposit bits selected by a mask, in parallel.

Two are worth dwelling on. The variable-count shifts SHLX, SHRX, and SARX do not read or write the flags register and do not pin the count to the CL register, so they avoid a false dependency that the legacy shl reg, cl form carries; each is a single micro-op[11]. And PDEP collapses the entire Morton-encode from §12 to one instruction by depositing coordinate bits straight into an interleave mask.

One caveat before you reach for PEXT

PEXT and PDEP are single-cycle on Intel and on AMD from Zen 3 onward, but on AMD's Zen, Zen+, and Zen 2 they are implemented in microcode and take on the order of a hundred cycles[11]. Code that assumed they were fast (notably some chess engines using PEXT for sliding-piece attack tables) ran dramatically slower on those parts. If you dispatch on PEXT, gate it on the vendor and family, not just the feature bit.

17Pitfalls

The recap, as a checklist of the mistakes that survive review because the code looks correct.

18What's intentionally missing, and what's next

This tutorial stayed on scalar integer bit work. Several neighbors were left out on purpose. SIMD bit operations (operating on 128, 256, or 512 bits at once, including the pshufb byte-shuffle as a parallel lookup) live in the SIMD tutorial. Bit-reversal permutations and the butterfly addressing behind the FFT are an audio and signal-processing topic. Gray codes, where consecutive values differ in one bit, matter for rotary encoders and some error-tolerant schemes. Entropy coders like rANS pack fractional bits for compression. Constant-time bit tricks for cryptography deliberately avoid data-dependent branches and timing, a discipline of its own. Each is a shift-and-mask story with different stakes.

For where these bits go next: the hash tables tutorial uses the shift-and-xor finalizers and Fibonacci hashing that this tutorial's mixing primitives feed; the spatial partitioning tutorial builds the Morton-ordered octrees from §12; and the memory model tutorial covers the alignment masks ((p + a - 1) & ~(a - 1)) that round pointers up to a power-of-two boundary.

19Sources

  1. Sean Eron Anderson. "Bit Twiddling Hacks." Stanford Graphics Lab, 1997–2005. graphics.stanford.edu/~seander/bithacks.html. The canonical web reference for branchless bit idioms: lowest-set-bit isolation, bit spreading for Morton codes, parallel population counts, and sign tricks. Code is in the public domain.
  2. Henry S. Warren Jr. Hacker's Delight, 2nd ed., Addison-Wesley, 2012. The standard book on integer bit manipulation. Source for the branchless absolute value, the rotate pattern-matching, the SWAR population counts, and the division-by-constant derivations.
  3. Michael Beeler, R. William Gosper, Rich Schroeppel. "HAKMEM." MIT AI Memo 239, February 1972. dspace.mit.edu. Item 169 gives the parallel (SWAR) population count on the 36-bit PDP-6/10, the ancestor of every fast popcount fallback.
  4. Donald E. Knuth. The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1, §7.1.3 "Bitwise Tricks and Techniques." Addison-Wesley, 2011. The exhaustive scholarly treatment of bit operations, magic masks, and broadword computing.
  5. Torbjรถrn Granlund, Peter L. Montgomery. "Division by Invariant Integers using Multiplication." Proc. ACM SIGPLAN PLDI, 1994, pp. 61–72. PDF. The algorithm compilers use to turn division by a constant into a reciprocal multiply and shift, including the bias for signed division covered in §6.
  6. Guy M. Morton. "A Computer Oriented Geodetic Data Base; and a New Technique in File Sequencing." IBM Ltd., Ottawa, 1966. The first description of the Z-order (Morton) curve: interleaving coordinate bits to preserve spatial locality in a 1D key.
  7. Charles E. Leiserson, Harald Prokop, Keith H. Randall. "Using de Bruijn Sequences to Index a 1 in a Computer Word." MIT Laboratory for Computer Science, 1998. PDF (MIT, via Internet Archive). The multiply-and-lookup method for count-trailing-zeros without a hardware instruction.
  8. Chris Lomont. "Fast Inverse Square Root." Purdue University tech note, 2003. PDF. Derives why 0x5f3759df is near-optimal and searches for a marginally better constant. The reference analysis of the Quake III hack.
  9. George Marsaglia. "Xorshift RNGs." Journal of Statistical Software, 8(14):1–6, 2003. jstatsoft.org. The shift-and-xor random number generators, with periods 2k−1. Later work (Panneton & L'Ecuyer; Vigna) analyzes their statistical weaknesses and the scrambled successors.
  10. Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 2: Instruction Set Reference. Entries for SAL/SAR/SHL/SHR (count masked to 5 or 6 bits), ROL/ROR, SHLD/SHRD, SARX/SHLX/SHRX, POPCNT, LZCNT/TZCNT, and the BMI1/BMI2 sets. Mirrored at felixcloutier.com/x86.
  11. Agner Fog. "Instruction Tables" and "The Microarchitecture of Intel, AMD and VIA CPUs." Technical University of Denmark. agner.org/optimize. Source for shift and multiply latencies and throughputs, the single-micro-op BMI2 shifts, and the microcoded PEXT/PDEP on pre-Zen-3 AMD.
  12. ISO C++. <bit> header (P0553 "Bit operations" and P0556), and P1236/P0907 making signed integers two's complement with defined arithmetic right shift. C++20. Reference: cppreference.com/w/cpp/header/bit. Defines rotl, rotr, popcount, countl_zero, countr_zero, bit_width.
  13. David Wragg. "Shift Instructions." 2012. david.wragg.org. Explains why C leaves over-width shifts undefined: so the compiler can emit the bare x86 shift, whose count is already masked, without a guard.
  14. Chess Programming Wiki. "Bitboards," "General Setwise Operations," "BMI2." chessprogramming.org. The reference for board-as-64-bit-word representations, the file-mask shift moves, and population-count material evaluation.
  15. Khronos Group. OpenGL and Vulkan format specifications, packed pixel formats (GL_UNSIGNED_SHORT_5_6_5, VK_FORMAT_R5G6B5, R8G8B8A8). registry.khronos.org. The bit layouts for the color packing in §11.
  16. Christer Ericson. "Order your graphics draw calls around!" realtimecollisiondetection.net, 2008. realtimecollisiondetection.net. The standard reference for bit-packed draw-call sort keys feeding a radix sort.
  17. Randal E. Bryant, David R. O'Hallaron. Computer Systems: A Programmer's Perspective, 3rd ed., Pearson, 2015, ch. 2. The teaching reference for two's complement, the arithmetic-shift division bias, and integer overflow semantics.
  18. Glenn Fiedler. "Floating Point Determinism." Gaffer On Games, 2010. gafferongames.com. Why floating-point results vary across compilers and CPUs, and why deterministic lockstep simulations turn to fixed-point integer math.
  19. id Software. Quake III Arena source release, code/game/q_math.c (Q_rsqrt), GPL, 2005 (code circa 1999). github.com/id-Software. The shipped form of the fast inverse square root, magic constant and all.

See also