Tutorial 16 · Engine Programming

Bit Shifting from Scratch

Every flag check, packed color, texture coordinate, random number, and spatial key in a game engine bottoms out in shifts and masks. We start from a single bit and build up to the things a senior engineer still gets wrong: why dividing a negative number by a shift is a bug, why 1 << 32 has three different answers, and why hand-writing a shift for speed buys you nothing. Fourteen live widgets, C and Rust side by side, and the primary sources behind each trick.

Time~60 min LevelJunior on-ramp; senior review for the traps PrereqsYou can read C or pseudocode. No prior binary needed; §2 builds it. The later engine sections assume you have seen a game loop. HardwareNone. A feel for how an integer sits in a register helps in §5 and §16.

01Why bits

An integer in a register is not a number. It is 32 or 64 wires, each carrying a 0 or a 1, and the CPU offers a handful of single-cycle operations that treat those wires directly: shift them sideways, AND them, OR them, XOR them. Numeric meaning is a convention layered on top. Bit manipulation is what you do when you drop the convention and work the wires.

In a game engine that happens constantly, usually in the hottest code:

Flags and layers. Which input buttons are down, which collision layers an object belongs to, which components an entity has. One integer, one bit per option, tested with a single AND.
Packing. An RGBA color is four bytes shifted into one 32-bit word. A draw-call sort key packs material, depth, and pass into one integer so a radix sort can order the whole frame^[16].
Spatial keys. Interleaving the bits of an (x, y, z) coordinate produces a , the address scheme behind texture swizzling, linear octrees, and GPU bounding-volume builders.
Cheap math. Multiply and divide by powers of two, fractions, fast reciprocal-square-root seeds, and shift-based random number generators.
Counting. How many bits are set (population count), where the lowest set bit is (count trailing zeros). The primitives behind bitset iteration and free-list allocation.

What you'll have by the end

A correct mental model of the two shift operators and when each is arithmetic or logical. The three traps that bite seniors: signed division by a power of two, shift-count undefined behavior, and the multiply-is-not-slower reality. Rotates and the mask toolkit. Then the engine payload: flags and bitsets, bitboards, RGBA and 565 packing, Morton swizzling, fixed-point arithmetic, the Quake III inverse square root, xorshift, and hardware popcount, clz, ctz, and BMI2. Every concept has a widget you can poke.

02Binary & two's complement

Positional notation in base two: bit i carries the value 2ⁱ. The byte 0b00101101 is 32 + 8 + 4 + 1 = 45. Because long binary strings are unreadable, we group four bits into one hex digit (a nibble): 0x2D is the same 45. Two nibbles make a byte, four bytes make a 32-bit word.

The right-most bit is the LSB; the left-most is the MSB. For an unsigned 32-bit integer the MSB is just the 2³¹ place. For a signed integer it is the sign.

Two's complement, in one paragraph

Nearly every machine you will ship on stores signed integers in : the top bit carries negative weight. In 8 bits the value is −128·b₇ + 64·b₆ + … + 1·b₀, so 0b11111111 is −1, not 255. Negating is "flip every bit, then add one" (-x == ~x + 1). Three consequences fall out of this and they drive most of the tutorial:

There is one more negative value than positive: the range is −2³¹ to 2³¹−1. INT_MIN has no positive twin, so -INT_MIN overflows.
The sign lives entirely in the top bit, so copying that bit downward (an arithmetic right shift) preserves sign.
x >> 31 on a signed 32-bit value is either all-zeros (x non-negative) or all-ones (x negative). That all-or-nothing mask is the seed of dozens of branchless tricks.

C++20 went further and declared two's complement the only legal signed representation, ending decades of "in principle a C compiler could use sign-magnitude" hedging^[12]. Click the bits below and watch the signed and unsigned readings diverge at bit 31.

binary: -
hex: -
unsigned: -
signed: -
reading: -

03Left & right shifts

A shift slides every bit a fixed number of places. Left shift (<<) moves bits toward the MSB and feeds zeros in at the bottom. Each step doubles the value, because every bit moves to the next-higher power of two. Right shift (>>) moves bits toward the LSB and (for an unsigned value) feeds zeros in at the top, halving the value and discarding the bit that falls off the bottom.

So x << 3 is x × 8 and x >> 3 is roughly x ÷ 8. The word "roughly" is doing real work there, and §6 is about exactly how much. For now, watch the bits move. Load a negative value and the right shift's two modes (next section) start to matter.

hex: -
value: -

04Signed vs unsigned: SAR and SHR

Left shift has one behavior. Right shift has two, and the operand's type picks between them. A right shift always feeds zeros into the top (the x86 instruction is SHR). An arithmetic right shift copies the sign bit into the top, so a negative number stays negative (the instruction is SAR).

In C and C++ the rule is: unsigned >> is logical, signed >> is arithmetic. The signed case was technically implementation-defined before C++20, which is standardese for "every real compiler sign-extends, but the standard would not promise it." C++20 removed the weasel words and now requires arithmetic right shift for signed types^[12]. On x86, ARM, and every other mainstream target it was always SAR for signed and SHR for unsigned^[10].

The same bits, shifted two ways

int32_t  signedVal   = -8;          // 0xFFFFFFF8
uint32_t unsignedVal = 0xFFFFFFF8u; // same 32 bits, read as unsigned

// Signed >> is arithmetic (SAR): the sign bit copies down.
signedVal   >> 1;   // -4   (0xFFFFFFFC)  divide-by-two stays correct

// Unsigned >> is logical (SHR): zeros come in at the top.
unsignedVal >> 1;   // 0x7FFFFFFC = 2147483644, not -4

// The sign-broadcast trick: -1 if negative, 0 if not.
int32_t signMask = signedVal >> 31;   // 0xFFFFFFFF (all ones)

// Branchless absolute value, straight out of Hacker's Delight.
int32_t mask = x >> 31;             // 0 or -1
int32_t absX = (x ^ mask) - mask;   // flip-and-add-one only when negative

let signed_val: i32 = -8;            // 0xFFFFFFF8
let unsigned_val: u32 = 0xFFFFFFF8;  // same bits, unsigned type

// In Rust the TYPE decides, just like C, but it is guaranteed:
signed_val   >> 1;   // -4            arithmetic, sign copies down
unsigned_val >> 1;   // 0x7FFFFFFC    logical, zero-filled

// Sign broadcast. Rust has no UB here; the semantics are defined.
let sign_mask: i32 = signed_val >> 31;   // -1 (all ones)

// Branchless abs (std has i32::abs; this shows the bit trick).
let mask = x >> 31;                  // 0 or -1
let abs_x = (x ^ mask) - mask;

The one-line takeaway

If you reinterpret a pointer or a value through an unsigned type and shift right, you get a logical shift, even if the bits "mean" a negative number. Mixing signedness around a shift is one of the quieter ways to ship a bug. Pick the type that matches the meaning, then let the compiler pick SAR or SHR.

05Multiply, divide & the "shift is faster" myth

Here is a belief that made sense on a 1980s microprocessor and has been quietly wrong ever since: that replacing x * 8 with x << 3 makes your code faster. On the CPUs you ship games on, it does not, because the compiler already emits the shift for you. The optimization has a name, , and it has been standard since before most shipping engines were written.

Two facts retire the myth. First, the makes any shift cost one cycle regardless of distance: shifting by 1 and shifting by 31 are the same price. The "big shifts are slow" intuition is a memory of the 8086, which really did loop one bit per cycle. Second, on a modern x86 core an integer multiply is about 3 cycles of latency with one finishing every cycle, and a shift is 1 cycle^[11]. That gap is real in a tight dependent chain, but exploiting it is the compiler's job, and it does, as the disassembly below shows:

Representative x86-64 output at -O2. The exact instructions vary by compiler and version; the shapes do not. x * 8 and x << 3 compile to the same instruction, so choosing one over the other is a statement about intent, not speed. Division is where shifts genuinely save cycles, and where the signed case hides a trap.

So when should you write a shift? When the value is a bit pattern, not a number. color >> 24 says "pull out the alpha byte." 1u << layer says "the bit for this layer." Writing those as color / 16777216 or pow(2, layer) would be obfuscation. Use the shift to signal bit work; use * and / for arithmetic and trust the compiler to reduce it.

06The signed-division trap

This is the one that gets seniors. For unsigned values, x / 8 and x >> 3 are identical. For signed values they are not, and the difference is a real bug that survives code review because the two expressions look interchangeable.

The reason is rounding direction. An arithmetic right shift rounds toward negative infinity (it floors): -1 >> 1 is -1, because 0xFFFFFFFF shifted right with sign extension is still 0xFFFFFFFF. But C and C++ integer division rounds toward zero (it truncates): -1 / 2 is 0. So -7 >> 1 is -4 while -7 / 2 is -3. Off by one, only for negatives, only when the division is inexact.

The compiler knows this, which is why signed x / 8 in the widget above was not a single sar. It adds a bias of 2ⁿ−1 to negative operands before shifting, nudging them back toward zero^[5]^[17]. The bias is computed branchlessly with the sign-broadcast mask from §4. Drag the dividend negative and watch the floor and the true quotient split apart, then watch the bias close the gap.

dividend x -9

shift n (divisor = 2ⁿ) 2

x >> n (floor): -
x / 2ⁿ (C, toward 0): -
bias added: -
(x + bias) >> n: -

The corrected form (x + ((x >> 31) & (2ⁿ - 1))) >> n adds 0 to non-negative operands and 2ⁿ−1 to negative ones, reproducing C's toward-zero division with one extra add and one AND. This is precisely the instruction sequence the compiler emits for x / 8 on a signed int.

07Undefined behavior & the count-masking footgun

Shifts have a surprising amount of packed into a simple-looking operator. The C and C++ standards say the result is undefined when the shift count is negative, or greater than or equal to the operand width, and (for signed left shifts) when a bit is pushed into or past the sign bit^[12]. "Undefined" means the compiler may assume it never happens and optimize on that assumption.

The count rule is the practical landmine, because the hardware disagrees with the language. An x86 shift instruction masks the count to the low 5 bits for a 32-bit operand, or 6 bits for a 64-bit operand^[10]. So shl eax, 32 is a no-op (32 masked to 0), not a zero. C leaves the case undefined precisely so the compiler can emit that bare masked instruction without checking the count^[13]. The result: one line of standard-conforming code can produce three different answers.

shift count 1

C standard

-
value << count

x86 hardware

-
count masked to 5 bits

compile-time fold

-
constant folded to 0

Count ≥ 32: this expression is undefined behavior in C and C++.

Other languages made different choices

Because the widgets here run in your browser, they are a live example of a third rule. JavaScript coerces operands to 32-bit integers for bitwise ops and masks the shift count to 5 bits, so 1 << 32 is 1, and it has a separate >>> operator for unsigned (logical) right shift. Java does the same masking (& 31 for int, & 63 for long) and also spells logical right shift >>>. None of these are "undefined": they are defined to mask. Only C and C++ leave it open so the compiler can emit the raw hardware instruction. Port a shift between these languages without thinking and the count-masking rule will eventually surprise you.

08Rotates, and why C has no rotate operator

A rotate is a shift that wraps: bits that fall off one end reappear at the other, so no information is lost and the population count never changes. CPUs have had rotate instructions for decades (ROL and ROR on x86). Hash functions, cryptographic primitives, and PRNGs lean on them constantly because rotation mixes bits without throwing any away.

C and C++ have no rotate operator, so for years people wrote the idiom (x << n) | (x >> (32 - n)). That idiom has a bug: when n is 0, the second shift becomes x >> 32, which is the undefined behavior from §7. The safe version masks the complement: (x << n) | (x >> ((-n) & 31)). Modern compilers pattern-match both forms back to a single ROL^[2]. Since C++20 you can stop playing this game and call std::rotl and std::rotr from <bit>, which are correct for every count including 0^[12].

Rotate, three ways

#include <bit>     // C++20
#include <cstdint>

// The classic idiom. Correct ONLY for 1 <= n <= 31.
uint32_t rotl_naive(uint32_t value, int n) {
    return (value << n) | (value >> (32 - n));   // UB when n == 0
}

// The safe hand-written form: mask the complement so n == 0 works.
uint32_t rotl_safe(uint32_t value, int n) {
    return (value << (n & 31)) | (value >> ((-n) & 31));
}

// The right answer in modern C++: defined for every n, compiles to ROL.
uint32_t rotl_std(uint32_t value, int n) {
    return std::rotl(value, n);
}

// Rust has had rotate in the standard library from the start.
// These are defined for every count and lower to ROL / ROR.
let rotated_left  = value.rotate_left(n);    // u32 -> u32
let rotated_right = value.rotate_right(n);

// Writing the shift idiom by hand would panic in debug builds if the
// shift amount reached 32 (Rust checks shift overflow), so use the
// built-ins. They are the canonical form, not a convenience wrapper.

09The mask toolkit

Shifts position a bit; masks act on it. The whole toolkit is four bitwise operators and a couple of idioms, and almost every higher-level trick in this tutorial is built from them^[1]. Knuth catalogs the scholarly version under "Bitwise Tricks and Techniques"^[4].

Operation	Expression	What it does
Test bit i	`(x >> i) & 1`	Isolate one bit as 0 or 1.
Set bit i	`x \| (1u << i)`	Force the bit to 1, leave the rest.
Clear bit i	`x & ~(1u << i)`	Force the bit to 0.
Toggle bit i	`x ^ (1u << i)`	Flip the bit.
Low n bits set	`(1u << n) - 1`	A mask of n ones. UB-adjacent: at n = 32 the shift itself is undefined.
Extract a field	`(x >> shift) & mask`	Pull a packed sub-value out.
Isolate lowest set bit	`x & -x`	Keep only the rightmost 1. The basis of bitset iteration.
Clear lowest set bit	`x & (x - 1)`	Drop the rightmost 1. Loop on this to visit set bits.

The last two deserve a second look because they fall straight out of two's complement. -x is ~x + 1, which flips every bit up to and including the lowest set bit, so x & -x keeps exactly that bit. And subtracting 1 borrows through the trailing zeros up to the lowest set bit, so x & (x - 1) erases it. Together they let you walk the set bits of a word in as many steps as there are set bits, not 32 or 64. That is the engine behind iterating a bitset, which is the next section.

10Flags, bitsets & bitboards

The most common bit work in a game is the humblest: one integer, one bit per yes/no fact. Sixty-four booleans in a single 64-bit word, tested and combined with the operators from §9, no cache misses, no branches.

Collision layers and masks

Physics engines decide whether two objects can collide with a pair of bitmasks. Each body has a category (one bit: which layer it is) and a mask (which layers it accepts). Two bodies collide only if each one's category is in the other's mask: (catA & maskB) && (catB & maskA). Box2D spells this exactly this way^[14]; Unity's layer collision matrix is the same idea wearing a UI. Two ANDs replace a lookup table.

Body A

category (pick one)

mask (accepts)

cat = - · mask = -

Body B

category (pick one)

mask (accepts)

cat = - · mask = -

PASS THROUGH

Bitsets as component signatures

An ECS often gives each entity a bitset "signature": bit k set means the entity has component k. A system that wants everything with Position and Velocity builds a query mask and accepts an entity when (signature & queryMask) == queryMask. Iterating the matching components means walking the set bits with the x & (x - 1) idiom from §9. The same pattern drives free-list allocators (a set bit is a free slot) and dirty-flag tracking.

Bitboards

Board games take this to its logical end. A bitboard packs an 8×8 board into a 64-bit word, one bit per square. The payoff is that a shift moves every piece at once: shifting the whole board left by 8 slides every piece one rank north, by 1 one file east. The catch is wraparound. Shifting east moves a piece on the h-file onto the a-file of the next rank, so you AND with a "not the a-file" mask to delete the wrapped bits^[14]. Population count gives material; count-trailing-zeros gives the next occupied square.

board: -
last move: click a square, then shift

Place a piece on the right edge (the h-file) and shift east: it vanishes instead of teleporting to the left edge, because notAFile masked away the wrapped bit. Drop the mask and that piece would reappear on the far side, one rank up. The mask is the difference between a move generator and a bug.

11Packing: color & bitfields

Shifts and masks are how you stuff several small values into one word and pull them back out. The canonical example is color. An 8-bit-per-channel RGBA pixel is four bytes packed into a 32-bit word, and the whole codec is shifts on the way in, masks on the way out.

Pack and unpack RGBA8888

uint32_t pack(uint8_t r, uint8_t g, uint8_t b, uint8_t a) {
    return (uint32_t(r) << 24) | (uint32_t(g) << 16)
         | (uint32_t(b) <<  8) |  uint32_t(a);
}

// Cast to uint32_t BEFORE shifting. (uint8_t(r) << 24) would promote to
// int and push a 1 into the sign bit: the signed-left-shift UB from §7.

uint8_t red(uint32_t c)   { return (c >> 24) & 0xFF; }
uint8_t green(uint32_t c) { return (c >> 16) & 0xFF; }
uint8_t blue(uint32_t c)  { return (c >>  8) & 0xFF; }
uint8_t alpha(uint32_t c) { return  c        & 0xFF; }

fn pack(r: u8, g: u8, b: u8, a: u8) -> u32 {
    (r as u32) << 24 | (g as u32) << 16 | (b as u32) << 8 | a as u32
}

// In Rust the `as u32` widens before the shift, so there is no sign-bit
// hazard. Shifting a u32 left into bit 31 is just a value, never UB.

fn red(c: u32)   -> u8 { ((c >> 24) & 0xFF) as u8 }
fn green(c: u32) -> u8 { ((c >> 16) & 0xFF) as u8 }
fn blue(c: u32)  -> u8 { ((c >>  8) & 0xFF) as u8 }
fn alpha(c: u32) -> u8 {  (c        & 0xFF) as u8 }

Squeeze the same color into 16 bits and you get RGB565, still used for framebuffers and textures where bandwidth matters more than fidelity^[15]. Packing drops the low bits of each channel (r >> 3, g >> 2, b >> 3); unpacking has to invent them back, usually by replicating the high bits down. That reconstruction is lossy, which is what produces banding in 16-bit gradients. The widget shows the error.

R 224

G 122

B 95

A 255

input

stored

packed: -
bits: -

Endianness is not bit order

A common confusion: people assume that packing colors with shifts is sensitive to endianness. It is not. Shifts operate on the value in a register, where bit i always means 2ⁱ, the same on every machine. Endianness only describes the order bytes land in memory when you store that register. (r << 24) | ... produces the identical number on a little-endian and a big-endian CPU; the bytes only reshuffle when you write the word to a buffer and read it back as bytes. When you genuinely need to swap byte order (network packets, file formats), that is a separate operation: bswap, or std::byteswap in C++23.

12Morton order & swizzling

Packing puts independent fields side by side. does something subtler: it interleaves the bits of two coordinates so that points close together in 2D get keys close together in 1D. Guy Morton described it in a 1966 IBM technical report on file sequencing for a geodetic database^[6], and it has been quietly load-bearing in graphics ever since.

The encode is pure shift-and-mask: spread x into the even bit positions, spread y into the odd positions, OR them together. Walking the resulting indices 0, 1, 2, 3 traces a recursive Z (hence "Z-order curve"), and that Z keeps spatial neighbors mostly adjacent in memory. GPUs lay textures out in Morton-order tiles, or close relatives of them, so that a 2D sampling neighborhood lands on one cache line. Linear octrees and quadtrees store nodes sorted by Morton code, computing parent and child relationships with shifts instead of pointer chases (the spatial partitioning tutorial builds those trees). Parallel BVH builders sort primitives by Morton code as their first step.

index: -
decoded: -
interleave

blue = x bits (even positions) · red = y bits (odd positions)

For wider coordinates you do not loop bit by bit. The standard trick is a sequence of shift-and-mask steps with "magic" constants that spread the bits in logarithmically few operations, the inverse of the SWAR fold you will see in §15^[1]. And on recent x86, a single PDEP instruction deposits bits into a mask in one go, collapsing the whole interleave to two instructions (§16).

13Fixed-point arithmetic

Before hardware floating point was free, and in any system that needs bit-exact reproducibility today, fractions are stored as integers with an implied binary point. The shift you keep in your head is the whole idea. A Q16.16 value is a 32-bit integer; the real number it represents is that integer divided by 2¹⁶.

Conversion is a shift: an integer n becomes n << 16. Addition and subtraction just work, because the points line up. Multiplication is where the shift earns its keep: multiplying two Q16.16 values gives a result scaled by 2³², so you shift the product right by 16 to put the point back, using a 64-bit intermediate so the high bits do not fall off. Division shifts the other way.

Q16.16 fixed-point multiply

using fixed = int32_t;            // Q16.16
constexpr int SHIFT = 16;

fixed from_int(int n)    { return n << SHIFT; }
fixed mul(fixed a, fixed b) {
    // 32x32 -> 64 so the top bits survive, then shift the point back.
    int64_t wide = int64_t(a) * int64_t(b);
    return fixed(wide >> SHIFT);
}
fixed div(fixed a, fixed b) {
    int64_t wide = (int64_t(a) << SHIFT);
    return fixed(wide / b);
}

type Fixed = i32;                 // Q16.16
const SHIFT: u32 = 16;

fn from_int(n: i32) -> Fixed { n << SHIFT }
fn mul(a: Fixed, b: Fixed) -> Fixed {
    // Widen to i64 so the product does not overflow, then re-align.
    let wide = a as i64 * b as i64;
    (wide >> SHIFT) as Fixed
}
fn div(a: Fixed, b: Fixed) -> Fixed {
    (((a as i64) << SHIFT) / b as i64) as Fixed
}

Why bother in 2026? Determinism. Floating-point results can differ across compilers, CPUs, and optimization flags (fused multiply-add here, a different rounding there), which quietly desynchronizes a lockstep simulation where every client must compute bit-identical results^[18]. Fixed-point integer math is exactly reproducible by construction, so RTS netcode and replay systems still reach for it. It was also the only option on consoles without an FPU: the SNES and the Game Boy Advance did their world math in fixed-point shifts.

raw integer: -
real value: -
smallest step: -
split: -

14Two famous hacks

Two bit tricks earned their fame in games. Both reinterpret the same bits as different types, which is the deepest move in the bit-manipulation playbook.

The fast inverse square root

Computing 1/√x for vector normalization once cost a division and a square root, both slow. The trick popularized by Quake III Arena in 1999 replaces them with a subtraction and a shift^[19]. Reinterpret the float's 32 bits as an integer; because IEEE-754 lays out the exponent in the high bits, the integer value is roughly a scaled logarithm of the float. A right shift by one halves that integer, which halves the log, which square-roots the value; subtracting from the magic constant 0x5f3759df negates it, turning √x into 1/√x. Reinterpret back to float and you have a seed good to about 3%, then one Newton-Raphson step sharpens it to a fraction of a percent. Chris Lomont reverse-engineered why the constant is near-optimal in 2003^[8].

signexponent (8)mantissa (23)

x = 2^4.0

bits as int: -
exponent field: -
i >> 1: -

raw approx

-
error -

+1 Newton

-
error -

true 1/√x

The single right shift i >> 1 is the heart of it: halving the integer reinterpretation of a float roughly halves its exponent, which is how a shift can approximate a square root. The red error curve is the bare bit-hack; the green curve is one Newton step later. The purple line marks your current x.

Xorshift: a PRNG that is only shifts

George Marsaglia's 2003 xorshift generators produce a stream of pseudo-random numbers from nothing but shifts and XORs^[9]. Three lines, no multiply, no table: XOR the state with a shifted copy of itself, three times, with carefully chosen shift amounts. A 32-bit xorshift runs through all 2³²−1 non-zero states before repeating. It is fast enough for particle jitter, loot rolls, and procedural placement, though it fails some stringent statistical tests and is not suitable for cryptography; later variants like xorshift128+ fixed the quality issues^[9].

points plotted: 0

uint32_t x = seed;          // never zero
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;                 // x is the next output

15Counting bits: popcount, clz, ctz

Three queries about a word show up everywhere: how many bits are set (popcount), how many leading zeros (clz), how many trailing zeros (ctz). Popcount gives set cardinality and Hamming distance. Leading zeros give an integer log base two, which sizes hash tables and buddy allocators. Trailing zeros give the index of the lowest set bit, which is how you iterate a bitset.

Modern hardware does all three in one instruction: POPCNT arrived with SSE4.2 in 2008, and LZCNT and TZCNT came with the BMI extensions^[10]. There is a sharp edge: the older bit-scan instructions BSR and BSF leave their result undefined when the input is zero, so clz and ctz of zero need care; LZCNT and TZCNT were defined to return the operand width instead. C++20 wraps all of this in <bit>: std::popcount, std::countl_zero, std::countr_zero, and std::bit_width, each lowering to the hardware instruction when present^[12].

When the instruction is not available, the classic fallback is a population count: treat the word as many small lanes, add adjacent pairs, then nibbles, then bytes, halving the lane count each step. Thirty-two bits fall in five operations instead of a 32-iteration loop. The technique goes back to HAKMEM, MIT AI Memo 239 from 1972^[3], and is tabulated in Hacker's Delight^[2]. Trailing-zero counts have an equally clever software form using a de Bruijn sequence and a small lookup table^[7].

naive loop

SWAR fold

hardware

The naive loop tests one bit per iteration: 32 steps. The SWAR fold sums in parallel, halving the field width each row, and reaches the same total in 5. The hardware POPCNT does it in a single instruction. Run the fold and watch the partial sums accumulate as the lane width doubles each row.

16Hardware bit-manip: BMI1 and BMI2

Since 2013, x86 has shipped two extension sets, BMI1 and BMI2, that turn several of this tutorial's multi-step idioms into single instructions^[10]. Compilers emit them automatically when you target a recent enough CPU; recognizing them in a disassembly tells you what the source was doing.

Instruction	Replaces	Meaning
`BLSR`	`x & (x - 1)`	Clear the lowest set bit.
`BLSI`	`x & -x`	Isolate the lowest set bit.
`ANDN`	`~a & b`	AND-NOT in one op, no separate NOT.
`BZHI`	`x & ((1 << n) - 1)`	Zero the bits at and above position n.
`SHLX / SHRX / SARX`	`<<` / `>>`	Shift by a variable count without touching flags.
`PEXT / PDEP`	spread/gather loops	Extract or deposit bits selected by a mask, in parallel.

Two are worth dwelling on. The variable-count shifts SHLX, SHRX, and SARX do not read or write the flags register and do not pin the count to the CL register, so they avoid a false dependency that the legacy shl reg, cl form carries; each is a single micro-op^[11]. And PDEP collapses the entire Morton-encode from §12 to one instruction by depositing coordinate bits straight into an interleave mask.

One caveat before you reach for PEXT

PEXT and PDEP are single-cycle on Intel and on AMD from Zen 3 onward, but on AMD's Zen, Zen+, and Zen 2 they are implemented in microcode and take on the order of a hundred cycles^[11]. Code that assumed they were fast (notably some chess engines using PEXT for sliding-piece attack tables) ran dramatically slower on those parts. If you dispatch on PEXT, gate it on the vendor and family, not just the feature bit.

17Pitfalls

The recap, as a checklist of the mistakes that survive review because the code looks correct.

Signed division is not a shift. x / 8 and x >> 3 diverge for negative, inexact values. Write the division and let the compiler add the bias.
Shift count at or above the width is undefined. value << 32 on a 32-bit type is UB in C and C++, even though x86 masks it to a no-op and Java and JavaScript define the mask. Never let a count reach the width.
Arithmetic versus logical is decided by the operand type. Shifting through an unsigned reinterpretation gives a logical shift even when the bits mean a negative number.
Left-shifting a negative or into the sign bit is undefined. Cast to unsigned before packing values into the top bit, as in the RGBA pack from §11.
The rotate idiom is UB at zero. (x << n) | (x >> (32 - n)) breaks when n is 0. Use std::rotl, or mask the complement.
Shifting for speed is cargo-culting. The compiler strength-reduces multiplies and divides by constants. Hand-written shifts buy clarity about intent, not cycles.
(1 << n) - 1 overflows at the width. Building an all-ones mask of n bits is itself a shift, undefined when n equals the type width. Special-case it or use BZHI.
Endianness is byte order in memory, not bit weight. Shifts are endianness-independent; only serializing the word to bytes involves endianness.
PEXT/PDEP are slow on pre-Zen-3 AMD. Gate on vendor and family, not just the CPUID feature flag.
Bit-scan of zero is undefined on the legacy instructions. BSR/BSF leave the result undefined for input 0; prefer LZCNT/TZCNT or guard the zero case.

18What's intentionally missing, and what's next

This tutorial stayed on scalar integer bit work. Several neighbors were left out on purpose. SIMD bit operations (operating on 128, 256, or 512 bits at once, including the pshufb byte-shuffle as a parallel lookup) live in the SIMD tutorial. Bit-reversal permutations and the butterfly addressing behind the FFT are an audio and signal-processing topic. Gray codes, where consecutive values differ in one bit, matter for rotary encoders and some error-tolerant schemes. Entropy coders like rANS pack fractional bits for compression. Constant-time bit tricks for cryptography deliberately avoid data-dependent branches and timing, a discipline of its own. Each is a shift-and-mask story with different stakes.

For where these bits go next: the hash tables tutorial uses the shift-and-xor finalizers and Fibonacci hashing that this tutorial's mixing primitives feed; the spatial partitioning tutorial builds the Morton-ordered octrees from §12; and the memory model tutorial covers the alignment masks ((p + a - 1) & ~(a - 1)) that round pointers up to a power-of-two boundary.

19Sources

Sean Eron Anderson. "Bit Twiddling Hacks." Stanford Graphics Lab, 1997–2005. graphics.stanford.edu/~seander/bithacks.html. The canonical web reference for branchless bit idioms: lowest-set-bit isolation, bit spreading for Morton codes, parallel population counts, and sign tricks. Code is in the public domain.
Henry S. Warren Jr. Hacker's Delight, 2nd ed., Addison-Wesley, 2012. The standard book on integer bit manipulation. Source for the branchless absolute value, the rotate pattern-matching, the SWAR population counts, and the division-by-constant derivations.
Michael Beeler, R. William Gosper, Rich Schroeppel. "HAKMEM." MIT AI Memo 239, February 1972. dspace.mit.edu. Item 169 gives the parallel (SWAR) population count on the 36-bit PDP-6/10, the ancestor of every fast popcount fallback.
Donald E. Knuth. The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1, §7.1.3 "Bitwise Tricks and Techniques." Addison-Wesley, 2011. The exhaustive scholarly treatment of bit operations, magic masks, and broadword computing.
Torbjörn Granlund, Peter L. Montgomery. "Division by Invariant Integers using Multiplication." Proc. ACM SIGPLAN PLDI, 1994, pp. 61–72. PDF. The algorithm compilers use to turn division by a constant into a reciprocal multiply and shift, including the bias for signed division covered in §6.
Guy M. Morton. "A Computer Oriented Geodetic Data Base; and a New Technique in File Sequencing." IBM Ltd., Ottawa, 1966. The first description of the Z-order (Morton) curve: interleaving coordinate bits to preserve spatial locality in a 1D key.
Charles E. Leiserson, Harald Prokop, Keith H. Randall. "Using de Bruijn Sequences to Index a 1 in a Computer Word." MIT Laboratory for Computer Science, 1998. PDF (MIT, via Internet Archive). The multiply-and-lookup method for count-trailing-zeros without a hardware instruction.
Chris Lomont. "Fast Inverse Square Root." Purdue University tech note, 2003. PDF. Derives why 0x5f3759df is near-optimal and searches for a marginally better constant. The reference analysis of the Quake III hack.
George Marsaglia. "Xorshift RNGs." Journal of Statistical Software, 8(14):1–6, 2003. jstatsoft.org. The shift-and-xor random number generators, with periods 2^k−1. Later work (Panneton & L'Ecuyer; Vigna) analyzes their statistical weaknesses and the scrambled successors.
Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 2: Instruction Set Reference. Entries for SAL/SAR/SHL/SHR (count masked to 5 or 6 bits), ROL/ROR, SHLD/SHRD, SARX/SHLX/SHRX, POPCNT, LZCNT/TZCNT, and the BMI1/BMI2 sets. Mirrored at felixcloutier.com/x86.
Agner Fog. "Instruction Tables" and "The Microarchitecture of Intel, AMD and VIA CPUs." Technical University of Denmark. agner.org/optimize. Source for shift and multiply latencies and throughputs, the single-micro-op BMI2 shifts, and the microcoded PEXT/PDEP on pre-Zen-3 AMD.
ISO C++. <bit> header (P0553 "Bit operations" and P0556), and P1236/P0907 making signed integers two's complement with defined arithmetic right shift. C++20. Reference: cppreference.com/w/cpp/header/bit. Defines rotl, rotr, popcount, countl_zero, countr_zero, bit_width.
David Wragg. "Shift Instructions." 2012. david.wragg.org. Explains why C leaves over-width shifts undefined: so the compiler can emit the bare x86 shift, whose count is already masked, without a guard.
Chess Programming Wiki. "Bitboards," "General Setwise Operations," "BMI2." chessprogramming.org. The reference for board-as-64-bit-word representations, the file-mask shift moves, and population-count material evaluation.
Khronos Group. OpenGL and Vulkan format specifications, packed pixel formats (GL_UNSIGNED_SHORT_5_6_5, VK_FORMAT_R5G6B5, R8G8B8A8). registry.khronos.org. The bit layouts for the color packing in §11.
Christer Ericson. "Order your graphics draw calls around!" realtimecollisiondetection.net, 2008. realtimecollisiondetection.net. The standard reference for bit-packed draw-call sort keys feeding a radix sort.
Randal E. Bryant, David R. O'Hallaron. Computer Systems: A Programmer's Perspective, 3rd ed., Pearson, 2015, ch. 2. The teaching reference for two's complement, the arithmetic-shift division bias, and integer overflow semantics.
Glenn Fiedler. "Floating Point Determinism." Gaffer On Games, 2010. gafferongames.com. Why floating-point results vary across compilers and CPUs, and why deterministic lockstep simulations turn to fixed-point integer math.
id Software. Quake III Arena source release, code/game/q_math.c (Q_rsqrt), GPL, 2005 (code circa 1999). github.com/id-Software. The shipped form of the fast inverse square root, magic constant and all.