Tutorial 09 · Engine Programming

x86-64 Assembly
from Scratch

Sixteen general-purpose registers. Sixteen 256-bit vector registers since AVX, thirty-two 512-bit ones on AVX-512 hardware. An instruction encoding that ranges from one byte to fifteen, a calling convention that disagrees between Linux and Windows, and a microarchitecture that turns your serial-looking code into a wide, speculative dispatch across eight or more execution ports. The view from the assembly layer is what tells you why your inner loop runs at the speed it does. We work from the CPU model up to a SIMD matrix-vector kernel and a μop scheduling demo, with every claim cited to a vendor manual or a refereed paper.

Time~65 min LevelEngine programmer, intermediate to senior PrereqsYou can read C or C++ comfortably. You know what a stack frame and a function pointer are. The Memory Model tutorial pairs naturally with §15 of this one. HardwareAny x86-64 machine from the last decade

01Why assembly still matters when nobody writes it

The honest answer is that very few production games ship hand-written assembly. The compiler is usually within a few percent of optimal on hot paths, and a single ABI mismatch in hand-written code can corrupt a callee-saved register in a way that crashes three frames later. The reason to learn assembly in 2026 is not to write it. It is to read it. The compiler's output is the ground truth for what your CPU is being asked to do, and reading it answers questions that source-level reasoning can't.

Concrete situations where the disassembly is the only authoritative source:

The optimizer didn't do what you assumed. A loop you thought was unrolled wasn't. A condition you thought compiled to cmov compiled to a branch. A small struct you thought was returned in registers got spilled to memory. Godbolt^[1] exists for these moments, and the answer is always in the assembly.
A bug only reproduces under release optimizations. The compiler reordered or elided code in a way that exposed a latent UB. Reading the assembly is faster than printf-debugging a release build.
You're integrating native libraries across an ABI. A C++ function calling into FFI code from Rust, the Switch Homebrew toolchain, an FMOD plugin, or a custom JIT, has to follow the platform's calling convention exactly. The only way to verify is to look at what each side emits.
Microarchitecture tuning. When a hot SIMD loop is running at half throughput, the answer lives in the dependency chain or port pressure of the emitted instructions, not in the source.
Debug symbols at the crash site. Reading a stack trace from a stripped binary in production means reading add rsp, 0x28; ret and inferring the function.

Every section of this tutorial answers a question that has surfaced in real engine work. The history is short, the CPU model is the standard mental picture a shipping engine programmer needs, and the SIMD walkthrough in §13 is a variant of the matrix-vector transform almost every renderer ships^[2].

What you'll have by the end

A working ability to read AT&T and Intel syntax x86-64 disassembly. Concrete knowledge of the System V AMD64 ABI (Linux, macOS, BSD, PlayStation) and the Microsoft x64 ABI (Windows, Xbox) and where they differ. The instruction encoding well enough to read a hex dump as instructions. The SIMD vocabulary (lanes, packed/scalar, SSE/AVX/AVX-512) and what a 4×4 matrix-vector multiply looks like at the assembly level. The microarchitectural vocabulary (μops, ports, dependency chains, retirement) that the Intel and AMD optimization manuals^[3]^[4] and Agner Fog's manuals^[5] are written in. Six live, in-browser widgets you can step through.

A tiny disassembly to set the tone

Two lines of C. Compile with gcc -O2 on x86-64 Linux:

add.c

int add(int a, int b) {
  return a + b;
}

The compiler emits:

add.s · Intel syntax, AT&T-style comments

add:
  lea   eax, [rdi + rsi]   ; eax = rdi + rsi (load-effective-address used as 3-op add)
  ret                      ; pop return address from [rsp], jump to it

Four lessons hiding in two instructions. First, the function arguments arrived in rdi and rsi because the System V ABI^[6] puts the first two integer arguments there; the same code on Windows would have used rcx and rdx^[7]. Second, the return value goes in rax (32 bits of which is eax). Third, the compiler used lea (Load Effective Address) as a three-operand non-destructive add: lea is the closest thing x86 has to an ARM-style add dst, src1, src2, and the compiler reaches for it constantly. Fourth, there is no stack frame at all, because nothing here required spilling state and the function is a leaf.

The rest of this tutorial unpacks each of those four observations from first principles. By the end you will be able to look at a fifty-line dump and tell where the register allocator gave up, where a function was inlined, and which loop the compiler vectorized.

02A short history of the architecture you're reading

x86-64 is the carrying state of nearly fifty years of architectural decisions, each made with the previous decade of code in mind. The instruction encoding has prefixes whose purpose is to extend an earlier 16-bit encoding. The calling conventions changed because the register count tripled. The vector extensions were added in four named generations. None of this is obvious from the documentation; each piece makes sense only against what it replaced.

1978

Intel 8086.^[9] A 16-bit CISC processor with eight named 16-bit registers (AX, BX, CX, DX, SI, DI, BP, SP) and a segmented address space. The instruction encoding is the basis of every x86 CPU since: most instructions are still recognized in their 1978 form, and the assembler mnemonics are mostly unchanged.

1985

Intel 80386. 32-bit registers (EAX, EBX, …), the flat-32 address space that displaced segmentation in practice, and protected mode. Every register from the 8086 was widened to 32 bits with an "E" prefix, and the encoding was extended with a prefix byte to choose between 16-bit and 32-bit operand sizes. The pattern of "extend by a prefix, keep the old encoding intact" set the precedent for everything that followed.

1997

MMX. Intel's first SIMD extension: 64-bit integer vectors aliased onto the FPU registers. Mostly abandoned by 2005; the AMD64 instruction set deprecated MMX-only code paths and the SSE family replaced it.

1999

SSE (Streaming SIMD Extensions). Eight new 128-bit registers (XMM0–XMM7), packed-single-precision-float arithmetic, and the first explicit prefetch instructions^[10]. SSE became the floating-point baseline; AMD64 made SSE2 (added 128-bit integer ops and packed doubles) part of the mandatory ISA, which is why every x86-64 compiler emits XMM-based floating point and the 1985-era x87 FPU stack is reserved for legacy paths.

2003

AMD64 (x86-64).^[8] AMD's extension to 64-bit, debuting on the K8 Opteron. Sixteen general-purpose registers (R8–R15 added, the eight 8086-derived registers widened to RAX, RBX, …). Sixteen XMM registers. A new REX prefix encodes the extended registers and the 64-bit operand size. RIP-relative addressing replaces the segment-based addressing that 32-bit position-independent code used. Intel adopted AMD's design the following year on the Prescott-based Xeon (under the name "Intel 64", originally "EM64T").

2008

SSE4.2. String/CRC instructions, the family the C++ standard library still uses for std::find on integer ranges and for std::hash when SSE4.2 is available.

2011

AVX. 256-bit YMM registers (the lower 128 bits alias XMM), a new three-operand VEX encoding^[10], and the separately-introduced FMA3 fused-multiply-add extension that follows on Haswell (2013). AMD's Jaguar (PS4/Xbox One, 2013) brought AVX (not AVX2) to consoles; PS5 and Xbox Series X/S (AMD Zen 2, 2020) brought AVX2. AVX2 is the practical SIMD baseline most current PC and console titles target.

2016–2022

AVX-512. 512-bit ZMM registers, thirty-two of them, with separate mask registers (K0–K7). First shipped on the Knights Landing Xeon Phi (2016) and the Skylake-X high-end desktop (2017); reached client mobile on Ice Lake (2019). Intel disabled it on client P-cores from Alder Lake (2021) through Arrow Lake (2024) once the chips shipped with E-cores that lack the unit. AMD shipped AVX-512 starting with Zen 4 (2022) and continued it on Zen 5. AVX-512 is the only x86 vector instruction set with first-class predication via the K-mask registers, which is part of why it remains worth writing by hand when the workload is in cache.

2023

APX (Advanced Performance Extensions, announced).^[11] Intel's proposal to add sixteen more general-purpose registers (R16–R31), three-operand encoding for legacy integer instructions, conditional loads and stores. Shipping silicon expected mid-decade; mentioned here only so the register-count number stays current.

The consequence for what you read in 2026: most disassembly is plain x86-64 with SSE2 floating point. Hot paths use AVX2 (256-bit) on PC and on PS5 / Xbox Series. AVX-512 shows up in compression, simulation, and ML kernels; rarely in shipped game code on consumer hardware because of the Intel client-side gap from Alder Lake through Arrow Lake. ARM, which we touch briefly in §13, is the other architecture worth knowing: Nintendo Switch and Switch 2, Apple Silicon, and a growing fraction of Windows-on-Arm devices all ship it, and the calling conventions, register file, and SIMD model (NEON, SVE2) follow different rules.

Intel syntax vs AT&T syntax: which is which, and which should I learn?

Two notational conventions exist for x86 assembly. The instructions are the same; the surface syntax differs:

Intel syntax writes mov eax, ebx with destination first. No register sigils. Memory references look like [rdi + 8]. Used by NASM, MASM, the Intel and AMD reference manuals^[10], and Compiler Explorer's default. Also the syntax GCC and Clang emit if you pass -masm=intel.

AT&T syntax writes movl %ebx, %eax with source first. Registers prefixed with %, immediates with $, and a suffix on the mnemonic encodes the operand size (l = long = 32 bits). Memory references look like 8(%rdi). Used by the GNU assembler (as) and the default output of objdump on Linux. Inherited from the AT&T Unix assembler convention from the early 1980s.

Read both. Intel syntax matches the vendor manuals; AT&T is what you'll see in objdump -d output from a Linux build. This tutorial uses Intel syntax in code listings and notes AT&T differences where they matter.

03The CPU you're actually programming for

An assembly instruction is not what executes. A modern x86 core decodes each instruction into one or more , schedules those μops across multiple execution units in parallel, and retires them in program order at the back end^[3]. The instruction stream you read is the contract; what runs is a heavily reordered, register-renamed, speculatively-executed shadow of it. This section is the sketch of that machine, enough to make sense of why some instructions are "free" and others stall a whole pipeline.

A modern x86 pipeline, simplified to the parts you care about reading assembly for:

Fetch. The CPU reads 16 to 32 bytes per cycle from the L1 instruction cache (32 KB on Skylake- and Zen-class cores; 48–64 KB on Sunny Cove, Golden Cove, and Lion Cove^[12]) into an aligned fetch window.
Decode. Four to six decoders convert variable-length x86 instructions into fixed-format μops on recent Intel cores; Zen 3 and 4 use four. Most simple instructions decode to a single μop; string ops, integer divide, and a handful of complex memory addressing modes decode to several.
μop cache. Recent x86 cores keep a decoded-μop cache in front of the rename stage. A hot loop that fits in the cache (1.5K μops on Skylake; 4K on Golden Cove and later^[12]) bypasses the legacy decoders entirely. AMD's equivalent on Zen 4 holds about 6.75K instructions.
Rename and dispatch. The renamer maps each architectural register named in the instruction to a physical register from a pool of several hundred. This is what lets the CPU execute "add rax, 1; add rax, 2" with no false dependency between the two adds: each writes to a different physical register, and the renamer tracks which physical register is the current rax.
Schedule and execute. The scheduler issues ready μops to a set of execution ports: 8 on Skylake, 10 on Sunny Cove, 12 on Golden Cove, and similar on recent Zen^[13]. A given port can execute one μop per cycle. Some ports are specialized; integer divide, store-data, branches, and vector multiply each live on a subset.
Retire. A reorder buffer holds completed μops until every earlier μop has finished. Retirement happens in program order at 4 μops per cycle on Skylake, rising to 6 on Golden Cove and 8 on Lion Cove and Zen 4^[12]. Once an instruction's μops retire, its writes become architecturally visible; until then the CPU can roll them back (on a branch mispredict, an exception, or a memory-order violation).

The two practical consequences for reading assembly. First, an instruction's latency (cycles from issue to result available) is different from its throughput (number of times per cycle the CPU can issue it). A vmulps on Skylake has a latency of 4 cycles but a throughput of two per cycle^[13]: a chain of eight dependent multiplies takes 8 × 4 = 32 cycles; eight independent ones issue in 4 cycles and the last result lands 4 cycles after that, on the order of 8 cycles total. Same instruction count, ~4× difference in finish time. Second, the difference between a "fast" and "slow" implementation of the same algorithm at the assembly level is often not the instruction count; it is whether the instructions form a serial dependency chain pinning everything to one port, or whether they spread out across the execution ports the CPU has on offer.

The widget below schedules a simple loop's μops onto four ALU ports and watches what stalls and what doesn't. We come back to this picture in §16 with a real worked example:

Sixteen add μops scheduled on a four-port integer ALU, modeled after Skylake's ports 0/1/5/6 (each accepts one integer add per cycle^[13]). In serial mode each add waits for the previous one (one chain of dependent adds; 16 cycles). In parallel mode the same sixteen adds use four independent accumulators (one chain per port; 4 cycles). Same instruction count, same arithmetic, a 4× difference in finish time.

04The register file

x86-64 exposes sixteen general-purpose 64-bit registers and sixteen (with AVX-512: thirty-two) vector registers. Knowing the names is half of reading a disassembly:

64-bit	32-bit	16-bit	8-bit low	Conventional role (System V)
`rax`	`eax`	`ax`	`al`	return value (int); scratch
`rbx`	`ebx`	`bx`	`bl`	callee-saved
`rcx`	`ecx`	`cx`	`cl`	4th integer arg
`rdx`	`edx`	`dx`	`dl`	3rd integer arg; high half of 128-bit return
`rsi`	`esi`	`si`	`sil`	2nd integer arg
`rdi`	`edi`	`di`	`dil`	1st integer arg
`rbp`	`ebp`	`bp`	`bpl`	frame pointer; callee-saved
`rsp`	`esp`	`sp`	`spl`	stack pointer
`r8`	`r8d`	`r8w`	`r8b`	5th integer arg
`r9`	`r9d`	`r9w`	`r9b`	6th integer arg
`r10`–`r11`	…			scratch
`r12`–`r15`	…			callee-saved

Writes to a 32-bit register zero-extend into the 64-bit register. Writes to an 8-bit or 16-bit register do not: they leave the upper bits unchanged. This is an AMD64 spec rule^[8], not a compiler convention, and it is the reason compilers emit xor eax, eax to zero the full rax: the 32-bit write zero-extends and the encoding is one byte shorter than mov rax, 0.

Vector registers come in three sizes that alias each other:

xmm0–xmm15 (with AVX-512: xmm0–xmm31): 128 bits. SSE/SSE2/SSE3/SSE4 baseline.
ymm0–ymm15 (with AVX-512: ymm0–ymm31): 256 bits. AVX/AVX2. The lower 128 bits of ymm0 is xmm0.
zmm0–zmm31: 512 bits. AVX-512. The lower 256 bits is ymm.

The aliasing is not free. On Skylake-class Intel cores, a naive mix of legacy-encoded SSE (xmm) and VEX-encoded AVX (ymm) without an intervening vzeroupper can incur a state-save penalty of tens of cycles per transition^[3]. Sunny Cove and later replace the hard stall with a smaller, more amortized cost, but the rule the compiler follows hasn't changed: emit vzeroupper at AVX-to-SSE boundaries. Inline assembly that mixes the two has to do the same.

Step through eight instructions and watch the register file update. Each row is one 64-bit register; columns are the high and low 32 bits. The arrow shows the current instruction; orange highlights the just-written register:

05Anatomy of an instruction

An x86-64 instruction is between 1 and 15 bytes. Its decoded form is a sequence of optional and mandatory fields, in this order^[10]:

Legacy prefixes (0–4 bytes). Address-size override, operand-size override, segment override, repeat prefix, LOCK.
REX prefix (0–1 byte). Required to encode the 64-bit operand size, registers R8–R15, or the SIL/DIL/BPL/SPL byte registers. Starts with the high nibble 0x4; the low nibble carries four bits W, R, X, B.
Opcode (1–3 bytes). Selects the instruction. Sometimes part of the opcode encodes a register too (the "+r" forms).
ModRM (0–1 byte). For instructions with operands, encodes the addressing mode and one or two register fields.
SIB (0–1 byte). Scale/Index/Base for the memory addressing modes that need it ([rbx + rcx*4 + 0x10]).
Displacement (0, 1, 2, or 4 bytes). The constant offset in a memory operand.
Immediate (0, 1, 2, 4, or 8 bytes). A literal value baked into the instruction.

AVX adds two more prefix families that replace REX and pack a "non-destructive source" field, letting vaddps zmm0, zmm1, zmm2 compute zmm0 = zmm1 + zmm2 without clobbering either source. The VEX prefix (2 or 3 bytes) covers AVX/AVX2; the EVEX prefix (4 bytes) covers AVX-512 and adds the mask register selector and rounding controls.

Click an instruction below to see its bytes broken out. Most production disassemblers can show this view (objdump -d -M intel --show-raw-insn, llvm-mc --show-encoding); seeing it once builds the muscle memory:

Each instruction's encoding is from the Intel SDM Volume 2 instruction reference^[10]. Real instruction encoders can pick different (equally valid) encodings of the same instruction; for example, add rax, rbx can be encoded with the destination in the ModRM reg field or in the ModRM r/m field. We show the canonical form GCC emits.

What's a ModRM byte actually?

One byte split into three fields: mod (top 2 bits), reg (middle 3 bits), r/m (bottom 3 bits). The reg field names a register; the r/m field names either a register or a memory operand depending on what mod says. A few examples:

mod=11 means "r/m is a register" (so the instruction has two register operands). mod=00 means "r/m is a memory operand with no displacement" (e.g., [rax]). mod=01 and mod=10 add an 8-bit or 32-bit displacement. r/m=100 is a sentinel meaning "a SIB byte follows," used for the indexed addressing modes like [rax + rcx*4].

The reg and r/m fields are 3 bits each. To name the 16 x86-64 registers you need 4 bits; the missing bit comes from the REX prefix's R and B bits. That's why REX is required whenever your instruction touches R8–R15.

06The core instructions, by frequency

Recent x86-64 has on the order of a thousand distinct mnemonics, with several thousand encoded variants when you count operand sizes and addressing modes^[13]. The fifteen or so in the table below account for the overwhelming majority of any non-vector compiler output. Sorted roughly by how often they appear:

Mnemonic	What it does	Form you'll usually see
`mov`	Copy bits. Register-to-register, register-to-memory, memory-to-register, immediate-to-register. The basic data movement instruction.	`mov rax, [rdi + 8]`
`lea`	Compute an address (or any 3-operand arithmetic that fits the addressing modes), don't dereference. See §7.	`lea rax, [rdi + rsi*4]`
`add` / `sub`	Integer add/subtract. Two-operand: `dst = dst op src`.	`add rsp, 0x28`
`imul`	Signed multiply. Most often seen as two-operand (`imul dst, src` → dst *= src) or three-operand with an immediate (`imul dst, src, imm` → dst = src × imm).	`imul rax, rcx`
`shl` / `shr` / `sar`	Bit shift left, logical right, arithmetic right. `sar` preserves the sign bit; `shr` doesn't.	`shl rax, 3`
`and` / `or` / `xor`	Bitwise ops. `xor reg, reg` is the canonical zeroing idiom.	`xor eax, eax`
`cmp` / `test`	Set the flags register. `cmp a, b` = subtract without storing; `test a, b` = AND without storing.	`cmp rax, 0`
`j`cc	Conditional jump. `je` = jump if equal (ZF=1), `jne`, `jl`, `jg`, `jb`, `ja` (signed vs unsigned). Follows a `cmp` or `test`.	`jne loop`
`jmp`	Unconditional jump.	`jmp .L7`
`call` / `ret`	Function call and return. `call` pushes the return address and jumps; `ret` pops and jumps.	`call malloc`
`push` / `pop`	Decrement RSP by 8 and store; or load and increment by 8. Used in function prologues/epilogues for callee-saved registers.	`push rbp`
`cmov`cc	Conditional move. `cmovge dst, src` = if SF=OF then dst=src. Branchless conditionals; see §11.	`cmovl rax, rdi`
`setcc`	Set a byte to 1 if a condition holds, else 0. Often used with `movzx` to materialize a 0/1 in a register.	`setl al`
`movzx` / `movsx`	Move with zero-extension or sign-extension. Bridge between 8/16-bit and 32/64-bit registers.	`movzx eax, byte ptr [rdi]`

The cc suffix on jumps, conditional moves, and setcc is one of sixteen condition codes that test the flags register set by the most recent cmp, test, or arithmetic instruction: z/e (zero/equal), nz/ne, l/g (signed less/greater), b/a (unsigned below/above), s (negative), o (overflow), and the rest. A jcc reads the flags and conditionally jumps; a cmovcc reads the flags and conditionally moves; a setcc reads the flags and conditionally writes 1 to a byte^[10].

Memory addressing supports one form, used by mov, lea, and most others: [base + index*scale + displacement], where base and index are 64-bit registers, scale is 1/2/4/8, and displacement is a signed 8 or 32-bit constant. Any subset can be omitted. RIP-relative addressing ([rip + offset]) is the special case used for position-independent globals on x86-64.

07The `lea` swiss army knife

lea (Load Effective Address) computes a memory address and writes it to a register without dereferencing. Because the addressing mode is the same as the one mov uses, lea can be used as a fast three-operand arithmetic instruction: any value that fits the base + index*scale + displacement form can be computed in one instruction without clobbering the source registers and without touching the flags. The compiler uses this constantly:

lea_tricks.s

; a + b without an add, and without trashing flags or either input
lea  rax, [rdi + rsi]

; 5 * x in one instruction (4*x + x)
lea  rax, [rdi + rdi*4]

; 9 * x  →  (8*x + x), via lea with scale=8
lea  rax, [rdi + rdi*8]

; Address arithmetic: &array[index] for an array of 4-byte ints
lea  rax, [rdi + rsi*4]

; Combined: 3*x + 7 in one instruction
lea  rax, [rdi + rdi*2 + 7]

Things to know. lea does not read memory, so the address it computes does not need to be valid. lea does not set flags, which makes it useful in the middle of a chain of conditional code where you can't afford to clobber them. The simple two-component form ([base + index*scale] or [base + disp]) is one cycle of latency and several per cycle of throughput on modern Intel and AMD^[13]. The three-component form ([base + index*scale + disp] with a non-unit scale) is the "complex LEA" on Sandy Bridge through Skylake: routed through port 1 only, with three-cycle latency^[12]. Ice Lake and later collapse most of the gap. A practical consequence is that the compiler will sometimes prefer shl + add over lea for arithmetic that needs to be on the critical path of a Skylake-class part.

08Calling conventions: System V vs Windows x64

Every function call across the public ABI on x86-64 follows one of two specifications: the System V AMD64 ABI^[6] (Linux, macOS, the BSDs, PlayStation 4, PlayStation 5) or the Microsoft x64 ABI^[7] (Windows, Xbox One, Xbox Series X/S). They disagree on which registers carry arguments, which are callee-saved, how struct returns work, and how the stack is aligned at the call site:

System V AMD64

Linux · macOS · BSD · PS4/PS5

Int args (in order)	`rdi, rsi, rdx, rcx, r8, r9`
Float args	`xmm0`–`xmm7`
Int return	`rax` (high half: `rdx`)
Float return	`xmm0` (high half: `xmm1`)
Caller-saved	`rax, rcx, rdx, rsi, rdi, r8–r11`, all `xmm*`
Callee-saved	`rbx, rbp, r12–r15`, `rsp`
Stack alignment at call	16-byte (so `rsp` = 16k − 8 on entry)
Red zone	128 bytes below `rsp` usable without adjusting
Shadow space	None

Microsoft x64

Windows · Xbox (MSVC toolchain)

Int args (in order)	`rcx, rdx, r8, r9`
Float args	`xmm0`–`xmm3`
Int return	`rax`
Float return	`xmm0`
Caller-saved	`rax, rcx, rdx, r8–r11`, `xmm0–xmm5`
Callee-saved	`rbx, rbp, rdi, rsi, rsp, r12–r15`, `xmm6–xmm15`
Stack alignment at call	16-byte (same)
Red zone	None
Shadow space	32 bytes the caller must allocate above the args

Two differences cause most ABI bugs in practice. First, rdi and rsi are callee-saved on Windows but caller-saved (argument registers) on System V. Code that hand-codes assembly without restoring rdi and rsi works on Linux and silently corrupts state on Windows^[7]. Second, Windows requires the caller to allocate a 32-byte "shadow space" above the four register arguments so the callee can spill them. A function that calls into Windows code from System V (or vice versa) needs a thunk that translates the calling convention.

The widget below shows the same function call under both ABIs side by side. A function render(vec3 pos, float scale, int flags) is called; the visualizer shows where each argument ends up and which register the call clobbers:

09Stack frames in practice

The function-call stack on x86-64 grows downward: rsp decreases on a call or a push, increases on a ret or pop. A typical non-leaf function emits a prologue that allocates a frame, saves the registers it needs to preserve, and ends with an epilogue that reverses the prologue:

function_with_frame.s

my_function:
  ; ── Prologue ─────────────────────────────────────
  push rbp                  ; save the previous frame pointer
  mov  rbp, rsp             ; rbp now points at our frame base
  push rbx                  ; save callee-saved registers we'll use
  push r12
  sub  rsp, 0x28            ; 40 bytes of locals; keeps rsp 16-byte aligned

  ; ── Body ─────────────────────────────────────────
  ...

  ; ── Epilogue ────────────────────────────────────
  add  rsp, 0x28            ; deallocate locals
  pop  r12                  ; restore callee-saved registers in reverse
  pop  rbx
  pop  rbp
  ret

Three rules to remember when reading frames:

The stack must be 16-byte aligned at every call site. Because call pushes the 8-byte return address, rsp on function entry is always 16k − 8. The prologue's push rbp brings it back to 16k. Sub‑sequent local allocations must keep the alignment; this is why frames are sized in multiples of 16 with an extra 8 if needed.
The frame pointer is optional on x86-64. With -fomit-frame-pointer (default at -O1 and above in GCC and Clang for many years), the compiler skips push rbp; mov rbp, rsp and addresses locals off rsp directly. This frees a register but makes stack unwinding harder; profilers and debuggers fall back to DWARF CFI or .eh_frame. Fedora 38 (2023) and Ubuntu 24.04 (2024) reversed their distro defaults and now ship system binaries with frame pointers on, so that production profilers like perf can unwind through them cheaply^[14]. Apple's AArch64 ABI requires the frame pointer on every non-leaf function^[15].
Leaf functions have no prologue at all. A function that calls nothing else and uses only caller-saved registers (like the add from §1) emits a single instruction body and a ret. On System V it can also use the 128-byte "red zone" below rsp for scratch storage without subtracting from rsp at all^[6].

The red zone is the System V detail that catches Windows porters: a leaf function can legitimately have local_a at [rsp - 8], with rsp never adjusted. On Windows that area is fair game for any signal handler or kernel transition to clobber, so the Microsoft ABI specifies no red zone and the compiler must sub rsp, ... for any local storage.

10Reading compiler output: a worked vector normalize

A canonical 3-D engine primitive: take a vec3, divide each component by its length, return the unit vector. Source:

normalize.cpp

struct vec3 { float x, y, z; };

vec3 normalize(vec3 v) {
  float lenSquared = v.x * v.x + v.y * v.y + v.z * v.z;
  float invLen    = 1.0f / sqrtf(lenSquared);
  return { v.x * invLen, v.y * invLen, v.z * invLen };
}

With clang -O2 -mavx2 -mfma on System V (illustrated; exact instruction selection varies across versions and surrounding context):

normalize.s · illustrative clang -O2 -mavx2 -mfma output · Intel syntax

normalize(vec3):
  ; System V vec3 ABI: xmm0[0]=x, xmm0[1]=y (upper lanes undefined); xmm1[0]=z.
  vmovshdup    xmm2, xmm0                ; xmm2 = {xmm0[1], xmm0[1], …} → lane 0 holds y
  vmulss       xmm3, xmm1, xmm1          ; xmm3[0] = z*z   (vmulss touches only lane 0)
  vfmadd231ss  xmm3, xmm2, xmm2          ; xmm3[0] += y*y
  vfmadd231ss  xmm3, xmm0, xmm0          ; xmm3[0] += x*x      → lenSquared
  vsqrtss      xmm3, xmm3, xmm3          ; xmm3[0] = sqrt(lenSquared)
  vmovss       xmm4, dword ptr [rip + .LC0] ; xmm4[0] = 1.0f from a rodata constant
  vdivss       xmm3, xmm4, xmm3          ; xmm3[0] = invLen
  vmulss       xmm0, xmm0, xmm3          ; xmm0[0] = x*invLen; xmm0[1..] unchanged
  vmulss       xmm2, xmm2, xmm3          ; xmm2[0] = y*invLen
  vmulss       xmm1, xmm1, xmm3          ; xmm1[0] = z*invLen
  vunpcklps    xmm0, xmm0, xmm2          ; xmm0 lane 1 ← xmm2[0] (y*invLen)
  ret                                    ; return {x', y', ·, ·} in xmm0; z' in xmm1[0]

A few things are happening here that aren't obvious from the source. The two-XMM vec3 argument convention is from the System V ABI^[6]: an aggregate of three floats has its x,y in xmm0 and z in xmm1. The compiler used vfmadd231ss (fused multiply-add, scalar single-precision) twice to combine the three squared-component sums into one cumulative result; the FMA instruction does a += b*c in one μop with one rounding step instead of two, which is both faster and more accurate^[16]. The vsqrtss / vdivss pair is the textbook scalar square root and reciprocal; on hot paths a renderer might replace it with vrsqrtss (approximate reciprocal-square-root, ~11-bit precision) followed by a Newton-Raphson iteration, trading two cycles of latency for some accuracy. The compiler doesn't do that substitution by default because IEEE-754 strict mode requires the rounding behavior of sqrt + div; -ffast-math permits it.

Reading conventions: the v prefix on each mnemonic is the AVX (VEX-encoded) three-operand form. Without AVX, the same operation would be mulss xmm0, xmm0; addss xmm0, xmm3 with the destination always also being a source. Three-operand AVX lets the compiler keep xmm0 unmodified and write the result somewhere else, which usually shortens the dependency chain. The ss suffix means "scalar single-precision," i.e., one float in the low 32 bits of the XMM register, ignoring the other three lanes^[17].

11Branches, mispredictions, and the branchless idiom

The CPU does not wait for a conditional branch to resolve before fetching the next instructions. It predicts the outcome of every branch (using a per-branch history table and a global branch history register, both heavily refined over the last twenty years^[18]) and speculatively executes the predicted path. When the prediction is right (the usual case for tight loops, monotonic conditions, and any pattern with a stable history), the branch is approximately free. When it is wrong, the CPU rolls back the speculative work and refetches; the cost is ten to twenty cycles on recent Intel and AMD client cores^[12].

"Approximately free" is worth measuring. Cloudflare's 2021 microbenchmark of a long chain of predicted unconditional jumps reports steady-state costs of 2 cycles per jmp on an Intel Xeon Gold 6262 (degrading past the ~4096-entry dense BTB capacity), ~3.5 cycles per jmp on Zen 2 (AMD EPYC 7642), and 1 cycle on Apple's M1 when the loop fits in roughly 4 KB of instructions, rising to 3 cycles per jmp once the loop overflows the smaller of the M1's branch-target buffers; never-taken conditional branches stay near 0.3 cycles per block regardless of micro-arch^[40]. Two consequences worth keeping. First, "predicted branch is free" is shorthand, not literal: every micro-arch pays one to four cycles even when the prediction is right. Second, the cost is non-linear in the working set: a hot loop that fits inside the BTB and one that just doesn't can run at three- or four-fold different speeds with no source change.

For data-dependent branches that the predictor cannot learn (a random check against a 50/50 input, or a comparison against a key that changes per iteration), the misprediction cost dominates. The remedy is branchless code: replace the conditional with a computation that produces the same result regardless of the predicate, using cmovcc or bit tricks. Take the absolute-value-of-int problem:

abs.s · branchy vs branchless

; Branchy: easy to read, mispredicts catastrophically on random inputs
abs_branchy:
  test edi, edi
  jns  .Lpos                   ; jump if not signed (positive)
  neg  edi
.Lpos:
  mov  eax, edi
  ret

; Branchless: same answer, no jump. Cost is constant.
abs_branchless:
  mov  eax, edi                 ; eax = x (the default)
  neg  eax                      ; eax = -x; flags reflect -x
  cmovl eax, edi                 ; if -x < 0 (i.e., x was positive), take the original instead
  ret                            ; eax holds |x|; UB for x = INT_MIN, same as the branchy form

The bit-trick form is shorter still: mov eax, edi; cdq; xor eax, edx; sub eax, edx. cdq sign-extends eax into edx, producing an all-ones mask if x is negative and an all-zeros mask otherwise; the xor and sub then compute (x XOR mask) − mask, which evaluates to x when the mask is zero and to −x when the mask is all-ones. Four instructions, no jump, no condition codes consumed, used throughout the Linux kernel and many engine math libraries^[19].

When to use which. Branchless is faster when the predictor cannot learn the pattern. It is slower when the predictor can, because cmov introduces an unconditional data dependency: the result waits for both the source and the predicate, even when the predicate could have been predicted and the dependent computation skipped. The canonical demonstration is the most-upvoted answer on Stack Overflow^[20]: summing the elements above a threshold in a 32k-int array runs several times faster when the array is sorted first, because the predictor latches onto the threshold crossover after a handful of iterations and from then on the branch is free. The branchless form gets no such win and, on the sorted input, loses to the branchy form.

The widget animates this. Run the same predicate over an array that's either random or sorted; the branchy version takes constant time on the sorted input and falls off a cliff on the random one:

branchy cycles

···

branchless cycles

···

ratio

···

12SIMD: SSE, AVX, AVX-512

SIMD (Single Instruction, Multiple Data) replaces a loop body that operates on one scalar with one instruction that operates on a vector of values. Every modern x86-64 CPU has SSE2 mandatorily; almost every chip from the last decade has AVX2; recent server and high-end client chips have AVX-512^[10]. The mental model:

An XMM register holds 128 bits = 4 floats = 2 doubles = 16 bytes = 4 int32s.
A YMM register holds 256 bits = 8 floats = 4 doubles = 32 bytes = 8 int32s.
A ZMM register holds 512 bits = 16 floats = 8 doubles = 64 bytes = 16 int32s.
Each value in the register is a lane. SIMD instructions act on every lane in parallel: vaddps ymm0, ymm1, ymm2 adds eight pairs of single-precision floats simultaneously.

The suffix on a SIMD mnemonic names the lane format. ps = packed single (4/8/16 floats), pd = packed double, ss = scalar single (one float, others untouched), sd = scalar double. Integer variants use b/w/d/q for byte/word/dword/qword. vpaddd ymm0, ymm1, ymm2 is eight 32-bit integer adds in parallel^[17].

The widget below animates one SIMD operation across a register. vaddps on a YMM register adds two eight-lane vectors in one cycle of throughput. Press play and watch eight independent adds light up at the same time:

What about ARM? AAPCS64 and NEON in 30 seconds.

ARM (AArch64) is the architecture of Apple Silicon, Nintendo Switch, Nintendo Switch 2, every modern Android phone, and the ARM-based AWS Graviton servers. The ISA differs from x86-64 in three big ways: it is fixed-length 32-bit (no variable-length encoding), it is load-store (no read-modify-write on memory operands), and it has 31 general-purpose 64-bit registers (X0–X30, plus a zero register XZR and the stack pointer SP).

The AArch64 Procedure Call Standard^[21] passes integer args in X0–X7 and float args in V0–V7. Callee-saved registers are X19–X28. SIMD lives in 32 "V" registers, each 128 bits; instructions like fmla v0.4s, v1.4s, v2.4s are the rough equivalent of vfmadd231ps xmm0, xmm1, xmm2. The SVE/SVE2 vector extensions (vector-length-agnostic; on Apple Silicon and modern ARM server cores) generalize this to longer registers without recompiling.

Most of this tutorial transfers across by mechanical translation: register names change, the instruction syntax changes, but the same dependency chains, port pressure, and ABI bugs apply.

13A worked SIMD example: 4×4 matrix-vector multiply

The transformation that every renderer applies to every vertex: multiply a 4-component vector by a 4×4 matrix. Naively, sixteen multiplies and twelve adds (or four dot products of length 4) per vertex. With SSE the same operation is four vmulps and three vaddps on 128-bit vectors. The standard formulation, for a column-major matrix M = [c0 | c1 | c2 | c3] and a vector v = (x, y, z, w):

Mv = c 0 \cdotx + c 1 \cdoty + c 2 \cdotz + c 3 \cdotw

Each c_i·s is a scalar-times-vector, which in SSE is a broadcast of the scalar across four lanes followed by a multiply. The three additions chain together. Source:

matvec.cpp · SSE 4x4 column-major mat * vec

// 16-byte aligned for movaps. col[i] is column i of M.
struct alignas(16) mat4 { __m128 col[4]; };

__m128 mul(const mat4& M, __m128 v) {
  __m128 x = _mm_shuffle_ps(v, v, _MM_SHUFFLE(0, 0, 0, 0));  // broadcast v.x
  __m128 y = _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 1, 1, 1));  // broadcast v.y
  __m128 z = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 2, 2, 2));  // broadcast v.z
  __m128 w = _mm_shuffle_ps(v, v, _MM_SHUFFLE(3, 3, 3, 3));  // broadcast v.w
  __m128 r = _mm_mul_ps(M.col[0], x);                  // r  = c0 * x
  r = _mm_fmadd_ps(M.col[1], y, r);                     // r += c1 * y
  r = _mm_fmadd_ps(M.col[2], z, r);                     // r += c2 * z
  r = _mm_fmadd_ps(M.col[3], w, r);                     // r += c3 * w
  return r;
}

With clang -O3 -mavx2 -mfma, the body is nine instructions plus the return, split into two parallel FMA chains to shorten the critical path:

matvec.s · illustrative clang -O3 -mavx2 -mfma output

mul(mat4 const&, __m128):
  ; rdi = const mat4*  (the matrix);  xmm0 = v
  vshufps      xmm1, xmm0, xmm0, 0x00       ; xmm1 = {v.x, v.x, v.x, v.x}
  vshufps      xmm2, xmm0, xmm0, 0x55       ; xmm2 = {v.y, v.y, v.y, v.y}
  vshufps      xmm3, xmm0, xmm0, 0xAA       ; xmm3 = {v.z, v.z, v.z, v.z}
  vshufps      xmm0, xmm0, xmm0, 0xFF       ; xmm0 = {v.w, v.w, v.w, v.w} (last use of v, OK to overwrite)
  vmulps       xmm1, xmm1, [rdi + 0x00]     ; xmm1 = c0 * v.x     (chain A start)
  vmulps       xmm0, xmm0, [rdi + 0x30]     ; xmm0 = c3 * v.w     (chain B start)
  vfmadd231ps  xmm1, xmm2, [rdi + 0x10]     ; xmm1 += c1 * v.y    (chain A: dst = src1*src2 + dst)
  vfmadd231ps  xmm0, xmm3, [rdi + 0x20]     ; xmm0 += c2 * v.z    (chain B)
  vaddps       xmm0, xmm0, xmm1             ; merge:  c0·x + c1·y + c2·z + c3·w
  ret

Two observations worth pulling out. The compiler split the source's single FMA chain into two parallel chains and merged them with a final vaddps: critical-path latency drops from one vmulps + three serial vfmadds (≈16 cycles on Skylake's 4-cycle FMA) to one vmulps, one vfmadd, one vaddps (≈12 cycles), at the cost of one extra instruction. The matrix columns are loaded directly inside the FMA and MUL memory-operand forms ([rdi + 0x10], etc.); on Intel and Zen the load and the compute fuse at the front end into a single macro-op^[13], so an explicit vmovaps would be redundant.

What's intentionally missing from this code. The 128-bit (one vector) form processes one vertex at a time. The throughput-tuned form transposes the data to SoA so that one ymm/zmm register holds the same component (x, then y, then z) for many vertices, and a single broadcast-and-FMA transforms 8 vertices at a time with AVX or 16 with AVX-512. The matrix is also typically loaded once and reused across many vertices; this code reloads it every call. On the GPU the same transform is the vertex shader, run massively in parallel, which is why CPU-side vertex transformation today is reserved for skinning, particle simulation, and physics: workloads where the CPU's branchy logic is worth more than the GPU's lane count.

14Intrinsics vs inline assembly vs writing pure asm

Three ways to drop below the C/C++ language barrier, in roughly decreasing order of how often you should use them:

Compiler intrinsics. Header-defined functions (<immintrin.h> on x86, <arm_neon.h> on ARM) that compile to a single SIMD instruction each^[17]. The compiler still allocates registers, schedules, and inlines. This is the default. Almost every shipping engine math library is intrinsics, not hand assembly.
Inline assembly. GCC/Clang's asm statement^[22] embeds asm in a C function with constraints describing the inputs, outputs, and clobbers. Useful for instructions with no intrinsic (CPUID, RDTSC, RDPMC, MSR access) and for hand-tuned hot inner loops where the register allocator's choices are hurting. MSVC does not support inline asm in x64 code^[23]; the Microsoft replacement is intrinsics or a separate .asm file compiled by ml64.
Pure assembly files. .s on GCC/Clang via GAS, .asm on MSVC via MASM or NASM. Used for the bottom-most platform code: context switches, fiber/coroutine implementations, atomic primitives that predate compiler intrinsics, signal frame mangling. Vanishingly rare in application or game code.

An example of when inline assembly is the right answer. CPUID, used at startup to detect available CPU features, has no C-level mechanism to invoke. GCC's intrinsic __cpuid handles it, but if you need the raw form:

cpuid.c · inline asm with operand constraints

static inline void cpuid(int leaf, int *eax, int *ebx, int *ecx, int *edx) {
  __asm__ (
    "cpuid"
    : "=a"(*eax),       // output: eax → *eax
      "=b"(*ebx),       // output: ebx → *ebx
      "=c"(*ecx),       // output: ecx → *ecx
      "=d"(*edx)        // output: edx → *edx
    : "a"(leaf)         // input:  leaf → eax
  );
}

The output constraints "=a", "=b", "=c", "=d" tell the compiler that cpuid writes those four registers and the values should go to the named C variables. The input constraint "a"(leaf) tells the compiler to load leaf into eax before executing the instruction. This is the entire mechanism; everything else is constraint syntax^[22].

Two reliable failure modes. Forgetting clobbers. If your inline asm modifies a register that's not in the input or output list, you have to declare it in the clobber list, or the compiler will assume the value it had before the asm is still there afterwards. Mixing inline asm with optimization-fragile patterns. Inline asm is a black box to the optimizer; it inhibits inlining, vectorization, and instruction scheduling across the asm. A tight intrinsic-based loop is almost always faster than the same loop with one inline-asm hot instruction in the middle.

15Atomics and fences at the machine level

A pointer to the Memory Model tutorial, condensed: every std::atomic operation in C++ lowers to a specific machine instruction whose ordering guarantees match the requested memory_order. The mapping on x86-64^[24]:

Relaxed load / store. Plain mov. x86 TSO already gives load-acquire and store-release ordering for free; relaxed and acquire/release have identical emitted code at the read or write site.
Acquire load. Plain mov. Same as relaxed on x86 because of TSO.
Release store. Plain mov. Same.
Sequentially-consistent load. Plain mov. The seq-cst constraint is enforced by the corresponding seq-cst store, not the load.
Sequentially-consistent store. Either mov followed by mfence (GCC's historical choice) or xchg [mem], reg (Clang's choice; xchg with a memory operand has an implicit LOCK prefix and is a full barrier). Both flush the store buffer before subsequent loads can retire^[25]. The two forms are functionally interchangeable; on modern hardware xchg is typically a few cycles faster.
Atomic RMW (fetch_add, compare_exchange). A LOCK-prefixed instruction (lock xadd, lock cmpxchg). The LOCK prefix asserts cross-core atomicity on the cache line and acts as a full barrier on x86.

On ARM (AArch64) the lowering is more visible. An acquire load lowers to LDAR (Load-Acquire Register); a release store lowers to STLR (Store-Release Register); since the C++20 fix codified in P0668^[24], a seq-cst load is LDAR and a seq-cst store is STLR alone. Pre-P0668 toolchains sometimes added a leading DMB ISH before STR for the seq-cst store, which is heavier but conservative. The Lê, Pop, Cohen, Zappa Nardelli 2013 paper on the corrected Chase-Lev deque^[26] is the standard reference for what the C++ memory model demands of ARM codegen; see the memory model tutorial for the full picture.

16μops, ports, and dependency chains in practice

Earlier we sketched the OOO engine. With the SIMD walkthrough in mind, this is what microarchitectural tuning actually looks like. Take a reduction: summing a million floats.

sum_serial.cpp · the naive loop

float sum_serial(const float* a, int n) {
  float s = 0;
  for (int i = 0; i < n; ++i) s += a[i];
  return s;
}

Compiled with -O2 (no fast-math), the inner loop is one addss per element. addss on Skylake has a latency of 4 cycles and a throughput of 2 per cycle^[13]. Because each iteration depends on s from the previous iteration, the loop is bound by the latency, not the throughput: one iteration every 4 cycles. The throughput of 2 adds per cycle is unreachable because there is only one chain. Eight ports of execution are mostly idle.

The fix is to break the dependency chain: use multiple accumulators that the compiler can schedule independently. -O3 -ffast-math on a modern compiler will do this for you (or -fassociative-math, which is the specific permission needed), producing something like:

sum_parallel.cpp · multi-accumulator + SIMD, after autovectorization

float sum_parallel(const float* a, int n) {
  __m256 s0 = _mm256_setzero_ps(), s1 = _mm256_setzero_ps();
  __m256 s2 = _mm256_setzero_ps(), s3 = _mm256_setzero_ps();
  for (int i = 0; i < n; i += 32) {
    s0 = _mm256_add_ps(s0, _mm256_loadu_ps(a + i + 0));
    s1 = _mm256_add_ps(s1, _mm256_loadu_ps(a + i + 8));
    s2 = _mm256_add_ps(s2, _mm256_loadu_ps(a + i + 16));
    s3 = _mm256_add_ps(s3, _mm256_loadu_ps(a + i + 24));
  }
  __m256 s = _mm256_add_ps(_mm256_add_ps(s0, s1), _mm256_add_ps(s2, s3));
  // horizontal sum of s into a scalar; details elided
  ...
}

Four independent accumulators, each a 256-bit vector of eight floats. Per iteration the loop does four 256-bit adds and four 256-bit loads. The four adds run on different dependency chains, which removes the cross-iteration latency that pinned the naive version. The four loads share two L1 load ports. Skylake has two FMA/ADD ports at 256 bits each and two L1 load ports at 256 bits each^[13]^[12], so both halves of the loop need two cycles. Steady-state throughput is one iteration every two cycles: 16 floats per cycle. Compared to the naive loop's 0.25 floats per cycle (one element per 4-cycle addss latency), that is a 64× speedup, decomposed as 8× from the SIMD width and 8× from breaking the latency chain. The loop is now L1-load-bandwidth-bound; the bottleneck moves to L2 once the working set exceeds the L1 footprint, and to DRAM once it exceeds L2.

Three tools to verify this analysis without re-running the program. llvm-mca^[27] takes a chunk of assembly and reports the expected steady-state IPC and per-port pressure. uica.uops.info^[28] does the same with a richer microarchitectural model. perf stat -e cycles,instructions,branches,branch-misses,... on Linux gives the actual run-time numbers from CPU performance counters. The three together cover prediction, simulation, and measurement.

17Pitfalls

Mixing legacy SSE and VEX-encoded AVX. Switching from non-VEX SSE (movaps xmm0, ...) into VEX AVX (vmulps ymm0, ...) without an intervening vzeroupper can incur a state-save penalty of tens of cycles on Skylake-class cores^[3]. The compiler emits vzeroupper at AVX/SSE boundaries; hand-written code must too.
Assuming x86 latency is the cost. Latency only matters on the critical path. A μop with latency 6 and throughput 2-per-cycle can issue every 0.5 cycles when its inputs are independent; the same μop costs 6 cycles per iteration when chained. Read both numbers from uops.info^[13], not just the latency.
Partial-register stalls. Writing al and then reading rax used to stall on Pentium 4 and Sandy Bridge; on modern cores it's usually free, but it depends on the specific micro-arch. Treat movzx eax, al as the safe explicit form when bridging width.
Clobbering a callee-saved register and not restoring it. The ABI tables in §8 are the contract; ignoring them produces bugs that surface as data corruption many calls away. The compiler's stack-protector and stack-usage warnings (-fstack-protector-strong, -Wstack-usage) catch some adjacent failures, but the real defenses are explicit clobber lists in inline asm and reviewing the disassembly of the prologue/epilogue.
Hand-aligning data when the compiler already did. alignas(16) in C++ and the aligned attribute in C produce the same memory layout the compiler is going to assume for SSE loads. Hand-aligning by adding pad fields by hand is a constant source of bugs after struct layout changes.
Optimizing inner loops the compiler already optimized. Before writing intrinsics, look at the compiler's output at -O3 with -march=native. Modern autovectorizers handle the standard patterns (reductions, maps, prefix scans). Hand-rolled intrinsics are right when the compiler bailed out on the pattern, not when it didn't.
RDTSC isn't a clock. It's a free-running TSC that on modern chips runs at a fixed (often the nominal) frequency regardless of dynamic frequency scaling^[29]. Don't use it to measure wall-clock time. Don't use it across cores without serializing. Use std::chrono::steady_clock for wall time and rdpmc with a privileged setup for cycle counting.

18What's next

Where to go from here:

Read your own engine's hot paths. Pick the math, physics, animation, and renderer-glue libraries you ship. Build a release binary with debug info. objdump -d --disassembler-options=intel-mnemonic on Linux, dumpbin /disasm on MSVC, or the Compiler Explorer^[1] "load my source" workflow. Look at what your hot inline functions actually compile to.
Sit with Agner Fog's manuals.^[5] The five PDFs at agner.org/optimize are the practitioner reference: optimization manuals 1–4 cover the C, asm, micro-arch, and instruction-tables material; manual 5 is calling conventions. Free, no DRM, updated through 2024.
Run llvm-mca and uica on the loops you write. Both are free; both take a small region of assembly and predict throughput and port pressure. Use them before and after intrinsic refactors to confirm the change you intended is the change the model sees.
Read a real ABI document end-to-end. The System V AMD64 ABI^[6] is sixty pages and gives a strikingly different mental model of "what a function call is" than the C language does. Aggregate passing rules, classification of struct fields, and the variadic-function ABI are all worth a slow read.
Pair with the Memory Model tutorial. The atomic-instruction lowering from §15 is the bridge: once you've read the memory model tutorial, the lock cmpxchg16b in a Chase-Lev deque and the vzeroupper at an AVX boundary are the same kind of thing: implementation details the language hides until you read the assembly.

19Sources & further reading

Numbered citations refer to the superscripts above. Most are freely available; a few are vendor manuals that require a no-cost registration.

A note on originality

The prose, code samples, CSS, and interactive widgets on this page are original writing. The instruction-encoding decomposition follows the Intel SDM Volume 2 chapter on instruction format [10]. The two ABI tables in §8 are a side-by-side condensation of the System V AMD64 ABI [6] and the Microsoft x64 calling convention reference [7]; consult them before relying on any field. The branchy-vs-branchless numerics in §11 trace to Bruce Dawson's analysis and the canonical Stack Overflow demonstration [20]; the bit-trick form of abs is from Hacker's Delight §2-4 [19]. The C++ atomic lowering table in §15 matches the mapping documented at [24].

Godbolt, M. Compiler Explorer. godbolt.org. The interactive compile-and-disassemble tool used in every example in this tutorial; mature support for GCC, Clang, MSVC, ICX, and a long tail of other compilers.
Lengyel, E. (2011). Mathematics for 3D Game Programming and Computer Graphics (3rd ed.). Cengage. The standard reference for the vector and matrix routines that ship in every renderer; SSE versions of normalize, cross, and matmul in Chapter 4.
Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual. Order Number 248966. intel.com. The normative source for Intel microarchitecture details: pipeline depth, μop fusion, AVX state transitions, the SSE-AVX transition penalty.
Advanced Micro Devices. Software Optimization Guide for AMD Family 19h Processors (Zen 3 and Zen 4). Publication #56665 / #57487. amd.com. AMD's counterpart to the Intel optimization manual; the canonical Zen 3/Zen 4 latency and throughput numbers.
Fog, A. Software optimization resources. agner.org/optimize. Five PDFs: optimizing C++, optimizing asm, microarchitecture of Intel/AMD/VIA CPUs, instruction tables, and calling conventions. Updated as recently as 2024. Free; no registration.
Matz, M., Hubička, J., Jaeger, A., & Mitchell, M. System V Application Binary Interface, AMD64 Architecture Processor Supplement. gitlab.com/x86-psABIs/x86-64-ABI. The normative document for Linux, macOS, the BSDs, and the PlayStation console toolchains. Continuously revised; the LaTeX source is the canonical version.
Microsoft. x64 calling convention. learn.microsoft.com. The Microsoft x64 ABI reference: argument registers, callee/caller-saved, shadow space, struct passing rules.
Advanced Micro Devices. AMD64 Architecture Programmer's Manual, Volume 1: Application Programming. Publication #24592. amd.com. The original AMD64 architecture spec; describes the REX prefix, the 64-bit operand size rules, and the zero-extension behavior of 32-bit register writes.
Intel Corporation. (1979). iAPX 86/88, 186/188 User's Manual. The original 8086 reference. Historical only; reproduced at archive.org.
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Order Numbers 253665–253669. intel.com. Five volumes; Volume 1 is the architecture overview, Volume 2 is the instruction set reference (the encoding tables), Volume 3 covers system programming.
Intel Corporation. (2023). Intel® Advanced Performance Extensions (Intel® APX) Architecture Specification. intel.com. The proposed extension to 32 GPRs and three-operand encoding for legacy integer instructions.
Fog, A. The microarchitecture of Intel, AMD and VIA CPUs. Manual 3 of the agner.org/optimize series. Periodically updated; the 2024 edition covers Zen 4, Raptor Lake, and Arrow Lake.
Abel, A., & Reineke, J. (2019). uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. ASPLOS. Project home: uops.info. Machine-measured latency, throughput, and port-usage tables for nearly every x86-64 instruction on every recent micro-architecture.
Möller, T. (2021). Frame pointers: why you should keep them. Brendan Gregg's writeup on the cost of omitting frame pointers for production profiling. brendangregg.com. Covers why Fedora 38 and Ubuntu 24.04 re-enabled -fno-omit-frame-pointer by default.
Apple. (2024). Writing ARM64 code for Apple platforms. developer.apple.com. Apple's ARM64 platform conventions; covers the frame pointer requirement, which is stricter than the AAPCS64 baseline.
Muller, J.-M., Brisebarre, N., de Dinechin, F., Jeannerod, C.-P., Lefèvre, V., Melquiond, G., Revol, N., Stehlé, D., & Torres, S. (2018). Handbook of Floating-Point Arithmetic (2nd ed.). Birkhäuser. The reference for FMA rounding semantics: a single rounding step after the multiply-then-add, instead of two.
Intel Corporation. Intel® Intrinsics Guide. intel.com/intrinsics-guide. The searchable reference for every intrinsic in the <immintrin.h> family, with mnemonic, encoding, latency, and a per-architecture availability badge.
Yeh, T.-Y., & Patt, Y. N. (1991). Two-Level Adaptive Training Branch Prediction. MICRO. PDF. The two-level predictor that became the basis of every shipping CPU's branch predictor; modern designs add path-history, TAGE, perceptron predictors, and indirect-target tables on top.
Warren, H. S. (2012). Hacker's Delight (2nd ed.). Addison-Wesley. The canonical reference for bit-twiddling: branchless absolute value, sign extension, population count, etc. §2-4 has the absolute-value bit trick used in §11.
Stack Overflow. Why is processing a sorted array faster than processing an unsorted array? (2012). stackoverflow.com. The most-upvoted Stack Overflow answer ever; canonical demonstration of branch-predictor effects in C++.
ARM Limited. Procedure Call Standard for the Arm® 64-bit Architecture (AArch64). Document IHI 0055. github.com/ARM-software/abi-aa. The AArch64 ABI: argument registers, callee-saved, parameter passing rules, the NEON vector ABI.
Free Software Foundation. Extended Asm — Assembler Instructions with C Expression Operands. GCC manual. gcc.gnu.org. The reference for GCC inline assembly constraints; Clang follows the same spec for compatibility.
Microsoft. Inline assembler. learn.microsoft.com. The MSVC documentation; states explicitly that inline asm is not supported in x64 or ARM64 code, with the redirect to intrinsics and external .asm via MASM.
Boehm, H.-J., & Giroux, O. (2018). P0668R5 — Revising the C++ memory model. WG21. open-std.org. The proposal that tightened seq_cst ordering on weak hardware; includes the canonical x86 and ARM lowering tables for every memory order. Also useful: Mapping C/C++ to processors, cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A, §8.2 (Memory Ordering), §8.3 (Serializing Instructions). The normative source on MFENCE, LFENCE, SFENCE, and the LOCK prefix as full barriers.
Lê, N. M., Pop, A., Cohen, A., & Zappa Nardelli, F. (2013). Correct and Efficient Work-Stealing for Weak Memory Models. PPoPP. PDF. The paper that corrected the 2005 Chase-Lev deque with the right C11 acquire/release ordering for ARM and POWER.
LLVM Project. llvm-mca — LLVM Machine Code Analyzer. llvm.org. A static throughput analyzer that takes a region of assembly and reports per-port pressure and steady-state IPC; ships with LLVM.
Abel, A., & Reineke, J. (2022). uiCA: Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures. ICS. Project home: uica.uops.info. A more recent simulator with a richer model of the front end, μop cache, and back-end ports.
Paoloni, G. (2010). How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures. Intel white paper. The reference for the right way to use RDTSC and RDTSCP for cycle-level benchmarking, including the serialization caveats.
Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann. The textbook on pipelining, OOO execution, and cache hierarchies. Chapters 2 and 3 are the long-form version of §3 of this tutorial.
Drepper, U. (2007). What Every Programmer Should Know About Memory. PDF. The long-form treatment of the memory hierarchy, prefetch, and cache effects; still the clearest single document on the topic.
Giesen, F. (Ryg). The ryg blog. fgiesen.wordpress.com. Practitioner-grade writeups on SIMD, codec implementation, and the pixel pipeline; the closest thing the field has to a working game-programmer's blog on this material.
Muła, W. Practical SIMD and bit-twiddling notes. 0x80.pl. Hundreds of microbenchmarks and worked examples for SSE, AVX2, AVX-512, NEON. The reference for "how do I vectorize this small thing?"
Dawson, B. Random ASCII. randomascii.wordpress.com. Long-running practitioner blog from a Valve / Microsoft profiling engineer; the source for many of the canonical "what does this assembly cost on Skylake" investigations.
Lemire, D. Daniel Lemire's blog. lemire.me/blog. Microbenchmarks, SIMD analyses, and a steady stream of evidence-based posts on modern x86 performance.
Patterson, D. A., & Waterman, A. (2017). The RISC-V Reader: An Open Architecture Atlas. Strawberry Canyon. A short, accessible alternative to x86 for understanding how a simpler ISA decodes; useful background even if you'll never ship to RISC-V.
Intel Corporation. Intel® VTune™ Profiler User Guide. intel.com. The reference profiler for Intel hardware; reads the same performance counters perf does, with a richer microarchitectural-event taxonomy.
Bendersky, E. Eli Bendersky's website. eli.thegreenplace.net. Long-form articles on linkers, ELF, position-independent code, and how compilers actually generate the assembly you read. Strong on the loader and dynamic-linking side that this tutorial skips.
Hyde, R. (2003). The Art of Assembly Language. No Starch Press. The classic intro book to assembly; uses HLA, but Part III is x86 assembly proper. Free online at plantation-productions.com.
Majkowski, M. (2021). Branch predictor: How many "if"s are too many? Cloudflare blog. blog.cloudflare.com. Microbenchmarks of long predicted-branch chains on Intel Xeon Gold 6262, AMD EPYC 7642 (Zen 2), and Apple M1, with reproducible code. Source for the per-architecture per-jmp cycle numbers and the BTB-capacity-cliff behavior cited in §11.