x86-64 Assembly
from Scratch
Sixteen general-purpose registers. Sixteen 256-bit vector registers since AVX, thirty-two 512-bit ones on AVX-512 hardware. An instruction encoding that ranges from one byte to fifteen, a calling convention that disagrees between Linux and Windows, and a microarchitecture that turns your serial-looking code into a wide, speculative dispatch across eight or more execution ports. The view from the assembly layer is what tells you why your inner loop runs at the speed it does. We work from the CPU model up to a SIMD matrix-vector kernel and a ฮผop scheduling demo, with every claim cited to a vendor manual or a refereed paper.
01Why assembly still matters when nobody writes it
The honest answer is that very few production games ship hand-written assembly. The compiler is usually within a few percent of optimal on hot paths, and a single ABI mismatch in hand-written code can corrupt a callee-saved register in a way that crashes three frames later. The reason to learn assembly in 2026 is not to write it. It is to read it. The compiler's output is the ground truth for what your CPU is being asked to do, and reading it answers questions that source-level reasoning can't.
Concrete situations where the disassembly is the only authoritative source:
- The optimizer didn't do what you assumed. A loop you thought was unrolled wasn't. A condition you thought compiled to
cmovcompiled to a branch. A small struct you thought was returned in registers got spilled to memory. Godbolt[1] exists for these moments, and the answer is always in the assembly. - A bug only reproduces under release optimizations. The compiler reordered or elided code in a way that exposed a latent UB. Reading the assembly is faster than printf-debugging a release build.
- You're integrating native libraries across an ABI. A C++ function calling into FFI code from Rust, the Switch Homebrew toolchain, an FMOD plugin, or a custom JIT, has to follow the platform's calling convention exactly. The only way to verify is to look at what each side emits.
- Microarchitecture tuning. When a hot SIMD loop is running at half throughput, the answer lives in the dependency chain or port pressure of the emitted instructions, not in the source.
- Debug symbols at the crash site. Reading a stack trace from a stripped binary in production means reading
add rsp, 0x28; retand inferring the function.
Every section of this tutorial answers a question that has surfaced in real engine work. The history is short, the CPU model is the standard mental picture a shipping engine programmer needs, and the SIMD walkthrough in ยง13 is a variant of the matrix-vector transform almost every renderer ships[2].
A working ability to read AT&T and Intel syntax x86-64 disassembly. Concrete knowledge of the System V AMD64 ABI (Linux, macOS, BSD, PlayStation) and the Microsoft x64 ABI (Windows, Xbox) and where they differ. The instruction encoding well enough to read a hex dump as instructions. The SIMD vocabulary (lanes, packed/scalar, SSE/AVX/AVX-512) and what a 4ร4 matrix-vector multiply looks like at the assembly level. The microarchitectural vocabulary (ฮผops, ports, dependency chains, retirement) that the Intel and AMD optimization manuals[3][4] and Agner Fog's manuals[5] are written in. Six live, in-browser widgets you can step through.
A tiny disassembly to set the tone
Two lines of C. Compile with gcc -O2 on x86-64 Linux:
int add(int a, int b) { return a + b; }
The compiler emits:
add: lea eax, [rdi + rsi] ; eax = rdi + rsi (load-effective-address used as 3-op add) ret ; pop return address from [rsp], jump to it
Four lessons hiding in two instructions. First, the function arguments arrived in rdi and rsi because the System V ABI[6] puts the first two integer arguments there; the same code on Windows would have used rcx and rdx[7]. Second, the return value goes in rax (32 bits of which is eax). Third, the compiler used lea (Load Effective Address) as a three-operand non-destructive add: lea is the closest thing x86 has to an ARM-style add dst, src1, src2, and the compiler reaches for it constantly. Fourth, there is no stack frame at all, because nothing here required spilling state and the function is a leaf.
The rest of this tutorial unpacks each of those four observations from first principles. By the end you will be able to look at a fifty-line dump and tell where the register allocator gave up, where a function was inlined, and which loop the compiler vectorized.
02A short history of the architecture you're reading
x86-64 is the carrying state of nearly fifty years of architectural decisions, each made with the previous decade of code in mind. The instruction encoding has prefixes whose purpose is to extend an earlier 16-bit encoding. The calling conventions changed because the register count tripled. The vector extensions were added in four named generations. None of this is obvious from the documentation; each piece makes sense only against what it replaced.
std::find on integer ranges and for std::hash when SSE4.2 is available.
The consequence for what you read in 2026: most disassembly is plain x86-64 with SSE2 floating point. Hot paths use AVX2 (256-bit) on PC and on PS5 / Xbox Series. AVX-512 shows up in compression, simulation, and ML kernels; rarely in shipped game code on consumer hardware because of the Intel client-side gap from Alder Lake through Arrow Lake. ARM, which we touch briefly in ยง13, is the other architecture worth knowing: Nintendo Switch and Switch 2, Apple Silicon, and a growing fraction of Windows-on-Arm devices all ship it, and the calling conventions, register file, and SIMD model (NEON, SVE2) follow different rules.
Intel syntax vs AT&T syntax: which is which, and which should I learn?
Two notational conventions exist for x86 assembly. The instructions are the same; the surface syntax differs:
Intel syntax writes mov eax, ebx with destination first. No register sigils. Memory references look like [rdi + 8]. Used by NASM, MASM, the Intel and AMD reference manuals[10], and Compiler Explorer's default. Also the syntax GCC and Clang emit if you pass -masm=intel.
AT&T syntax writes movl %ebx, %eax with source first. Registers prefixed with %, immediates with $, and a suffix on the mnemonic encodes the operand size (l = long = 32 bits). Memory references look like 8(%rdi). Used by the GNU assembler (as) and the default output of objdump on Linux. Inherited from the AT&T Unix assembler convention from the early 1980s.
Read both. Intel syntax matches the vendor manuals; AT&T is what you'll see in objdump -d output from a Linux build. This tutorial uses Intel syntax in code listings and notes AT&T differences where they matter.
03The CPU you're actually programming for
An assembly instruction is not what executes. A modern x86 core decodes each instruction into one or more , schedules those ฮผops across multiple execution units in parallel, and retires them in program order at the back end[3]. The instruction stream you read is the contract; what runs is a heavily reordered, register-renamed, speculatively-executed shadow of it. This section is the sketch of that machine, enough to make sense of why some instructions are "free" and others stall a whole pipeline.
A modern x86 pipeline, simplified to the parts you care about reading assembly for:
- Fetch. The CPU reads 16 to 32 bytes per cycle from the L1 instruction cache (32 KB on Skylake- and Zen-class cores; 48โ64 KB on Sunny Cove, Golden Cove, and Lion Cove[12]) into an aligned fetch window.
- Decode. Four to six decoders convert variable-length x86 instructions into fixed-format ฮผops on recent Intel cores; Zen 3 and 4 use four. Most simple instructions decode to a single ฮผop; string ops, integer divide, and a handful of complex memory addressing modes decode to several.
- ฮผop cache. Recent x86 cores keep a decoded-ฮผop cache in front of the rename stage. A hot loop that fits in the cache (1.5K ฮผops on Skylake; 4K on Golden Cove and later[12]) bypasses the legacy decoders entirely. AMD's equivalent on Zen 4 holds about 6.75K instructions.
- Rename and dispatch. The renamer maps each architectural register named in the instruction to a physical register from a pool of several hundred. This is what lets the CPU execute "
add rax, 1; add rax, 2" with no false dependency between the two adds: each writes to a different physical register, and the renamer tracks which physical register is the currentrax. - Schedule and execute. The scheduler issues ready ฮผops to a set of execution ports: 8 on Skylake, 10 on Sunny Cove, 12 on Golden Cove, and similar on recent Zen[13]. A given port can execute one ฮผop per cycle. Some ports are specialized; integer divide, store-data, branches, and vector multiply each live on a subset.
- Retire. A reorder buffer holds completed ฮผops until every earlier ฮผop has finished. Retirement happens in program order at 4 ฮผops per cycle on Skylake, rising to 6 on Golden Cove and 8 on Lion Cove and Zen 4[12]. Once an instruction's ฮผops retire, its writes become architecturally visible; until then the CPU can roll them back (on a branch mispredict, an exception, or a memory-order violation).
The two practical consequences for reading assembly. First, an instruction's latency (cycles from issue to result available) is different from its throughput (number of times per cycle the CPU can issue it). A vmulps on Skylake has a latency of 4 cycles but a throughput of two per cycle[13]: a chain of eight dependent multiplies takes 8 ร 4 = 32 cycles; eight independent ones issue in 4 cycles and the last result lands 4 cycles after that, on the order of 8 cycles total. Same instruction count, ~4ร difference in finish time. Second, the difference between a "fast" and "slow" implementation of the same algorithm at the assembly level is often not the instruction count; it is whether the instructions form a serial dependency chain pinning everything to one port, or whether they spread out across the execution ports the CPU has on offer.
The widget below schedules a simple loop's ฮผops onto four ALU ports and watches what stalls and what doesn't. We come back to this picture in ยง16 with a real worked example:
04The register file
x86-64 exposes sixteen general-purpose 64-bit registers and sixteen (with AVX-512: thirty-two) vector registers. Knowing the names is half of reading a disassembly:
| 64-bit | 32-bit | 16-bit | 8-bit low | Conventional role (System V) |
|---|---|---|---|---|
rax | eax | ax | al | return value (int); scratch |
rbx | ebx | bx | bl | callee-saved |
rcx | ecx | cx | cl | 4th integer arg |
rdx | edx | dx | dl | 3rd integer arg; high half of 128-bit return |
rsi | esi | si | sil | 2nd integer arg |
rdi | edi | di | dil | 1st integer arg |
rbp | ebp | bp | bpl | frame pointer; callee-saved |
rsp | esp | sp | spl | stack pointer |
r8 | r8d | r8w | r8b | 5th integer arg |
r9 | r9d | r9w | r9b | 6th integer arg |
r10โr11 | โฆ | scratch | ||
r12โr15 | โฆ | callee-saved | ||
Writes to a 32-bit register zero-extend into the 64-bit register. Writes to an 8-bit or 16-bit register do not: they leave the upper bits unchanged. This is an AMD64 spec rule[8], not a compiler convention, and it is the reason compilers emit xor eax, eax to zero the full rax: the 32-bit write zero-extends and the encoding is one byte shorter than mov rax, 0.
Vector registers come in three sizes that alias each other:
xmm0โxmm15(with AVX-512:xmm0โxmm31): 128 bits. SSE/SSE2/SSE3/SSE4 baseline.ymm0โymm15(with AVX-512:ymm0โymm31): 256 bits. AVX/AVX2. The lower 128 bits ofymm0isxmm0.zmm0โzmm31: 512 bits. AVX-512. The lower 256 bits isymm.
The aliasing is not free. On Skylake-class Intel cores, a naive mix of legacy-encoded SSE (xmm) and VEX-encoded AVX (ymm) without an intervening vzeroupper can incur a state-save penalty of tens of cycles per transition[3]. Sunny Cove and later replace the hard stall with a smaller, more amortized cost, but the rule the compiler follows hasn't changed: emit vzeroupper at AVX-to-SSE boundaries. Inline assembly that mixes the two has to do the same.
Step through eight instructions and watch the register file update. Each row is one 64-bit register; columns are the high and low 32 bits. The arrow shows the current instruction; orange highlights the just-written register:
05Anatomy of an instruction
An x86-64 instruction is between 1 and 15 bytes. Its decoded form is a sequence of optional and mandatory fields, in this order[10]:
- Legacy prefixes (0โ4 bytes). Address-size override, operand-size override, segment override, repeat prefix,
LOCK. - REX prefix (0โ1 byte). Required to encode the 64-bit operand size, registers R8โR15, or the SIL/DIL/BPL/SPL byte registers. Starts with the high nibble
0x4; the low nibble carries four bits W, R, X, B. - Opcode (1โ3 bytes). Selects the instruction. Sometimes part of the opcode encodes a register too (the "+r" forms).
- ModRM (0โ1 byte). For instructions with operands, encodes the addressing mode and one or two register fields.
- SIB (0โ1 byte). Scale/Index/Base for the memory addressing modes that need it (
[rbx + rcx*4 + 0x10]). - Displacement (0, 1, 2, or 4 bytes). The constant offset in a memory operand.
- Immediate (0, 1, 2, 4, or 8 bytes). A literal value baked into the instruction.
AVX adds two more prefix families that replace REX and pack a "non-destructive source" field, letting vaddps zmm0, zmm1, zmm2 compute zmm0 = zmm1 + zmm2 without clobbering either source. The VEX prefix (2 or 3 bytes) covers AVX/AVX2; the EVEX prefix (4 bytes) covers AVX-512 and adds the mask register selector and rounding controls.
Click an instruction below to see its bytes broken out. Most production disassemblers can show this view (objdump -d -M intel --show-raw-insn, llvm-mc --show-encoding); seeing it once builds the muscle memory:
What's a ModRM byte actually?
One byte split into three fields: mod (top 2 bits), reg (middle 3 bits), r/m (bottom 3 bits). The reg field names a register; the r/m field names either a register or a memory operand depending on what mod says. A few examples:
mod=11 means "r/m is a register" (so the instruction has two register operands). mod=00 means "r/m is a memory operand with no displacement" (e.g., [rax]). mod=01 and mod=10 add an 8-bit or 32-bit displacement. r/m=100 is a sentinel meaning "a SIB byte follows," used for the indexed addressing modes like [rax + rcx*4].
The reg and r/m fields are 3 bits each. To name the 16 x86-64 registers you need 4 bits; the missing bit comes from the REX prefix's R and B bits. That's why REX is required whenever your instruction touches R8โR15.
06The core instructions, by frequency
Recent x86-64 has on the order of a thousand distinct mnemonics, with several thousand encoded variants when you count operand sizes and addressing modes[13]. The fifteen or so in the table below account for the overwhelming majority of any non-vector compiler output. Sorted roughly by how often they appear:
| Mnemonic | What it does | Form you'll usually see |
|---|---|---|
mov | Copy bits. Register-to-register, register-to-memory, memory-to-register, immediate-to-register. The basic data movement instruction. | mov rax, [rdi + 8] |
lea | Compute an address (or any 3-operand arithmetic that fits the addressing modes), don't dereference. See ยง7. | lea rax, [rdi + rsi*4] |
add / sub | Integer add/subtract. Two-operand: dst = dst op src. | add rsp, 0x28 |
imul | Signed multiply. Most often seen as two-operand (imul dst, src โ dst *= src) or three-operand with an immediate (imul dst, src, imm โ dst = src ร imm). | imul rax, rcx |
shl / shr / sar | Bit shift left, logical right, arithmetic right. sar preserves the sign bit; shr doesn't. | shl rax, 3 |
and / or / xor | Bitwise ops. xor reg, reg is the canonical zeroing idiom. | xor eax, eax |
cmp / test | Set the flags register. cmp a, b = subtract without storing; test a, b = AND without storing. | cmp rax, 0 |
jcc | Conditional jump. je = jump if equal (ZF=1), jne, jl, jg, jb, ja (signed vs unsigned). Follows a cmp or test. | jne loop |
jmp | Unconditional jump. | jmp .L7 |
call / ret | Function call and return. call pushes the return address and jumps; ret pops and jumps. | call malloc |
push / pop | Decrement RSP by 8 and store; or load and increment by 8. Used in function prologues/epilogues for callee-saved registers. | push rbp |
cmovcc | Conditional move. cmovge dst, src = if SF=OF then dst=src. Branchless conditionals; see ยง11. | cmovl rax, rdi |
setcc | Set a byte to 1 if a condition holds, else 0. Often used with movzx to materialize a 0/1 in a register. | setl al |
movzx / movsx | Move with zero-extension or sign-extension. Bridge between 8/16-bit and 32/64-bit registers. | movzx eax, byte ptr [rdi] |
The cc suffix on jumps, conditional moves, and setcc is one of sixteen condition codes that test the flags register set by the most recent cmp, test, or arithmetic instruction: z/e (zero/equal), nz/ne, l/g (signed less/greater), b/a (unsigned below/above), s (negative), o (overflow), and the rest. A jcc reads the flags and conditionally jumps; a cmovcc reads the flags and conditionally moves; a setcc reads the flags and conditionally writes 1 to a byte[10].
Memory addressing supports one form, used by mov, lea, and most others: [base + index*scale + displacement], where base and index are 64-bit registers, scale is 1/2/4/8, and displacement is a signed 8 or 32-bit constant. Any subset can be omitted. RIP-relative addressing ([rip + offset]) is the special case used for position-independent globals on x86-64.
07The lea swiss army knife
lea (Load Effective Address) computes a memory address and writes it to a register without dereferencing. Because the addressing mode is the same as the one mov uses, lea can be used as a fast three-operand arithmetic instruction: any value that fits the base + index*scale + displacement form can be computed in one instruction without clobbering the source registers and without touching the flags. The compiler uses this constantly:
; a + b without an add, and without trashing flags or either input lea rax, [rdi + rsi] ; 5 * x in one instruction (4*x + x) lea rax, [rdi + rdi*4] ; 9 * x โ (8*x + x), via lea with scale=8 lea rax, [rdi + rdi*8] ; Address arithmetic: &array[index] for an array of 4-byte ints lea rax, [rdi + rsi*4] ; Combined: 3*x + 7 in one instruction lea rax, [rdi + rdi*2 + 7]
Things to know. lea does not read memory, so the address it computes does not need to be valid. lea does not set flags, which makes it useful in the middle of a chain of conditional code where you can't afford to clobber them. The simple two-component form ([base + index*scale] or [base + disp]) is one cycle of latency and several per cycle of throughput on modern Intel and AMD[13]. The three-component form ([base + index*scale + disp] with a non-unit scale) is the "complex LEA" on Sandy Bridge through Skylake: routed through port 1 only, with three-cycle latency[12]. Ice Lake and later collapse most of the gap. A practical consequence is that the compiler will sometimes prefer shl + add over lea for arithmetic that needs to be on the critical path of a Skylake-class part.
08Calling conventions: System V vs Windows x64
Every function call across the public ABI on x86-64 follows one of two specifications: the System V AMD64 ABI[6] (Linux, macOS, the BSDs, PlayStation 4, PlayStation 5) or the Microsoft x64 ABI[7] (Windows, Xbox One, Xbox Series X/S). They disagree on which registers carry arguments, which are callee-saved, how struct returns work, and how the stack is aligned at the call site:
| Int args (in order) | rdi, rsi, rdx, rcx, r8, r9 |
|---|---|
| Float args | xmm0โxmm7 |
| Int return | rax (high half: rdx) |
| Float return | xmm0 (high half: xmm1) |
| Caller-saved | rax, rcx, rdx, rsi, rdi, r8โr11, all xmm* |
| Callee-saved | rbx, rbp, r12โr15, rsp |
| Stack alignment at call | 16-byte (so rsp = 16k โ 8 on entry) |
| Red zone | 128 bytes below rsp usable without adjusting |
| Shadow space | None |
| Int args (in order) | rcx, rdx, r8, r9 |
|---|---|
| Float args | xmm0โxmm3 |
| Int return | rax |
| Float return | xmm0 |
| Caller-saved | rax, rcx, rdx, r8โr11, xmm0โxmm5 |
| Callee-saved | rbx, rbp, rdi, rsi, rsp, r12โr15, xmm6โxmm15 |
| Stack alignment at call | 16-byte (same) |
| Red zone | None |
| Shadow space | 32 bytes the caller must allocate above the args |
Two differences cause most ABI bugs in practice. First, rdi and rsi are callee-saved on Windows but caller-saved (argument registers) on System V. Code that hand-codes assembly without restoring rdi and rsi works on Linux and silently corrupts state on Windows[7]. Second, Windows requires the caller to allocate a 32-byte "shadow space" above the four register arguments so the callee can spill them. A function that calls into Windows code from System V (or vice versa) needs a thunk that translates the calling convention.
The widget below shows the same function call under both ABIs side by side. A function render(vec3 pos, float scale, int flags) is called; the visualizer shows where each argument ends up and which register the call clobbers:
09Stack frames in practice
The function-call stack on x86-64 grows downward: rsp decreases on a call or a push, increases on a ret or pop. A typical non-leaf function emits a prologue that allocates a frame, saves the registers it needs to preserve, and ends with an epilogue that reverses the prologue:
my_function: ; โโ Prologue โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ push rbp ; save the previous frame pointer mov rbp, rsp ; rbp now points at our frame base push rbx ; save callee-saved registers we'll use push r12 sub rsp, 0x28 ; 40 bytes of locals; keeps rsp 16-byte aligned ; โโ Body โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ... ; โโ Epilogue โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ add rsp, 0x28 ; deallocate locals pop r12 ; restore callee-saved registers in reverse pop rbx pop rbp ret
Three rules to remember when reading frames:
- The stack must be 16-byte aligned at every
callsite. Becausecallpushes the 8-byte return address,rspon function entry is always 16k โ 8. The prologue'spush rbpbrings it back to 16k. Subโsequent local allocations must keep the alignment; this is why frames are sized in multiples of 16 with an extra 8 if needed. - The frame pointer is optional on x86-64. With
-fomit-frame-pointer(default at-O1and above in GCC and Clang for many years), the compiler skipspush rbp; mov rbp, rspand addresses locals offrspdirectly. This frees a register but makes stack unwinding harder; profilers and debuggers fall back to DWARF CFI or.eh_frame. Fedora 38 (2023) and Ubuntu 24.04 (2024) reversed their distro defaults and now ship system binaries with frame pointers on, so that production profilers likeperfcan unwind through them cheaply[14]. Apple's AArch64 ABI requires the frame pointer on every non-leaf function[15]. - Leaf functions have no prologue at all. A function that calls nothing else and uses only caller-saved registers (like the
addfrom ยง1) emits a single instruction body and aret. On System V it can also use the 128-byte "red zone" belowrspfor scratch storage without subtracting fromrspat all[6].
The red zone is the System V detail that catches Windows porters: a leaf function can legitimately have local_a at [rsp - 8], with rsp never adjusted. On Windows that area is fair game for any signal handler or kernel transition to clobber, so the Microsoft ABI specifies no red zone and the compiler must sub rsp, ... for any local storage.
10Reading compiler output: a worked vector normalize
A canonical 3-D engine primitive: take a vec3, divide each component by its length, return the unit vector. Source:
struct vec3 { float x, y, z; }; vec3 normalize(vec3 v) { float lenSquared = v.x * v.x + v.y * v.y + v.z * v.z; float invLen = 1.0f / sqrtf(lenSquared); return { v.x * invLen, v.y * invLen, v.z * invLen }; }
With clang -O2 -mavx2 -mfma on System V (illustrated; exact instruction selection varies across versions and surrounding context):
normalize(vec3): ; System V vec3 ABI: xmm0[0]=x, xmm0[1]=y (upper lanes undefined); xmm1[0]=z. vmovshdup xmm2, xmm0 ; xmm2 = {xmm0[1], xmm0[1], โฆ} โ lane 0 holds y vmulss xmm3, xmm1, xmm1 ; xmm3[0] = z*z (vmulss touches only lane 0) vfmadd231ss xmm3, xmm2, xmm2 ; xmm3[0] += y*y vfmadd231ss xmm3, xmm0, xmm0 ; xmm3[0] += x*x โ lenSquared vsqrtss xmm3, xmm3, xmm3 ; xmm3[0] = sqrt(lenSquared) vmovss xmm4, dword ptr [rip + .LC0] ; xmm4[0] = 1.0f from a rodata constant vdivss xmm3, xmm4, xmm3 ; xmm3[0] = invLen vmulss xmm0, xmm0, xmm3 ; xmm0[0] = x*invLen; xmm0[1..] unchanged vmulss xmm2, xmm2, xmm3 ; xmm2[0] = y*invLen vmulss xmm1, xmm1, xmm3 ; xmm1[0] = z*invLen vunpcklps xmm0, xmm0, xmm2 ; xmm0 lane 1 โ xmm2[0] (y*invLen) ret ; return {x', y', ยท, ยท} in xmm0; z' in xmm1[0]
A few things are happening here that aren't obvious from the source. The two-XMM vec3 argument convention is from the System V ABI[6]: an aggregate of three floats has its x,y in xmm0 and z in xmm1. The compiler used vfmadd231ss (fused multiply-add, scalar single-precision) twice to combine the three squared-component sums into one cumulative result; the FMA instruction does a += b*c in one ฮผop with one rounding step instead of two, which is both faster and more accurate[16]. The vsqrtss / vdivss pair is the textbook scalar square root and reciprocal; on hot paths a renderer might replace it with vrsqrtss (approximate reciprocal-square-root, ~11-bit precision) followed by a Newton-Raphson iteration, trading two cycles of latency for some accuracy. The compiler doesn't do that substitution by default because IEEE-754 strict mode requires the rounding behavior of sqrt + div; -ffast-math permits it.
Reading conventions: the v prefix on each mnemonic is the AVX (VEX-encoded) three-operand form. Without AVX, the same operation would be mulss xmm0, xmm0; addss xmm0, xmm3 with the destination always also being a source. Three-operand AVX lets the compiler keep xmm0 unmodified and write the result somewhere else, which usually shortens the dependency chain. The ss suffix means "scalar single-precision," i.e., one float in the low 32 bits of the XMM register, ignoring the other three lanes[17].
11Branches, mispredictions, and the branchless idiom
The CPU does not wait for a conditional branch to resolve before fetching the next instructions. It predicts the outcome of every branch (using a per-branch history table and a global branch history register, both heavily refined over the last twenty years[18]) and speculatively executes the predicted path. When the prediction is right (the usual case for tight loops, monotonic conditions, and any pattern with a stable history), the branch is approximately free. When it is wrong, the CPU rolls back the speculative work and refetches; the cost is ten to twenty cycles on recent Intel and AMD client cores[12].
"Approximately free" is worth measuring. Cloudflare's 2021 microbenchmark of a long chain of predicted unconditional jumps reports steady-state costs of 2 cycles per jmp on an Intel Xeon Gold 6262 (degrading past the ~4096-entry dense BTB capacity), ~3.5 cycles per jmp on Zen 2 (AMD EPYC 7642), and 1 cycle on Apple's M1 when the loop fits in roughly 4 KB of instructions, rising to 3 cycles per jmp once the loop overflows the smaller of the M1's branch-target buffers; never-taken conditional branches stay near 0.3 cycles per block regardless of micro-arch[40]. Two consequences worth keeping. First, "predicted branch is free" is shorthand, not literal: every micro-arch pays one to four cycles even when the prediction is right. Second, the cost is non-linear in the working set: a hot loop that fits inside the BTB and one that just doesn't can run at three- or four-fold different speeds with no source change.
For data-dependent branches that the predictor cannot learn (a random check against a 50/50 input, or a comparison against a key that changes per iteration), the misprediction cost dominates. The remedy is branchless code: replace the conditional with a computation that produces the same result regardless of the predicate, using cmovcc or bit tricks. Take the absolute-value-of-int problem:
; Branchy: easy to read, mispredicts catastrophically on random inputs abs_branchy: test edi, edi jns .Lpos ; jump if not signed (positive) neg edi .Lpos: mov eax, edi ret ; Branchless: same answer, no jump. Cost is constant. abs_branchless: mov eax, edi ; eax = x (the default) neg eax ; eax = -x; flags reflect -x cmovl eax, edi ; if -x < 0 (i.e., x was positive), take the original instead ret ; eax holds |x|; UB for x = INT_MIN, same as the branchy form
The bit-trick form is shorter still: mov eax, edi; cdq; xor eax, edx; sub eax, edx. cdq sign-extends eax into edx, producing an all-ones mask if x is negative and an all-zeros mask otherwise; the xor and sub then compute (x XOR mask) โ mask, which evaluates to x when the mask is zero and to โx when the mask is all-ones. Four instructions, no jump, no condition codes consumed, used throughout the Linux kernel and many engine math libraries[19].
When to use which. Branchless is faster when the predictor cannot learn the pattern. It is slower when the predictor can, because cmov introduces an unconditional data dependency: the result waits for both the source and the predicate, even when the predicate could have been predicted and the dependent computation skipped. The canonical demonstration is the most-upvoted answer on Stack Overflow[20]: summing the elements above a threshold in a 32k-int array runs several times faster when the array is sorted first, because the predictor latches onto the threshold crossover after a handful of iterations and from then on the branch is free. The branchless form gets no such win and, on the sorted input, loses to the branchy form.
The widget animates this. Run the same predicate over an array that's either random or sorted; the branchy version takes constant time on the sorted input and falls off a cliff on the random one:
12SIMD: SSE, AVX, AVX-512
SIMD (Single Instruction, Multiple Data) replaces a loop body that operates on one scalar with one instruction that operates on a vector of values. Every modern x86-64 CPU has SSE2 mandatorily; almost every chip from the last decade has AVX2; recent server and high-end client chips have AVX-512[10]. The mental model:
- An XMM register holds 128 bits = 4 floats = 2 doubles = 16 bytes = 4 int32s.
- A YMM register holds 256 bits = 8 floats = 4 doubles = 32 bytes = 8 int32s.
- A ZMM register holds 512 bits = 16 floats = 8 doubles = 64 bytes = 16 int32s.
- Each value in the register is a lane. SIMD instructions act on every lane in parallel:
vaddps ymm0, ymm1, ymm2adds eight pairs of single-precision floats simultaneously.
The suffix on a SIMD mnemonic names the lane format. ps = packed single (4/8/16 floats), pd = packed double, ss = scalar single (one float, others untouched), sd = scalar double. Integer variants use b/w/d/q for byte/word/dword/qword. vpaddd ymm0, ymm1, ymm2 is eight 32-bit integer adds in parallel[17].
The widget below animates one SIMD operation across a register. vaddps on a YMM register adds two eight-lane vectors in one cycle of throughput. Press play and watch eight independent adds light up at the same time:
What about ARM? AAPCS64 and NEON in 30 seconds.
ARM (AArch64) is the architecture of Apple Silicon, Nintendo Switch, Nintendo Switch 2, every modern Android phone, and the ARM-based AWS Graviton servers. The ISA differs from x86-64 in three big ways: it is fixed-length 32-bit (no variable-length encoding), it is load-store (no read-modify-write on memory operands), and it has 31 general-purpose 64-bit registers (X0โX30, plus a zero register XZR and the stack pointer SP).
The AArch64 Procedure Call Standard[21] passes integer args in X0โX7 and float args in V0โV7. Callee-saved registers are X19โX28. SIMD lives in 32 "V" registers, each 128 bits; instructions like fmla v0.4s, v1.4s, v2.4s are the rough equivalent of vfmadd231ps xmm0, xmm1, xmm2. The SVE/SVE2 vector extensions (vector-length-agnostic; on Apple Silicon and modern ARM server cores) generalize this to longer registers without recompiling.
Most of this tutorial transfers across by mechanical translation: register names change, the instruction syntax changes, but the same dependency chains, port pressure, and ABI bugs apply.
13A worked SIMD example: 4ร4 matrix-vector multiply
The transformation that every renderer applies to every vertex: multiply a 4-component vector by a 4ร4 matrix. Naively, sixteen multiplies and twelve adds (or four dot products of length 4) per vertex. With SSE the same operation is four vmulps and three vaddps on 128-bit vectors. The standard formulation, for a column-major matrix M = [c0 | c1 | c2 | c3] and a vector v = (x, y, z, w):
Each ciยทs is a scalar-times-vector, which in SSE is a broadcast of the scalar across four lanes followed by a multiply. The three additions chain together. Source:
// 16-byte aligned for movaps. col[i] is column i of M. struct alignas(16) mat4 { __m128 col[4]; }; __m128 mul(const mat4& M, __m128 v) { __m128 x = _mm_shuffle_ps(v, v, _MM_SHUFFLE(0, 0, 0, 0)); // broadcast v.x __m128 y = _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 1, 1, 1)); // broadcast v.y __m128 z = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 2, 2, 2)); // broadcast v.z __m128 w = _mm_shuffle_ps(v, v, _MM_SHUFFLE(3, 3, 3, 3)); // broadcast v.w __m128 r = _mm_mul_ps(M.col[0], x); // r = c0 * x r = _mm_fmadd_ps(M.col[1], y, r); // r += c1 * y r = _mm_fmadd_ps(M.col[2], z, r); // r += c2 * z r = _mm_fmadd_ps(M.col[3], w, r); // r += c3 * w return r; }
With clang -O3 -mavx2 -mfma, the body is nine instructions plus the return, split into two parallel FMA chains to shorten the critical path:
mul(mat4 const&, __m128): ; rdi = const mat4* (the matrix); xmm0 = v vshufps xmm1, xmm0, xmm0, 0x00 ; xmm1 = {v.x, v.x, v.x, v.x} vshufps xmm2, xmm0, xmm0, 0x55 ; xmm2 = {v.y, v.y, v.y, v.y} vshufps xmm3, xmm0, xmm0, 0xAA ; xmm3 = {v.z, v.z, v.z, v.z} vshufps xmm0, xmm0, xmm0, 0xFF ; xmm0 = {v.w, v.w, v.w, v.w} (last use of v, OK to overwrite) vmulps xmm1, xmm1, [rdi + 0x00] ; xmm1 = c0 * v.x (chain A start) vmulps xmm0, xmm0, [rdi + 0x30] ; xmm0 = c3 * v.w (chain B start) vfmadd231ps xmm1, xmm2, [rdi + 0x10] ; xmm1 += c1 * v.y (chain A: dst = src1*src2 + dst) vfmadd231ps xmm0, xmm3, [rdi + 0x20] ; xmm0 += c2 * v.z (chain B) vaddps xmm0, xmm0, xmm1 ; merge: c0ยทx + c1ยทy + c2ยทz + c3ยทw ret
Two observations worth pulling out. The compiler split the source's single FMA chain into two parallel chains and merged them with a final vaddps: critical-path latency drops from one vmulps + three serial vfmadds (โ16 cycles on Skylake's 4-cycle FMA) to one vmulps, one vfmadd, one vaddps (โ12 cycles), at the cost of one extra instruction. The matrix columns are loaded directly inside the FMA and MUL memory-operand forms ([rdi + 0x10], etc.); on Intel and Zen the load and the compute fuse at the front end into a single macro-op[13], so an explicit vmovaps would be redundant.
What's intentionally missing from this code. The 128-bit (one vector) form processes one vertex at a time. The throughput-tuned form transposes the data to SoA so that one ymm/zmm register holds the same component (x, then y, then z) for many vertices, and a single broadcast-and-FMA transforms 8 vertices at a time with AVX or 16 with AVX-512. The matrix is also typically loaded once and reused across many vertices; this code reloads it every call. On the GPU the same transform is the vertex shader, run massively in parallel, which is why CPU-side vertex transformation today is reserved for skinning, particle simulation, and physics: workloads where the CPU's branchy logic is worth more than the GPU's lane count.
14Intrinsics vs inline assembly vs writing pure asm
Three ways to drop below the C/C++ language barrier, in roughly decreasing order of how often you should use them:
- Compiler intrinsics. Header-defined functions (
<immintrin.h>on x86,<arm_neon.h>on ARM) that compile to a single SIMD instruction each[17]. The compiler still allocates registers, schedules, and inlines. This is the default. Almost every shipping engine math library is intrinsics, not hand assembly. - Inline assembly. GCC/Clang's
asmstatement[22] embeds asm in a C function with constraints describing the inputs, outputs, and clobbers. Useful for instructions with no intrinsic (CPUID, RDTSC, RDPMC, MSR access) and for hand-tuned hot inner loops where the register allocator's choices are hurting. MSVC does not support inline asm in x64 code[23]; the Microsoft replacement is intrinsics or a separate.asmfile compiled byml64. - Pure assembly files.
.son GCC/Clang via GAS,.asmon MSVC via MASM or NASM. Used for the bottom-most platform code: context switches, fiber/coroutine implementations, atomic primitives that predate compiler intrinsics, signal frame mangling. Vanishingly rare in application or game code.
An example of when inline assembly is the right answer. CPUID, used at startup to detect available CPU features, has no C-level mechanism to invoke. GCC's intrinsic __cpuid handles it, but if you need the raw form:
static inline void cpuid(int leaf, int *eax, int *ebx, int *ecx, int *edx) { __asm__ ( "cpuid" : "=a"(*eax), // output: eax โ *eax "=b"(*ebx), // output: ebx โ *ebx "=c"(*ecx), // output: ecx โ *ecx "=d"(*edx) // output: edx โ *edx : "a"(leaf) // input: leaf โ eax ); }
The output constraints "=a", "=b", "=c", "=d" tell the compiler that cpuid writes those four registers and the values should go to the named C variables. The input constraint "a"(leaf) tells the compiler to load leaf into eax before executing the instruction. This is the entire mechanism; everything else is constraint syntax[22].
Two reliable failure modes. Forgetting clobbers. If your inline asm modifies a register that's not in the input or output list, you have to declare it in the clobber list, or the compiler will assume the value it had before the asm is still there afterwards. Mixing inline asm with optimization-fragile patterns. Inline asm is a black box to the optimizer; it inhibits inlining, vectorization, and instruction scheduling across the asm. A tight intrinsic-based loop is almost always faster than the same loop with one inline-asm hot instruction in the middle.
15Atomics and fences at the machine level
A pointer to the Memory Model tutorial, condensed: every std::atomic operation in C++ lowers to a specific machine instruction whose ordering guarantees match the requested memory_order. The mapping on x86-64[24]:
- Relaxed load / store. Plain
mov. x86 TSO already gives load-acquire and store-release ordering for free; relaxed and acquire/release have identical emitted code at the read or write site. - Acquire load. Plain
mov. Same as relaxed on x86 because of TSO. - Release store. Plain
mov. Same. - Sequentially-consistent load. Plain
mov. The seq-cst constraint is enforced by the corresponding seq-cst store, not the load. - Sequentially-consistent store. Either
movfollowed bymfence(GCC's historical choice) orxchg [mem], reg(Clang's choice;xchgwith a memory operand has an implicit LOCK prefix and is a full barrier). Both flush the store buffer before subsequent loads can retire[25]. The two forms are functionally interchangeable; on modern hardwarexchgis typically a few cycles faster. - Atomic RMW (
fetch_add,compare_exchange). ALOCK-prefixed instruction (lock xadd,lock cmpxchg). The LOCK prefix asserts cross-core atomicity on the cache line and acts as a full barrier on x86.
On ARM (AArch64) the lowering is more visible. An acquire load lowers to LDAR (Load-Acquire Register); a release store lowers to STLR (Store-Release Register); since the C++20 fix codified in P0668[24], a seq-cst load is LDAR and a seq-cst store is STLR alone. Pre-P0668 toolchains sometimes added a leading DMB ISH before STR for the seq-cst store, which is heavier but conservative. The Lรช, Pop, Cohen, Zappa Nardelli 2013 paper on the corrected Chase-Lev deque[26] is the standard reference for what the C++ memory model demands of ARM codegen; see the memory model tutorial for the full picture.
16ฮผops, ports, and dependency chains in practice
Earlier we sketched the OOO engine. With the SIMD walkthrough in mind, this is what microarchitectural tuning actually looks like. Take a reduction: summing a million floats.
float sum_serial(const float* a, int n) { float s = 0; for (int i = 0; i < n; ++i) s += a[i]; return s; }
Compiled with -O2 (no fast-math), the inner loop is one addss per element. addss on Skylake has a latency of 4 cycles and a throughput of 2 per cycle[13]. Because each iteration depends on s from the previous iteration, the loop is bound by the latency, not the throughput: one iteration every 4 cycles. The throughput of 2 adds per cycle is unreachable because there is only one chain. Eight ports of execution are mostly idle.
The fix is to break the dependency chain: use multiple accumulators that the compiler can schedule independently. -O3 -ffast-math on a modern compiler will do this for you (or -fassociative-math, which is the specific permission needed), producing something like:
float sum_parallel(const float* a, int n) { __m256 s0 = _mm256_setzero_ps(), s1 = _mm256_setzero_ps(); __m256 s2 = _mm256_setzero_ps(), s3 = _mm256_setzero_ps(); for (int i = 0; i < n; i += 32) { s0 = _mm256_add_ps(s0, _mm256_loadu_ps(a + i + 0)); s1 = _mm256_add_ps(s1, _mm256_loadu_ps(a + i + 8)); s2 = _mm256_add_ps(s2, _mm256_loadu_ps(a + i + 16)); s3 = _mm256_add_ps(s3, _mm256_loadu_ps(a + i + 24)); } __m256 s = _mm256_add_ps(_mm256_add_ps(s0, s1), _mm256_add_ps(s2, s3)); // horizontal sum of s into a scalar; details elided ... }
Four independent accumulators, each a 256-bit vector of eight floats. Per iteration the loop does four 256-bit adds and four 256-bit loads. The four adds run on different dependency chains, which removes the cross-iteration latency that pinned the naive version. The four loads share two L1 load ports. Skylake has two FMA/ADD ports at 256 bits each and two L1 load ports at 256 bits each[13][12], so both halves of the loop need two cycles. Steady-state throughput is one iteration every two cycles: 16 floats per cycle. Compared to the naive loop's 0.25 floats per cycle (one element per 4-cycle addss latency), that is a 64ร speedup, decomposed as 8ร from the SIMD width and 8ร from breaking the latency chain. The loop is now L1-load-bandwidth-bound; the bottleneck moves to L2 once the working set exceeds the L1 footprint, and to DRAM once it exceeds L2.
Three tools to verify this analysis without re-running the program. llvm-mca[27] takes a chunk of assembly and reports the expected steady-state IPC and per-port pressure. uica.uops.info[28] does the same with a richer microarchitectural model. perf stat -e cycles,instructions,branches,branch-misses,... on Linux gives the actual run-time numbers from CPU performance counters. The three together cover prediction, simulation, and measurement.
17Pitfalls
- Mixing legacy SSE and VEX-encoded AVX. Switching from non-VEX SSE (
movaps xmm0, ...) into VEX AVX (vmulps ymm0, ...) without an interveningvzerouppercan incur a state-save penalty of tens of cycles on Skylake-class cores[3]. The compiler emitsvzeroupperat AVX/SSE boundaries; hand-written code must too. - Assuming x86 latency is the cost. Latency only matters on the critical path. A ฮผop with latency 6 and throughput 2-per-cycle can issue every 0.5 cycles when its inputs are independent; the same ฮผop costs 6 cycles per iteration when chained. Read both numbers from uops.info[13], not just the latency.
- Partial-register stalls. Writing
aland then readingraxused to stall on Pentium 4 and Sandy Bridge; on modern cores it's usually free, but it depends on the specific micro-arch. Treatmovzx eax, alas the safe explicit form when bridging width. - Clobbering a callee-saved register and not restoring it. The ABI tables in ยง8 are the contract; ignoring them produces bugs that surface as data corruption many calls away. The compiler's stack-protector and stack-usage warnings (
-fstack-protector-strong,-Wstack-usage) catch some adjacent failures, but the real defenses are explicit clobber lists in inline asm and reviewing the disassembly of the prologue/epilogue. - Hand-aligning data when the compiler already did.
alignas(16)in C++ and thealignedattribute in C produce the same memory layout the compiler is going to assume for SSE loads. Hand-aligning by adding pad fields by hand is a constant source of bugs after struct layout changes. - Optimizing inner loops the compiler already optimized. Before writing intrinsics, look at the compiler's output at
-O3with-march=native. Modern autovectorizers handle the standard patterns (reductions, maps, prefix scans). Hand-rolled intrinsics are right when the compiler bailed out on the pattern, not when it didn't. - RDTSC isn't a clock. It's a free-running TSC that on modern chips runs at a fixed (often the nominal) frequency regardless of dynamic frequency scaling[29]. Don't use it to measure wall-clock time. Don't use it across cores without serializing. Use
std::chrono::steady_clockfor wall time andrdpmcwith a privileged setup for cycle counting.
18What's next
Where to go from here:
- Read your own engine's hot paths. Pick the math, physics, animation, and renderer-glue libraries you ship. Build a release binary with debug info.
objdump -d --disassembler-options=intel-mnemonicon Linux,dumpbin /disasmon MSVC, or the Compiler Explorer[1] "load my source" workflow. Look at what your hot inline functions actually compile to. - Sit with Agner Fog's manuals.[5] The five PDFs at agner.org/optimize are the practitioner reference: optimization manuals 1โ4 cover the C, asm, micro-arch, and instruction-tables material; manual 5 is calling conventions. Free, no DRM, updated through 2024.
- Run llvm-mca and uica on the loops you write. Both are free; both take a small region of assembly and predict throughput and port pressure. Use them before and after intrinsic refactors to confirm the change you intended is the change the model sees.
- Read a real ABI document end-to-end. The System V AMD64 ABI[6] is sixty pages and gives a strikingly different mental model of "what a function call is" than the C language does. Aggregate passing rules, classification of struct fields, and the variadic-function ABI are all worth a slow read.
- Pair with the Memory Model tutorial. The atomic-instruction lowering from ยง15 is the bridge: once you've read the memory model tutorial, the
lock cmpxchg16bin a Chase-Lev deque and thevzeroupperat an AVX boundary are the same kind of thing: implementation details the language hides until you read the assembly.
19Sources & further reading
Numbered citations refer to the superscripts above. Most are freely available; a few are vendor manuals that require a no-cost registration.
The prose, code samples, CSS, and interactive widgets on this page are original writing. The instruction-encoding decomposition follows the Intel SDM Volume 2 chapter on instruction format [10]. The two ABI tables in ยง8 are a side-by-side condensation of the System V AMD64 ABI [6] and the Microsoft x64 calling convention reference [7]; consult them before relying on any field. The branchy-vs-branchless numerics in ยง11 trace to Bruce Dawson's analysis and the canonical Stack Overflow demonstration [20]; the bit-trick form of abs is from Hacker's Delight ยง2-4 [19]. The C++ atomic lowering table in ยง15 matches the mapping documented at [24].
- Godbolt, M. Compiler Explorer. godbolt.org. The interactive compile-and-disassemble tool used in every example in this tutorial; mature support for GCC, Clang, MSVC, ICX, and a long tail of other compilers.
-
Lengyel, E. (2011). Mathematics for 3D Game Programming and Computer Graphics (3rd ed.). Cengage. The standard reference for the vector and matrix routines that ship in every renderer; SSE versions of
normalize,cross, andmatmulin Chapter 4. - Intel Corporation. Intelยฎ 64 and IA-32 Architectures Optimization Reference Manual. Order Number 248966. intel.com. The normative source for Intel microarchitecture details: pipeline depth, ฮผop fusion, AVX state transitions, the SSE-AVX transition penalty.
- Advanced Micro Devices. Software Optimization Guide for AMD Family 19h Processors (Zen 3 and Zen 4). Publication #56665 / #57487. amd.com. AMD's counterpart to the Intel optimization manual; the canonical Zen 3/Zen 4 latency and throughput numbers.
- Fog, A. Software optimization resources. agner.org/optimize. Five PDFs: optimizing C++, optimizing asm, microarchitecture of Intel/AMD/VIA CPUs, instruction tables, and calling conventions. Updated as recently as 2024. Free; no registration.
- Matz, M., Hubiฤka, J., Jaeger, A., & Mitchell, M. System V Application Binary Interface, AMD64 Architecture Processor Supplement. gitlab.com/x86-psABIs/x86-64-ABI. The normative document for Linux, macOS, the BSDs, and the PlayStation console toolchains. Continuously revised; the LaTeX source is the canonical version.
- Microsoft. x64 calling convention. learn.microsoft.com. The Microsoft x64 ABI reference: argument registers, callee/caller-saved, shadow space, struct passing rules.
- Advanced Micro Devices. AMD64 Architecture Programmer's Manual, Volume 1: Application Programming. Publication #24592. amd.com. The original AMD64 architecture spec; describes the REX prefix, the 64-bit operand size rules, and the zero-extension behavior of 32-bit register writes.
- Intel Corporation. (1979). iAPX 86/88, 186/188 User's Manual. The original 8086 reference. Historical only; reproduced at archive.org.
- Intel Corporation. Intelยฎ 64 and IA-32 Architectures Software Developer's Manual. Order Numbers 253665โ253669. intel.com. Five volumes; Volume 1 is the architecture overview, Volume 2 is the instruction set reference (the encoding tables), Volume 3 covers system programming.
- Intel Corporation. (2023). Intelยฎ Advanced Performance Extensions (Intelยฎ APX) Architecture Specification. intel.com. The proposed extension to 32 GPRs and three-operand encoding for legacy integer instructions.
- Fog, A. The microarchitecture of Intel, AMD and VIA CPUs. Manual 3 of the agner.org/optimize series. Periodically updated; the 2024 edition covers Zen 4, Raptor Lake, and Arrow Lake.
- Abel, A., & Reineke, J. (2019). uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. ASPLOS. Project home: uops.info. Machine-measured latency, throughput, and port-usage tables for nearly every x86-64 instruction on every recent micro-architecture.
-
Mรถller, T. (2021). Frame pointers: why you should keep them. Brendan Gregg's writeup on the cost of omitting frame pointers for production profiling. brendangregg.com. Covers why Fedora 38 and Ubuntu 24.04 re-enabled
-fno-omit-frame-pointerby default. - Apple. (2024). Writing ARM64 code for Apple platforms. developer.apple.com. Apple's ARM64 platform conventions; covers the frame pointer requirement, which is stricter than the AAPCS64 baseline.
- Muller, J.-M., Brisebarre, N., de Dinechin, F., Jeannerod, C.-P., Lefรจvre, V., Melquiond, G., Revol, N., Stehlรฉ, D., & Torres, S. (2018). Handbook of Floating-Point Arithmetic (2nd ed.). Birkhรคuser. The reference for FMA rounding semantics: a single rounding step after the multiply-then-add, instead of two.
-
Intel Corporation. Intelยฎ Intrinsics Guide. intel.com/intrinsics-guide. The searchable reference for every intrinsic in the
<immintrin.h>family, with mnemonic, encoding, latency, and a per-architecture availability badge. - Yeh, T.-Y., & Patt, Y. N. (1991). Two-Level Adaptive Training Branch Prediction. MICRO. PDF. The two-level predictor that became the basis of every shipping CPU's branch predictor; modern designs add path-history, TAGE, perceptron predictors, and indirect-target tables on top.
- Warren, H. S. (2012). Hacker's Delight (2nd ed.). Addison-Wesley. The canonical reference for bit-twiddling: branchless absolute value, sign extension, population count, etc. ยง2-4 has the absolute-value bit trick used in ยง11.
- Stack Overflow. Why is processing a sorted array faster than processing an unsorted array? (2012). stackoverflow.com. The most-upvoted Stack Overflow answer ever; canonical demonstration of branch-predictor effects in C++.
- ARM Limited. Procedure Call Standard for the Armยฎ 64-bit Architecture (AArch64). Document IHI 0055. github.com/ARM-software/abi-aa. The AArch64 ABI: argument registers, callee-saved, parameter passing rules, the NEON vector ABI.
- Free Software Foundation. Extended Asm โ Assembler Instructions with C Expression Operands. GCC manual. gcc.gnu.org. The reference for GCC inline assembly constraints; Clang follows the same spec for compatibility.
-
Microsoft. Inline assembler. learn.microsoft.com. The MSVC documentation; states explicitly that inline asm is not supported in x64 or ARM64 code, with the redirect to intrinsics and external
.asmvia MASM. - Boehm, H.-J., & Giroux, O. (2018). P0668R5 โ Revising the C++ memory model. WG21. open-std.org. The proposal that tightened seq_cst ordering on weak hardware; includes the canonical x86 and ARM lowering tables for every memory order. Also useful: Mapping C/C++ to processors, cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.
- Intel Corporation. Intelยฎ 64 and IA-32 Architectures Software Developer's Manual, Volume 3A, ยง8.2 (Memory Ordering), ยง8.3 (Serializing Instructions). The normative source on MFENCE, LFENCE, SFENCE, and the LOCK prefix as full barriers.
- Lรช, N. M., Pop, A., Cohen, A., & Zappa Nardelli, F. (2013). Correct and Efficient Work-Stealing for Weak Memory Models. PPoPP. PDF. The paper that corrected the 2005 Chase-Lev deque with the right C11 acquire/release ordering for ARM and POWER.
- LLVM Project. llvm-mca โ LLVM Machine Code Analyzer. llvm.org. A static throughput analyzer that takes a region of assembly and reports per-port pressure and steady-state IPC; ships with LLVM.
- Abel, A., & Reineke, J. (2022). uiCA: Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures. ICS. Project home: uica.uops.info. A more recent simulator with a richer model of the front end, ฮผop cache, and back-end ports.
- Paoloni, G. (2010). How to Benchmark Code Execution Times on Intelยฎ IA-32 and IA-64 Instruction Set Architectures. Intel white paper. The reference for the right way to use RDTSC and RDTSCP for cycle-level benchmarking, including the serialization caveats.
- Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann. The textbook on pipelining, OOO execution, and cache hierarchies. Chapters 2 and 3 are the long-form version of ยง3 of this tutorial.
- Drepper, U. (2007). What Every Programmer Should Know About Memory. PDF. The long-form treatment of the memory hierarchy, prefetch, and cache effects; still the clearest single document on the topic.
- Giesen, F. (Ryg). The ryg blog. fgiesen.wordpress.com. Practitioner-grade writeups on SIMD, codec implementation, and the pixel pipeline; the closest thing the field has to a working game-programmer's blog on this material.
- Muลa, W. Practical SIMD and bit-twiddling notes. 0x80.pl. Hundreds of microbenchmarks and worked examples for SSE, AVX2, AVX-512, NEON. The reference for "how do I vectorize this small thing?"
- Dawson, B. Random ASCII. randomascii.wordpress.com. Long-running practitioner blog from a Valve / Microsoft profiling engineer; the source for many of the canonical "what does this assembly cost on Skylake" investigations.
- Lemire, D. Daniel Lemire's blog. lemire.me/blog. Microbenchmarks, SIMD analyses, and a steady stream of evidence-based posts on modern x86 performance.
- Patterson, D. A., & Waterman, A. (2017). The RISC-V Reader: An Open Architecture Atlas. Strawberry Canyon. A short, accessible alternative to x86 for understanding how a simpler ISA decodes; useful background even if you'll never ship to RISC-V.
-
Intel Corporation. Intelยฎ VTuneโข Profiler User Guide. intel.com. The reference profiler for Intel hardware; reads the same performance counters
perfdoes, with a richer microarchitectural-event taxonomy. - Bendersky, E. Eli Bendersky's website. eli.thegreenplace.net. Long-form articles on linkers, ELF, position-independent code, and how compilers actually generate the assembly you read. Strong on the loader and dynamic-linking side that this tutorial skips.
- Hyde, R. (2003). The Art of Assembly Language. No Starch Press. The classic intro book to assembly; uses HLA, but Part III is x86 assembly proper. Free online at plantation-productions.com.
- Majkowski, M. (2021). Branch predictor: How many "if"s are too many? Cloudflare blog. blog.cloudflare.com. Microbenchmarks of long predicted-branch chains on Intel Xeon Gold 6262, AMD EPYC 7642 (Zen 2), and Apple M1, with reproducible code. Source for the per-architecture per-jmp cycle numbers and the BTB-capacity-cliff behavior cited in ยง11.