All tutorials Mighty Professional
Tutorial 09 ยท Engine Programming

x86-64 Assembly
from Scratch

Sixteen general-purpose registers. Sixteen 256-bit vector registers since AVX, thirty-two 512-bit ones on AVX-512 hardware. An instruction encoding that ranges from one byte to fifteen, a calling convention that disagrees between Linux and Windows, and a microarchitecture that turns your serial-looking code into a wide, speculative dispatch across eight or more execution ports. The view from the assembly layer is what tells you why your inner loop runs at the speed it does. We work from the CPU model up to a SIMD matrix-vector kernel and a ฮผop scheduling demo, with every claim cited to a vendor manual or a refereed paper.

Time~65 min LevelEngine programmer, intermediate to senior PrereqsYou can read C or C++ comfortably. You know what a stack frame and a function pointer are. The Memory Model tutorial pairs naturally with ยง15 of this one. HardwareAny x86-64 machine from the last decade

01Why assembly still matters when nobody writes it

The honest answer is that very few production games ship hand-written assembly. The compiler is usually within a few percent of optimal on hot paths, and a single ABI mismatch in hand-written code can corrupt a callee-saved register in a way that crashes three frames later. The reason to learn assembly in 2026 is not to write it. It is to read it. The compiler's output is the ground truth for what your CPU is being asked to do, and reading it answers questions that source-level reasoning can't.

Concrete situations where the disassembly is the only authoritative source:

Every section of this tutorial answers a question that has surfaced in real engine work. The history is short, the CPU model is the standard mental picture a shipping engine programmer needs, and the SIMD walkthrough in ยง13 is a variant of the matrix-vector transform almost every renderer ships[2].

What you'll have by the end

A working ability to read AT&T and Intel syntax x86-64 disassembly. Concrete knowledge of the System V AMD64 ABI (Linux, macOS, BSD, PlayStation) and the Microsoft x64 ABI (Windows, Xbox) and where they differ. The instruction encoding well enough to read a hex dump as instructions. The SIMD vocabulary (lanes, packed/scalar, SSE/AVX/AVX-512) and what a 4ร—4 matrix-vector multiply looks like at the assembly level. The microarchitectural vocabulary (ฮผops, ports, dependency chains, retirement) that the Intel and AMD optimization manuals[3][4] and Agner Fog's manuals[5] are written in. Six live, in-browser widgets you can step through.

A tiny disassembly to set the tone

Two lines of C. Compile with gcc -O2 on x86-64 Linux:

add.c
int add(int a, int b) {
  return a + b;
}

The compiler emits:

add.s ยท Intel syntax, AT&T-style comments
add:
  lea   eax, [rdi + rsi]   ; eax = rdi + rsi (load-effective-address used as 3-op add)
  ret                      ; pop return address from [rsp], jump to it

Four lessons hiding in two instructions. First, the function arguments arrived in rdi and rsi because the System V ABI[6] puts the first two integer arguments there; the same code on Windows would have used rcx and rdx[7]. Second, the return value goes in rax (32 bits of which is eax). Third, the compiler used lea (Load Effective Address) as a three-operand non-destructive add: lea is the closest thing x86 has to an ARM-style add dst, src1, src2, and the compiler reaches for it constantly. Fourth, there is no stack frame at all, because nothing here required spilling state and the function is a leaf.

The rest of this tutorial unpacks each of those four observations from first principles. By the end you will be able to look at a fifty-line dump and tell where the register allocator gave up, where a function was inlined, and which loop the compiler vectorized.

02A short history of the architecture you're reading

x86-64 is the carrying state of nearly fifty years of architectural decisions, each made with the previous decade of code in mind. The instruction encoding has prefixes whose purpose is to extend an earlier 16-bit encoding. The calling conventions changed because the register count tripled. The vector extensions were added in four named generations. None of this is obvious from the documentation; each piece makes sense only against what it replaced.

1978
Intel 8086.[9] A 16-bit CISC processor with eight named 16-bit registers (AX, BX, CX, DX, SI, DI, BP, SP) and a segmented address space. The instruction encoding is the basis of every x86 CPU since: most instructions are still recognized in their 1978 form, and the assembler mnemonics are mostly unchanged.
1985
Intel 80386. 32-bit registers (EAX, EBX, โ€ฆ), the flat-32 address space that displaced segmentation in practice, and protected mode. Every register from the 8086 was widened to 32 bits with an "E" prefix, and the encoding was extended with a prefix byte to choose between 16-bit and 32-bit operand sizes. The pattern of "extend by a prefix, keep the old encoding intact" set the precedent for everything that followed.
1997
MMX. Intel's first SIMD extension: 64-bit integer vectors aliased onto the FPU registers. Mostly abandoned by 2005; the AMD64 instruction set deprecated MMX-only code paths and the SSE family replaced it.
1999
SSE (Streaming SIMD Extensions). Eight new 128-bit registers (XMM0โ€“XMM7), packed-single-precision-float arithmetic, and the first explicit prefetch instructions[10]. SSE became the floating-point baseline; AMD64 made SSE2 (added 128-bit integer ops and packed doubles) part of the mandatory ISA, which is why every x86-64 compiler emits XMM-based floating point and the 1985-era x87 FPU stack is reserved for legacy paths.
2003
AMD64 (x86-64).[8] AMD's extension to 64-bit, debuting on the K8 Opteron. Sixteen general-purpose registers (R8โ€“R15 added, the eight 8086-derived registers widened to RAX, RBX, โ€ฆ). Sixteen XMM registers. A new REX prefix encodes the extended registers and the 64-bit operand size. RIP-relative addressing replaces the segment-based addressing that 32-bit position-independent code used. Intel adopted AMD's design the following year on the Prescott-based Xeon (under the name "Intel 64", originally "EM64T").
2008
SSE4.2. String/CRC instructions, the family the C++ standard library still uses for std::find on integer ranges and for std::hash when SSE4.2 is available.
2011
AVX. 256-bit YMM registers (the lower 128 bits alias XMM), a new three-operand VEX encoding[10], and the separately-introduced FMA3 fused-multiply-add extension that follows on Haswell (2013). AMD's Jaguar (PS4/Xbox One, 2013) brought AVX (not AVX2) to consoles; PS5 and Xbox Series X/S (AMD Zen 2, 2020) brought AVX2. AVX2 is the practical SIMD baseline most current PC and console titles target.
2016โ€“2022
AVX-512. 512-bit ZMM registers, thirty-two of them, with separate mask registers (K0โ€“K7). First shipped on the Knights Landing Xeon Phi (2016) and the Skylake-X high-end desktop (2017); reached client mobile on Ice Lake (2019). Intel disabled it on client P-cores from Alder Lake (2021) through Arrow Lake (2024) once the chips shipped with E-cores that lack the unit. AMD shipped AVX-512 starting with Zen 4 (2022) and continued it on Zen 5. AVX-512 is the only x86 vector instruction set with first-class predication via the K-mask registers, which is part of why it remains worth writing by hand when the workload is in cache.
2023
APX (Advanced Performance Extensions, announced).[11] Intel's proposal to add sixteen more general-purpose registers (R16โ€“R31), three-operand encoding for legacy integer instructions, conditional loads and stores. Shipping silicon expected mid-decade; mentioned here only so the register-count number stays current.

The consequence for what you read in 2026: most disassembly is plain x86-64 with SSE2 floating point. Hot paths use AVX2 (256-bit) on PC and on PS5 / Xbox Series. AVX-512 shows up in compression, simulation, and ML kernels; rarely in shipped game code on consumer hardware because of the Intel client-side gap from Alder Lake through Arrow Lake. ARM, which we touch briefly in ยง13, is the other architecture worth knowing: Nintendo Switch and Switch 2, Apple Silicon, and a growing fraction of Windows-on-Arm devices all ship it, and the calling conventions, register file, and SIMD model (NEON, SVE2) follow different rules.

Intel syntax vs AT&T syntax: which is which, and which should I learn?

Two notational conventions exist for x86 assembly. The instructions are the same; the surface syntax differs:

Intel syntax writes mov eax, ebx with destination first. No register sigils. Memory references look like [rdi + 8]. Used by NASM, MASM, the Intel and AMD reference manuals[10], and Compiler Explorer's default. Also the syntax GCC and Clang emit if you pass -masm=intel.

AT&T syntax writes movl %ebx, %eax with source first. Registers prefixed with %, immediates with $, and a suffix on the mnemonic encodes the operand size (l = long = 32 bits). Memory references look like 8(%rdi). Used by the GNU assembler (as) and the default output of objdump on Linux. Inherited from the AT&T Unix assembler convention from the early 1980s.

Read both. Intel syntax matches the vendor manuals; AT&T is what you'll see in objdump -d output from a Linux build. This tutorial uses Intel syntax in code listings and notes AT&T differences where they matter.

03The CPU you're actually programming for

An assembly instruction is not what executes. A modern x86 core decodes each instruction into one or more , schedules those ฮผops across multiple execution units in parallel, and retires them in program order at the back end[3]. The instruction stream you read is the contract; what runs is a heavily reordered, register-renamed, speculatively-executed shadow of it. This section is the sketch of that machine, enough to make sense of why some instructions are "free" and others stall a whole pipeline.

A modern x86 pipeline, simplified to the parts you care about reading assembly for:

The two practical consequences for reading assembly. First, an instruction's latency (cycles from issue to result available) is different from its throughput (number of times per cycle the CPU can issue it). A vmulps on Skylake has a latency of 4 cycles but a throughput of two per cycle[13]: a chain of eight dependent multiplies takes 8 ร— 4 = 32 cycles; eight independent ones issue in 4 cycles and the last result lands 4 cycles after that, on the order of 8 cycles total. Same instruction count, ~4ร— difference in finish time. Second, the difference between a "fast" and "slow" implementation of the same algorithm at the assembly level is often not the instruction count; it is whether the instructions form a serial dependency chain pinning everything to one port, or whether they spread out across the execution ports the CPU has on offer.

The widget below schedules a simple loop's ฮผops onto four ALU ports and watches what stalls and what doesn't. We come back to this picture in ยง16 with a real worked example:

Live ยท Port pressure scheduler
Sixteen add ฮผops scheduled on a four-port integer ALU, modeled after Skylake's ports 0/1/5/6 (each accepts one integer add per cycle[13]). In serial mode each add waits for the previous one (one chain of dependent adds; 16 cycles). In parallel mode the same sixteen adds use four independent accumulators (one chain per port; 4 cycles). Same instruction count, same arithmetic, a 4ร— difference in finish time.

04The register file

x86-64 exposes sixteen general-purpose 64-bit registers and sixteen (with AVX-512: thirty-two) vector registers. Knowing the names is half of reading a disassembly:

64-bit32-bit16-bit8-bit lowConventional role (System V)
raxeaxaxalreturn value (int); scratch
rbxebxbxblcallee-saved
rcxecxcxcl4th integer arg
rdxedxdxdl3rd integer arg; high half of 128-bit return
rsiesisisil2nd integer arg
rdiedididil1st integer arg
rbpebpbpbplframe pointer; callee-saved
rspespspsplstack pointer
r8r8dr8wr8b5th integer arg
r9r9dr9wr9b6th integer arg
r10โ€“r11โ€ฆscratch
r12โ€“r15โ€ฆcallee-saved

Writes to a 32-bit register zero-extend into the 64-bit register. Writes to an 8-bit or 16-bit register do not: they leave the upper bits unchanged. This is an AMD64 spec rule[8], not a compiler convention, and it is the reason compilers emit xor eax, eax to zero the full rax: the 32-bit write zero-extends and the encoding is one byte shorter than mov rax, 0.

Vector registers come in three sizes that alias each other:

The aliasing is not free. On Skylake-class Intel cores, a naive mix of legacy-encoded SSE (xmm) and VEX-encoded AVX (ymm) without an intervening vzeroupper can incur a state-save penalty of tens of cycles per transition[3]. Sunny Cove and later replace the hard stall with a smaller, more amortized cost, but the rule the compiler follows hasn't changed: emit vzeroupper at AVX-to-SSE boundaries. Inline assembly that mixes the two has to do the same.

Step through eight instructions and watch the register file update. Each row is one 64-bit register; columns are the high and low 32 bits. The arrow shows the current instruction; orange highlights the just-written register:

Live ยท Register-file stepper
A small program: load two integers, compute (a + b) * 2 via an lea trick, store to memory, then return. Watch the partial-write rule: when the code writes eax, the upper 32 bits of rax zero out. When it writes al, the upper 56 bits of rax are untouched.

05Anatomy of an instruction

An x86-64 instruction is between 1 and 15 bytes. Its decoded form is a sequence of optional and mandatory fields, in this order[10]:

  1. Legacy prefixes (0โ€“4 bytes). Address-size override, operand-size override, segment override, repeat prefix, LOCK.
  2. REX prefix (0โ€“1 byte). Required to encode the 64-bit operand size, registers R8โ€“R15, or the SIL/DIL/BPL/SPL byte registers. Starts with the high nibble 0x4; the low nibble carries four bits W, R, X, B.
  3. Opcode (1โ€“3 bytes). Selects the instruction. Sometimes part of the opcode encodes a register too (the "+r" forms).
  4. ModRM (0โ€“1 byte). For instructions with operands, encodes the addressing mode and one or two register fields.
  5. SIB (0โ€“1 byte). Scale/Index/Base for the memory addressing modes that need it ([rbx + rcx*4 + 0x10]).
  6. Displacement (0, 1, 2, or 4 bytes). The constant offset in a memory operand.
  7. Immediate (0, 1, 2, 4, or 8 bytes). A literal value baked into the instruction.

AVX adds two more prefix families that replace REX and pack a "non-destructive source" field, letting vaddps zmm0, zmm1, zmm2 compute zmm0 = zmm1 + zmm2 without clobbering either source. The VEX prefix (2 or 3 bytes) covers AVX/AVX2; the EVEX prefix (4 bytes) covers AVX-512 and adds the mask register selector and rounding controls.

Click an instruction below to see its bytes broken out. Most production disassemblers can show this view (objdump -d -M intel --show-raw-insn, llvm-mc --show-encoding); seeing it once builds the muscle memory:

Live ยท Instruction-encoding decoder
Each instruction's encoding is from the Intel SDM Volume 2 instruction reference[10]. Real instruction encoders can pick different (equally valid) encodings of the same instruction; for example, add rax, rbx can be encoded with the destination in the ModRM reg field or in the ModRM r/m field. We show the canonical form GCC emits.
What's a ModRM byte actually?

One byte split into three fields: mod (top 2 bits), reg (middle 3 bits), r/m (bottom 3 bits). The reg field names a register; the r/m field names either a register or a memory operand depending on what mod says. A few examples:

mod=11 means "r/m is a register" (so the instruction has two register operands). mod=00 means "r/m is a memory operand with no displacement" (e.g., [rax]). mod=01 and mod=10 add an 8-bit or 32-bit displacement. r/m=100 is a sentinel meaning "a SIB byte follows," used for the indexed addressing modes like [rax + rcx*4].

The reg and r/m fields are 3 bits each. To name the 16 x86-64 registers you need 4 bits; the missing bit comes from the REX prefix's R and B bits. That's why REX is required whenever your instruction touches R8โ€“R15.

06The core instructions, by frequency

Recent x86-64 has on the order of a thousand distinct mnemonics, with several thousand encoded variants when you count operand sizes and addressing modes[13]. The fifteen or so in the table below account for the overwhelming majority of any non-vector compiler output. Sorted roughly by how often they appear:

MnemonicWhat it doesForm you'll usually see
movCopy bits. Register-to-register, register-to-memory, memory-to-register, immediate-to-register. The basic data movement instruction.mov rax, [rdi + 8]
leaCompute an address (or any 3-operand arithmetic that fits the addressing modes), don't dereference. See ยง7.lea rax, [rdi + rsi*4]
add / subInteger add/subtract. Two-operand: dst = dst op src.add rsp, 0x28
imulSigned multiply. Most often seen as two-operand (imul dst, src โ†’ dst *= src) or three-operand with an immediate (imul dst, src, imm โ†’ dst = src ร— imm).imul rax, rcx
shl / shr / sarBit shift left, logical right, arithmetic right. sar preserves the sign bit; shr doesn't.shl rax, 3
and / or / xorBitwise ops. xor reg, reg is the canonical zeroing idiom.xor eax, eax
cmp / testSet the flags register. cmp a, b = subtract without storing; test a, b = AND without storing.cmp rax, 0
jccConditional jump. je = jump if equal (ZF=1), jne, jl, jg, jb, ja (signed vs unsigned). Follows a cmp or test.jne loop
jmpUnconditional jump.jmp .L7
call / retFunction call and return. call pushes the return address and jumps; ret pops and jumps.call malloc
push / popDecrement RSP by 8 and store; or load and increment by 8. Used in function prologues/epilogues for callee-saved registers.push rbp
cmovccConditional move. cmovge dst, src = if SF=OF then dst=src. Branchless conditionals; see ยง11.cmovl rax, rdi
setccSet a byte to 1 if a condition holds, else 0. Often used with movzx to materialize a 0/1 in a register.setl al
movzx / movsxMove with zero-extension or sign-extension. Bridge between 8/16-bit and 32/64-bit registers.movzx eax, byte ptr [rdi]

The cc suffix on jumps, conditional moves, and setcc is one of sixteen condition codes that test the flags register set by the most recent cmp, test, or arithmetic instruction: z/e (zero/equal), nz/ne, l/g (signed less/greater), b/a (unsigned below/above), s (negative), o (overflow), and the rest. A jcc reads the flags and conditionally jumps; a cmovcc reads the flags and conditionally moves; a setcc reads the flags and conditionally writes 1 to a byte[10].

Memory addressing supports one form, used by mov, lea, and most others: [base + index*scale + displacement], where base and index are 64-bit registers, scale is 1/2/4/8, and displacement is a signed 8 or 32-bit constant. Any subset can be omitted. RIP-relative addressing ([rip + offset]) is the special case used for position-independent globals on x86-64.

07The lea swiss army knife

lea (Load Effective Address) computes a memory address and writes it to a register without dereferencing. Because the addressing mode is the same as the one mov uses, lea can be used as a fast three-operand arithmetic instruction: any value that fits the base + index*scale + displacement form can be computed in one instruction without clobbering the source registers and without touching the flags. The compiler uses this constantly:

lea_tricks.s
; a + b without an add, and without trashing flags or either input
lea  rax, [rdi + rsi]

; 5 * x in one instruction (4*x + x)
lea  rax, [rdi + rdi*4]

; 9 * x  โ†’  (8*x + x), via lea with scale=8
lea  rax, [rdi + rdi*8]

; Address arithmetic: &array[index] for an array of 4-byte ints
lea  rax, [rdi + rsi*4]

; Combined: 3*x + 7 in one instruction
lea  rax, [rdi + rdi*2 + 7]

Things to know. lea does not read memory, so the address it computes does not need to be valid. lea does not set flags, which makes it useful in the middle of a chain of conditional code where you can't afford to clobber them. The simple two-component form ([base + index*scale] or [base + disp]) is one cycle of latency and several per cycle of throughput on modern Intel and AMD[13]. The three-component form ([base + index*scale + disp] with a non-unit scale) is the "complex LEA" on Sandy Bridge through Skylake: routed through port 1 only, with three-cycle latency[12]. Ice Lake and later collapse most of the gap. A practical consequence is that the compiler will sometimes prefer shl + add over lea for arithmetic that needs to be on the critical path of a Skylake-class part.

08Calling conventions: System V vs Windows x64

Every function call across the public ABI on x86-64 follows one of two specifications: the System V AMD64 ABI[6] (Linux, macOS, the BSDs, PlayStation 4, PlayStation 5) or the Microsoft x64 ABI[7] (Windows, Xbox One, Xbox Series X/S). They disagree on which registers carry arguments, which are callee-saved, how struct returns work, and how the stack is aligned at the call site:

System V AMD64
Linux ยท macOS ยท BSD ยท PS4/PS5
Int args (in order)rdi, rsi, rdx, rcx, r8, r9
Float argsxmm0โ€“xmm7
Int returnrax (high half: rdx)
Float returnxmm0 (high half: xmm1)
Caller-savedrax, rcx, rdx, rsi, rdi, r8โ€“r11, all xmm*
Callee-savedrbx, rbp, r12โ€“r15, rsp
Stack alignment at call16-byte (so rsp = 16k โˆ’ 8 on entry)
Red zone128 bytes below rsp usable without adjusting
Shadow spaceNone
Microsoft x64
Windows ยท Xbox (MSVC toolchain)
Int args (in order)rcx, rdx, r8, r9
Float argsxmm0โ€“xmm3
Int returnrax
Float returnxmm0
Caller-savedrax, rcx, rdx, r8โ€“r11, xmm0โ€“xmm5
Callee-savedrbx, rbp, rdi, rsi, rsp, r12โ€“r15, xmm6โ€“xmm15
Stack alignment at call16-byte (same)
Red zoneNone
Shadow space32 bytes the caller must allocate above the args

Two differences cause most ABI bugs in practice. First, rdi and rsi are callee-saved on Windows but caller-saved (argument registers) on System V. Code that hand-codes assembly without restoring rdi and rsi works on Linux and silently corrupts state on Windows[7]. Second, Windows requires the caller to allocate a 32-byte "shadow space" above the four register arguments so the callee can spill them. A function that calls into Windows code from System V (or vice versa) needs a thunk that translates the calling convention.

The widget below shows the same function call under both ABIs side by side. A function render(vec3 pos, float scale, int flags) is called; the visualizer shows where each argument ends up and which register the call clobbers:

Live ยท ABI side-by-side
A vec3 argument is more interesting than it looks. System V passes a 3ร—float aggregate that fits in two XMM registers (xmm0 for x,y; xmm1 for z). Microsoft x64 passes any aggregate larger than 8 bytes by reference: the caller writes it to its stack and passes a pointer in rcx[7].

09Stack frames in practice

The function-call stack on x86-64 grows downward: rsp decreases on a call or a push, increases on a ret or pop. A typical non-leaf function emits a prologue that allocates a frame, saves the registers it needs to preserve, and ends with an epilogue that reverses the prologue:

function_with_frame.s
my_function:
  ; โ”€โ”€ Prologue โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  push rbp                  ; save the previous frame pointer
  mov  rbp, rsp             ; rbp now points at our frame base
  push rbx                  ; save callee-saved registers we'll use
  push r12
  sub  rsp, 0x28            ; 40 bytes of locals; keeps rsp 16-byte aligned

  ; โ”€โ”€ Body โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  ...

  ; โ”€โ”€ Epilogue โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  add  rsp, 0x28            ; deallocate locals
  pop  r12                  ; restore callee-saved registers in reverse
  pop  rbx
  pop  rbp
  ret

Three rules to remember when reading frames:

The red zone is the System V detail that catches Windows porters: a leaf function can legitimately have local_a at [rsp - 8], with rsp never adjusted. On Windows that area is fair game for any signal handler or kernel transition to clobber, so the Microsoft ABI specifies no red zone and the compiler must sub rsp, ... for any local storage.

10Reading compiler output: a worked vector normalize

A canonical 3-D engine primitive: take a vec3, divide each component by its length, return the unit vector. Source:

normalize.cpp
struct vec3 { float x, y, z; };

vec3 normalize(vec3 v) {
  float lenSquared = v.x * v.x + v.y * v.y + v.z * v.z;
  float invLen    = 1.0f / sqrtf(lenSquared);
  return { v.x * invLen, v.y * invLen, v.z * invLen };
}

With clang -O2 -mavx2 -mfma on System V (illustrated; exact instruction selection varies across versions and surrounding context):

normalize.s ยท illustrative clang -O2 -mavx2 -mfma output ยท Intel syntax
normalize(vec3):
  ; System V vec3 ABI: xmm0[0]=x, xmm0[1]=y (upper lanes undefined); xmm1[0]=z.
  vmovshdup    xmm2, xmm0                ; xmm2 = {xmm0[1], xmm0[1], โ€ฆ} โ†’ lane 0 holds y
  vmulss       xmm3, xmm1, xmm1          ; xmm3[0] = z*z   (vmulss touches only lane 0)
  vfmadd231ss  xmm3, xmm2, xmm2          ; xmm3[0] += y*y
  vfmadd231ss  xmm3, xmm0, xmm0          ; xmm3[0] += x*x      โ†’ lenSquared
  vsqrtss      xmm3, xmm3, xmm3          ; xmm3[0] = sqrt(lenSquared)
  vmovss       xmm4, dword ptr [rip + .LC0] ; xmm4[0] = 1.0f from a rodata constant
  vdivss       xmm3, xmm4, xmm3          ; xmm3[0] = invLen
  vmulss       xmm0, xmm0, xmm3          ; xmm0[0] = x*invLen; xmm0[1..] unchanged
  vmulss       xmm2, xmm2, xmm3          ; xmm2[0] = y*invLen
  vmulss       xmm1, xmm1, xmm3          ; xmm1[0] = z*invLen
  vunpcklps    xmm0, xmm0, xmm2          ; xmm0 lane 1 โ† xmm2[0] (y*invLen)
  ret                                    ; return {x', y', ยท, ยท} in xmm0; z' in xmm1[0]

A few things are happening here that aren't obvious from the source. The two-XMM vec3 argument convention is from the System V ABI[6]: an aggregate of three floats has its x,y in xmm0 and z in xmm1. The compiler used vfmadd231ss (fused multiply-add, scalar single-precision) twice to combine the three squared-component sums into one cumulative result; the FMA instruction does a += b*c in one ฮผop with one rounding step instead of two, which is both faster and more accurate[16]. The vsqrtss / vdivss pair is the textbook scalar square root and reciprocal; on hot paths a renderer might replace it with vrsqrtss (approximate reciprocal-square-root, ~11-bit precision) followed by a Newton-Raphson iteration, trading two cycles of latency for some accuracy. The compiler doesn't do that substitution by default because IEEE-754 strict mode requires the rounding behavior of sqrt + div; -ffast-math permits it.

Reading conventions: the v prefix on each mnemonic is the AVX (VEX-encoded) three-operand form. Without AVX, the same operation would be mulss xmm0, xmm0; addss xmm0, xmm3 with the destination always also being a source. Three-operand AVX lets the compiler keep xmm0 unmodified and write the result somewhere else, which usually shortens the dependency chain. The ss suffix means "scalar single-precision," i.e., one float in the low 32 bits of the XMM register, ignoring the other three lanes[17].

11Branches, mispredictions, and the branchless idiom

The CPU does not wait for a conditional branch to resolve before fetching the next instructions. It predicts the outcome of every branch (using a per-branch history table and a global branch history register, both heavily refined over the last twenty years[18]) and speculatively executes the predicted path. When the prediction is right (the usual case for tight loops, monotonic conditions, and any pattern with a stable history), the branch is approximately free. When it is wrong, the CPU rolls back the speculative work and refetches; the cost is ten to twenty cycles on recent Intel and AMD client cores[12].

"Approximately free" is worth measuring. Cloudflare's 2021 microbenchmark of a long chain of predicted unconditional jumps reports steady-state costs of 2 cycles per jmp on an Intel Xeon Gold 6262 (degrading past the ~4096-entry dense BTB capacity), ~3.5 cycles per jmp on Zen 2 (AMD EPYC 7642), and 1 cycle on Apple's M1 when the loop fits in roughly 4 KB of instructions, rising to 3 cycles per jmp once the loop overflows the smaller of the M1's branch-target buffers; never-taken conditional branches stay near 0.3 cycles per block regardless of micro-arch[40]. Two consequences worth keeping. First, "predicted branch is free" is shorthand, not literal: every micro-arch pays one to four cycles even when the prediction is right. Second, the cost is non-linear in the working set: a hot loop that fits inside the BTB and one that just doesn't can run at three- or four-fold different speeds with no source change.

For data-dependent branches that the predictor cannot learn (a random check against a 50/50 input, or a comparison against a key that changes per iteration), the misprediction cost dominates. The remedy is branchless code: replace the conditional with a computation that produces the same result regardless of the predicate, using cmovcc or bit tricks. Take the absolute-value-of-int problem:

abs.s ยท branchy vs branchless
; Branchy: easy to read, mispredicts catastrophically on random inputs
abs_branchy:
  test edi, edi
  jns  .Lpos                   ; jump if not signed (positive)
  neg  edi
.Lpos:
  mov  eax, edi
  ret

; Branchless: same answer, no jump. Cost is constant.
abs_branchless:
  mov  eax, edi                 ; eax = x (the default)
  neg  eax                      ; eax = -x; flags reflect -x
  cmovl eax, edi                 ; if -x < 0 (i.e., x was positive), take the original instead
  ret                            ; eax holds |x|; UB for x = INT_MIN, same as the branchy form

The bit-trick form is shorter still: mov eax, edi; cdq; xor eax, edx; sub eax, edx. cdq sign-extends eax into edx, producing an all-ones mask if x is negative and an all-zeros mask otherwise; the xor and sub then compute (x XOR mask) โˆ’ mask, which evaluates to x when the mask is zero and to โˆ’x when the mask is all-ones. Four instructions, no jump, no condition codes consumed, used throughout the Linux kernel and many engine math libraries[19].

When to use which. Branchless is faster when the predictor cannot learn the pattern. It is slower when the predictor can, because cmov introduces an unconditional data dependency: the result waits for both the source and the predicate, even when the predicate could have been predicted and the dependent computation skipped. The canonical demonstration is the most-upvoted answer on Stack Overflow[20]: summing the elements above a threshold in a 32k-int array runs several times faster when the array is sorted first, because the predictor latches onto the threshold crossover after a handful of iterations and from then on the branch is free. The branchless form gets no such win and, on the sorted input, loses to the branchy form.

The widget animates this. Run the same predicate over an array that's either random or sorted; the branchy version takes constant time on the sorted input and falls off a cliff on the random one:

Live ยท Branchy vs branchless
branchy cycles
ยทยทยท
branchless cycles
ยทยทยท
ratio
ยทยทยท
A simplified model. The branchy version pays 14 cycles per mispredict (a midpoint of recent Intel/AMD client measurements[12]) and 1 cycle per correct prediction; the branchless version pays 3 cycles per element unconditionally. On sorted data the predictor mispredicts twice (once at warm-up, once at the crossover) and the branchy version wins; on random data the predictor never learns and the branchless version wins decisively.

12SIMD: SSE, AVX, AVX-512

SIMD (Single Instruction, Multiple Data) replaces a loop body that operates on one scalar with one instruction that operates on a vector of values. Every modern x86-64 CPU has SSE2 mandatorily; almost every chip from the last decade has AVX2; recent server and high-end client chips have AVX-512[10]. The mental model:

The suffix on a SIMD mnemonic names the lane format. ps = packed single (4/8/16 floats), pd = packed double, ss = scalar single (one float, others untouched), sd = scalar double. Integer variants use b/w/d/q for byte/word/dword/qword. vpaddd ymm0, ymm1, ymm2 is eight 32-bit integer adds in parallel[17].

The widget below animates one SIMD operation across a register. vaddps on a YMM register adds two eight-lane vectors in one cycle of throughput. Press play and watch eight independent adds light up at the same time:

Live ยท SIMD lanes in flight
A YMM-register operation is one ฮผop on Skylake and Zen; throughput is two per cycle for additions and FMAs on Skylake-X and later[13]. The animation is wall-clock-paced for readability; the real CPU finishes the whole thing in a single cycle.
What about ARM? AAPCS64 and NEON in 30 seconds.

ARM (AArch64) is the architecture of Apple Silicon, Nintendo Switch, Nintendo Switch 2, every modern Android phone, and the ARM-based AWS Graviton servers. The ISA differs from x86-64 in three big ways: it is fixed-length 32-bit (no variable-length encoding), it is load-store (no read-modify-write on memory operands), and it has 31 general-purpose 64-bit registers (X0โ€“X30, plus a zero register XZR and the stack pointer SP).

The AArch64 Procedure Call Standard[21] passes integer args in X0โ€“X7 and float args in V0โ€“V7. Callee-saved registers are X19โ€“X28. SIMD lives in 32 "V" registers, each 128 bits; instructions like fmla v0.4s, v1.4s, v2.4s are the rough equivalent of vfmadd231ps xmm0, xmm1, xmm2. The SVE/SVE2 vector extensions (vector-length-agnostic; on Apple Silicon and modern ARM server cores) generalize this to longer registers without recompiling.

Most of this tutorial transfers across by mechanical translation: register names change, the instruction syntax changes, but the same dependency chains, port pressure, and ABI bugs apply.

13A worked SIMD example: 4ร—4 matrix-vector multiply

The transformation that every renderer applies to every vertex: multiply a 4-component vector by a 4ร—4 matrix. Naively, sixteen multiplies and twelve adds (or four dot products of length 4) per vertex. With SSE the same operation is four vmulps and three vaddps on 128-bit vectors. The standard formulation, for a column-major matrix M = [c0 | c1 | c2 | c3] and a vector v = (x, y, z, w):

Mv = c0ยทx + c1ยทy + c2ยทz + c3ยทw

Each ciยทs is a scalar-times-vector, which in SSE is a broadcast of the scalar across four lanes followed by a multiply. The three additions chain together. Source:

matvec.cpp ยท SSE 4x4 column-major mat * vec
// 16-byte aligned for movaps. col[i] is column i of M.
struct alignas(16) mat4 { __m128 col[4]; };

__m128 mul(const mat4& M, __m128 v) {
  __m128 x = _mm_shuffle_ps(v, v, _MM_SHUFFLE(0, 0, 0, 0));  // broadcast v.x
  __m128 y = _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 1, 1, 1));  // broadcast v.y
  __m128 z = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 2, 2, 2));  // broadcast v.z
  __m128 w = _mm_shuffle_ps(v, v, _MM_SHUFFLE(3, 3, 3, 3));  // broadcast v.w
  __m128 r = _mm_mul_ps(M.col[0], x);                  // r  = c0 * x
  r = _mm_fmadd_ps(M.col[1], y, r);                     // r += c1 * y
  r = _mm_fmadd_ps(M.col[2], z, r);                     // r += c2 * z
  r = _mm_fmadd_ps(M.col[3], w, r);                     // r += c3 * w
  return r;
}

With clang -O3 -mavx2 -mfma, the body is nine instructions plus the return, split into two parallel FMA chains to shorten the critical path:

matvec.s ยท illustrative clang -O3 -mavx2 -mfma output
mul(mat4 const&, __m128):
  ; rdi = const mat4*  (the matrix);  xmm0 = v
  vshufps      xmm1, xmm0, xmm0, 0x00       ; xmm1 = {v.x, v.x, v.x, v.x}
  vshufps      xmm2, xmm0, xmm0, 0x55       ; xmm2 = {v.y, v.y, v.y, v.y}
  vshufps      xmm3, xmm0, xmm0, 0xAA       ; xmm3 = {v.z, v.z, v.z, v.z}
  vshufps      xmm0, xmm0, xmm0, 0xFF       ; xmm0 = {v.w, v.w, v.w, v.w} (last use of v, OK to overwrite)
  vmulps       xmm1, xmm1, [rdi + 0x00]     ; xmm1 = c0 * v.x     (chain A start)
  vmulps       xmm0, xmm0, [rdi + 0x30]     ; xmm0 = c3 * v.w     (chain B start)
  vfmadd231ps  xmm1, xmm2, [rdi + 0x10]     ; xmm1 += c1 * v.y    (chain A: dst = src1*src2 + dst)
  vfmadd231ps  xmm0, xmm3, [rdi + 0x20]     ; xmm0 += c2 * v.z    (chain B)
  vaddps       xmm0, xmm0, xmm1             ; merge:  c0ยทx + c1ยทy + c2ยทz + c3ยทw
  ret

Two observations worth pulling out. The compiler split the source's single FMA chain into two parallel chains and merged them with a final vaddps: critical-path latency drops from one vmulps + three serial vfmadds (โ‰ˆ16 cycles on Skylake's 4-cycle FMA) to one vmulps, one vfmadd, one vaddps (โ‰ˆ12 cycles), at the cost of one extra instruction. The matrix columns are loaded directly inside the FMA and MUL memory-operand forms ([rdi + 0x10], etc.); on Intel and Zen the load and the compute fuse at the front end into a single macro-op[13], so an explicit vmovaps would be redundant.

What's intentionally missing from this code. The 128-bit (one vector) form processes one vertex at a time. The throughput-tuned form transposes the data to SoA so that one ymm/zmm register holds the same component (x, then y, then z) for many vertices, and a single broadcast-and-FMA transforms 8 vertices at a time with AVX or 16 with AVX-512. The matrix is also typically loaded once and reused across many vertices; this code reloads it every call. On the GPU the same transform is the vertex shader, run massively in parallel, which is why CPU-side vertex transformation today is reserved for skinning, particle simulation, and physics: workloads where the CPU's branchy logic is worth more than the GPU's lane count.

14Intrinsics vs inline assembly vs writing pure asm

Three ways to drop below the C/C++ language barrier, in roughly decreasing order of how often you should use them:

An example of when inline assembly is the right answer. CPUID, used at startup to detect available CPU features, has no C-level mechanism to invoke. GCC's intrinsic __cpuid handles it, but if you need the raw form:

cpuid.c ยท inline asm with operand constraints
static inline void cpuid(int leaf, int *eax, int *ebx, int *ecx, int *edx) {
  __asm__ (
    "cpuid"
    : "=a"(*eax),       // output: eax โ†’ *eax
      "=b"(*ebx),       // output: ebx โ†’ *ebx
      "=c"(*ecx),       // output: ecx โ†’ *ecx
      "=d"(*edx)        // output: edx โ†’ *edx
    : "a"(leaf)         // input:  leaf โ†’ eax
  );
}

The output constraints "=a", "=b", "=c", "=d" tell the compiler that cpuid writes those four registers and the values should go to the named C variables. The input constraint "a"(leaf) tells the compiler to load leaf into eax before executing the instruction. This is the entire mechanism; everything else is constraint syntax[22].

Two reliable failure modes. Forgetting clobbers. If your inline asm modifies a register that's not in the input or output list, you have to declare it in the clobber list, or the compiler will assume the value it had before the asm is still there afterwards. Mixing inline asm with optimization-fragile patterns. Inline asm is a black box to the optimizer; it inhibits inlining, vectorization, and instruction scheduling across the asm. A tight intrinsic-based loop is almost always faster than the same loop with one inline-asm hot instruction in the middle.

15Atomics and fences at the machine level

A pointer to the Memory Model tutorial, condensed: every std::atomic operation in C++ lowers to a specific machine instruction whose ordering guarantees match the requested memory_order. The mapping on x86-64[24]:

On ARM (AArch64) the lowering is more visible. An acquire load lowers to LDAR (Load-Acquire Register); a release store lowers to STLR (Store-Release Register); since the C++20 fix codified in P0668[24], a seq-cst load is LDAR and a seq-cst store is STLR alone. Pre-P0668 toolchains sometimes added a leading DMB ISH before STR for the seq-cst store, which is heavier but conservative. The Lรช, Pop, Cohen, Zappa Nardelli 2013 paper on the corrected Chase-Lev deque[26] is the standard reference for what the C++ memory model demands of ARM codegen; see the memory model tutorial for the full picture.

16ฮผops, ports, and dependency chains in practice

Earlier we sketched the OOO engine. With the SIMD walkthrough in mind, this is what microarchitectural tuning actually looks like. Take a reduction: summing a million floats.

sum_serial.cpp ยท the naive loop
float sum_serial(const float* a, int n) {
  float s = 0;
  for (int i = 0; i < n; ++i) s += a[i];
  return s;
}

Compiled with -O2 (no fast-math), the inner loop is one addss per element. addss on Skylake has a latency of 4 cycles and a throughput of 2 per cycle[13]. Because each iteration depends on s from the previous iteration, the loop is bound by the latency, not the throughput: one iteration every 4 cycles. The throughput of 2 adds per cycle is unreachable because there is only one chain. Eight ports of execution are mostly idle.

The fix is to break the dependency chain: use multiple accumulators that the compiler can schedule independently. -O3 -ffast-math on a modern compiler will do this for you (or -fassociative-math, which is the specific permission needed), producing something like:

sum_parallel.cpp ยท multi-accumulator + SIMD, after autovectorization
float sum_parallel(const float* a, int n) {
  __m256 s0 = _mm256_setzero_ps(), s1 = _mm256_setzero_ps();
  __m256 s2 = _mm256_setzero_ps(), s3 = _mm256_setzero_ps();
  for (int i = 0; i < n; i += 32) {
    s0 = _mm256_add_ps(s0, _mm256_loadu_ps(a + i + 0));
    s1 = _mm256_add_ps(s1, _mm256_loadu_ps(a + i + 8));
    s2 = _mm256_add_ps(s2, _mm256_loadu_ps(a + i + 16));
    s3 = _mm256_add_ps(s3, _mm256_loadu_ps(a + i + 24));
  }
  __m256 s = _mm256_add_ps(_mm256_add_ps(s0, s1), _mm256_add_ps(s2, s3));
  // horizontal sum of s into a scalar; details elided
  ...
}

Four independent accumulators, each a 256-bit vector of eight floats. Per iteration the loop does four 256-bit adds and four 256-bit loads. The four adds run on different dependency chains, which removes the cross-iteration latency that pinned the naive version. The four loads share two L1 load ports. Skylake has two FMA/ADD ports at 256 bits each and two L1 load ports at 256 bits each[13][12], so both halves of the loop need two cycles. Steady-state throughput is one iteration every two cycles: 16 floats per cycle. Compared to the naive loop's 0.25 floats per cycle (one element per 4-cycle addss latency), that is a 64ร— speedup, decomposed as 8ร— from the SIMD width and 8ร— from breaking the latency chain. The loop is now L1-load-bandwidth-bound; the bottleneck moves to L2 once the working set exceeds the L1 footprint, and to DRAM once it exceeds L2.

Three tools to verify this analysis without re-running the program. llvm-mca[27] takes a chunk of assembly and reports the expected steady-state IPC and per-port pressure. uica.uops.info[28] does the same with a richer microarchitectural model. perf stat -e cycles,instructions,branches,branch-misses,... on Linux gives the actual run-time numbers from CPU performance counters. The three together cover prediction, simulation, and measurement.

17Pitfalls

18What's next

Where to go from here:

19Sources & further reading

Numbered citations refer to the superscripts above. Most are freely available; a few are vendor manuals that require a no-cost registration.

A note on originality

The prose, code samples, CSS, and interactive widgets on this page are original writing. The instruction-encoding decomposition follows the Intel SDM Volume 2 chapter on instruction format [10]. The two ABI tables in ยง8 are a side-by-side condensation of the System V AMD64 ABI [6] and the Microsoft x64 calling convention reference [7]; consult them before relying on any field. The branchy-vs-branchless numerics in ยง11 trace to Bruce Dawson's analysis and the canonical Stack Overflow demonstration [20]; the bit-trick form of abs is from Hacker's Delight ยง2-4 [19]. The C++ atomic lowering table in ยง15 matches the mapping documented at [24].

  1. Godbolt, M. Compiler Explorer. godbolt.org. The interactive compile-and-disassemble tool used in every example in this tutorial; mature support for GCC, Clang, MSVC, ICX, and a long tail of other compilers.
  2. Lengyel, E. (2011). Mathematics for 3D Game Programming and Computer Graphics (3rd ed.). Cengage. The standard reference for the vector and matrix routines that ship in every renderer; SSE versions of normalize, cross, and matmul in Chapter 4.
  3. Intel Corporation. Intelยฎ 64 and IA-32 Architectures Optimization Reference Manual. Order Number 248966. intel.com. The normative source for Intel microarchitecture details: pipeline depth, ฮผop fusion, AVX state transitions, the SSE-AVX transition penalty.
  4. Advanced Micro Devices. Software Optimization Guide for AMD Family 19h Processors (Zen 3 and Zen 4). Publication #56665 / #57487. amd.com. AMD's counterpart to the Intel optimization manual; the canonical Zen 3/Zen 4 latency and throughput numbers.
  5. Fog, A. Software optimization resources. agner.org/optimize. Five PDFs: optimizing C++, optimizing asm, microarchitecture of Intel/AMD/VIA CPUs, instruction tables, and calling conventions. Updated as recently as 2024. Free; no registration.
  6. Matz, M., Hubiฤka, J., Jaeger, A., & Mitchell, M. System V Application Binary Interface, AMD64 Architecture Processor Supplement. gitlab.com/x86-psABIs/x86-64-ABI. The normative document for Linux, macOS, the BSDs, and the PlayStation console toolchains. Continuously revised; the LaTeX source is the canonical version.
  7. Microsoft. x64 calling convention. learn.microsoft.com. The Microsoft x64 ABI reference: argument registers, callee/caller-saved, shadow space, struct passing rules.
  8. Advanced Micro Devices. AMD64 Architecture Programmer's Manual, Volume 1: Application Programming. Publication #24592. amd.com. The original AMD64 architecture spec; describes the REX prefix, the 64-bit operand size rules, and the zero-extension behavior of 32-bit register writes.
  9. Intel Corporation. (1979). iAPX 86/88, 186/188 User's Manual. The original 8086 reference. Historical only; reproduced at archive.org.
  10. Intel Corporation. Intelยฎ 64 and IA-32 Architectures Software Developer's Manual. Order Numbers 253665โ€“253669. intel.com. Five volumes; Volume 1 is the architecture overview, Volume 2 is the instruction set reference (the encoding tables), Volume 3 covers system programming.
  11. Intel Corporation. (2023). Intelยฎ Advanced Performance Extensions (Intelยฎ APX) Architecture Specification. intel.com. The proposed extension to 32 GPRs and three-operand encoding for legacy integer instructions.
  12. Fog, A. The microarchitecture of Intel, AMD and VIA CPUs. Manual 3 of the agner.org/optimize series. Periodically updated; the 2024 edition covers Zen 4, Raptor Lake, and Arrow Lake.
  13. Abel, A., & Reineke, J. (2019). uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. ASPLOS. Project home: uops.info. Machine-measured latency, throughput, and port-usage tables for nearly every x86-64 instruction on every recent micro-architecture.
  14. Mรถller, T. (2021). Frame pointers: why you should keep them. Brendan Gregg's writeup on the cost of omitting frame pointers for production profiling. brendangregg.com. Covers why Fedora 38 and Ubuntu 24.04 re-enabled -fno-omit-frame-pointer by default.
  15. Apple. (2024). Writing ARM64 code for Apple platforms. developer.apple.com. Apple's ARM64 platform conventions; covers the frame pointer requirement, which is stricter than the AAPCS64 baseline.
  16. Muller, J.-M., Brisebarre, N., de Dinechin, F., Jeannerod, C.-P., Lefรจvre, V., Melquiond, G., Revol, N., Stehlรฉ, D., & Torres, S. (2018). Handbook of Floating-Point Arithmetic (2nd ed.). Birkhรคuser. The reference for FMA rounding semantics: a single rounding step after the multiply-then-add, instead of two.
  17. Intel Corporation. Intelยฎ Intrinsics Guide. intel.com/intrinsics-guide. The searchable reference for every intrinsic in the <immintrin.h> family, with mnemonic, encoding, latency, and a per-architecture availability badge.
  18. Yeh, T.-Y., & Patt, Y. N. (1991). Two-Level Adaptive Training Branch Prediction. MICRO. PDF. The two-level predictor that became the basis of every shipping CPU's branch predictor; modern designs add path-history, TAGE, perceptron predictors, and indirect-target tables on top.
  19. Warren, H. S. (2012). Hacker's Delight (2nd ed.). Addison-Wesley. The canonical reference for bit-twiddling: branchless absolute value, sign extension, population count, etc. ยง2-4 has the absolute-value bit trick used in ยง11.
  20. Stack Overflow. Why is processing a sorted array faster than processing an unsorted array? (2012). stackoverflow.com. The most-upvoted Stack Overflow answer ever; canonical demonstration of branch-predictor effects in C++.
  21. ARM Limited. Procedure Call Standard for the Armยฎ 64-bit Architecture (AArch64). Document IHI 0055. github.com/ARM-software/abi-aa. The AArch64 ABI: argument registers, callee-saved, parameter passing rules, the NEON vector ABI.
  22. Free Software Foundation. Extended Asm โ€” Assembler Instructions with C Expression Operands. GCC manual. gcc.gnu.org. The reference for GCC inline assembly constraints; Clang follows the same spec for compatibility.
  23. Microsoft. Inline assembler. learn.microsoft.com. The MSVC documentation; states explicitly that inline asm is not supported in x64 or ARM64 code, with the redirect to intrinsics and external .asm via MASM.
  24. Boehm, H.-J., & Giroux, O. (2018). P0668R5 โ€” Revising the C++ memory model. WG21. open-std.org. The proposal that tightened seq_cst ordering on weak hardware; includes the canonical x86 and ARM lowering tables for every memory order. Also useful: Mapping C/C++ to processors, cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html.
  25. Intel Corporation. Intelยฎ 64 and IA-32 Architectures Software Developer's Manual, Volume 3A, ยง8.2 (Memory Ordering), ยง8.3 (Serializing Instructions). The normative source on MFENCE, LFENCE, SFENCE, and the LOCK prefix as full barriers.
  26. Lรช, N. M., Pop, A., Cohen, A., & Zappa Nardelli, F. (2013). Correct and Efficient Work-Stealing for Weak Memory Models. PPoPP. PDF. The paper that corrected the 2005 Chase-Lev deque with the right C11 acquire/release ordering for ARM and POWER.
  27. LLVM Project. llvm-mca โ€” LLVM Machine Code Analyzer. llvm.org. A static throughput analyzer that takes a region of assembly and reports per-port pressure and steady-state IPC; ships with LLVM.
  28. Abel, A., & Reineke, J. (2022). uiCA: Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures. ICS. Project home: uica.uops.info. A more recent simulator with a richer model of the front end, ฮผop cache, and back-end ports.
  29. Paoloni, G. (2010). How to Benchmark Code Execution Times on Intelยฎ IA-32 and IA-64 Instruction Set Architectures. Intel white paper. The reference for the right way to use RDTSC and RDTSCP for cycle-level benchmarking, including the serialization caveats.
  30. Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann. The textbook on pipelining, OOO execution, and cache hierarchies. Chapters 2 and 3 are the long-form version of ยง3 of this tutorial.
  31. Drepper, U. (2007). What Every Programmer Should Know About Memory. PDF. The long-form treatment of the memory hierarchy, prefetch, and cache effects; still the clearest single document on the topic.
  32. Giesen, F. (Ryg). The ryg blog. fgiesen.wordpress.com. Practitioner-grade writeups on SIMD, codec implementation, and the pixel pipeline; the closest thing the field has to a working game-programmer's blog on this material.
  33. Muล‚a, W. Practical SIMD and bit-twiddling notes. 0x80.pl. Hundreds of microbenchmarks and worked examples for SSE, AVX2, AVX-512, NEON. The reference for "how do I vectorize this small thing?"
  34. Dawson, B. Random ASCII. randomascii.wordpress.com. Long-running practitioner blog from a Valve / Microsoft profiling engineer; the source for many of the canonical "what does this assembly cost on Skylake" investigations.
  35. Lemire, D. Daniel Lemire's blog. lemire.me/blog. Microbenchmarks, SIMD analyses, and a steady stream of evidence-based posts on modern x86 performance.
  36. Patterson, D. A., & Waterman, A. (2017). The RISC-V Reader: An Open Architecture Atlas. Strawberry Canyon. A short, accessible alternative to x86 for understanding how a simpler ISA decodes; useful background even if you'll never ship to RISC-V.
  37. Intel Corporation. Intelยฎ VTuneโ„ข Profiler User Guide. intel.com. The reference profiler for Intel hardware; reads the same performance counters perf does, with a richer microarchitectural-event taxonomy.
  38. Bendersky, E. Eli Bendersky's website. eli.thegreenplace.net. Long-form articles on linkers, ELF, position-independent code, and how compilers actually generate the assembly you read. Strong on the loader and dynamic-linking side that this tutorial skips.
  39. Hyde, R. (2003). The Art of Assembly Language. No Starch Press. The classic intro book to assembly; uses HLA, but Part III is x86 assembly proper. Free online at plantation-productions.com.
  40. Majkowski, M. (2021). Branch predictor: How many "if"s are too many? Cloudflare blog. blog.cloudflare.com. Microbenchmarks of long predicted-branch chains on Intel Xeon Gold 6262, AMD EPYC 7642 (Zen 2), and Apple M1, with reproducible code. Source for the per-architecture per-jmp cycle numbers and the BTB-capacity-cliff behavior cited in ยง11.

See also