Tutorial 11 · Engine Programming

Debugging C++
in Release Mode
from Scratch

The bug your QA filed is in the optimized binary the player runs, not the debug build that prints all your asserts. This tutorial is the practical playbook for that binary: which debug info survives -O2, how to read disassembly when the source is gone, hardware watchpoints for the writer you can't catch in source, sanitizers for the bug you can't reproduce, crash dumps for the bug that already happened on someone else's machine, and the small library of tactics that turns a heisenbug into a regular bug. Six live widgets, no IDE-specific UI, every claim cited.

Time~70 min LevelIntermediate to senior PrereqsYou can read C or C++ comfortably and you have used a debugger at least casually. The x86-64 Assembly tutorial pairs naturally with §6 and §7. The Memory Model tutorial is the right companion for §11. ToolsAny of GDB, LLDB, WinDbg, plus objdump or llvm-objdump

01The skill that actually ships

Almost every consumer-facing C++ binary that crashes on a player's machine was compiled with optimizations on. The debug build is a development tool, not a shipping target: it is too slow to hit frame rate, too fat to fit in console memory, and too forgiving to expose the data races that only matter when the scheduler is fast. The bug your support pipeline collects, the call stack the crash reporter uploads, the assembly the kernel attaches to a faulting page: all of those came from the release binary. Reading it as a first language, not as a translation of the source, is the skill that closes the gap between "we couldn't reproduce it" and "we shipped the fix."

Categories of bug that classically only surface under optimization:

Latent UB the optimizer takes seriously. The standard says signed overflow is undefined; -O2 will hoist the branch that depended on it^[1]. -O0 kept the branch and the bug looked benign.
Data races that the slower debug runtime hid. Debug allocators, debug iterators, and _ITERATOR_DEBUG_LEVEL add latency that smooths out the unsafe interleaving the release scheduler exposes^[2].
Reads of uninitialized memory. MSVC's debug CRT fills fresh heap allocations with 0xCD, freed heap with 0xDD, and the guard bytes around each allocation ("no man's land") with 0xFD^[3]. The 0xCC you see on uninitialized stack is the MSVC compiler's debug behavior (fill with the int 3 opcode), not the CRT's. Release builds fill with whatever was already there.
Inlining-dependent bugs. A function called through a pointer in debug becomes a direct, inlined call in release. The argument-evaluation order, the lifetime of a temporary, the address taken of a local, all of those can change.
Stripped or different symbol resolution. A weak symbol resolution that happened to pick the local definition in debug can pick a vendored library's version in release; an inlined-and-elided helper in one translation unit can survive in another.
Floating-point reassociation. -ffast-math permits the compiler to reorder additions, fuse multiply-adds, treat NaN as impossible, and assume no signed zero^[4]. Determinism across -O0 and -O2 goes with it.

Game-specific reasons the release binary is the only one that matters: a console build is the release binary plus a small amount of profile data; a memory-budget bug only triggers when the engine pre-allocates the production heap; a 4-ms frame spike from a rare allocator path needs to run at production rates to surface. None of those reproduce under -O0.

What you'll have by the end

A working playbook for the optimized binary: which debug-info survives -O2 and how to use it, how to read a release-mode disassembly cold, how to set up sanitizers and what they cover, how to read a Windows minidump or a Linux core file, what hardware watchpoints can do that source-level breakpoints can't, and which release-only bugs are usually UB in disguise. Six interactive widgets, including a side-by-side optimizer ladder, a stack-frame walker that follows DWARF unwind data, an AddressSanitizer shadow-memory simulator, and a data-race detector with happens-before edges.

The shape of a release-only bug

A small, complete example of the genre. Two C functions, identical inputs, same compiler. The first is a debug build. The second is the same source compiled with -O2:

debug · -O0 -g

int guard(int n) {
  int doubled = n + n;       // safe at -O0: even on overflow,
  if (doubled < n) return -1; // the compare actually runs
  return doubled;
}
// guard(INT_MAX) returns -1, as the source suggests.

release · -O2

int guard(int n) {
  int doubled = n + n;
  if (doubled < n) return -1;
  return doubled;
}
// guard(INT_MAX) hits signed overflow, which is undefined behavior
// per the C standard (C17 §6.5/5). The optimizer proves the branch
// is reachable only via overflow, assumes that case never occurs,
// and deletes the branch. The function returns whatever doubled is
// (typically the two's-complement result on x86-64), unchecked.

The two binaries disagree on the same source. Neither compiler is wrong. The C and C++ standards leave signed overflow undefined^[1]; the optimizer is permitted to assume the program never overflows and to delete code whose only path to execution required overflow. -O0 kept the literal branch because the optimizer was off; -O2 elided it for the same reason it elides any other dead code. The bug is a UB-shaped hole in the source. The release binary is the one telling the truth.

The same shape recurs throughout this tutorial: a thing the debugger or the source pretends to know turns out to be a fiction the optimizer has already moved past, and the path forward is to read the artifact (the assembly, the shadow-memory map, the unwind table, the minidump) instead.

02What the binary still knows about itself

An optimized binary has lost most of its source-level structure. Stack-allocated locals are register-resident or folded into other expressions. Loops are unrolled, inlined, vectorized, or removed. Functions disappear into their callers. What remains is the machine code plus a separate side-channel of debug information the compiler emitted alongside it. The two main formats are DWARF (used by Linux, macOS, BSD, almost every non-Microsoft toolchain) and CodeView, packaged in PDB files (used by MSVC and most Windows native toolchains)^[7]^[8]. Reading a release crash without one of these is reading hex.

The five tables a debugger actually reads:

Symbol table. Maps function names to address ranges. Survives stripping by being moved to a separate .debug file or PDB, but the unstripped version still has it. Lets the debugger print Game::Update + 0x47 instead of 0x4012a7.
Line table. The mapping from instruction address to (source file, line, column). DWARF stores it as a state-machine program in .debug_line^[7]; PDB stores it in CodeView line records. Coarse after optimization: a single source line maps to several non-contiguous instruction ranges, and several source lines may map to the same instruction.
Location list. The mapping from (variable, instruction range) to "where the value lives" (a register, a stack slot, a constant, a derived expression). DWARF stores this in .debug_loc or .debug_loclists. The reason your debugger sometimes prints "value optimized out" and sometimes prints a register: the location list either does or does not have a description for that PC^[7].
Unwind information. Per-function table of how to walk back up the stack from any instruction inside the function. DWARF calls these CFI and stores them in .eh_frame on Linux/macOS and .debug_frame in old toolchains; Windows x64 uses .pdata and .xdata sections that are part of the PE specification, not optional^[9].
Type information. Records of struct layout, enum names, vtable shapes, template instantiations. The largest category of debug-info by far; on Windows the type info is a major reason a PDB can be tens or hundreds of megabytes for a game binary.

The two practical consequences of this layout. First, you can keep optimization on and still get most of the line, location, and unwind information; -g on GCC and Clang, /Zi on MSVC, do not change codegen^[10]. The debug info is a separate side product. Second, stripping the binary moves the debug-info to a separate file (the .dSYM on macOS, the .debug file on Linux, the PDB on Windows). A symbol server is the standard way to keep them retrievable per crash dump^[11].

The unwind tables in particular are easy to undervalue. Without .eh_frame or .pdata, walking back from a faulting instruction needs the frame pointer chain, which the optimizer omits by default. With them, the debugger reads "at this PC, the return address is at [rsp + 0x28], the saved rbx is at [rsp + 0x20]" and walks the stack with no help from the program. This is the entire reason a stripped, optimized release binary still produces a clean stack trace from a crash.

"Optimized out" in the debugger: what it actually means

You set a breakpoint at line foo.cpp:42 in a release build. You hit it. You ask for the value of localVariable. The debugger prints <optimized out>. Three things might be true:

The variable is in a register that has been reused. Its value lived in r12 for cycles 100–115, but at line 42 the function has finished with it and r12 now holds something else. The DWARF location list correctly reports "no longer here." The truth is that the value at line 42 doesn't exist anywhere; the optimizer didn't copy it to a stack slot for your debugger's benefit.

The variable was folded into a larger expression. If x is only ever used as x + 1 and the surrounding code uses the increment directly, x may not have a runtime representation at all. The location list in this case will not even have an entry.

The variable was a constant. If radius was initialized to 1.0f and never changed, the literal 1.0f is what got compiled in, and there is no register or memory holding "the variable named radius."

The fix isn't to mistrust the debugger. The fix is to read the disassembly to see where the value of localVariable actually came from, and inspect it there. info args, info locals, and the equivalent on LLDB or WinDbg give you the location list as the compiler wrote it; the rest is reading.

03Building for debuggability without giving up speed

The compiler flags that affect debuggability split into two groups: the ones that emit debug info (free, in code-gen terms) and the ones that change codegen (not free). The former should always be on for any build you might want to debug, including release-with-debug-info shipping builds. The latter need a deliberate trade-off.

Flag	Compiler	What it does	Codegen cost
`-g` / `-g3`	GCC, Clang	Emit DWARF debug info. `-g3` additionally emits macro definitions.	None. Output is a separate set of `.debug_*` sections.
`/Zi`	MSVC	Emit CodeView debug info to a separate PDB.	None. Disables incremental linking by default; use `/ZI` only for non-shipping builds.
`-gsplit-dwarf`	GCC, Clang	Split debug info into `.dwo` files. Reduces link time and binary size on disk.	None.
`-fno-omit-frame-pointer`	GCC, Clang	Reserve `rbp` as the frame pointer in every function. Improves backtrace robustness when the unwind table is missing or wrong.	~1–2% on most code; one fewer general-purpose register^[12].
`/Oy-`	MSVC	Disable frame-pointer omission. Same trade-off.	Same as `-fno-omit-frame-pointer`.
`-fasynchronous-unwind-tables`	GCC, Clang	Emit `.eh_frame` at every instruction, not just at call sites.	Larger binary, no perf cost. On by default on x86-64 Linux.
`-O2 -g` / `/O2 /Zi`	both	The shipping-with-symbols build. Optimizations on, debug info on the side.	None beyond `-O2` itself.
`-Og`	GCC, Clang	Optimize for debugging: enable optimizations that don't disturb stepping/locals.	Slower than `-O2`; faster than `-O0`; locals usually still inspectable.
`-fno-strict-aliasing`	GCC, Clang	Disable type-based alias analysis. Stops the compiler from assuming `int` and `float` point to disjoint storage.	1–10% on autovectorized loops; defangs a major class of UB-induced miscompile^[13].

The default for any in-development build of an engine: -O2 -g -fno-omit-frame-pointer, with sanitizer flags enabled in the configurations that need them (covered in §8). Shipping builds keep -O2 -g and split the debug info to a symbol server. The real argument is over the frame pointer.

The 2024 default switch in Fedora and Ubuntu to keep frame pointers on at -O2 closed a long-running argument in the Linux ecosystem^[12]. The benefit is that perf and any sampling profiler can build accurate stack traces by walking rbp at native speed; without it, perf has to either parse .eh_frame on every sample (expensive) or use --call-graph=lbr with hardware Last-Branch-Record support (limited to 32 entries on most parts). The cost of keeping rbp reserved is in the low percent range on the workloads that have been measured. For a game that wants production-quality profiles, the trade is usually worth it; for a benchmark suite looking for the last 1%, it isn't.

Apple's ARM64 platform ABI mandates the frame pointer^[14]; you don't get a choice. Microsoft's x64 ABI doesn't require it but supplies the unwind tables, which means stack walks work with or without^[9].

A note on asserts in shipping builds

The NDEBUG macro disables assert() from <cassert>. Many engines define a separate GAME_ASSERT that is kept on in shipping for the cheap ones (pointer non-null, range check, invariant) and stripped for the expensive ones. The cheap ones turn an undefined-behavior crash into a defined error log plus a contextful minidump. Mike Acton's CppCon 2014 talk on the Insomniac engine makes the related case for explicit invariants in production-quality code^[15].

The optimizer ladder below shows what each level of -O actually changes for a small loop. Click a flag to see the assembly the compiler emits and the transformations it applied:

Output is approximate, modeled after Clang 17 on x86-64 Linux. Per-element cycle figures are steady-state estimates assuming the data is in L1; actual throughput depends on the microarchitecture, alignment, and surrounding code. The transformations shown are the standard set the optimizer applies in order: dead code elimination^[16], mem-to-reg, constant propagation, common subexpression elimination, loop-invariant code motion, induction-variable simplification, loop unrolling, and SLP/loop autovectorization.

04The debugger as a data tool

The popular conception of a debugger is a UI: set a breakpoint, hit it, look at locals. That mental model gets you about ten percent of what the tool can do. The debugger is a programmable inspection surface attached to a running process, with full access to memory, registers, threads, system calls, and signals. Most of the bugs that don't fall out of "set breakpoint, hit breakpoint, look around" are ones where you script the debugger to collect data over many iterations and then look at the result.

The four classes of tactic worth knowing, all available in GDB, LLDB, and WinDbg in some form:

Conditional and counted breakpoints

"Stop here, but only when i > 10000 and node->parent == nullptr." On GDB this is a conditional breakpoint^[17]: break NavMesh.cpp:402 if i > 10000 && node->parent == nullptr. The cost is one stop-evaluate-continue cycle per hit; for a hot path, this is too slow to use naively. The standard speedup is to compile the condition into the inferior: GDB's --script infrastructure or the LLDB Python API will JIT a small native fragment that evaluates the condition without the round-trip.

"Stop here on the 1024th hit." Ignore counts: ignore 1 1023 on GDB, equivalent option on LLDB. Useful for finding the iteration where an invariant first breaks: ignore until just before, single-step from there.

Tracepoints and breakpoint commands

"Don't stop, just log." A tracepoint on GDB or a breakpoint with a commands ... continue block prints values to a log and lets the program run on. The output is structured: log the call site, the argument, a timestamp, and a small fingerprint of the relevant state. Run the failing scenario, look at the log offline. Often replaces printf-debugging because it doesn't require a recompile and won't drift when someone reorganizes the file.

Reverse and replay debugging

rr, originally developed at Mozilla and now maintained at rr-debugger/rr, records the system calls and non-deterministic events of a Linux process and replays them deterministically inside GDB^[18]. Inside the replay you can step backwards, set breakpoints in the future and run forward, and re-run the same buggy execution as many times as you need. The recording has a one-off slowdown, in the 1.2× to 4× range for typical workloads; the replay is at full speed. WinDbg has Microsoft's Time Travel Debugging for the same workflow on Windows^[19]. The single class of bug rr and TTD rule out faster than anything else: "the value of this pointer changed and I don't know who changed it". Run backwards from the wrong value to the assignment.

Scripted post-mortem inspection

The Python APIs of GDB and LLDB, and the JavaScript dx engine in WinDbg, let you walk arbitrary structures programmatically. A two-line script that walks every entity in the ECS and reports the ones with a specific component-mask flag set will dump it in a quarter of a second. Scripting the debugger is the right call when the question is "across all current data, which N satisfy P," and the alternative is clicking through ten thousand entries by hand.

Tip · Save the session

A debugging session that found a bug is also a regression test for whether you fixed it. Save the breakpoint definitions, the scripted commands, and the conditional expressions to a .gdbinit, .lldbinit, or WinDbg script. Re-running the same script against the patched binary should now reach the end without tripping the assertion or logging the off-by-one. The script becomes documentation that the next person on the bug can re-execute.

05Hardware watchpoints: catching the writer

The single most useful debugger feature most engineers underuse is the hardware watchpoint. A breakpoint stops execution at an instruction; a watchpoint stops execution when a specific memory location is read or written, no matter which instruction did it. The mechanism is hardware-supported: x86 has four debug-address registers DR0–DR3, each holding a linear address, with the read/write/execute mask and the access size encoded in DR7^[20]. ARMv8 has a similar mechanism through the DBGWVR/DBGWCR register pairs; the architecture allows between 2 and 16 watchpoint pairs and most A-profile cores ship with 4^[21].

The shape of a problem hardware watchpoints solve faster than anything else: "this byte should always be 0xAB, but at some point in the frame it is 0xCC. Who is writing it?" Source-level reasoning runs out of road; the offender could be any function, any thread, any memcpy with a wrong length. A watchpoint on the byte traps the single store that broke the invariant, dumps a stack at the offending instruction, and the bug is the line above.

Setting one is one command:

gdb · watch a single byte

# Trap any write to the byte at 0x7ffff7e0a420.
(gdb) watch *(uint8_t *)0x7ffff7e0a420

# Or: trap any write to the cookie field of a known object.
(gdb) watch player->cookie

# rwatch / awatch trap reads / any access. Same DR0-DR3 hardware.
(gdb) rwatch *(uint32_t *)&config->magic
(gdb) awatch *(uint32_t *)&config->magic

The widget below is a simulator. A small program writes to memory cells in a loop; pretend you are debugging it and don't know which iteration corrupts the watched cell. Click a cell to set a hardware watchpoint and run; the simulator pauses on the offending write and shows the stack at that instant:

live watched caught write

Click a cell to set or clear a hardware watchpoint, then Run. The simulated program does a memcpy with a slightly wrong length and a memset later in the frame. The watched cell traps the offending write at the instruction that did it; the stack and registers at the moment of the trap are what your debugger would print. The four-watchpoint limit is the same one you would hit with DR0–DR3 on real hardware.

Two practical limits. First, hardware watchpoints are scarce: four on x86, four on most ARM cores, sometimes eight or sixteen. The debugger will silently fall back to software watchpoints if you ask for a fifth, which is single-stepping plus a memory check at every step. Software watchpoints are correct but four to five orders of magnitude slower; a frame that ran in 16 ms takes minutes. The fix is to use fewer; pick the watch range carefully. Second, hardware watchpoints fire on the granularity of 1, 2, 4, or 8 bytes (Intel's encoding^[20]). Watching a 64-byte struct needs a software watchpoint or a clever placement of four 8-byte watchpoints if the corruption is known to be aligned.

"My watchpoint never fires." Three reasons.

The address moved. If you watched player->cookie and player was reallocated, the address you watched is no longer the address the field lives at. Watchpoints are on linear addresses, not on logical names. Re-set after any reallocation.

The write is going through DMA or a kernel-side memcpy. Hardware watchpoints fire only on user-space CPU writes. A driver doing copy_to_user from kernel space, or a GPU DMA writing to a mapped buffer, will not trigger them. The fix is to add a software check at the boundary of the suspect copy.

The page is not present. A watchpoint on an address that the kernel has paged out will not fire when the page is brought back in by another thread. This is rare on a desktop with plenty of RAM and common on a memory-pressured embedded target.

06Reading optimized assembly when the source is gone

A crash in a third-party DLL, a stripped vendored library, a driver fault, an inlined helper from a header you don't have: all of these end with you in front of a disassembly window with no source. The assembly tutorial covers the ISA itself; this section covers the workflow for inferring meaning from a code chunk you didn't write.

Five questions, in order, that turn a screen of hex into actionable information.

1. What function are you in?

The faulting RIP on the crash dump is an absolute address. Resolve it to a function with the symbol table: addr2line -e game.exe -f 0x7ff6a02b14c7 on the GNU toolchain, the equivalent in llvm-symbolizer, or SymFromAddr from dbghelp.dll on Windows^[22]. If the binary is stripped, the closest exported symbol is usually still resolvable: the dynamic-linker symbol table is separate from the debug-info one and is what the loader uses; strip by default leaves it. From the function name and the offset (+0x47) you can narrow disassembly to one function instead of the entire .text.

2. What does the prologue tell you?

The first instructions of a function on x86-64 are a stereotyped prologue: zero or more push reg for callee-saved registers being preserved, then a sub rsp, N that allocates the local frame. The prologue tells you, before you read a single line of the body, how many callee-saved registers the function uses (r12–r15, rbx, rbp on System V^[23]) and how much stack the function needs:

a typical non-leaf prologue, x86-64 System V

push  rbp                  ; save caller's frame pointer (only if -fno-omit-frame-pointer)
push  r15                   ; preserve callee-saved r15
push  r14                   ; preserve callee-saved r14
push  rbx                   ; preserve callee-saved rbx
sub   rsp, 0x48             ; allocate 72 bytes of frame
                            ; total stack used: 4 saves * 8 + 72 + 8 ret-addr = 112 bytes

The frame size puts an upper bound on the local data: 72 bytes of stack means at most ~9 8-byte locals or 18 4-byte ones, often less because the compiler aligns and pads. Three callee-saved registers (excluding the frame-pointer push of rbp) means the function did enough work to need three registers it couldn't clobber; a leaf function with no work would have skipped the prologue entirely. A frame size of 0x28 (40 bytes) on a function that calls another function is typical on Windows: 32 bytes of shadow space plus 8 bytes to keep rsp 16-byte-aligned at the next call^[24].

3. Where do the arguments live?

The first six integer arguments on System V are in rdi, rsi, rdx, rcx, r8, r9; the first four on Windows are in rcx, rdx, r8, r9^[23]^[24]. The first eight floating-point arguments go in xmm0–xmm7 on System V; the first four in xmm0–xmm3 on Windows. A function whose body starts with mov rbx, rdi is preserving its first argument across a call. A function whose body starts with mov rax, [rdi] is dereferencing its first argument: that argument is a pointer, and the next reads tell you the struct shape.

4. What does each call site tell you?

A call rel32 at offset 0x47 in the function is calling some other function at a known offset; resolve that offset against the symbol table or the import table. A call qword ptr [rip + 0x...] is calling through a static function pointer: the address is in the GOT (Global Offset Table) on Linux or the IAT (Import Address Table) on Windows, both populated by the loader from the import metadata. A call qword ptr [rax + 0x18] is a virtual call: [rax] is the vtable pointer, the +0x18 is the slot, and the slot index plus the class name from RTTI usually identifies the method.

5. Where do the data accesses point?

RIP-relative loads mov rax, [rip + 0x12345] are the standard way an optimized binary refers to globals, string literals, vtables, and constant data. The displacement is computed against the next instruction's address; a debugger or objdump will resolve them to the symbol they hit. Loads with a base+index addressing mode (mov eax, [rdi + rsi*4]) are array accesses; the base is usually a function argument (a pointer to the array), the index is the loop counter, and the scale is sizeof(element).

The widget below is the same workflow, applied to a faulting instruction in a stripped binary. Step through and the panel on the right will narrate what each instruction tells you about the stack frame, the argument registers, and where the data lives:

Each row is 8 bytes of stack memory. The callee-saved registers and the return address are recovered from the function's .eh_frame entry on Linux or its .pdata/.xdata entry on Windows^[9]; this is the same procedure the debugger and the kernel oops handler use. The local-variable rows are recovered from .debug_loc when the binary has it; "spilled register" rows are inferred from the prologue.

07Fingerprinting library code

Many of the disassemblies you'll find yourself reading are not your code. The C runtime, the C++ standard library, the platform allocator, and a handful of library functions account for a large share of any program's .text. Knowing them by sight saves the time of resolving every faulting RIP back to a name. The patterns are surprisingly stable across compiler versions, and they look distinctive once you've seen them.

Six patterns worth memorizing.

memcpy and similar. A short prologue, then a wide vector loop with paired AVX or SSE loads and stores: vmovdqu ymm0, [rsi + ...]; vmovdqu [rdi + ...], ymm0. The "Erms" path on modern Intel collapses to rep movsb, one instruction the CPU runs as a microcoded fast loop^[25]. Either shape is a memcpy.
memset. Same shape as memcpy but with a constant register source: vpbroadcastb ymm0, [rsi]; vmovdqu [rdi], ymm0 or rep stosb.
strlen. An aligned-load loop followed by a SIMD compare against zero and a tzcnt or bsf on the result mask: pcmpeqb xmm0, xmm1; pmovmskb eax, xmm0; bsf ecx, eax. The pattern is "load, compare against zero, mask, bit-scan." Glibc's strlen uses this exact shape on x86-64.
Vtable dispatch. mov rax, [rcx]; call qword ptr [rax + N] is a virtual call. rcx on Windows or rdi on System V is the this pointer; [rcx] reads the vtable pointer from offset 0; [rax + N] indexes into the vtable. The slot index is N / 8.
Allocation site. A call to operator new(unsigned long) on Linux (mangled _Znwm) or malloc directly. Followed by a null check, a placement-new constructor inlined, and the resulting pointer assigned to a register or stored. The mangled C++ symbol is the giveaway: _Z is the GCC C++ ABI prefix^[26].
Stack-protector epilogue. A function compiled with -fstack-protector-strong ends with mov rax, [rsp + N]; xor rax, fs:[0x28]; jne __stack_chk_fail on Linux. The fs:[0x28] is the per-thread cookie; the XOR-and-compare is the canary check^[27].

The widget below is a quiz of the kind you'd be doing in a real debugging session: a snippet of stripped assembly, three candidate functions, pick the one that matches:

08Sanitizers: instrumented runtime checks

A sanitizer is the compiler's offer to insert runtime checks around every potentially-unsafe operation. The four shipping today, all in Clang and GCC and most in MSVC:

AddressSanitizer (ASan). Detects out-of-bounds reads and writes on heap, stack, and globals; use-after-free; use-after-return; double-free^[6]. The mechanism is a shadow memory: every 8 bytes of program memory map to one byte of shadow that records how many of the 8 are addressable. Every load and store is preceded by a shadow check the compiler inserts inline. The original paper measured an average 1.73× slowdown on SPEC CPU2006, with the worst benchmarks (perlbench, xalancbmk) closer to 2.7×^[6]; memory overhead is roughly 2× to 4×.
UndefinedBehaviorSanitizer (UBSan). Detects signed integer overflow, shift-by-too-many-bits, null dereference (in some modes), reading from a misaligned pointer, and a long list of others^[5]. Lower overhead than ASan; the checks are a handful of extra instructions per operation.
ThreadSanitizer (TSan). Detects data races by instrumenting every memory access with a vector-clock update^[28]. 5× to 15× slowdown is typical, and the memory overhead is large (often 5× to 10×). Used on CI machines and in playtests, not in shipping builds.
MemorySanitizer (MSan). Detects reads of uninitialized memory by tracking a one-bit-per-byte "is this byte initialized" shadow^[29]. Requires every dependency, including the standard library, to be MSan-instrumented; this makes it harder to deploy than ASan, but when you can deploy it, it catches a class of bug nothing else does.

On a game, the standard configuration is: ASan + UBSan on every CI build that doesn't need shipping performance; TSan on a separate "race-hunt" build run nightly; MSan only when a specific bug suggests an uninitialized read. A debug build with -fsanitize=address,undefined catches an enormous fraction of latent bugs at no engineering cost beyond the build flags.

The shadow-memory mechanism is worth understanding because the failure mode of "ASan didn't catch it" is usually traceable to it. Each 8-byte aligned chunk of memory has one shadow byte; the byte's value is 0 (all 8 bytes addressable), 1–7 (only the first k bytes are addressable, the rest are a redzone), or a negative value (the entire chunk is poisoned). On every memory access the compiler emits an inline check that loads the shadow byte and traps if the access reads or writes a poisoned chunk.

The widget below is a small simulation. Allocate a buffer; the runtime poisons the redzones around it. Read or write past the end of the buffer; ASan traps the access and prints the report. Free the buffer; the entire chunk turns into a use-after-free trap if you touch it again:

user memory · 32 bytes shown

shadow memory · one byte per 8 bytes of user memory

addressable redzone poisoned freed

The shadow-byte encoding here mirrors the real AddressSanitizer: a chunk fully addressable is 0, a partially-addressable chunk holds the count of addressable bytes, and a poisoned chunk holds a negative value identifying the kind of poison (heap-redzone, freed, stack-after-return)^[6]. The 8-byte granularity is why ASan can miss intra-chunk overflows on small fields; some off-by-one bugs on a uint32_t field inside a struct slip through.

Sanitizers that survive into production

Two newer tools are cheap enough to leave in shipping builds. HWASan uses ARMv8.5's hardware memory tagging (MTE) to store a 4-bit tag in the high bits of each pointer and check it against the matching tag stored alongside the allocation; the overhead is a few percent on hardware that supports MTE^[30]. GWP-ASan samples a small fraction of allocations at random and routes them through a guard-page allocator; misses the rest, but adds essentially zero cost in steady state and catches the same class of overflow on the sampled allocations^[31]. Both are deployed in Android and Chrome at scale.

Windows has Page Heap (enabled per-binary with gflags.exe /p /enable game.exe /full), which places each allocation on its own page with the next page unmapped^[32]. Catches the same class of overflow as ASan with no recompile, at the cost of a much larger memory footprint and a substantial allocator slowdown. Useful when you can reproduce a bug but can't rebuild the binary that reproduces it.

09Crash dumps and post-mortem inspection

A crash dump is a snapshot of a process's address space at the moment it faulted. With matching debug info, the snapshot is a full debugger session you can re-attach to as many times as you need. With no debug info, it is a wall of hex. The format and the tooling differ by platform, but the workflow is the same.

Windows: minidumps

MiniDumpWriteDump writes a structured snapshot to a .dmp file^[33]. The flags chosen at write time decide what's in it. MiniDumpNormal is the smallest (a few hundred KB): registers, stack of every thread, list of loaded modules, system info. MiniDumpWithFullMemory is the largest (the entire process's committed memory): every heap, every mapped file, every thread stack. The middle flags (WithDataSegs, WithProcessThreadData, WithIndirectlyReferencedMemory) are the practical compromise for shipping titles that need to investigate without uploading gigabytes per crash.

Open a minidump in WinDbg with windbg -z game.dmp, point it at the matching PDB and source server with .sympath and .srcpath, and the standard commands work as if the process were live: k for the stack, ~* k for every thread's stack, !analyze -v for the heuristic root cause, dt for type-directed memory inspection, !heap -p -a addr for the allocation history of an address with Page Heap on. The Microsoft public symbol server (https://msdl.microsoft.com/download/symbols) provides PDBs for every shipped Windows DLL; configure it once and your stack traces include named frames inside kernel32, user32, d3d11^[11].

Linux: core dumps

The kernel writes a core file when a process is killed by a signal that's set up to dump (SIGSEGV, SIGABRT, SIGBUS, etc.) and ulimit -c permits it. The path pattern lives in /proc/sys/kernel/core_pattern and on most systemd-based distros (Fedora, Arch, modern Debian, openSUSE) pipes the dump to systemd-coredump^[34]. coredumpctl debug PID drops you into GDB attached to the snapshot, with the matching binary and debug-info files automatically located via the build-id. Ubuntu defaults to apport instead, with the same idea and a different command-line shape.

The same workflow without systemd-coredump: gdb game core. Set set debug-file-directory to where your .debug files live; load any libraries the loader was using; bt to walk the stack; info threads to list every thread; thread apply all bt to walk every stack at once.

What to look at first

A consistent triage order, roughly platform-independent:

Faulting instruction and address. What kind of fault was it (read, write, execute) and at what address. A SIGSEGV with an address near zero is a null deref. An address in the high bits set (e.g., 0xfffffffffffffff8) is often a small negative offset off a null pointer ("dereferencing the next field after the null check failed"). An address ending in 0xfeeefeee, 0xdeadbeef, or another sentinel is a write to memory that was deliberately filled with that pattern by the runtime.
The faulting thread's stack. Count frames; identify the topmost one in code you own. The crash is often caused by the wrong arguments coming into one of your functions from a library call.
Other threads' stacks. If the bug is a race, the corruptor is on a different thread. Look for threads parked on a mutex you hold; that's a deadlock. Look for threads inside memcpy or memset with their destination overlapping your structure.
The argument registers at the crash. The first six (System V) or four (Windows) integer registers are the arguments to whatever function was being called. A SIGSEGV on the first instruction of a function with rdi == 0 on Linux means someone called you with a null first argument.
The address of the corruption. If a sanitizer is installed (or Page Heap, or GWP-ASan) and reported the corruption, use its allocation/free history. Otherwise use a hardware watchpoint on the next reproduction.

10Heisenbugs and undefined behavior

A heisenbug is one whose presence depends on observation: it disappears under the debugger, reappears in release, vanishes if you add a printf, comes back when you remove one. The category is real, and the cause is almost always one of:

Optimizer-exposed UB. Signed overflow, strict aliasing, null deref, race on a non-atomic, oversized shift. The optimizer assumes UB doesn't happen and rewrites code accordingly. Adding a printf changes the surrounding code enough that the optimizer's assumption changes too, and the bug moves^[1].
Reads of uninitialized memory. The value read is whatever was at that address, and that depends on every prior allocator call and every prior write. Adding instrumentation changes the allocator state. MSan catches these reliably; until you have it, the bug looks like magic.
Timing-dependent races. The debug runtime adds latency at every locking and allocator boundary. Release runs without it. The race fires only in release.
Order-of-evaluation differences. C++17 fixed some of the most surprising ones, but argument-evaluation order is still not specified for many cases^[35]. f(g(), h()) may call g before or after h; if both have side effects on shared state, the result differs.
Floating-point reassociation. Under -ffast-math, the optimizer fuses, reorders, and treats NaN as impossible. Two builds of the same source can give different bit-exact answers; cumulative drift in physics or AI can become divergent over many frames^[4].

The textbook example, walked through in Chris Lattner's "What Every C Programmer Should Know About Undefined Behavior" series^[36], is a function that dereferences a pointer before checking it for null. The C standard says dereferencing a null pointer is undefined; the compiler is then permitted to assume the pointer is non-null at every later point, which means the subsequent if (p) check is dead code that the optimizer deletes. The same shape shipped as a Linux kernel privilege-escalation bug in 2009 (CVE-2009-1897): a tun-driver patch added sk = tun->sk just above an existing null check on tun, GCC removed the check on the same logic, and the kernel oops became a local-root exploit^[36]. The pattern is generic: the optimizer's correctness only requires the source's behavior to match in non-UB executions, and UB in the source is the lever that lets it delete code.

The categories below are worth recognizing on sight because they explain the great majority of "it works in debug" reports.

Strict aliasing

The C and C++ standards say that a memory location's stored value can only be accessed through an lvalue of compatible type, or through a char*^[13]. The optimizer uses this to assume that int* and float* point to disjoint storage and to reorder loads and stores. The classic failure:

strict-aliasing UB · do not write this

float bits_to_float(uint32_t bits) {
  return *(float*)&bits;     // UB: reading float through int storage
}                                // at -O2 may return 0 or stale register

The compliant equivalents are std::memcpy(&result, &bits, sizeof result) in C++11+ or std::bit_cast<float>(bits) in C++20. Both compile to identical machine code; both are defined; neither relies on the strict-aliasing exception.

Use-after-move

The C++ standard library's "moved-from" objects are in a "valid but unspecified" state. Many user-defined types follow the same convention. Reading from a moved-from object is not UB by the standard, but it is almost always a bug; the value is whatever the move constructor left behind, which depends on the implementation. Clang's -Wuse-after-move and clang-tidy's bugprone-use-after-move catch most of these statically.

Iterator invalidation

Calling vector::push_back or vector::insert can reallocate the underlying buffer, invalidating every iterator and pointer into the vector. A range-for loop over a vector that calls a function that pushes back into the same vector is the canonical case. Debug iterators (_GLIBCXX_DEBUG, MSVC's _ITERATOR_DEBUG_LEVEL=2) catch these at runtime; release iterators do not^[2].

11Race conditions and TSan

A data race in C++ is two threads accessing the same memory location, at least one of them writing, with no synchronization between them, and with the accesses not both being on relaxed atomics^[37]. The C++ memory model declares this to be undefined behavior; the optimizer is permitted to reorder, hoist, or fuse the accesses on the assumption that races don't happen. The Memory Model tutorial covers the model in depth; this section is the debugging side.

The right tool for detecting races is ThreadSanitizer^[28]. Its mechanism is a vector clock per thread: every memory access updates the thread's clock; every synchronization (lock acquire, atomic with acquire-release) merges the clocks of the involved threads; a write to a location whose clock is incomparable to the writer's is a race. The instrumentation is heavy (5–15× slowdown), but it catches races even when the timing happens to be benign in this run; the algorithm is sound, not heuristic.

The widget below is a small simulation. Two threads increment a shared counter. With no synchronization, the result depends on the interleaving. With a mutex or a relaxed atomic, the count is correct. The right panel shows TSan-style happens-before edges and the race report:

When TSan can't help

TSan instruments user-space accesses with the runtime it controls. The bugs it can't catch:

Races involving signal handlers. The handler interrupts at an unpredictable point; TSan's runtime cannot intercept the entry. Most async-signal-safe code is hand-written and the constraints are documented; the right defense is code review against the signal-safety list, not a sanitizer.
Races on memory shared with another process or kernel-side. A driver writing to a buffer your process maps via mmap, a SHM region another process is touching, a GPU DMA. None of these go through TSan's instrumented load/store; TSan sees only your side.
Races inside hand-written assembly or intrinsics. The compiler doesn't see the load and store, so doesn't instrument them.
Benign races that are still UB. The "double-checked locking" pattern, before C++11 atomics, is the textbook example. The race is benign in any execution that completes; the standard still says it is undefined.

For the bugs TSan can't catch, the next-best tool is rr or TTD: record a failing run, inspect the interleaving deterministically. For races that won't reproduce locally, the tactic is to add invariant checks in the suspect region (assert(state == EXPECTED)) and ship the build to a wider set of testers; the assert turns the silent corruption into a defined crash with a stack trace.

12Game-specific debugging

The same principles apply to game engines, with three extensions: a GPU runs in parallel and can hang independently; the frame budget is hard and a 4-ms spike is a bug, not a slowdown; and a multiplayer game's bug may live in the divergence between two clients, not on either alone.

GPU hangs and TDR

A GPU command that doesn't complete within the OS-defined timeout (about 2 seconds on Windows^[38]) triggers Timeout Detection and Recovery: the OS resets the GPU, the application loses its device, and DirectX or Vulkan returns DXGI_ERROR_DEVICE_REMOVED on the next call. The crash dump from a TDR is a snapshot of CPU state plus, if the runtime supports it (D3D12's DRED, Vulkan's VK_EXT_device_fault), the last GPU commands and the page-fault address.

The right tool for a TDR is a frame capture. RenderDoc^[39] captures every command submitted in a frame and replays it against a clean GPU; the offending draw is the one that doesn't return. PIX^[40] on Windows does the same with deeper integration into D3D12 and Xbox toolchains. The shader debugger inside both lets you step through the failing thread of the failing wave.

Frame spikes

Frame-time anomalies are easier to debug as a profile than as a traditional bug. A 4-ms spike on a 16-ms frame budget is a regression even if no functional output is wrong. The two kinds of profiler:

Sampling profilers. perf on Linux^[41], ETW (Windows Performance Recorder, xperf) on Windows^[42]. Interrupt the program at a fixed rate (1 kHz typical), record the call stack, aggregate. Statistical, no instrumentation cost. Best for "where is the time going overall"; misses individual spikes that don't sample.
Tracing profilers. Tracy^[43], Superluminal, the engine's own scope-based timer (PIX markers feed PIX, nvtx ranges feed NSight). Record an entry-and-exit timestamp around each annotated region. Heavier, but every spike is captured and attributable.

A practical default for engine work: sampling profiler running on every CI build to catch broad regressions, instrumented tracing turned on in playtest builds to catch individual spikes. Both produce flame graphs the team should look at as a regular practice, not only when something is broken^[44].

Determinism as a debugging tool

A deterministic engine is one that, given the same starting state and the same input log, produces bit-exact identical output. Achieving determinism is non-trivial (no walltime, no rand() without a seeded RNG, no platform-specific floating-point reassociation, no thread-order-dependent updates) but is usually worth doing for the simulation tier of the engine. The payoff for debugging is large: a bug reported as a frame-1024 desync between two clients can be reproduced by replaying the input log; a regression introduced by a refactor is found by running the same log against the old and new binaries and looking at the first frame they differ.

Lockstep multiplayer games are deterministic by necessity, since the full simulation runs on every client and the network only carries inputs. The same machinery doubles as the bug-reproduction tool: the input log from a desynced session is the smallest possible reproduction. RTS games, fighting games, and many RPGs ship with the input log saved on every match for this reason.

13Pitfalls

Trusting the line table. A single source line in -O2 code maps to several disjoint instruction ranges, and several source lines map to the same instruction. The debugger picks a representative; treat it as a hint, not a fact. To find what's actually at a PC, read the disassembly.
"It works in debug" as a fix. Different runtime, different memory layout, different scheduler. The bug is still there; you just stopped seeing it. Find a sanitizer that catches it before declaring victory.
Catching the second bug, not the first. A use-after-free crashes hundreds of instructions after the actual error, in a place that looks like the bug. The pointer was freed five frames ago; the dereference now is the symptom. Tools that record allocation history (ASan, Page Heap, !heap -p -a) point at the original free.
Using volatile as a thread-safety primitive. volatile in C and C++ disables compiler optimizations on accesses to a variable; it does not impose memory barriers, does not provide atomicity, and does not stop the CPU from reordering. Use std::atomic for cross-thread state^[37].
Disabling sanitizers because they slow CI. A 3× slowdown on CI is cheap insurance against shipping a heap corruption. Run the sanitizer-instrumented build as a separate job on a smaller subset of tests; don't trade away coverage for build time.
Collecting only minimal minidumps. The smallest minidumps don't include heap memory; the most useful thing for debugging a heap corruption is the heap. Pick the minidump flags that capture indirectly-referenced memory (MiniDumpWithIndirectlyReferencedMemory on Windows^[33]) so the dump includes the data the registers and stack actually point at.
Reading the stack trace as the truth. An optimized binary's call stack can lie about the instruction that crashed if the function was tail-called; the topmost frame may be the function that called foo, not foo itself. Cross-check with the faulting RIP.
Looking for the bug under the streetlight. The cause of a crash in memcpy is almost never inside memcpy. Walk the stack until you find the first frame in code that you control and that has a clearly-wrong value.

14What's next

The natural follow-on tutorials and references:

The x86-64 Assembly tutorial, for the ISA-level vocabulary §6 used freely.
The Memory Model tutorial, for the foundation under §11 on data races.
Read the DWARF 5 spec^[7]. Sixty pages on the line program (§6) and forty on call frame information (§6) are the parts that pay off most quickly.
Brendan Gregg's Systems Performance^[45] for the long-form treatment of perf, eBPF, and the production-debugging tooling on Linux.
John Robbins' Debugging Applications for Microsoft .NET and Microsoft Windows^[46]. Covers WinDbg, minidumps, and the SOS extension at length; the Windows counterpart to Gregg's book.
Try rr on a real bug. The on-ramp is small (rr record ./your_program, then rr replay), and the workflow is hard to give up once you've used it for a "who set this pointer" hunt.
Set up a symbol server for your team. The friction of "I can't symbolicate my crash" is what kills most teams' dump-collection pipelines; a working symbol server removes it.

15Sources & further reading

Numbered citations refer to the superscripts above. Primary references first, practitioner resources second.

A note on originality

The prose, code samples, CSS, and interactive widgets on this page are original writing. The shadow-memory mechanism in §8's ASan widget follows the published AddressSanitizer paper [6]; the shadow encoding (chunk fully-addressable / partial / poisoned) is the actual encoding ASan uses. The DWARF call-frame description in §6 is the standard mechanism documented in the DWARF 5 specification [7]. The vector-clock model behind §11's TSan visualization is the algorithm described in [28]. The release-vs-debug example in §1 is a standard demonstration of UB-induced miscompile, repeated in many compiler-engineering talks.

ISO/IEC. (2020). Programming languages — C++ (ISO/IEC 14882:2020). Working draft N4860 available at wg21.link/n4860. Annex A and §6.9.1 enumerate the categories of undefined behavior; §7.6.1 covers signed integer overflow.
Microsoft. Checked iterators and iterator debug levels. learn.microsoft.com. Documents _ITERATOR_DEBUG_LEVEL, the macro that turns on iterator-validation checks in MSVC's standard library.
Microsoft. CRT debug heap details. learn.microsoft.com. Documents the 0xCD/0xDD/0xFD fill bytes the debug CRT writes to allocated, freed, and no-mans-land memory; the source of the "you'll see 0xCDCDCDCD" lore.
Free Software Foundation. Optimize Options — -ffast-math. GCC manual. gcc.gnu.org. Lists the sub-flags -ffast-math implies (-fno-signed-zeros, -fno-trapping-math, -funsafe-math-optimizations, etc.) and what each one permits the compiler to assume.
LLVM Project. UndefinedBehaviorSanitizer. clang.llvm.org. The reference for the UBSan check categories, runtime cost, and the trap-vs-recover modes.
Serebryany, K., Bruening, D., Potapenko, A., & Vyukov, D. (2012). AddressSanitizer: A Fast Address Sanity Checker. USENIX ATC. PDF. The original paper. Describes the shadow-memory layout, the redzone scheme, the use-after-free quarantine, and the measured 1.7–3× slowdown on SPEC CPU2006.
DWARF Debugging Information Format Committee. (2017). DWARF Debugging Information Format Version 5. dwarfstd.org. The current spec. Chapters 6 (line number program), 6 (call frame information), and 7 (data representation) are the most-consulted sections.
Microsoft. microsoft-pdb. github.com/microsoft/microsoft-pdb. Microsoft's partial open-source documentation of the Program Database file format. Combined with LLVM's pdbview docs, sufficient to read CodeView records.
Microsoft. x64 exception handling. learn.microsoft.com. The .pdata and .xdata sections of the PE format and how the OS uses them to walk the stack on exceptions and crash-dump generation.
Free Software Foundation. Debugging Options. GCC manual. gcc.gnu.org. The reference for -g, -gN, -glldb, -gdwarf-N, and the explicit promise that they don't change generated code.
Microsoft. Symbol servers and symbol stores. learn.microsoft.com. The Windows symbol-server protocol; works with WinDbg, Visual Studio, and the Windows Performance Toolkit.
Gregg, B. (2024). The Return of the Frame Pointers. brendangregg.com. The case for keeping -fno-omit-frame-pointer on by default; led to Fedora 38 and Ubuntu 24.04 making it the default for system packages.
ISO/IEC. (2018). Programming languages — C (ISO/IEC 9899:2018). §6.5/7 (the strict-aliasing rule). Carried into C++ as [basic.lval] in the C++ standard.
Apple. Writing ARM64 code for Apple platforms. developer.apple.com. Apple's ARM64 platform conventions; mandates the frame pointer at x29, with linkage at every call.
Acton, M. (2014). Data-Oriented Design and C++. CppCon 2014. YouTube. The reference talk on the Insomniac engine team's approach to runtime layout and the role of asserts and explicit invariants in production code.
LLVM Project. LLVM's Analysis and Transform Passes. llvm.org. The catalog of optimization passes the optimizer applies, in the order they apply. Useful both for reading the assembly the optimizer produced and for understanding why a particular transformation happened.
Free Software Foundation. Debugging with GDB. sourceware.org/gdb. The GDB reference manual. Chapters on conditional breakpoints, watchpoints, the Python API, and reverse debugging.
O'Callahan, R., Jones, C., Froyd, N., Huey, K., Noll, A., & Partush, N. (2017). Engineering Record And Replay For Deployability. USENIX ATC. PDF. The paper describing rr's design: deterministic replay via syscall recording and signal interception.
Microsoft. Time Travel Debugging — Overview. learn.microsoft.com. WinDbg's record-and-replay extension. Same idea as rr, integrated into the Windows debugger.
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B: System Programming Guide. Order Number 253669. intel.com. Chapter 17 covers the debug registers (DR0–DR7), the access-type encoding, and the breakpoint-condition fields.
ARM Limited. Arm Architecture Reference Manual for A-profile architecture. Document DDI 0487. developer.arm.com. Section D2 (the Debug architecture) covers the watchpoint and breakpoint registers (WVR/WCR, BVR/BCR) and their access semantics.
Microsoft. DbgHelp Library. learn.microsoft.com. The Windows debug-helper API; SymFromAddr, StackWalk64, and the symbol-server interface used by every Windows debugger.
Matz, M., Hubička, J., Jaeger, A., & Mitchell, M. (2014). System V Application Binary Interface, AMD64 Architecture Processor Supplement. gitlab.com/x86-psABIs/x86-64-ABI. The Linux/macOS/BSD calling convention reference. Argument registers, callee-saved set, struct-classification rules.
Microsoft. x64 calling convention. learn.microsoft.com. The Windows x64 ABI: argument registers, callee/caller-saved sets, shadow space, struct passing.
Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual. §3.7.6 (Enhanced REP MOVSB and STOSB). Order Number 248966. intel.com. The microcode optimization that makes rep movsb a viable memcpy on Ivy Bridge and later.
Itanium C++ ABI Working Group. Itanium C++ ABI. itanium-cxx-abi.github.io. The C++ ABI used by GCC, Clang, ICC, and most non-Microsoft toolchains. §5 documents the name-mangling scheme that produces symbols like _Znwm for operator new(unsigned long).
Free Software Foundation. Instrumentation Options — -fstack-protector. GCC manual. gcc.gnu.org. Documents the stack-canary insertion, the __stack_chk_fail handler, and the variants -fstack-protector, -fstack-protector-strong, and -fstack-protector-all.
Serebryany, K., & Iskhodzhanov, T. (2009). ThreadSanitizer — data race detection in practice. WBIA. PDF. The original TSan paper; the modern compiler-instrumented v2 design is documented in the Clang docs at clang.llvm.org.
Stepanov, E., & Serebryany, K. (2015). MemorySanitizer: fast detector of uninitialized memory use in C++. CGO. PDF. Bit-level shadow tracking; requires every dependency to be instrumented.
Serebryany, K., Stepanov, E., Shlyapnikov, A., Tsyrklevich, V., & Vyukov, D. (2018). Memory Tagging and how it improves C/C++ memory safety. arXiv:1802.09517. arxiv.org. Covers HWASan and the relationship to ARMv8.5 MTE; deployed in Android.
LLVM Project. GWP-ASan. llvm.org. The sampling-based heap allocator that catches a fraction of out-of-bounds and use-after-free at near-zero overhead. Deployed in Chrome and Android.
Microsoft. GFlags and PageHeap. learn.microsoft.com. The gflags /p /enable mode that places each allocation on its own page with the next page unmapped.
Microsoft. MiniDumpWriteDump function. learn.microsoft.com. The reference for the dump-type flags and what each one captures. Choosing the right flags is the difference between a useful and a useless crash report.
systemd Project. systemd-coredump. freedesktop.org. The default core-dump handler on most modern Linux distributions; the coredumpctl command-line interface to it.
Vandevoorde, D. (2017). P0145R3 — Refining Expression Evaluation Order for Idiomatic C++. WG21. open-std.org. The C++17 paper that nailed down the evaluation order of common patterns (a->b(), a[b]) that had been unspecified.
Lattner, C. (2011). What Every C Programmer Should Know About Undefined Behavior, Part 1. LLVM Project Blog. blog.llvm.org. Three-part series walking through canonical UB-induced miscompiles: signed overflow, dereferencing-before-null-check, oversized shifts, strict-aliasing violations. The "their code broke when someone else did a debug build" framing is from Part 1. The shipping example mentioned in §10 is CVE-2009-1897, walked through by Jonathan Corbet at LWN: Fun with NULL pointers, part 1 (lwn.net/Articles/342330) — GCC eliminating an existing null check on tun in the Linux kernel tun-driver after a preceding dereference.
Boehm, H.-J., & Adve, S. V. (2008). Foundations of the C++ concurrency memory model. PLDI. dl.acm.org. The paper that became the C++11 memory model. §1.10 of the C++ standard is the normative version.
Microsoft. Timeout Detection and Recovery (TDR). learn.microsoft.com. The Windows Display Driver Model's protection against GPU hangs; the 2-second default and how to configure it.
Karlsson, B. RenderDoc. renderdoc.org. Free, open-source frame-capture and replay tool for Vulkan, D3D11, D3D12, OpenGL. The standard tool for "what did my GPU actually receive?"
Microsoft. PIX on Windows. devblogs.microsoft.com. Microsoft's GPU profiler and frame-capture tool; deeper integration with D3D12 and the Xbox toolchain than RenderDoc, with support for shader debugging at the wave level.
Linux kernel. perf: Linux profiling with performance counters. perf.wiki.kernel.org. The reference Linux profiler. perf record, perf report, perf script, the --call-graph options.
Microsoft. Event Tracing for Windows (ETW). learn.microsoft.com. The kernel-level tracing infrastructure on Windows; xperf, Windows Performance Recorder, and Windows Performance Analyzer all sit on top of it.
Taudul, B. Tracy Profiler. github.com/wolfpld/tracy. Real-time, nanosecond-resolution frame profiler designed for games. Open source, integrates with most engines through a small C API.
Gregg, B. (2016). The Flame Graph. Communications of the ACM 59(6). queue.acm.org. The visualization that became the standard for sampling-profile output; usable on stacks from perf, ETW, or any other sampling source.
Gregg, B. (2020). Systems Performance: Enterprise and the Cloud (2nd ed.). Pearson. The reference book for performance debugging on Linux. Chapters 6 (CPUs), 7 (memory), and 14 (eBPF) are the most-read.
Robbins, J. (2007). Debugging Microsoft .NET 2.0 Applications. Microsoft Press. WinDbg, SOS, ADPlus, and the Windows-side debugging machinery; aged in places (the .NET 2.0 specifics) but the WinDbg material is still the most comprehensive.
Bendersky, E. Eli Bendersky's website. eli.thegreenplace.net. Long-form articles on DWARF, ELF, the loader, and the corners of the GNU toolchain that show up in release-mode debugging. Strong on the format-of-the-debug-info side.
Dawson, B. Random ASCII. randomascii.wordpress.com. Practitioner blog from a profiling and debugging engineer at Valve and previously Microsoft. The "thirty-one days of debugging" series and the ETW writeups are the foundation reading for Windows-side performance work.
Giesen, F. (Ryg). The ryg blog. fgiesen.wordpress.com. Practitioner-grade writeups on codec internals, SIMD, and the kinds of release-mode bugs that come up in shipping codec work at RAD.
Cooper, K. D., & Torczon, L. (2011). Engineering a Compiler (2nd ed.). Morgan Kaufmann. The standard textbook on compilers; chapters on register allocation, instruction selection, and the optimizer pipeline give the underlying view of why -O2 output looks the way it does.
Levine, J. R. (1999). Linkers and Loaders. Morgan Kaufmann. Old, still authoritative on ELF, PE, the dynamic linker, and what the loader does between execve and main.
Drepper, U. (2011). How to Write Shared Libraries. PDF. The reference for symbol resolution, GOT/PLT, the visibility attributes, and the IFUNC mechanism. The companion document to the same author's "What Every Programmer Should Know About Memory."