Tutorial 05 · Engine Programming

The C++ Memory Model
from Scratch

Eight cores. Three levels of cache, the first two private per core. A store buffer that holds your write for tens of cycles before any other core can see it, longer when the line is contested. The C++ memory model is the contract that lets you reason about what your threads can observe without naming any of this. We start with one cache line, work up through acquire-release, hazard pointers, and the ARM-correct Chase-Lev deque, and run every step as a live demo in your browser.

Time~65 min LevelSenior engine programmer PrereqsYou can read C++ at an intermediate level. You've heard of std::atomic. The Job Systems tutorial uses these primitives; this one teaches what they mean. HardwareSome idea that CPUs have caches

01Why a memory model exists

A single-threaded program is easy to reason about. The compiler may reorder instructions for speed, the CPU may execute them out of order, the cache may hold a write for a thousand cycles before flushing it to memory, none of that matters. The single thread does its own reads and only ever sees the values it last wrote, no matter how aggressively the layers underneath shuffle the execution. The illusion is total. The hardware spends transistors to maintain it.

A multi-threaded program shatters that illusion the moment a second thread looks at the same memory. The compiler's reorderings now affect what another thread sees. The CPU's out-of-order execution now matters. The store buffer that held the write for fast retirement is now a delay another thread observes. The cache that hasn't told its neighbors about the new value is now a divergence. A is the contract the platform offers about what reorderings are visible across threads, so a programmer can write code that's correct without naming the hardware.

What you'll have by the end

A working understanding of the C++ memory model strong enough to write a thread-safe reference counter, a single-producer/single-consumer lock-free queue, a Treiber stack with the ABA bug fixed by hazard pointers, and the Lê-Pop-Cohen-Zappa-Nardelli 2013 corrected Chase-Lev work-stealing deque^[1]. A live ordering playground lets you flip orderings on a producer-consumer pair and watch invariants break. By the end you'll know why seq_cst is expensive on ARM and free on x86^[2], why memory_order_relaxed is not "no ordering," why hazard pointers exist when generational counters could do the job, and what the difference between lock-free and wait-free actually buys you.

The textbook example that breaks under threading

Two threads, two shared integers initialized to zero. Thread A stores 1 to x then loads y. Thread B stores 1 to y then loads x. At least one of the loads must return 1, right? Both stores happen before any load on the other thread, so when the loads run, at least one of the stores must be visible:

store_buffer_demo.cpp · the intuition that doesn't survive contact

// Shared globals, initialized to zero.
int x = 0;
int y = 0;

// Thread A:
x = 1;             // store
int a_view_of_y = y; // load

// Thread B (running in parallel):
y = 1;             // store
int b_view_of_x = x; // load

// Question: can BOTH a_view_of_y == 0 AND b_view_of_x == 0?
// Intuition says no. One of the stores has to land before the loads run.

On an x86 CPU, both loads returning zero is not only possible, it's the failure mode tests use to detect a missing fence^[2]. The reason is the store buffer: when thread A writes x = 1, the store sits in the per-core store buffer for some cycles before being committed to the cache. Thread A's own subsequent load of y can run before that buffered store is visible to thread B. Thread B does the same thing with its store of y. Both threads finish their loads before either store reaches the cache the other thread reads from. Both loads return zero. Sequential consistency is broken, and there's no compiler optimization involved at all.

The widget below runs that exact pattern as a model: two threads, each with a store buffer of configurable depth, each issuing a store followed by a load. Click run and watch the outcome distribution. With buffer depth zero (loads observe stores immediately), the (0, 0) outcome never appears. As soon as the buffer holds even one outstanding store, (0, 0) starts showing up:

store buffer depth 2

trials per click 2000

(1, 1) both saw store

···

(0, 1) or (1, 0)

···

(0, 0) both saw stale

···

02A short history of getting cross-thread ordering right

The C++ memory model didn't drop fully formed in 2011. Its shape is the cumulative answer to thirty years of confusing concurrency bugs, each of which forced someone to write down what the hardware actually guarantees. A tour, because the current shape only makes sense in the context of what it replaced:

1979

Lamport defines sequential consistency.^[4] "The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program." The strongest model worth shipping, and the one most programmers still imagine when they think about threading. It is not what most hardware gives you.

1995

Sun's TSO becomes the SPARC default.^[5] Total Store Order: stores are seen by other cores in program order, but a load can retire before an earlier store to a different address becomes visible. The shape x86 would adopt informally and document much later.

2004

The Java 5 memory model (JSR-133).^[6] Manson, Pugh, and Adve formalize a memory model with happens-before as the central relation. Volatile fields gain release/acquire semantics. Java becomes the first widely-deployed language with a written contract about cross-thread visibility.

2005

Boehm: "Threads cannot be implemented as a library."^[7] The watershed paper that ends the pthreads-on-top-of-C era. Boehm shows that compilers, ignorant of threads, will reorder code in ways that break race-free reasoning. The C and C++ committees notice. C++11 starts on its memory model.

2008

Boehm and Adve: "Foundations of the C++ Concurrency Memory Model."^[8] The PLDI paper that became the basis of the C++11 atomics and orderings. relaxed, acquire, release, acq_rel, and seq_cst as a controlled family of relaxations from sequential consistency.

2010

x86-TSO formalized.^[2] Sewell, Sarkar, Owens, Nardelli, and Myreen publish a precise, machine-checked model of x86 ordering. Settles a decade of arguments about whether Intel hardware really did have a store buffer and what was actually guaranteed.

2011

C++11 ships std::atomic.^[3] The C++ standard gains a memory model. The familiar five orderings appear. std::atomic_thread_fence gives standalone barriers. The fence-vs-atomic-operation distinction (Hans Boehm's "fence pairs with an atomic, not another fence" rule) is encoded in the abstract machine.

2012

Maranget, Sarkar, Sewell: "A Tutorial Introduction to the ARM and POWER Relaxed Memory Models."^[9] The first plain-language treatment of how weak ARM ordering is, with the running examples that became canonical (independent reads of independent writes, message-passing pattern, store-buffer pattern). The paper that taught a generation of systems programmers how to read a litmus test.

2013

Lê, Pop, Cohen, Zappa Nardelli correct Chase-Lev.^[1] They find that the 2005 Chase-Lev work-stealing deque has races on ARM and POWER, and publish the version with the correct acquire/release annotations. Modern work-stealing runtimes that target ARM (Rust's Rayon, Go's scheduler, several AAA engines on PS5 and XSX) use this corrected variant or one equivalent.

2014

ARM v8 adds dedicated load-acquire and store-release instructions.^[10] LDAR and STLR. Half-fences in hardware, cheaper than the full DMB ISH barrier the v7-era code used. The C++ memory_order_acquire and memory_order_release finally have a direct hardware lowering on ARM, not just a stronger fence.

2017

C++20 fixes seq_cst on ARM and POWER.^[11] P0668 "Revising the C++ memory model" addresses a known bug where the C++11 specification of seq_cst was not strong enough to forbid certain executions on weak hardware. The fix tightens the standard and the implementation strategies; the practical effect is that seq_cst stores on ARM emit DMB ISH; STR; DMB ISH instead of the cheaper-but-wrong STLR.

2024

C++26 adopts std::execution (senders/receivers).^[12] Structured concurrency on top of the same memory model. The orderings haven't changed; the way you compose work on top of them has. Out of scope for this tutorial except as the next thing to learn after.

The recurring pattern: the math is from 1979, the formal model from 2005-2010, the standard library from 2011, the corrections to make the standard work on ARM from 2013-2017. Each step the previous one is taken for granted. Most production C++ code in 2026 is still written against a mental model that's closer to Lamport 1979 than to C++20, which is why so much of it is wrong.

03What the CPU actually does

Before the C++ orderings make sense, you need the picture of the hardware they're hiding. A modern CPU is roughly the following, per core:

An L1 data cache (32 KB on Zen 4 and Skylake, 48 KB on Sunny Cove and Zen 5, 128-192 KB on Apple M-series performance cores^[13]), the only piece of memory the core touches at execution speed.
A store buffer (a per-core FIFO, on the order of 50 to 200+ entries on recent x86 and ARM; ~56 on Skylake, ~64 on Zen 4, larger on recent Apple cores) that holds retired stores until they can be committed to L1. The store retires from the pipeline immediately; the buffer drains lazily.
A load buffer that holds in-flight loads and lets the core speculate past them.
On weakly-ordered architectures (POWER, classical ARM implementations), an invalidation queue that defers incoming "this cache line is now invalid" messages so they don't block the pipeline. x86 hides this layer behind TSO.

The L1 cache talks to other cores' L1 caches through a cache coherence protocol, typically . MESI guarantees that for any one cache line, only one core at a time holds a writable copy; readers all see the same value. Cache coherence is what makes memory_order_relaxed still meaningful. Even with relaxed, two threads reading the same atomic see a coherent sequence of values; what relaxed gives up is ordering across distinct atomics.

The widget below animates one cache line moving through the four MESI states as two cores read and write. Click a core's read or write button; watch the state change and the message traffic between the caches. The legend at the bottom names each state:

Core 0

Core 1

M Modified · dirty, exclusive to this core E Exclusive · clean, only this core has it S Shared · clean, multiple cores have it I Invalid · this core's copy is stale

A read into an Invalid line issues a "read" message; if no other core has it, the requesting core ends in Exclusive. If another core has it Modified, the holder writes back and both end in Shared. A write into a Shared line broadcasts "invalidate" to the others before transitioning to Modified. The invalidate is the source of the cross-core latency that std::atomic operations pay every time they cross a cache line another core was reading^[14].

The store buffer is the second crucial piece. When the core writes to memory, the write retires from the execution pipeline into the store buffer almost immediately. From the core's perspective, the write is done. From the cache's perspective, and therefore from every other core's perspective, the write hasn't happened yet. The buffer drains in the background, committing entries to L1 once the cache line is in the right MESI state. Until then, every other core sees stale data.

The visible effects on multithreaded code:

Store-to-load reordering, single-core. A store followed by a load to a different address can complete with the load reading the cache before the store reaches it. The core itself uses store-buffer forwarding so its own loads see its own stores; other cores do not.
Store-to-load reordering, cross-core. The Dekker pattern from §1. Both threads' stores are still in their respective store buffers when their respective loads run.
Independent reads of independent writes (IRIW). Four cores: two writers, two readers. The two readers can disagree about the order in which the two writes became visible. Only seq_cst forbids this on weak hardware^[9].

Why does the store buffer exist at all? Why can't the CPU just commit stores directly?

Because the cache line might not be in the right state. To write a value to a cache line, the core needs the line in Modified or Exclusive state. If the line is Shared (other cores have it) or Invalid (the core doesn't have it), the write would stall waiting for the invalidate-broadcast or read-for-ownership round trip. That's 50 to 200 cycles per store. On a code path with ten stores in a row, that's serialization on cache traffic, and the CPU pipeline would be idle the entire time.

The store buffer hides the latency. The store retires from the pipeline into the buffer, the pipeline keeps going, the buffer waits in the background for the cache line. By the time the buffer wants to commit, the cache line is usually ready. The store buffer is one of the most important latency-hiding tricks in modern CPUs and is why single-threaded performance is what it is. The cost is that another core's view of memory is now stale relative to the executing core's view, which is exactly the problem the memory model has to specify.

04Sequential consistency, the strawman

Sequential consistency (SC) is Lamport's 1979 model^[4]: every operation, by every thread, can be placed in one global total order, and every thread sees that order. Inside the order, each thread's operations appear in program order. Inside the order, between threads, the operations can interleave however the schedule dictated. The two rules:

One total order exists.
Each thread's contributions to that order match its program order.

SC is what most programmers imagine. It is not what any mainstream production CPU implements as the default, because preserving it across cores requires either fully synchronous cache invalidations (slow) or the per-store global broadcasts that make seq_cst stores on ARM cost a fence on each side. Shipping hardware uses TSO (x86, SPARC v9, IBM zSeries), or one of the weakly-ordered families (ARMv8, POWER, RISC-V's RVWMO), and language standards expose those models with optional knobs to recover SC where you need it.

Two patterns the SC mental model gets wrong on real hardware. Both are litmus tests from the Sewell and Maranget papers^[2]^[9]:

Store Buffer (SB)

init: x=0, y=0

T0:           T1:
  x = 1;        y = 1;
  r0 = y;       r1 = x;

forbid: r0=0 AND r1=0  (under SC)
allowed: r0=0 AND r1=0  (under TSO and weaker)

Message Passing (MP)

init: data=0, ready=0

T0:                  T1:
  data = 42;           while (ready == 0);
  ready = 1;           r = data;

forbid: r=0    (under SC, or with acq/rel)
allowed: r=0   (under TSO, ARM weak, without barriers)

Independent Reads of Independent Writes (IRIW)

init: x=0, y=0

T0: x = 1;          T1: y = 1;
T2: r0=x; r1=y;     T3: r2=y; r3=x;

forbid: r0=1, r1=0, r2=1, r3=0  (under SC)
allowed on POWER and pre-2018 ARMv7
forbidden on x86 (TSO is multi-copy atomic)
forbidden on ARMv8 since the 2018 revision
  (Pulte et al., "Simplifying ARM Concurrency")

The IRIW one is the cleanest demonstration that the world is not always flat. Two writes happen in parallel. Two distinct readers, each reading both atomics, can disagree about the order in which the writes became visible. Reader 2 saw x become 1 before y; reader 3 saw y become 1 before x. On x86 this can't happen because TSO is multi-copy atomic: every core sees stores from every other core in the same order, even though each core can reorder its own store with its own subsequent load. On POWER and pre-2018 ARMv7 it can. Modern ARMv8 was retroactively declared multi-copy atomic in 2018^[35], which closed this particular surprise on AArch64; weak orderings on ARMv8 still let SB, MP, and LB outcomes through, just not IRIW.

05x86-TSO vs ARM: the two memory models that ship

Two memory models cover the vast majority of CPUs games and servers run on. Both relax sequential consistency in specific ways, and the relaxations cost specific fences to recover:

x86-TSO (Total Store Order)

Sewell et al. 2010^[2] proved that x86 (Intel and AMD) behave as if each core has a FIFO store buffer between its registers and a global, coherent memory. The model permits exactly one relaxation from SC:

Loads can move ahead of earlier stores to a different address (the store-buffer effect). This is the Dekker case from §1.

Everything else holds. Loads do not reorder with earlier loads. Stores do not reorder with earlier stores. A load on one core observing a value implies the corresponding store has drained from some store buffer to the shared cache. All cores observe all stores in the same global order (multi-copy atomicity). This is why x86 atomics are almost free: a normal load is acquire by default, a normal store is release by default, only seq_cst stores need an MFENCE to flush the store buffer.

ARMv8-A and POWER (weak memory models)

Maranget, Sarkar, Sewell 2012^[9] documents ARMv7 and POWER as multiple-copy non-atomic with a per-core ordering relation called reads-from. The relaxations from SC permitted by both architectures at the time:

Loads can move ahead of earlier loads (no implicit load-load ordering).
Stores can move ahead of earlier stores (no implicit store-store ordering).
Loads can move ahead of earlier stores (as on x86).
Two readers can observe two writers' stores in different orders (IRIW). POWER still permits this; ARMv8 was retroactively re-defined as multi-copy atomic in the 2018 spec revision^[35], closing the IRIW gap but leaving the load-load and store-store reorderings in place.

ARMv8-A introduced acquire/release semantics as first-class instructions: LDAR (Load-Acquire Register) and STLR (Store-Release Register)^[10]. These are half-fences: an LDAR prevents subsequent loads and stores from moving ahead of it; an STLR prevents preceding loads and stores from moving past it. Cheaper than the v7-era DMB ISH full fence. The C++ memory_order_acquire and memory_order_release lower to these directly on ARMv8.

The widget below compares the litmus-test outcomes under SC, x86-TSO, and ARM. Toggle the model and watch which outcomes the model permits. Numbers are from running the formal models in the Sewell and Maranget papers, not from a real CPU:

06The five C++ memory orderings

C++ exposes five values for the ordering parameter on every atomic operation. Two are pure loads, two are pure stores, one is both. Reading the table the wrong direction is how people misuse them. Read it as "the operation does this", not as "the ordering does this":

Ordering	Use on	What it gives you	What it costs on x86 / ARM
`memory_order_relaxed`	load, store, RMW	Atomicity and modification-order consistency on this atomic. No ordering across other atomics or non-atomics.	x86: plain MOV. ARM: plain LDR / STR.
`memory_order_acquire`	load, RMW	This load synchronizes-with a corresponding release store on the same atomic. No subsequent load or store can move ahead of this load.	x86: plain MOV (acquire is free). ARM: LDAR (one half-fence).
`memory_order_release`	store, RMW	This store synchronizes-with a subsequent acquire load on the same atomic that reads this value. No preceding load or store can move past this store.	x86: plain MOV (release is free on TSO). ARM: STLR (one half-fence).
`memory_order_acq_rel`	RMW only	The RMW is both acquire (on the value it read) and release (on the value it wrote).	x86: LOCK-prefixed RMW. ARM: LDAXR / STLXR pair.
`memory_order_seq_cst`	load, store, RMW	All seq_cst operations across all threads have a single total order. Stronger than acquire+release where IRIW would otherwise be permitted; on ARMv8 post-2018 the gap closed for the IRIW shape but seq_cst still adds ordering for SB-like patterns.	x86: store needs MFENCE or compiles to XCHG. ARMv8: STLR for the store, LDAR for the load (modern GCC/Clang); pre-P0668 some compilers used the conservative `DMB ISH; STR; DMB ISH` form.^[11]

Two things to internalize from the table. First, an ordering is a property of the operation, not of the atomic. The same std::atomic<int> can be loaded relaxed by one thread and acquired by another; only the acquire load participates in synchronizes-with. Second, fences only buy you something when they pair with an atomic operation, never with each other. An std::atomic_thread_fence(memory_order_release) pairs with a later atomic acquire load; two fences alone do nothing.

What does "synchronizes-with" actually mean in the C++ standard?

The C++ abstract machine defines three relations on operations: sequenced-before (intra-thread program order), synchronizes-with (cross-thread cause-and-effect), and happens-before (the transitive closure of the two). A data race is when two operations both access the same memory, at least one is a write, and neither happens-before the other.

Synchronizes-with is the cross-thread connector. The standard case: a release store on an atomic synchronizes-with an acquire load on the same atomic if the load reads the value the store wrote (or any later value in the modification order). The synchronizes-with edge is what turns sequenced-before edges on either thread into happens-before edges that cross the threads.

Practical translation: if your acquire load returns the value your release store wrote, then every memory operation sequenced before the release store on the writer is happens-before every memory operation sequenced after the acquire load on the reader. That's the whole game. Acquire/release is just the C++ vocabulary for "publish this data and subscribe to it safely."

07Acquire and release: publish-subscribe in two lines

The single most important pattern in lock-free programming is publishing. A producer thread builds an object somewhere in memory, then sets an atomic flag to advertise that the object is ready. A consumer thread polls the flag, and when it observes "ready," it can safely use the object. The producer's release store on the flag synchronizes-with the consumer's acquire load on the flag, and every write the producer did before the release is now visible to the consumer after the acquire.

The pattern is so common it has a name in the formal literature: message passing (MP), the second litmus test from §4. Acquire/release is what makes it work without an explicit lock:

message_passing.cpp · the publication pattern

// Producer thread:
int sharedData[1024];                       // non-atomic payload
std::atomic<bool> payloadReady{false};

void producer() {
  for (int i = 0; i < 1024; ++i)
    sharedData[i] = computeValue(i);              // non-atomic writes
  payloadReady.store(true, std::memory_order_release);  // the synchronizing store
}

// Consumer thread:
void consumer() {
  while (!payloadReady.load(std::memory_order_acquire))  // the synchronizing load
    std::this_thread::yield();
  // After the acquire returned true, every write the producer did
  // before the release is happens-before this point. Safe to read.
  int sum = 0;
  for (int i = 0; i < 1024; ++i)
    sum += sharedData[i];                         // non-atomic reads, no race
}

The non-obvious bits. sharedData is a plain non-atomic array, and there is no data race on it because the acquire-release pair establishes a happens-before edge from every preceding producer write to every subsequent consumer read; the C++ standard treats non-atomic accesses as race-free when a synchronizing pair brackets them. If either the producer's store or the consumer's load drops to relaxed, the synchronizes-with edge breaks and the writes to sharedData are no longer guaranteed to be visible. On x86 the assembly is identical between release/acquire and relaxed because TSO gives the ordering for free; on ARMv8 the difference is LDAR / STLR instead of LDR / STR.

The widget below is the same pattern as a live animation. The producer builds a payload while the consumer polls. Toggle the orderings on the flag and watch the consumer read garbage when the orderings are relaxed:

trials

···

payload OK

···

torn read

···

Release-acquire vs release-consume

C++11 also defined memory_order_consume, a strictly weaker form of acquire intended for pointer dereferences. The idea: if you only need ordering on memory the acquired pointer points to, you can use the data dependency through the pointer instead of a full acquire fence. On ARM and POWER, that saves a cycle per dereference. In practice, no major compiler implements consume correctly; they all promote it to acquire. The C++17 standard discouraged its use; consume is still in the spec but treated as acquire by every shipping compiler^[16]. Use acquire.

08Sequentially consistent: the strong default

The default for std::atomic operations is seq_cst. It's also the slowest on ARM. The promise: every seq_cst operation, across every thread, can be placed in a single total order that all threads agree on. That extra agreement is what forbids IRIW. Acquire/release on its own permits two readers to see two writers' stores in different orders; seq_cst on those operations does not.

The cost on weakly-ordered hardware is real, and the exact lowering has shifted over the past decade. Pre-P0668, some compilers emitted DMB ISH; STR; DMB ISH around seq_cst stores on ARMv8, conservatively pairing the store with full data-memory barriers on both sides. Post-P0668^[11], modern GCC and Clang emit just STLR for the store and LDAR for the load: an LDAR-STLR pair is sufficient to recover seq_cst under the multi-copy-atomic ARMv8 model^[35]. On x86-TSO the same store compiles to either MOV + MFENCE (GCC) or XCHG (Clang); the load is a plain MOV. With the post-P0668 lowering, the marginal cost of seq_cst over release/acquire on individual loads and stores is often zero on ARMv8: both compile to LDAR/STLR. The cost shows up on RMW-heavy paths, where some compilers and codegen targets still bracket seq_cst RMWs with extra DMB ISH fences, and on x86 where seq_cst stores still need MFENCE while release stores are plain MOVs.

The rule of thumb that survives contact with profilers:

Use seq_cst by default while you're learning. It's the easiest to reason about. The cost on the algorithms you're learning with is typically irrelevant.
Use acquire/release in production code where the pattern is publication. The compiler emits identical code on x86 and saves a fence per operation on ARM.
Use relaxed for counters and statistics. Hit counts, frame numbers, profiler markers. No ordering needed; you just want the atomicity and modification-order consistency.
Never use consume. It doesn't work as specified.

The expensive case: a counter incremented by many threads where one thread occasionally reads it. The reads don't need to be seq_cst; the counter doesn't need to be seq_cst either. A relaxed RMW is correct and cheap. Spend the ordering only on the atomic that synchronizes with another thread's reading of something else.

09The atomic reference counter

The reference counter is the smallest non-trivial lock-free primitive and the one most asked about in engine interviews. The job: an object can be held by several owners on several threads, an owner can drop its hold, and the last drop runs the destructor. Threads must agree on who is last. The bug-free version has two interesting orderings, one per direction:

ref_counted.cpp · the canonical version

class RefCounted {
  mutable std::atomic<uint32_t> refCount{0};

public:
  void addRef() const noexcept {
    // Relaxed is correct here. We already hold a reference (the caller has a
    // pointer to this object), so the object can't be destroyed underneath us.
    // No ordering with other memory is needed; just bump the counter.
    refCount.fetch_add(1, std::memory_order_relaxed);
  }

  void release() const noexcept {
    // acq_rel on the decrement gives us two things:
    //   - release: every memory access before this release is visible to
    //     the thread that observes refCount == 0 after its own decrement.
    //   - acquire: when our decrement gives refCount == 0, we see every
    //     memory access that other threads did before their releases.
    // Without the acquire side, the destructor could read stale fields.
    if (refCount.fetch_sub(1, std::memory_order_acq_rel) == 1)
      delete this;
  }
};

The classic alternative is to use memory_order_release on the decrement and a separate std::atomic_thread_fence(memory_order_acquire) on the path that triggers the delete. This shaves the fence cost on every release that doesn't hit zero, which is most of them. Boost's intrusive_ptr uses this trick^[17]:

ref_counted_optimized.cpp · the Boost variant

void release() const noexcept {
  // Release is enough on the common path: publish all our writes
  // so the eventual deleting thread can see them.
  if (refCount.fetch_sub(1, std::memory_order_release) == 1) {
    // We are the deleter. Now we need an acquire fence to pair with
    // every other thread's release decrement.
    std::atomic_thread_fence(std::memory_order_acquire);
    delete this;
  }
}

Three things engine interviewers like asking about this:

Why relaxed on addRef? Because the caller has a live reference; the object cannot be destroyed until after this call, so no ordering against other memory is required. Increments don't synchronize-with anything.
What if I used relaxed on release? Bug. The destructor on the last-decrement thread can read stale fields the other threads wrote. The release-acquire pair is what makes those writes visible.
What if I used seq_cst everywhere? Correct, but pays an extra fence per decrement on ARM. The default for shared_ptr in many implementations is exactly this conservative choice. Most refcount-heavy paths show up as the dominant cost in profiles because of it.

std::shared_ptr in libstdc++ implements roughly the optimized variant^[18]. The weak count is a separate atomic with its own ordering. make_shared fuses the object and the control block into one allocation, which is faster to allocate but ties the object's storage lifetime to the weak count: the storage isn't freed until both strong and weak counts hit zero. That's a real consideration when objects are large and weak pointers can be long-lived.

10False sharing and the cache line

Cache coherence works at the granularity of a cache line, not an individual address. On x86-64 and most ARM, that's 64 bytes^[19]. On Apple M-series and some POWER CPUs, it's 128 bytes. When two atomics live on the same cache line, every write to one invalidates the other in every other core's cache, even though logically the threads aren't sharing the data. The cache line ping-pongs between cores, each write paying a 50-to-200-cycle coherence round trip^[14]. The data isn't actually shared; the cache line is. The bug has a name: false sharing.

The fix is to put each independently-accessed atomic on its own cache line, by padding the struct so the next field lands on the next line:

false_sharing.cpp · what NOT to do

// Both atomics land on the same 64-byte cache line. Every increment by
// one thread invalidates the line in the other thread's cache.
struct Counters {
  std::atomic<uint64_t> producerCount;  // thread A writes
  std::atomic<uint64_t> consumerCount;  // thread B writes
};

false_sharing_fixed.cpp · with alignas

// Each atomic is forced onto its own 64-byte cache line. No false sharing.
// C++17 also has std::hardware_destructive_interference_size for portability.
struct Counters {
  alignas(64) std::atomic<uint64_t> producerCount;
  alignas(64) std::atomic<uint64_t> consumerCount;
};

// Or, if you don't control the struct layout, put the atomic in its own
// struct and pad to a full cache line:
struct alignas(64) PaddedAtomic {
  std::atomic<uint64_t> value;
  char pad[64 - sizeof(std::atomic<uint64_t>)];
};

The widget below runs the two-thread benchmark in your browser. The model approximates the cache-line ping-pong with a per-line lock that serializes writes from different threads. Toggle padding on and off and watch the throughput change:

threads 2

unpadded ops/ms

···

padded ops/ms

···

speedup

···

C++17 gave us a portable hint:

portable_alignment.cpp

// std::hardware_destructive_interference_size is the implementation's
// recommended alignment to avoid false sharing. On x86-64 it's 64.
// On Apple M-series toolchains it's typically 128.
struct Counters {
  alignas(std::hardware_destructive_interference_size)
    std::atomic<uint64_t> producerCount;
  alignas(std::hardware_destructive_interference_size)
    std::atomic<uint64_t> consumerCount;
};

Implementation note: libstdc++ defaults this to 64 even on aarch64, which produces false sharing on Apple Silicon if you ship a binary compiled for generic ARM. Some teams hardcode alignas(128) for portability with Apple targets; others use a build-time switch. Lemire ran a measured benchmark in 2023^[19] showing the Apple M-series 128-byte effective line; he is the citation to point at if a teammate asks why you padded to 128 when "everyone knows the line is 64."

11SPSC: a lock-free queue with no CAS

The single-producer/single-consumer (SPSC) ring buffer is the simplest useful lock-free data structure and the one that hits the most use cases per line of code. Used for: render-thread to RHI thread command buffers, gameplay-thread to audio-thread events, profiler scopes, telemetry. The producer pushes; the consumer pops; both run on different threads; neither blocks the other; there's no compare-and-swap involved because each side owns its own pointer.

spsc_ring.cpp · the producer owns writeIndex, the consumer owns readIndex

template <typename T, size_t Capacity>
class SpscRing {
  static_assert((Capacity & (Capacity - 1)) == 0,
    "Capacity must be a power of two so the mask is fast");

  // One cache line per index. False sharing between producer and consumer
  // is the most common SPSC bug in production code.
  alignas(64) std::atomic<size_t> writeIndex{0};   // producer writes, consumer reads
  alignas(64) std::atomic<size_t> readIndex{0};    // consumer writes, producer reads
  alignas(64) T storage[Capacity];

public:
  // Returns false if the ring is full. Called only by the producer thread.
  bool push(const T& value) {
    // We own the write index, so relaxed read is fine; no other thread changes it.
    const size_t currentWrite = writeIndex.load(std::memory_order_relaxed);
    // Acquire on the read index pairs with the consumer's release on readIndex.
    // The synchronizes-with edge guarantees the consumer's previous slot read is
    // happens-before our upcoming slot write, even though the addresses differ.
    const size_t currentRead = readIndex.load(std::memory_order_acquire);
    if (currentWrite - currentRead == Capacity) return false;  // full

    storage[currentWrite & (Capacity - 1)] = value;   // non-atomic write to slot

    // Release on write index: every prior write (including the slot)
    // happens-before any consumer's matching acquire load.
    writeIndex.store(currentWrite + 1, std::memory_order_release);
    return true;
  }

  // Returns false if the ring is empty. Called only by the consumer thread.
  bool pop(T& out) {
    const size_t currentRead = readIndex.load(std::memory_order_relaxed);
    const size_t currentWrite = writeIndex.load(std::memory_order_acquire);
    if (currentWrite == currentRead) return false;       // empty

    out = storage[currentRead & (Capacity - 1)];        // non-atomic read of slot

    readIndex.store(currentRead + 1, std::memory_order_release);
    return true;
  }
};

The pattern is two release-acquire pairs, one per direction. The producer's release on writeIndex publishes the slot write; the consumer's acquire on writeIndex picks it up. The consumer's release on readIndex tells the producer the slot is free; the producer's acquire on readIndex picks it up. Each side reads the index it owns relaxed, since no other thread modifies it. Monotonically-increasing indices (instead of wrapping at capacity) make the empty/full check a simple subtraction; the mask happens at the slot lookup.

What's intentionally missing

This SPSC supports one producer and one consumer; using it with two producers on the same ring will lose pushes. The slot type T must be trivially copyable; non-trivial types need placement-new and explicit destruction, which adds complexity for the half-pushed/half-popped windows. There's no batched push, no backpressure beyond returning false, and no notification mechanism for a sleeping consumer (use a condition variable or std::atomic::wait in C++20^[21]). The Rust port of this is one of the entries in §17's playground.

producer rate 200 Hz

consumer rate 200 Hz

writes

···

reads

···

occupancy

···

12The ABA problem

Lock-free designs lean heavily on compare-and-swap (CAS): "if the atomic still holds the value I read, replace it with this new value." CAS is the workhorse of every multi-producer queue, every Treiber stack, every work-stealing deque. It has one classic failure mode that catches almost everyone the first time: ABA.

The pattern: thread A reads pointer P and sees value X. Thread A is preempted. Thread B pops X from the structure, frees it, allocates a new node, and the allocator happens to hand back the same address. Thread B pushes the new node, so P now holds X again, but it's a different object. Thread A wakes up, does a CAS expecting X, succeeds, and corrupts the structure because it's operating on the recycled address as if it were the original.

The litmus example: a Treiber stack^[22]. The simplest lock-free LIFO. Pop the top:

treiber_stack_aba.cpp · the broken version

struct Node {
  int value;
  Node* next;
};

std::atomic<Node*> topOfStack{nullptr};

Node* pop() {
  Node* oldTop = topOfStack.load(std::memory_order_acquire);
  while (oldTop) {
    // Read top->next BEFORE the CAS. This is where ABA bites.
    Node* newTop = oldTop->next;
    if (topOfStack.compare_exchange_weak(oldTop, newTop,
          std::memory_order_acq_rel,
          std::memory_order_acquire)) {
      return oldTop;
    }
    // CAS failed; oldTop now holds the current top. Try again.
  }
  return nullptr;
}

// The race:
// Thread A: reads top = X (a Node with X->next = Y). About to CAS X -> Y.
// Thread B: pop X (top is now Y). pop Y (top is now Z). push X (recycled!).
//           Now top = X, but X->next = Z (B set it when pushing).
// Thread A: CAS expects X, finds X, succeeds. Sets top = Y. But Y was freed!
// Result: top points at freed memory, or the stack has lost nodes Z, ...

Three families of fix:

Tagged pointers. Pack a counter into the unused high bits of the pointer. Current x86-64 (4-level paging) uses canonical 48-bit virtual addresses, leaving the top 16 bits for tags; 5-level paging (Ice Lake-SP and later) extends addresses to 57 bits, reducing the available tag bits to 7. AArch64 typically uses 48 or 52 bits and supports Top-Byte-Ignore (8 free top bits for user code), though PAC consumes some of those bits when enabled^[23]. The CAS compares pointer-plus-counter; each push increments the counter, so the recycled pointer has a new tag and the CAS fails.
Double-width CAS. Use std::atomic<TaggedPtr> where TaggedPtr is a 128-bit struct (pointer + 64-bit counter). Compiles to CMPXCHG16B on x86 and CASP on ARMv8.1. Slower than a normal CAS but simpler than tagged pointers if you don't have spare bits.
Memory reclamation. Don't free the popped node until you know no concurrent reader is looking at it. Hazard pointers (§14) and epoch-based reclamation are the production techniques.

step

CAS result

···

stack state

···

13compare_exchange_weak vs strong

Every C++ atomic has two CAS variants. The semantic difference: compare_exchange_strong returns false only when the values actually differ; compare_exchange_weak is allowed to return false spuriously, even when the values match. The reason is hardware:

x86. CAS is the CMPXCHG instruction (or LOCK CMPXCHG for the atomic variant). One instruction; never fails spuriously. Weak and strong compile to the same code.
ARMv7 and ARMv8. The natural CAS is a load-linked / store-conditional pair (LDREX/STREX on ARMv7, LDXR/STXR on ARMv8). The store-conditional fails not only when the value changed but also when a context switch happened, an interrupt fired, or the cache line was invalidated between the load and the store. The hardware can't tell the difference, so weak inherits the spurious-failure behavior. Strong wraps a retry loop around it.
ARMv8.1+ has CAS. CAS, CASA, CASL, CASAL instructions added in 2014. Single-instruction CAS, no spurious failures, no LL/SC retry loop. Modern compilers emit these when -march=armv8.1-a or newer is set; otherwise they fall back to the LDXR/STXR loop.

The rule:

Use compare_exchange_weak when you already have a retry loop. The classic case: a CAS-loop pop on a stack or queue. The loop body re-reads the expected value on each iteration, so a spurious failure is the same as a real conflict from the loop's perspective. You save the strong's internal retry.
Use compare_exchange_strong when there's no loop. One-shot CAS attempts (set-once-if-zero, race to install a sentinel) want strong. Otherwise you'd have to write the retry loop yourself, defeating the purpose.

A correct weak loop:

cas_weak_loop.cpp

// Atomically multiply a counter by 1.1 (a value the hardware can't do directly).
void multiplyBy11Percent(std::atomic<double>& counter) {
  double currentValue = counter.load(std::memory_order_relaxed);
  double nextValue;
  do {
    nextValue = currentValue * 1.1;
    // Weak: the loop will retry on spurious failure, which is fine.
    // On real conflict the loop also retries; same code path.
  } while (!counter.compare_exchange_weak(currentValue, nextValue,
            std::memory_order_relaxed));
}

Note the third argument is the ordering on success. The fourth argument (omitted here, defaults to the same) is the ordering on failure. Failure can be weaker than success; failure of a CAS by definition didn't write anything, so it doesn't need the release side.

14Hazard pointers: safe memory reclamation

Tagged pointers fix the ABA identity problem on a CAS. They don't solve the lifetime problem: if thread A is holding pointer P while thread B frees the object P points at, thread A's subsequent dereference is undefined behavior, no matter how clever your tag is. Some scheme has to keep P's target alive until thread A drops it.

Maged Michael's 2002 PODC paper^[24] introduced hazard pointers: a small per-thread array of pointer slots that name the objects this thread is currently using. Before a writer frees a popped node, it scans every thread's hazard array; if any slot names the about-to-be-freed address, the writer defers the free. The reader's job is to publish what it's looking at; the writer's job is to check before freeing.

hazard_pointers_sketch.cpp · the protocol

// One hazard slot per thread, two slots per thread is enough for most lock-free
// structures (Michael's paper shows one slot suffices for the Treiber stack).
thread_local std::atomic<void*> hazardSlot{nullptr};

// Reader side: load the pointer, publish it, re-validate.
template <typename T>
T* protectedLoad(std::atomic<T*>& source) {
  T* candidate;
  do {
    candidate = source.load(std::memory_order_acquire);
    // Publish what we're about to dereference. The seq_cst here is required:
    // every writer's scan of hazard slots must see this before its free.
    hazardSlot.store(candidate, std::memory_order_seq_cst);
    // Re-validate: if the source still points at candidate, no writer could
    // have freed it (because we got the hazard slot up first). Otherwise retry.
  } while (candidate != source.load(std::memory_order_acquire));
  return candidate;
}

// Writer side: deferred free. Push to a retire list; periodically scan
// every thread's hazard slot and free what nobody's looking at.
void retire(void* node) {
  retireList.push(node);
  if (retireList.size() >= 128) scanAndFree();
}

void scanAndFree() {
  // Snapshot all threads' hazard slots into a set.
  std::unordered_set<void*> protectedSet;
  for (auto& threadSlot : allThreadHazardSlots())
    if (void* p = threadSlot.load(std::memory_order_seq_cst))
      protectedSet.insert(p);

  // Free anything in retireList that isn't in protectedSet; keep the rest.
  for (auto it = retireList.begin(); it != retireList.end(); ) {
    if (!protectedSet.count(*it)) { delete *it; it = retireList.erase(it); }
    else                            { ++it; }
  }
}

The seq_cst on the hazard slot publication is real and necessary. Without it, a writer's hazard scan can race with a reader's hazard store: writer sees old slot, frees, then reader's store goes through to a now-freed memory address it doesn't dereference but does look at. The original Michael paper proved that a release on the store and an acquire on the scan are insufficient; the strict total order over scan-vs-publish requires seq_cst. C++26 adds std::hazard_pointer as a library-supplied implementation^[25]; until then, folly::hazptr^[26] is the production-quality reference.

Hazard pointers vs epoch reclamation

The other production technique is epoch-based reclamation (EBR): every thread enters a "global epoch" when it starts reading; writers retire nodes into a per-epoch bucket; nodes are freed only after every thread has advanced past the epoch they were retired in. Faster than hazard pointers on the read side (no per-load fence). Slower on the worst-case write side, and prone to leaks if any thread stalls. Linux's RCU is the EBR family. Hazard pointers are preferred when readers can be preempted; EBR is preferred when reads are fast and stalls are rare. Crossbeam in Rust gives you both^[27].

15The Chase-Lev work-stealing deque

The Chase-Lev deque is the backbone of every modern work-stealing scheduler^[28]. The shape: each worker thread owns a deque of jobs. The owner pushes and pops at the bottom (LIFO, cache-friendly: the next job a worker grabs is the one it just produced). Other workers steal at the top (FIFO, fairness: the longest-waiting job goes first). One owner, many thieves. No locks. The owner's bottom-only access is free of contention with thieves except on the rare empty-deque case; a thief uses CAS to claim its slot.

The 2005 Chase-Lev paper specified the algorithm against sequential consistency. On x86-TSO it works as written. On ARM, POWER, and other weak memory models, it has races. Lê, Pop, Cohen, and Zappa Nardelli's 2013 PPoPP paper^[1] identified the races and provided the corrected acquire/release annotations. Rust's crossbeam-deque and the work-stealing variants built on top of it (Rayon, Tokio's blocking pool) ship the Lê-corrected algorithm directly^[27]; engine schedulers on weak hardware that use a Chase-Lev shape have to use the same corrections, even when their published architecture talks don't name the paper. Not every engine uses a Chase-Lev deque, though: Naughty Dog's fiber-aware design uses per-priority queues, Unreal's TaskGraph went through several iterations, and many studios run on a single global MPMC queue à la enkiTS.

chase_lev_deque.cpp · the ARM-correct Lê variant

template <typename Job>
class ChaseLevDeque {
  // Owner-thread state: top is the thief side, bottom is the owner side.
  // Both monotonic; never wrap. Slots are addressed top mod capacity.
  alignas(64) std::atomic<int64_t> top{0};        // thieves load+CAS this
  alignas(64) std::atomic<int64_t> bottom{0};     // owner stores this
  alignas(64) std::atomic<CircularArray*> storage;  // resizable buffer

public:
  // Owner only. Push to the bottom.
  void push(Job job) {
    // Relaxed: only this thread writes bottom; only this thread reads it here.
    int64_t currentBottom = bottom.load(std::memory_order_relaxed);
    int64_t currentTop    = top.load(std::memory_order_acquire);
    CircularArray* buffer = storage.load(std::memory_order_relaxed);
    if (currentBottom - currentTop > buffer->capacity() - 1) {
      buffer = growBuffer(buffer, currentBottom, currentTop);
      storage.store(buffer, std::memory_order_release);
    }
    buffer->put(currentBottom, job);                          // non-atomic write to slot
    // Release: publishes the slot write to any thief about to acquire-load bottom.
    // Lê et al. found the original Chase-Lev used a relaxed store here, which is
    // a race on ARM. The release fixes it.
    bottom.store(currentBottom + 1, std::memory_order_release);
  }

  // Owner only. Pop from the bottom (LIFO for cache locality).
  bool pop(Job& out) {
    int64_t currentBottom = bottom.load(std::memory_order_relaxed) - 1;
    CircularArray* buffer = storage.load(std::memory_order_relaxed);
    // Tentatively claim the slot by decrementing bottom. Release: ordering for
    // thieves that may try to steal here.
    bottom.store(currentBottom, std::memory_order_release);
    // seq_cst load on top, paired with the seq_cst CAS in steal(), is the
    // fix Lê et al. identified. The 2005 paper had a relaxed load that races
    // with thieves on ARM.
    int64_t currentTop = top.load(std::memory_order_seq_cst);
    if (currentTop <= currentBottom) {
      out = buffer->get(currentBottom);                       // non-atomic slot read
      if (currentTop == currentBottom) {
        // Last item, racing with potential stealer. CAS to settle.
        bool won = top.compare_exchange_strong(currentTop, currentTop + 1,
                       std::memory_order_seq_cst,
                       std::memory_order_relaxed);
        bottom.store(currentBottom + 1, std::memory_order_relaxed);
        return won;
      }
      return true;
    }
    // Empty. Restore bottom to its pre-pop value.
    bottom.store(currentBottom + 1, std::memory_order_relaxed);
    return false;
  }

  // Thieves only. Steal from the top (FIFO).
  bool steal(Job& out) {
    // seq_cst on both ends of the top/bottom pair guarantees a total order
    // between concurrent pop() and steal() attempts.
    int64_t currentTop    = top.load(std::memory_order_seq_cst);
    int64_t currentBottom = bottom.load(std::memory_order_acquire);
    if (currentTop >= currentBottom) return false;             // empty
    CircularArray* buffer = storage.load(std::memory_order_consume);
    Job claimed = buffer->get(currentTop);                       // non-atomic slot read
    // CAS to claim the slot. On success, this slot is ours.
    if (top.compare_exchange_strong(currentTop, currentTop + 1,
          std::memory_order_seq_cst,
          std::memory_order_relaxed)) {
      out = claimed;
      return true;
    }
    return false;                                                // CAS lost, another thief got it
  }
};

What the Lê paper changed: the original used memory_order_relaxed on the bottom-store in push and on the top-load in pop. That is correct on x86-TSO but admits an ARM execution where a thief observes the new bottom before the slot write. Releasing the bottom-store on push and seq_cst-ing the top-load in pop forbids that execution. The CAS on the empty case in pop against a concurrent steal needs seq_cst on both sides so the two atomics participate in the same total order. The deque is a clean example of how orderings on different atomics interact and why seq_cst on the right pair is sometimes the only correct option.

Two implementation notes the original paper preserves and this listing follows: the storage.load(memory_order_consume) in steal() is faithful to Lê et al., who used consume to express "we only need ordering on the buffer pointer's pointee, not on unrelated memory." Every shipping compiler promotes consume to acquire, so the practical lowering is identical, and the pitfalls section in §20 stands. The growBuffer step is sketched as one atomic store; in production you have to retain old buffers until no thief is still mid-steal in them, which is its own deferred-reclamation problem (hazard pointers or epoch, again).

A widget for this lives in §17

The Chase-Lev deque is too tangled to animate cleanly in a small canvas: four threads, two atomics, an array. The playground in §17 has a deque visualizer where you can run the owner and two thieves and watch the operations interleave under each memory model.

16The seqlock: optimistic reads of compound state

Some data is too big for a single atomic. A 4×4 transform matrix, a vector of bones, a UI animation state with a dozen floats. The Linux kernel has the same problem with the jiffies clock and uses a seqlock^[29]: a sequence number that flips between even (consistent) and odd (writer-in-progress). Readers don't lock. They snapshot the sequence, copy the data, snapshot the sequence again. If the two sequences match and are even, the read was consistent; otherwise retry.

seqlock.cpp · optimistic reads, no writer blocking

template <typename T>
class Seqlock {
  alignas(64) std::atomic<uint64_t> sequence{0};
  T payload{};

public:
  // One writer. Sequence: even -> odd (writing) -> even+2 (done).
  void write(const T& value) {
    uint64_t beforeWrite = sequence.load(std::memory_order_relaxed);
    // Flip to odd. Release prevents the compiler/CPU from sinking the payload
    // write below this point on the same thread.
    sequence.store(beforeWrite + 1, std::memory_order_release);
    payload = value;                                                       // non-atomic write
    // Flip back to even+2. Release publishes the payload to any reader's
    // subsequent acquire-load of the sequence counter.
    sequence.store(beforeWrite + 2, std::memory_order_release);
  }

  // Many readers. Returns the value if read was consistent, retries otherwise.
  T read() {
    T snapshot;
    uint64_t before, after;
    do {
      before = sequence.load(std::memory_order_acquire);
      if (before & 1) continue;                                       // writer mid-flight
      snapshot = payload;                                                  // non-atomic read
      // Acquire fence: prevents the payload read above from being reordered
      // after the second sequence load. This is the seqlock's load-load barrier.
      std::atomic_thread_fence(std::memory_order_acquire);
      after = sequence.load(std::memory_order_relaxed);
    } while (before != after);                                            // torn read; retry
    return snapshot;
  }
};

Why this works: the writer flips to odd, performs its non-atomic writes, flips to even+2. A reader that catches an odd sequence retries. A reader that catches an even sequence snapshots the payload, then re-reads the sequence; if the writer started mid-copy, the second read shows a different value and the reader retries. Memory ordering: the release on the second sequence store publishes the payload writes; the acquire on the reader's first sequence load picks them up; the second sequence load detects any concurrent writer.

There is one subtle pitfall on weak hardware: payload = value and snapshot = payload can technically race when reader and writer overlap, which the C++ standard treats as undefined behavior even though the seqlock protocol guarantees the torn read is discarded. The kernel and Folly seqlock implementations accept this gap, which is well understood and works on every shipping compiler in practice. The portable C++ fixes are: copy the payload through std::memcpy (which the standard treats more permissively than struct assignment), or wrap each field in std::atomic_ref (C++20) and read with memory_order_relaxed. Boehm's "Can Seqlocks Get Along With Programming Language Memory Models?" and Lewiss-Brown's CppCon 2022 talk both walk through the trade-offs^[30].

Use a seqlock when reads vastly outnumber writes, the payload is too big for a single atomic, and stale reads are fine. Examples: a global config snapshot, a "current frame stats" struct, a recent-input-snapshot for animation IK. Don't use it for a queue or stack; the retry loop assumes the reader can re-attempt cheaply.

17Try it yourself: the memory-ordering playground

The playground below runs a producer-consumer pair against a model of x86-TSO and ARM weak ordering. You set the ordering on each atomic operation, hit run, and the playground simulates many interleavings and counts how often the invariant breaks. Two preset programs are available: the message-passing pattern from §7, and the Dekker store-buffer pattern from §1.

⌬ Ordering Playground

Ready. Click "Run 10,000 trials" to begin.

The simulator is approximate. It models x86-TSO with a per-thread store buffer, ARMv8-style weak ordering with per-atomic local visibility delays, and SC as a single global interleaving. The "ARM weak" mode here includes IRIW as a permitted outcome to make the litmus test demonstrable; modern ARMv8 (post-2018 multicopy-atomic revision) actually forbids that specific shape, while POWER still permits it. Real hardware has more constraints than this model encodes (per-address coherence is exact) and more flexibility on POWER specifically. The educational point is which orderings let an invariant fail, not the exact failure rate.

18How Unreal does it

Unreal Engine exposes three layers of atomic primitives, each with its own ordering defaults^[31]:

FThreadSafeCounter and FThreadSafeCounter64. A wrapper over std::atomic with the operations Unreal historically needed (Increment, Decrement, Add, Subtract, Reset). All operations are seq_cst internally. Used in engine code where the conservative ordering doesn't matter on profiles.
TAtomic<T>. Unreal's typed atomic wrapper; closer to std::atomic. Lets you specify EMemoryOrder per operation. Used in the hot paths the engine team cares about.
UE::Tasks and the underlying TaskGraph. The scheduler from the Jobs tutorial. Internally uses a work-stealing deque structurally equivalent to the Chase-Lev variant in §15, plus a separate "named thread" path for the render and RHI threads.

The pattern most engine teams converge on, after their first major ARM-platform port:

Replace the default-seq_cst increments with relaxed where possible. FThreadSafeCounter's fences are dominant in long uptime profiles on Switch and PS5 builds.
Audit every CAS loop for compare_exchange_weak use. Strong is correct but pays a retry on every spurious failure on ARM v7 / pre-LSE ARM v8.
Pad the producer and consumer atomics of every SPSC ring (render-to-RHI, audio-engine event ring, profiler ring) onto separate cache lines. The first frame-profiler I/O ring an engine team ships is almost always unpadded; the second one isn't.
Use UE::FPlatformAtomics for the platform-specific paths that need the exact instruction (e.g. InterlockedCompareExchange128 for ABA fixes via tagged 128-bit pointers).

19How the same atomic lowers on x86 and ARM

The clearest way to internalize what each ordering costs is to read the assembly the compiler emits on each architecture. Six operations, two architectures, GCC 14 at -O2. The instruction counts are real; you can reproduce them at godbolt.org:

Operation	x86-64 (SSE/AVX)	aarch64 (ARMv8.0)	aarch64 (ARMv8.1+ with LSE)
`load(relaxed)`	`MOV` rax, [counter]	`LDR` x0, [counter]	`LDR` x0, [counter]
`load(acquire)`	`MOV` rax, [counter]	`LDAR` x0, [counter]	`LDAR` x0, [counter]
`load(seq_cst)`	`MOV` rax, [counter]	`LDAR` x0, [counter]	`LDAR` x0, [counter]
`store(relaxed)`	`MOV` [counter], rax	`STR` [counter], x0	`STR` [counter], x0
`store(release)`	`MOV` [counter], rax	`STLR` [counter], x0	`STLR` [counter], x0
`store(seq_cst)`	`MOV` + `MFENCE` (GCC) or `XCHG` (Clang)	`STLR` (modern GCC/Clang); older toolchains: `DMB ISH; STR; DMB ISH`	`STLR` (LSE doesn't change seq_cst load/store, only RMWs)
`fetch_add(relaxed)`	`LOCK XADD` [counter], 1	`LDXR; ADD; STXR` loop	`LDADD` (single instr)
`fetch_add(seq_cst)`	`LOCK XADD` [counter], 1	`LDAXR; ADD; STLXR` loop	`LDADDAL` (single instr)
`compare_exchange_strong(seq_cst)`	`LOCK CMPXCHG` [counter], rax	`LDAXR; CMP; B.NE done; STLXR; CBNZ retry`	`CASAL` (single instr)

LSE = Large System Extensions, part of ARMv8.1-A (2014). Adds LDADD, CAS, SWP, and family as single instructions. Apple M1+, Cortex-A75+, Neoverse-N1+ support LSE. Older cores (Cortex-A57 in Nintendo Switch, Cortex-A53 in early ARM phones) do not, and pay the LDXR/STXR loop. -march=armv8.1-a+lse or -mcpu=apple-a14 enables the LSE codegen; without it, GCC and Clang fall back to LL/SC even on hardware that supports LSE^[32].

What the table is telling you:

On x86, ordering is mostly free. The atomic primitives are MOV + LOCK CMPXCHG. Only seq_cst stores pay an MFENCE. The difference between relaxed and acquire/release on x86 atomics is zero instructions; on a profile they look identical.
On ARM, ordering buys you fences. Relaxed is plain LDR/STR; acquire/release switch to LDAR/STLR; seq_cst stores get full DMB barriers. Each step costs cycles. The difference between picking the right ordering and overspecifying is measurable on ARM in a way it isn't on x86.
On pre-LSE ARM, every RMW is a CAS loop. fetch_add compiles to a load-link/store-conditional loop that can fail under cache contention. Under heavy contention, the loop dominates. ARMv8.1's LDADD-family single-instruction RMWs are the reason modern phones and consoles got much cheaper atomic counters.

20Pitfalls

A field-guide list, drawn from real engine post-mortems and from the patterns this tutorial spent fifteen sections building up:

volatile is not an atomic. volatile tells the compiler not to cache the value in a register. It says nothing about the CPU's store buffer, cache coherence, or cross-thread ordering. volatile int is not thread-safe; std::atomic<int> is. The exception: device-mapped I/O registers, where volatile is required (and not for the threading reason).
memory_order_consume doesn't work. The spec is fine; no compiler implements it correctly. Every compiler that supports it (GCC, Clang, MSVC) promotes it to acquire internally. P0735 and P1726 in WG21 are trying to fix the spec; until then, use acquire^[16].
Standalone fences alone do nothing. An std::atomic_thread_fence only matters when paired with an atomic operation of the right ordering on the other thread. Two fences without atomic operations between them are a no-op in the C++ abstract machine.
seq_cst is not a magic safety blanket. Putting seq_cst on every operation is correct, but it pays an ARM fence cost per operation, and it doesn't fix a wrong data structure. If your code has a race that seq_cst fixes and acquire/release doesn't, you have an IRIW pattern, which is rare.
RMW operations have two orderings. fetch_add, compare_exchange, exchange are read-modify-write; they're both a load and a store. Pick the ordering that covers what the operation has to synchronize with. acq_rel on a CAS is the common case; release-only on a CAS that just publishes (no acquire needed) is the special case.
Two-way CAS ordering arguments. compare_exchange's third argument is the success ordering; the optional fourth is the failure ordering. Failure can be weaker than success (failed CAS didn't store anything). A common mistake is to copy-paste acq_rel to both, which forces an acquire fence on the no-op failure path.
Cache-line padding is platform-specific. 64 bytes on x86-64 and most ARM. 128 on Apple M-series and POWER. std::hardware_destructive_interference_size in libstdc++ defaults to 64 even on aarch64, which produces false sharing on Apple Silicon. Either hard-code 128 for portability or query the platform manual.
The compiler can still reorder non-atomic accesses around atomic ones. memory_order_relaxed only enforces atomicity of the one access, not ordering against unrelated writes. The compiler is free to reorder non-atomic stores past a relaxed atomic store. If you want the order, you need acquire/release.
The C++ memory model only describes well-formed programs. A program with a data race (two non-synchronized accesses to a non-atomic where at least one is a write) is undefined behavior, full stop. No "actual hardware shows the value 5" reasoning rescues it; UB means the compiler can do anything, and modern compilers do.

21What's next

Where to go from here:

Pair with the Job Systems tutorial. Now that the orderings make sense, re-read the Chase-Lev deque implementation in §10 of that tutorial. Then read Lê et al. 2013^[1] end-to-end. Then read Naughty Dog's GDC talk on parallelizing their engine with fibers.
Read the formal models. Sewell's x86-TSO^[2], Maranget's ARM/POWER tutorial^[9], and Alglave et al.'s "Herding Cats"^[33] are the trio of papers that took weak memory from "you have to test on the actual hardware" to "you can run a model checker and prove your invariants."
Run a litmus checker yourself. herd7 from the Diy7 tool suite^[34] takes a litmus test in plain text and reports which outcomes each memory model permits. litmus7 compiles the same test for the hardware and runs it. The combination is how the formal-model papers were checked; you can run them on a 16-core development machine in seconds.

Then build something. Write a SPSC. Port the Chase-Lev. Find a bug. Read the disassembly. The orderings are a contract, and the only way to internalize a contract is to enforce it.

22Sources & further reading

Numbered citations refer to the superscripts above. Everything below is either freely available on the open web or linked from a published paper's PDF.

A note on originality

The prose, code samples, CSS, and interactive widgets on this page are original writing. The atomic ref counter follows the canonical Boehm pattern. The Chase-Lev deque implementation in §15 follows Lê, Pop, Cohen, Zappa Nardelli (2013) [1] with the original Chase & Lev (2005) [28] attributed at the point of use. The x86-TSO description and litmus-test framing tracks Sewell et al. (2010) [2]. The ARM/POWER weak-model description and the litmus-test labeling track Maranget, Sarkar, Sewell (2012) [9]. The seqlock implementation follows the canonical Linux kernel pattern with the C++ memory ordering analysis from Lewiss-Brown's CppCon 2022 talk [30]. The hazard-pointer protocol follows Maged Michael (2002) [24].

Lê, N. M., Pop, A., Cohen, A., & Zappa Nardelli, F. (2013). Correct and Efficient Work-Stealing for Weak Memory Models. PPoPP. PDF. The paper that corrected the 2005 Chase-Lev deque for ARM and POWER.
Sewell, P., Sarkar, S., Owens, S., Zappa Nardelli, F., & Myreen, M. O. (2010). x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors. Communications of the ACM. PDF. The formal model of x86 atomics; the canonical citation for "x86 has a store buffer."
ISO/IEC. (2020). ISO/IEC 14882:2020 — Programming languages — C++. [intro.races] §6.9.2. The normative text on the C++ memory model: sequenced-before, synchronizes-with, happens-before. Free working draft: eel.is/c++draft.
Lamport, L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Transactions on Computers C-28(9). PDF. The definition of sequential consistency. The starting point for everything since.
SPARC International. (1992). The SPARC Architecture Manual, Version 8. §7 (Memory). PDF. The original Total Store Order specification. The model x86-TSO mirrors.
Manson, J., Pugh, W., & Adve, S. V. (2005). The Java Memory Model. POPL. PDF. JSR-133. The first widely-deployed language memory model based on happens-before.
Boehm, H.-J. (2005). Threads Cannot Be Implemented As a Library. PLDI. PDF. The paper that ended the pthreads-on-top-of-C era.
Boehm, H.-J., & Adve, S. V. (2008). Foundations of the C++ Concurrency Memory Model. PLDI. PDF. The basis of C++11 atomics and orderings.
Maranget, L., Sarkar, S., & Sewell, P. (2012). A Tutorial Introduction to the ARM and POWER Relaxed Memory Models. PDF. The plain-language reference for weak memory, written against the pre-2018 ARMv7/v8 model.
ARM Limited. Arm Architecture Reference Manual for A-profile architecture. §B2 (The AArch64 application level memory model). developer.arm.com. The normative source for LDAR, STLR, DMB ISH, and the ARMv8 model.
Boehm, H.-J., & Giroux, O. (2018). P0668R5 — Revising the C++ memory model. WG21. open-std.org. The paper that tightened seq_cst on ARM and POWER.
Dominiak, M., Baker, L., Howes, L., Shoop, K., Garland, M., Niebler, E., & Lelbach, B. (2024). P2300R10 — std::execution. WG21. Adopted into C++26. wg21.link/P2300.
Frumusanu, A. (2021). Apple Announces the M1 Pro & M1 Max SoCs: Massive Performance Uplifts. AnandTech. anandtech.com. Memory-system width and store-buffer depth measurements for M1 Max.
McKenney, P. E. (2024). Is Parallel Programming Hard, And, If So, What Can You Do About It? (Edition 2024.10.29a) kernel.org/perfbook. The canonical practitioner book on memory ordering, RCU, and the Linux kernel's approach. Free PDF.
Preshing, J. (2012). Acquire and Release Semantics. preshing.com. The clearest practitioner explanation of acquire/release.
McKenney, P. E., Maranget, L., Sutter, H., & others. (2017). P0750R1 — Consume. WG21. wg21.link/p0750r1. The history of why memory_order_consume is broken and the proposals to fix it.
Boost. boost::intrusive_ptr. boost.org. The release-decrement-plus-acquire-fence pattern. Used widely in game engines for hot-path ref counting.
GNU libstdc++. shared_ptr implementation. gcc.gnu.org. Implementation notes on the libstdc++ shared_ptr control block, fence selection, and make_shared layout.
Lemire, D. (2023). Measuring the size of the cache line empirically. lemire.me. 64 B on x86-64, 128 B on Apple M-series.
alic.dev (2023). The False Sharing Penalty. alic.dev. A measured benchmark showing ~2-4× slowdown from false sharing on x86, scaling with thread count.
Giroux, O. (2019). P0514R4 — Efficient concurrent waiting for C++20. WG21. PDF. The proposal that added std::atomic::wait/notify.
Treiber, R. K. (1986). Systems Programming: Coping with Parallelism. IBM Research Report RJ 5118. The original lock-free stack. The Treiber stack is the canonical lock-free LIFO and the simplest non-trivial CAS-based data structure.
Wikipedia. ABA problem. en.wikipedia.org. Background, with examples, plus the tagged-pointer family of solutions.
Michael, M. M. (2002). Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic Reads and Writes. PODC. Tom Hart's MS thesis, citing the original PODC paper (the PODC paper itself is paywalled; the technique is also documented in Michael 2004 IEEE TPDS). The introduction of hazard pointers.
Wong, M., Liu, D., Michael, M., & others. (2024). P2530 — Hazard Pointers for C++26. WG21. PDF. The standard-library hazard-pointer proposal.
Meta. folly::hazptr. GitHub. github.com/facebook/folly. Production hazard-pointer implementation maintained by Maged Michael.
crossbeam-rs. Crossbeam: tools for concurrent programming in Rust. GitHub. github.com/crossbeam-rs/crossbeam. The reference Rust implementation of epoch-based reclamation (crossbeam-epoch) and concurrent data structures.
Chase, D., & Lev, Y. (2005). Dynamic Circular Work-Stealing Deque. SPAA. PDF. The original deque algorithm; correct on SC, racy on ARM/POWER (see [1]).
Linux kernel. seqlock.h. elixir.bootlin.com. The reference implementation. Used for jiffies, vdso, and the read-mostly read-side of dozens of subsystems.
Lewiss-Brown, H. (2022). Atomics in C++ — From "Hello World" to Lock-Free Data Structures. CppCon. YouTube. The seqlock and SPSC discussions are the practitioner-grade walkthrough of memcpy semantics and torn-read protection.
Epic Games. Atomic Operations in Unreal Engine. dev.epicgames.com. FPlatformAtomics, the TAtomic wrapper, and the FThreadSafeCounter family.
ARM Limited. ARMv8.1-A Large System Extensions (LSE). developer.arm.com. The single-instruction LDADD/CAS/SWP family. Targets that need LSE codegen: -march=armv8.1-a+lse or higher.
Alglave, J., Maranget, L., & Tautschnig, M. (2014). Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM TOPLAS. PDF. The mathematical foundation of the herd7 model checker; what powers the modern litmus-test ecosystem.
Alglave, J., & Maranget, L. The diy7 toolsuite (herd7 and litmus7). diy.inria.fr. The reference litmus-test runner.
Pulte, C., Flur, S., Deacon, W., French, J., Sarkar, S., & Sewell, P. (2018). Simplifying ARM Concurrency: Multicopy-atomic Axiomatic and Operational Models for ARMv8. POPL. PDF. The paper that retroactively defined ARMv8 as multicopy-atomic, closing IRIW on AArch64 while leaving the SB/MP/LB reorderings in place.
Cox, R. (2021). Hardware Memory Models. Three-part series, research.swtch.com. Part 1, Part 2 (Programming Language Memory Models), Part 3 (The Go Memory Model). The clearest tour-of-the-territory writing on the topic, from the author of the Go runtime.
Preshing, J. (2013). An Introduction to Lock-Free Programming. preshing.com. The introductory companion to the acquire/release post.
Vyukov, D. Bounded MPMC queue. 1024cores. 1024cores.net. The reference multi-producer / multi-consumer lock-free queue, with the sequence-number protocol that makes a bounded MPMC queue correct without garbage collection.
Williams, A. (2019). C++ Concurrency in Action (2nd ed.). Manning. The practitioner book on C++11/17/20 concurrency. Chapter 5 is the deepest treatment of the memory model in any C++ textbook.
Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide, Part 1. §8.2 "Memory Ordering." intel.com. The normative source for x86 memory ordering.
Sutter, H. (2005). The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb's Journal. gotw.ca. The essay that named the problem. Now twenty years old; still right.
Michael, M. M., & Scott, M. L. (1996). Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. PODC. PDF. The Michael-Scott queue: a canonical lock-free MPMC queue, foundational reading for the algorithm family that includes Java's ConcurrentLinkedQueue and Rust's crossbeam-deque.
Stewart, P. (2018). Threads and Locks: The Linux Kernel's Memory Model. Linux Foundation. kernel.org. The kernel's "LKMM" treatment of atomics and ordering, with the differences from the C11 model called out.