Build a Game Engine · Memory & Concurrency

Lock-free Queues & Memory Ordering

The job pool and the audio thread both want a queue that never blocks: the job pool so workers don't serialize on a mutex, the audio callback because it cannot take a lock at all. Lock-free queues deliver that, and they are also where concurrency bugs go to hide. We build the SPSC ring that needs no CAS, work through Vyukov's MPMC design, and confront the two traps (ABA, and weak memory) that make hand-rolled lock-free dangerous.

Time~55 min LevelSenior PrereqsThe Memory Model (atomics, acquire/release) and Job Systems (the thread pool that needs this). StackC++ & Rust

01Why lock-free

Two concrete cases pull an engine toward lock-free queues, and one honest reason to think twice.

The job pool. Every worker pulls tasks from a shared queue. Put a mutex on it and the workers serialize on that one lock exactly when you wanted them parallel (the scaling problem from the Job Systems tutorial).
The audio thread. The audio callback runs under a hard deadline. If it takes a mutex and a lower-priority thread holds it, priority inversion can stall the callback past its deadline and the sound glitches. The control thread hands parameters to the audio thread through a lock-free ring instead^[11].
The counterpoint. Lock-free is genuinely hard, easy to get subtly wrong, and hard to test. Often a mutex plus a ring buffer is fast enough. Reach for lock-free when you can prove the lock is the bottleneck, and prefer it for progress and tail-latency, not raw throughput.

02The memory model in one screen

The Memory Model tutorial builds this in full; here is the working subset. Two ideas, kept separate: atomicity (an operation happens all-or-nothing) and (which writes another thread is guaranteed to see, and when).

relaxed: atomic, but no ordering. Fine for a counter nobody synchronizes on.
acquire / release: a release store synchronizes-with an acquire load that reads its value, so everything the writer did before the release becomes visible to the reader after the acquire^[3]^[4]. This is the pair lock-free queues are built on.
seq_cst: a single global order on top of acquire/release. The safe default, the most expensive.

Acquire/release does not "make it atomic"

Atomicity and ordering are independent; even relaxed is atomic. And the acquire only synchronizes if it actually reads the value the release wrote. Reading a stale earlier value forms no happens-before edge.

03CAS and the progress taxonomy

(CAS) is the universal building block: atomically write a new value only if the location still holds the expected one. Loops of CAS are how multiple threads agree without a lock. But "lock-free" is a precise progress guarantee, and three levels get conflated.

The hierarchy (Herlihy & Shavit)

Obstruction-free: a thread makes progress if it runs alone (others paused). Allows livelock under contention.
: some thread always makes progress system-wide. No livelock; an individual thread can still starve.
Wait-free: every thread finishes in a bounded number of its own steps. No starvation, and usually expensive or impractical.

Source: Herlihy & Shavit, The Art of Multiprocessor Programming^[2].

Two misreadings of the word

"Lock-free means no waiting" is false: it promises the system progresses, not your thread, which can lose a CAS forever. The "lock" is not a mutex: it refers to the program locking up. A mutex algorithm isn't lock-free because, as Preshing puts it, once a thread takes the mutex "your worst enemy could simply never schedule that thread again" and everyone stalls^[1]. Scope every claim to the operation and the variant: Vyukov's MPMC (next) and moodycamel's queue are commonly called lock-free, yet neither is wait-free.

04SPSC: reach for this first

Single-producer, single-consumer is the queue to reach for first, and it needs no CAS at all. The reason: each index has exactly one writer. The producer owns the write index, the consumer owns the read index, so no two threads ever contend to modify the same atomic.

Correctness comes from one acquire/release pair per direction. The producer writes the slot, then does a release store of the write index to publish it; the consumer does an acquire load of the write index before reading the slot. Each side reads its own index with relaxed, because nothing else writes it. boost documents its spsc_queue as lock-free (wait-free for single-element push/pop), and moodycamel's reader-writer queue as "completely wait-free, no compare-and-swap loop"^[10].

SPSC bounded ring (single owner per index)

template <typename T, size_t Capacity>          // Capacity is a power of two
class SpscRing {
    alignas(64) std::atomic<size_t> writeIndex{0};   // producer writes, consumer reads
    alignas(64) std::atomic<size_t> readIndex{0};    // own cache line each: no false sharing
    alignas(64) T storage[Capacity];
public:
    bool push(const T& value) {                          // producer thread only
        size_t w = writeIndex.load(std::memory_order_relaxed);   // we own it
        size_t r = readIndex.load(std::memory_order_acquire);    // pairs with consumer release
        if (w - r == Capacity) return false;                       // full
        storage[w & (Capacity - 1)] = value;
        writeIndex.store(w + 1, std::memory_order_release);       // publish
        return true;
    }
    bool pop(T& out) {                                     // consumer thread only
        size_t r = readIndex.load(std::memory_order_relaxed);    // we own it
        size_t w = writeIndex.load(std::memory_order_acquire);   // pairs with producer release
        if (w == r) return false;                              // empty
        out = storage[r & (Capacity - 1)];
        readIndex.store(r + 1, std::memory_order_release);        // free the slot
        return true;
    }
};

use std::sync::atomic::{AtomicUsize, Ordering};

// CAPACITY is a power of two; one writer per index means no CAS.
impl<T: Copy, const CAPACITY: usize> SpscRing<T, CAPACITY> {
    fn push(&self, value: T) -> Result<(), T> {            // producer only
        let w = self.write_index.load(Ordering::Relaxed);     // we own it
        let r = self.read_index.load(Ordering::Acquire);      // pairs with consumer release
        if w - r == CAPACITY { return Err(value); }              // full
        unsafe { *self.storage[w & (CAPACITY - 1)].get() = value; }
        self.write_index.store(w + 1, Ordering::Release);     // publish
        Ok(())
    }
    fn pop(&self) -> Option<T> {                            // consumer only
        let r = self.read_index.load(Ordering::Relaxed);      // we own it
        let w = self.write_index.load(Ordering::Acquire);     // pairs with producer release
        if w == r { return None; }                            // empty
        let value = unsafe { *self.storage[r & (CAPACITY - 1)].get() };
        self.read_index.store(r + 1, Ordering::Release);      // free the slot
        Some(value)
    }
}

The widget runs it. The producer and consumer indices chase around the ring; no lock is ever held, and neither index is ever touched by two threads:

05Bounded MPMC: Vyukov

Many producers and many consumers can't each own an index, so you need CAS. Dmitry Vyukov's bounded MPMC queue is the design most engines reach for, and the Job Systems tutorial uses it^[5].

It's a fixed array of cells, each holding a payload plus an atomic sequence number, with two position counters. To enqueue: load the enqueue position, index the cell, and compare its sequence to the position. If they're equal the cell is free, so CAS the position forward and, on success, write the payload and release-store the cell's sequence to mark it ready. Dequeue mirrors this. The cell sequence is loaded with acquire and stored with release; the position counters use relaxed CAS.

It is not, strictly, lock-free, and the sequence number is a generation counter

Vyukov says so himself: it is "not lockfree in the official meaning, just implemented by atomic operations without mutexes"^[5]. A producer that claims a slot and then stalls blocks consumers waiting on that slot. And the per-cell sequence isn't a full/empty flag; it's a lap/generation counter, which is also what makes the cell immune to ABA (a stale producer sees the wrong sequence and fails its check) without a separate tag.

06The ABA problem

is the classic lock-free trap. A CAS reads value A; another thread changes the location A → B → A (often by freeing a node and the allocator handing the same address back); the original CAS then succeeds because the bits match, even though the "A" is now a different object or freed memory. A false-positive CAS.

Three families of fix, and the trap inside the first one:

Tagged / versioned pointers: pack a counter into spare pointer bits (or use double-width CAS) so A-then-A still differs by the tag.
Hazard pointers: readers publish what they're dereferencing; a writer won't free anything a hazard slot names.
Epoch-based reclamation: retire freed nodes into epoch buckets and free only after every thread has moved on. Rust's crossbeam-epoch ships this^[9].

Tags fix the comparison, not the memory

A versioned pointer stops the CAS from matching a recycled value, but if your thread still holds a raw pointer to a node another thread freed, dereferencing it is undefined behavior no matter what the tag says. Memory lifetime needs hazard pointers or epochs. The Memory Model tutorial walks the Treiber-stack race that produces ABA in detail.

Step through it. Thread A reads the top, pauses; thread B pops two nodes, frees them, and pushes a new node at the recycled address; A resumes and its CAS wrongly succeeds. Turn on tags to watch it fail and retry instead:

07x86 vs ARM: the latent bug

The most dangerous thing about lock-free code is that a wrong version often passes every test on your desktop. x86's memory model (TSO) is strong: a plain load is already an acquire and a plain store already a release, so relaxed where you needed acquire/release is accidentally correct^[7].

ARM and POWER are weak: those reorderings really happen. The Cambridge lowering table makes it concrete: an acquire/release on x86 compiles to a plain MOV, while on ARM it needs LDAR/STLR instructions^[6]. So a missing fence is invisible on your dev box and corrupts state on a Switch, a phone, or Apple Silicon. It's a bug, not a timing fluke.

Need a refresher on why memory operations get reordered?

Two layers reorder, and both must be controlled. The compiler reorders, fuses, and hoists loads and stores while optimizing; as far as it knows, single-threaded results are unchanged. Then the CPU reorders at runtime: a store sits in a store buffer before reaching cache, and a later load to a different address can complete first. Neither layer can tell that another thread is watching.

This is invisible in single-threaded code, which is why it stays hidden until two threads communicate through shared memory. An acquire/release pair (or a fence) is the instruction to both layers: do not move these operations across this point. On x86-TSO the hardware already gives that ordering on plain loads and stores, so the constraint is mostly a compiler barrier; on weakly-ordered ARM it lowers to real ordering instructions (LDAR/STLR).

The consoles are x86, which hides this

The PS5 and Xbox Series are AMD Zen 2, the same x86-TSO model as a PC. A race that only appears there is timing exposing a bug that exists on x86 too; only genuinely weak targets (ARM) change the model itself. Test lock-free code on ARM hardware, or you are not testing the ordering.

Run the message-passing test: writer sets data = 42 then sets a flag; reader waits on the flag then reads data. Toggle the ordering and the hardware model:

Wrong answers, and why: lock-free is system-wide progress (not no-waiting, not just mutex-absence); and the cross-platform corruption is a weak-memory ordering bug x86 hides, not a speed or memory-size issue.

08When not to go lock-free

The strongest practical advice in this tutorial: most of the time, don't hand-roll it. Lock-free correctness is subtle on weak hardware, hard to test because races hide, and easy to get wrong in reclamation.

The expert consensus is to use a proven implementation. The C++ Core Guidelines say not to write lock-free code unless you must, and to prefer a known library^[12]. In practice: in C++ reach for moodycamel::ConcurrentQueue (MPMC) or ReaderWriterQueue (SPSC, wait-free)^[10]; in Rust reach for crossbeam's ArrayQueue (bounded MPMC), SegQueue (unbounded), and crossbeam-deque (the Chase-Lev work-stealing deque, weak-memory-correct, used under Rayon and Tokio)^[9]^[14]. The code in this tutorial is for understanding; ship the library.

09Pitfalls

"Lock-free means no waiting"It means system-wide progress; a single thread can still starve. That's wait-free.

Works on x86, breaks on ARMA missing acquire/release; TSO hides it, weak memory reveals it. Test on ARM.

CAS on a recycled valueABA: the bits matched but the object changed. Tag the pointer and reclaim safely.

Tag but still crashTags fix the comparison, not lifetime; you still need hazard pointers or epochs.

Slow SPSC ringhead and tail share a cache line: false sharing. Pad each to its own line.

CAS where you didn't need itSPSC has one owner per index, so it needs none. Save CAS for true MPMC.

Hand-rolling itLock-free is where correctness goes to die; ship moodycamel/crossbeam unless you can prove the need.

10What's next

That fills the last gap in the concurrency foundation, alongside the Memory Model, Allocators, and Job Systems. The lock-free ring here is exactly what feeds the audio thread later. With the foundations done, the series turns to the renderer: The GPU & Graphics Pipeline, then your first triangle in Vulkan. The full path is on the series hub.

Jeff Preshing. "An Introduction to Lock-Free Programming." preshing.com. Lock-free as system-wide progress; the "lock" means lock-up, not mutex.
Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann. The obstruction-free / lock-free / wait-free progress hierarchy and the ABA problem.
Jeff Preshing. "Acquire and Release Semantics." preshing.com. The synchronizes-with edge a release store forms with an acquire load that reads it.
cppreference. std::memory_order. en.cppreference.com. The exact relaxed / acquire / release / seq_cst semantics.
Dmitry Vyukov. "Bounded MPMC queue." 1024cores. The per-cell sequence-number design, the orderings, and the "not lockfree in the official meaning" caveat.
Cambridge Computer Lab. "C/C++11 mappings to processors." cl.cam.ac.uk. The instruction lowering per ordering per ISA: x86 MOV vs ARM LDAR/STLR.
Peter Sewell et al. "x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors." cl.cam.ac.uk. The formal x86 store-buffer model (loads acquire, stores release, for free).
The Rustonomicon. "Atomics." doc.rust-lang.org. Acquire/release pairing in Rust and "free on strongly-ordered platforms."
crossbeam (Rust). github.com/crossbeam-rs/crossbeam. Epoch-based reclamation, ArrayQueue/SegQueue, and the Chase-Lev crossbeam-deque.
Cameron Desrochers. moodycamel ConcurrentQueue and ReaderWriterQueue. github.com/cameron314/concurrentqueue. The MPMC queue ("lock-free, not quite wait-free") and the wait-free SPSC queue.
Ross Bencina. "Real-time audio programming 101: time waits for nothing." rossbencina.com. Why the audio callback must not lock, and the lock-free ring pattern.
C++ Core Guidelines, Concurrency (CP.100–CP.101). isocpp.github.io. Don't use lock-free unless you must; prefer a known library.
David Chase and Yossi Lev. "Dynamic Circular Work-Stealing Deque." SPAA 2005. vanderbilt.edu. The work-stealing deque behind job systems.
Nhat Minh Lê et al. "Correct and Efficient Work-Stealing for Weak Memory Models." PPoPP 2013. fzn.fr. The acquire/release fixes that make Chase-Lev correct on ARM/POWER.