Tutorial 09 · Engine Programming

Job Systems
from Scratch

A working job scheduler of the kind that runs The Last of Us and Doom Eternal at 60 fps. We start with one thread and work up to a fiber-aware, dependency-tracking, work-stealing system that uses every core on the box. No engine, no framework: just the data structures, the memory model, and live demos.

Time~50 min LevelJunior engine programmer PrereqsYou can read C++ or Rust at a beginner level HardwareSome passing acquaintance with the idea that CPUs have cores and caches helps

01Why a game engine needs a job system

A 60 fps game has 16.6 milliseconds to update the world, simulate physics, run animation, cull the scene, build a few thousand draw calls, and stream textures off disk. A modern console has eight CPU cores. A modern PC has between eight and thirty-two. If you spend all sixteen of those milliseconds on one core and leave the other thirty-one idle, your game runs at 30 fps and your fans stay quiet. The job of a is to spread that work across every core the machine has.

"Spread the work across cores" is harder than it sounds. Threads have to coordinate. Memory has to be cache-friendly. A naïve solution that hands work to other threads can run slower than the single-threaded version it replaced. A job system is the layer that handles all of that for you.

What you'll have by the end

A ~200-line work-stealing job scheduler with dependency tracking, in C++ and Rust, that you can read in one sitting. A JavaScript port runs live on this page, drives a worker-timeline visualizer, and lets you build your own task graphs in a browser playground. By the end you'll know why Naughty Dog uses fibers, why Unity passes structs to jobs instead of references, why Unreal added a thing called a Pipe, and how to avoid the bug where padding a struct makes your code 4x faster.

The fixed-thread era is over

In the mid-2000s, when consoles first shipped with multiple cores, engines were typically refactored along subsystem lines. One thread for the renderer, one for physics, one for audio, one for everything else. This is intuitive because it maps onto the way the engine is already organized. It also stops scaling around four cores because there is no fifth subsystem to assign. Worse, the workload of each subsystem changes minute to minute. A frame heavy on physics leaves the render thread idle, a frame heavy on rendering leaves the physics thread idle, and the user sees stutters.

The breakthrough, which the industry settled on between roughly 2009 and 2015, was to stop splitting by subsystem and start splitting by data. Instead of one fat physics thread that integrates every rigid body, you have a thousand small , each of which integrates ten bodies. Throw all thousand jobs at a pool of workers, let the workers grab them as they finish whatever they were doing, and every core stays busy. Tatarchuk's GDC 2015 talk on the Destiny renderer put this bluntly: "a game engine must be designed from the ground up for job-based multithreading"^[3].

The widget below contrasts the two models on a synthetic frame of 240 work units. The "fixed threads" mode runs a render thread, a physics thread, and a gameplay thread, each with its own fixed workload. The "job system" mode dumps the same total work into a pool of workers. The tail-latency gap is the point:

cores 8

workload skew (heavy physics frame ⇆ heavy render frame) 0.50

frame time

···

core utilization

···

02A short history of getting parallel

Job systems didn't drop fully formed from the sky. The shape they're in today is the product of three distinct decades of pain, and skipping the lineage is how people end up rebuilding the 2008 version of the wrong idea. A short tour:

1995

Cilk introduces work-stealing. Blumofe and Leiserson at MIT publish "Cilk: An Efficient Multithreaded Runtime System"^[5]. They prove that a randomized work-stealing scheduler runs within a constant factor of optimal for a wide class of parallel programs. Every modern job system, including the one we'll build, descends from this paper.

2005

Chase and Lev publish the deque. Their SPAA paper "Dynamic Circular Work-Stealing Deque"^[6] describes the lock-free data structure that became the backbone of every work-stealing runtime since: Cilk, TBB, Go, Rust's Rayon, and Java's ForkJoinPool. We'll implement a variant in §10.

2006

The Xbox 360 and PS3 ship with multi-core CPUs. Games start using a render thread, a gameplay thread, and a few helpers. It works on three cores. It does not scale.

2009

Tony Albrecht gives "Pitfalls of Object-Oriented Programming."^[1] On PS3 hardware he reorganizes a scene-traversal loop and gets a 6x speedup. The talk crystallizes "data-oriented design" as a discipline. Job systems aren't worth building until the data is organized to support them, and Albrecht's talk is where the industry collectively notices.

2010

Johan Andersson's "Parallel Futures of a Game Engine."^[11] DICE moves Frostbite to a job-graph model, parallel command-buffer recording, and PS3 SPU-based shading. The Battlefield engine becomes the public reference for a job-system-based AAA engine.

2015

The fiber moment. Christian Gyrling gives "Parallelizing the Naughty Dog Engine Using Fibers"^[2] at GDC. Six worker threads, 160 fibers, all jobs run on fibers, dependencies suspend fibers cooperatively. Bungie's Genova and Tatarchuk give two companion talks on Destiny^[4]^[3]: the entire engine is a job graph. Stefan Reinalter writes the five-part Molecular Matters series^[7] showing a Chase-Lev work-stealing scheduler from scratch. The blueprint is now public.

2018

Unity ships the C# Job System and Burst.^[8] The first time a "real" data-parallel scheduler is in the hands of every hobbyist. The Burst compiler turns a restricted dialect of C# into LLVM-optimized SIMD machine code.

2022

Unreal Engine 5 introduces UE::Tasks.^[9] Epic replaces named-thread affinity with a Pipe primitive that auto-serializes access to non-thread-safe APIs. By UE 5.4 the RHI is parallelized too.

2024

C++26 adopts std::execution.^[10] Senders, receivers, and structured concurrency become a standard library primitive. The language gets, for the first time, a vocabulary for describing parallel work that doesn't depend on the OS thread API.

The pattern: the math is from 1995, the data structure from 2005, the data-layout discipline from 2009, the complete engine deployment from 2015, the standard-library vocabulary from 2024. Roughly a decade between each, with each step taking the previous one for granted. The rest of this tutorial walks through the same sequence in one sitting.

03The naïve worker pool

Start with the simplest thing that could possibly work. N worker threads, one shared queue of work, and a mutex protecting the queue. Each worker loops: grab the mutex, pop a job, drop the mutex, run the job. When the queue is empty, the worker sleeps on a condition variable until somebody adds work.

naive_pool.cpp · the simplest design

// A simple thread pool: N worker threads share one job queue,
// protected by a mutex. The first design every junior writes.
class NaivePool {
  // The lock that protects the queue. Exactly one thread can hold it.
  std::mutex                          queueLock;

  // Lets idle workers sleep until somebody calls notify_one().
  // We wake them when a new job arrives or when shutting down.
  std::condition_variable             workAvailable;

  // The queue of jobs. Each job is a callable with no arguments.
  std::queue<std::function<void()>>    jobQueue;

  // One std::thread per worker. They all run workerLoop().
  std::vector<std::thread>             workerThreads;

  // Flip this to false in the destructor to tell workers to exit.
  bool                                 isRunning = true;

public:
  // Spawn `workerCount` threads. Each enters workerLoop() immediately.
  NaivePool(int workerCount) {
    for (int i = 0; i < workerCount; i++)
      workerThreads.emplace_back([this] { workerLoop(); });
  }

  // Add a job to the queue and wake one sleeping worker.
  void submit(std::function<void()> job) {
    {
      // Take the lock just long enough to push. Releasing it inside
      // the braces lets workers actually pop the job we just added.
      std::lock_guard lock(queueLock);    // every submitter contends here
      jobQueue.push(std::move(job));
    }
    // Wake one worker. If multiple are sleeping, only one needs to run.
    workAvailable.notify_one();
  }

  // Every worker runs this loop until shutdown.
  void workerLoop() {
    while (isRunning) {
      std::function<void()> nextJob;
      {
        std::unique_lock lock(queueLock);  // every worker also contends here
        // Sleep until there's work to do, or we're shutting down.
        // cv.wait() releases the lock while sleeping, then re-acquires it
        // before returning, so the queue check below is always safe.
        workAvailable.wait(lock, [&] {
          return !jobQueue.empty() || !isRunning;
        });
        if (!isRunning) return;             // shutdown signal
        nextJob = std::move(jobQueue.front());
        jobQueue.pop();
      }                                  // release the lock before running the job
      nextJob();                          // run outside the lock so other workers can fetch theirs
    }
  }
};

This works. It's about 50 lines of C++, it's correct, and it will give you a real speedup on, say, four cores running 100 ms of mostly-independent work. It's the design every junior implements first. It also stops scaling fast, and understanding why is the whole reason you read the rest of this tutorial.

Predict

Before reading the next list: look at the code above and try to name at least one reason it stops scaling from 4 cores to 32. Two reasons for partial credit, three for the full set. Then read on.

Three things go wrong as you scale

Lock contention. Every worker grabs the queue's mutex on every job. At 16 workers and a million tiny jobs per second, the lock is the bottleneck and the workers spend most of their time waiting in line to look at the queue.
Cache-line ping-pong. The queue's head and tail pointers, the mutex's state, and the condition variable's wait list usually share a cache line. Every push and pop forces that line to be transferred between cores. We'll measure this directly in §5.
It can't express dependencies. Real frame work has structure: animation runs after physics, render runs after visibility culling. With a flat queue, the only way to enforce ordering is to block, and a blocked thread is a wasted core.

Each of these problems has a fix. We'll work through them one at a time. By the end the lock is gone, the queue is sharded so no two cores fight over the same line, and dependencies are first-class.

What's a mutex, again?

A mutex (mutual exclusion lock) is a primitive that guarantees only one thread at a time can be inside a "critical section" of code. A thread that tries to enter while someone else is in waits, either by spinning or by going to sleep until the holder releases the lock.

Mutexes are correct but they don't scale: they serialize the code they protect. If a critical section takes one microsecond, sixteen threads contending for it can only do a million ops per second total, regardless of how many cores you have.

A condition variable is a partner primitive that lets a thread sleep until some condition becomes true (here, "the queue is non-empty"). It's the standard way to avoid busy-waiting inside a critical section.

04Threads aren't tasks. Tasks aren't threads.

Most beginner code mixes these two ideas. They are not the same.

A thread is an execution context: a stack, a program counter, and a slot in the operating system's scheduler. Creating one on Linux costs roughly 10 microseconds (Lemire 2020). Each one reserves about a megabyte of address space for its stack, and once you have more threads than physical cores, the kernel has to context-switch between them, which costs on the order of a microsecond per swap on Linux, more once you count the cache pollution that follows. So you want few threads.

A task (or job) is a unit of work: a function and its inputs. Creating one costs essentially nothing. So you want many tasks.

The rule

One worker thread per physical core (or one per logical core minus one, leaving one for the OS, depending on platform). Then divide the actual work into thousands of tasks that those workers chew through. Threads are about the machine. Tasks are about the work.

Unity's manual states this explicitly: "The job system creates worker threads matching available CPU cores… [Unity] ensures that there are only enough threads to match the capacity of the CPU cores"^[8]. Naughty Dog runs six worker threads on PS4, pinned one per core, and then puts a hundred and sixty fibers on top of them to hold the work^[2]. Both engines are following the same rule.

In our scheduler:

eq. 1 \cdot worker count workers = max (1, N cores - 1)

We reserve one logical core for the main thread and the OS. Everything else becomes a worker. On an 8-core machine that's 7 worker threads plus the main thread, so 8 threads total but only 7 of them are dedicated to jobs.

Once you separate the two concepts, a few other things become obvious. Tasks are cheap to make, so you should make a lot of them: granularity matters, and we'll quantify it in §7. Tasks don't need their own stack: they borrow the worker's stack while they run. That makes them about 64 bytes each instead of a megabyte.

The first time you write a multi-core program and have it run slower than the single-core version, the cause is usually . It is one of the more surprising bugs in concurrent code, and one of the easier ones to fix once you know the shape.

Caches don't fetch one byte at a time. They fetch a whole cache line, typically 64 bytes on x86-64 (Intel and AMD, including the Zen 2 silicon in the PS5 and Xbox Series X) and 128 bytes on Apple's M-series chips^[12]. Two variables that sit within 64 bytes of each other end up on the same cache line. Now imagine two threads, each writing to one of those variables. The variables are independent. The cores are independent. But the cache line is one object, and the cache-coherence protocol (MESI on x86) treats it as one object^[13]:

Core 0 writes to variable a. The line is loaded into Core 0's L1, marked as Modified.
Core 1 wants to write to variable b. The line is in Core 0's L1, so Core 1 has to invalidate Core 0's copy, then pull the line over to its own L1.
Core 0 wants to write to a again. The line is now in Core 1's L1. Pull it back.
Repeat. Forever.

Refresher: how does CPU caching even work?

Main memory (RAM) is slow. Accessing it costs roughly 100 nanoseconds, which is something like 300 to 400 CPU cycles. If every load and store went all the way to RAM, your CPU would spend most of its time idle. Caches exist to fix that.

Modern CPUs have a hierarchy of caches, each one bigger and slower than the last. Approximate sizes and latencies on a modern x86 core:

Every load goes to L1 first. If the data is there (a "hit"), it returns in a few cycles. If not (a "miss"), it falls back to L2, then L3, then main memory, each step costing more. A program that accesses memory in cache-friendly patterns can run 10× to 100× faster than the same program with bad patterns, even if the work is identical.

Caches don't fetch one byte at a time. They fetch a fixed-size block called a cache line. When you load x, the CPU pulls in the 64 bytes around x on x86 (or 128 bytes on Apple Silicon), because nearby memory is often accessed next. That's why "data locality" matters so much: a tight loop that touches sequential memory only pays the cache miss once per 64 bytes.

The catch, and the reason §5 exists, is that writes have to keep all the caches in sync. If Core 0 writes a byte on a cache line that Core 1 has cached, Core 1's copy must be invalidated. The MESI protocol (Modified, Exclusive, Shared, Invalid) is the state machine that tracks this across cores. Two threads can independently corrupt each other's performance just by writing to nearby variables: that's false sharing, the topic of this section.

Tools to measure your cache behavior in practice: perf stat -e cache-misses on Linux, Intel VTune's "Microarchitecture Exploration" view, and Apple's Instruments "Counters" template. Profiling cache misses is often the single biggest win available once a program is correct.

The line pings between the cores' caches even though the data they touch is logically independent. Hence "false sharing": the sharing looks like sharing only to the cache, not to the programmer.

Run the demo. The top row has two counters packed adjacently in a struct; both threads write to it. The bottom row has the same counters, padded to live on separate cache lines. Same code, same workload, different memory layout:

The fix

Make sure data that is touched by different threads sits on different cache lines. In C++17 the standard provides a portable constant for the right alignment:

padding.cpp · the standard idiom

// std::hardware_destructive_interference_size is the smallest offset
// between two objects that's guaranteed to avoid false sharing.
// It lives in <new>. Typical values: 64 bytes on x86, 128 on Apple Silicon.
constexpr size_t cacheLineBytes = std::hardware_destructive_interference_size;

// Per-worker statistics. One instance per worker thread, accessed
// concurrently. Without padding, jobsCompleted and jobsStolen would
// share a cache line and produce ping-pong contention.
struct WorkerStats {
  // alignas tells the compiler to start this member on a fresh cache line,
  // inserting padding bytes before it if needed.
  alignas(cacheLineBytes) std::atomic<int64_t> jobsCompleted;

  // This counter now lives on its own cache line too, so writes
  // here don't invalidate jobsCompleted on the other core's cache.
  alignas(cacheLineBytes) std::atomic<int64_t> jobsStolen;
};
// sizeof(WorkerStats) is now 128 bytes on x86, not 16. We trade memory
// for performance: a tiny per-worker overhead, a large speedup on the
// hot path.

The alignas declaration tells the compiler to start each member on a fresh cache line, padding with empty bytes if necessary. Yes, it wastes memory. A struct that "should" be 16 bytes balloons to 128. Worth it: you're paying a few bytes of address space for an order-of-magnitude speedup on the hot path. Worth checking that your compiler actually emits the padding too; on some toolchains alignas on a member inside a struct can silently get ignored if the parent struct's alignment is smaller.

A portability footnote

C++17 added std::hardware_destructive_interference_size precisely so you wouldn't have to hardcode 64. Unfortunately, as of late 2024 libstdc++ still warns about using it as an ABI hazard because its value is implementation-defined. Most engine codebases just hardcode alignas(64) for x86 and call it a day. If you ship on Apple Silicon, use 128. If you want to be fully portable, take the warning and use the constant.

06Lock-free queues

Now we go back to the queue. We have N workers and we want them to push and pop concurrently without a mutex. The phrase to look up is lock-free queue: a queue whose operations are implemented using atomic compare-and-swap (CAS) operations rather than locks. Done right, multiple workers can make progress simultaneously instead of serializing through a critical section.

The classic design is Dmitry Vyukov's bounded MPMC queue^[16]: a fixed-size array where each slot has a sequence counter, and one CAS per enqueue or dequeue. It's not strictly lock-free (a stalled producer can block consumers waiting on that specific slot), but it's used in shipping engines because the fast path is two cache-line accesses and zero locks.

eq. 2 \cdot Vyukov enqueue (sketch) pos \leftarrow load (tail) slot \leftarrow buf [pos & mask] if slot. seq = pos: CAS (tail, pos, pos +1); if succeeded, write payload, set seq = pos +1 .

Each slot tracks its own version (the sequence number). Producers race for the slot whose sequence equals the current tail. Whoever wins the CAS writes the payload and bumps the sequence to mark the slot full; whoever loses sees a stale sequence and retries. Consumers race symmetrically on the head.

What is a CAS? What is memory ordering?

Compare-and-swap (CAS) is a hardware atomic operation: "if the value at this address is X, atomically replace it with Y and tell me you succeeded; otherwise leave it alone and tell me what it actually is." It's the foundation of lock-free programming. Every modern CPU has one (lock cmpxchg on x86, casal on ARMv8.1+, an LL/SC pair otherwise).

Memory ordering controls how aggressively the compiler and CPU can reorder loads and stores around an atomic operation. The C++ levels you'll see most:

memory_order_relaxed: no ordering; the atomic is atomic but nothing else is constrained. Good for counters you read for diagnostics.
memory_order_release on a store: anything written before this store becomes visible to a thread that does an acquire-load and observes the value written by this store (or a value later in its release sequence). Touching the same atomic isn't enough; if the acquire reads a stale value from before the release, no happens-before is established.
memory_order_acquire on a load: anything written by the releasing thread before its release-store is visible after this load, assuming the load actually observes that release-store's value.
memory_order_acq_rel: for read-modify-write ops like fetch_add or compare_exchange. Acts as acquire on the load side and release on the store side. The right choice when an RMW is both consuming a published value and publishing one of its own.
memory_order_seq_cst: every seq_cst operation across all threads agrees on a single total order. The strongest and slowest option, and the default for std::atomic ops when you don't pass an order. Use it when you can't reason about anything weaker.

A useful mental model: release = "publish," acquire = "subscribe." Anything you wrote before the publish is visible to anyone who subscribes and sees your publish. That's what makes a lock-free queue work: the producer publishes the payload with a release-store on the sequence number; the consumer reads the sequence with an acquire-load and, having observed it, is guaranteed to see the payload. Jeff Preshing's blog has the canonical explanation^[17].

One footnote on cost: on x86 the gap between acquire/release and seq_cst is small, since plain loads and stores are already acquire/release and only seq_cst stores need an mfence. On ARM and Power the gap is much wider, which is why being precise about ordering matters even if your dev box can't tell the difference. (memory_order_consume also exists in the standard but every mainstream compiler promotes it to acquire; you can ignore it.)

Run the queue below. Producers (purple) push payloads; consumers (green) pull them. Sequence numbers on each slot flip from "ready to write" to "ready to read" and back:

producers 2

consumers 3

rate 1.0×

enqueued

dequeued

contended CAS

An MPMC queue is the right structure when you have a true many-to-many relationship. In a job system you mostly don't. The pattern that wins is this:

One queue (a deque, really) per worker thread.
The owning worker is the only producer and almost always the only consumer.
Other workers occasionally steal from the deque when they run out of their own work.

That structure is called a Chase-Lev deque, and it's the topic of the next section. The MPMC queue earns its keep in the global submission path (the main thread handing top-level work to the scheduler) where many threads might be producers and the cost of contention is amortized over a coarse-grained event.

07Work stealing

Now the puzzle. We have N worker threads. We want them all busy. We don't want them fighting for a lock. A single global queue (even a lock-free one) is still a contention point. What if every worker had its own queue?

That works fine when the work is balanced. It falls apart the moment one worker finishes early and the others are still backed up: worker A is idle, worker B has a queue of fifty jobs, and they can't help each other. We need a way for idle workers to pull work from busy ones.

Work stealing, the design from Cilk in 1995^[5], solves this by making the per-worker queue a deque (double-ended queue) with asymmetric access:

The owner pushes and pops at the bottom of its deque. This is the fast path; almost always uncontended.
A thief (another worker that ran out of work) steals from the top. This is the slow path; contention here is rare but it still has to be correct.

Owner operations are LIFO (last-in, first-out), which keeps a worker's hot data in its L1. Thief operations are FIFO (first-in, first-out), which tends to grab older, larger jobs that are more likely to contain further sub-work for the thief to subsume. The whole structure is the Chase-Lev deque^[6]; a 2013 correction by Lê et al. for ARM and POWER memory models is the version you should actually implement^[18].

The visual:

workers 6

initial imbalance 0.70

jobs done

steals

idle workers

The "work-first" principle

Frigo, Leiserson, and Randall's PLDI 1998 paper on the Cilk-5 scheduler^[19] articulates the design principle that makes work-stealing schedulers fast, not just correct: optimize the path where no stealing happens. They call this the work-first principle. Since steals are rare in a well-balanced computation, the common case is "owner pops a job from its own deque and runs it." That path should cost something close to a function call.

Concretely, in our scheduler:

The owner's push/pop are almost non-atomic. They use memory_order_relaxed on the bottom counter; the only atomic operation is the bottom-counter store, and there's an acquire-fence on the rare path where the owner and a thief race over the last element.
The thief's steal is a single CAS on the top counter.
The deque storage is owned by one worker (no cross-thread allocation in the fast path).

The cost of an uncontended push or pop ends up close to a non-atomic increment plus a release-store: tens of nanoseconds, not hundreds. The cost of a steal is higher because the CAS forces cache-line ownership exchange between cores, so it's typically a low-hundreds-of-cycles operation. As long as steals are a small fraction of total operations, the average is dominated by the fast path. Reinalter's Molecular Matters series^[7] walks through a complete C++ implementation with measured cycle counts; it's worth reading after this section.

How granular is granular?

One question that comes up every time you build a work-stealing scheduler: how big should each job be? Too small and you spend more time managing the queue than running the work. Too big and you can't load-balance because there aren't enough jobs to steal.

The Cilk-5 paper's design principle is that the spawn fast-path should cost only a small multiple of an ordinary function call so that spawn overhead doesn't dominate the work. The practical version: each job should do at least one to two orders of magnitude more work than your queue operations take. If push/pop is around 10 ns, aim for at least 1 µs of work per job. At that ratio, a 16.6 ms frame can comfortably schedule on the order of ten thousand jobs without spending serious time on the queue itself. Unity's documentation lands in the same range: a batchSize of 32 to 128 for cheap operations, 1 for heavy ones^[8].

08Dependencies as a DAG

Real frame work isn't a flat pile of independent jobs. It has structure. Visibility culling has to finish before draw-call building can start. Animation has to finish before skinning. Audio listener position depends on the player's transform, which is updated by gameplay. The graph of which job depends on which is a directed acyclic graph (DAG), and a real scheduler has to honor it.

A sketch of a typical frame's job graph:

You can see why job systems beat fixed-thread designs. The graph has natural fan-out at every level. A fixed-thread design has to serialize the colored boxes in some order; a job system runs every box that's ready in parallel.

Three ways to express dependencies

Different engines pick different syntax for the same idea. The mechanics map onto a small handful of primitives:

Counters (Naughty Dog). Each job decrements a shared atomic counter when it finishes. A waiting job sleeps until the counter hits zero^[2]. The counter is the dependency primitive; you can wait on the counter, branch on it, or use it to launch a continuation. Simple and explicit.
Handles (Unity). Every Schedule() call returns a JobHandle. You pass it into the next Schedule() call as a dependency, or combine several handles with JobHandle.CombineDependencies(a, b, c). The scheduler tracks the resulting DAG and runs each job once its predecessors complete^[8].
Continuations (Unreal Tasks, C++26 senders). Each task carries a list of prerequisite tasks. When a prerequisite completes, it decrements a counter on its dependents; whichever decrements to zero gets enqueued. No worker ever blocks waiting^[9].

Under the hood, all three are doing the same thing: each task has a refcount of unresolved prerequisites, and when a prerequisite finishes it decrements its dependents' refcounts; whichever drops to zero is enqueued. The syntax differs but the data structure is the same.

In code, the heart of a dependency-tracking scheduler is about a dozen lines:

deps.cpp · core of a DAG scheduler

// A Task is a node in the dependency DAG. Three fields:
//   1. The callable to run (its "body").
//   2. How many prerequisites still haven't finished.
//   3. Which other tasks are waiting on this one.
struct Task {
  std::function<void()>   body;
  std::atomic<int>          unresolvedPrereqs;   // drops as prereqs finish
  std::vector<Task*>          dependents;          // who depends on us
};

// Called once for each task right after the caller builds it.
// If no prereqs are open (e.g. the task is a root), enqueue immediately.
void submitWhenReady(Task* task) {
  if (task->unresolvedPrereqs.load(memory_order_acquire) == 0)
    pushToReadyQueue(task);                          // no prereqs left, run it
}

// A worker calls this when it pops a task from a queue.
void runTask(Task* task) {
  task->body();                                       // the actual work

  // Now notify everyone waiting on this task. Each dependent's prereq
  // counter drops by one; if it hits zero, we just unblocked it.
  for (Task* dependent : task->dependents) {
    // fetch_sub atomically decrements and returns the value BEFORE
    // the decrement. So a return value of 1 means "we were the last
    // prereq, the counter is now 0, this dependent is runnable."
    int previousCount = dependent->unresolvedPrereqs.fetch_sub(1,
        memory_order_acq_rel);
    if (previousCount == 1)
      pushToReadyQueue(dependent);                   // this dependent is now runnable
  }
}

Three fields per task, two functions. Add the worker pool from §3 and the work-stealing deque from §7, and you have a working DAG scheduler. §10 puts it all together.

Build a small graph and run it through the scheduler:

graph

workers 4

elapsed

0 ms

critical path

···

workers used

Work and span

Two numbers describe a parallel computation. Work (T₁) is the total time on one core. Span (T_∞) is the time on infinite cores, which equals the critical-path length. On P cores, an optimal scheduler runs in roughly T₁/P + T_∞ time^[20]. The first term is "we have to do the work"; the second is "we can't beat the longest dependency chain." Engine performance work is about shrinking either of those.

09Fibers and why Naughty Dog uses them

So far our scheduler is built on threads alone. There's a problem with that model, and it shows up the first time you write a job that wants to wait for another job to finish in the middle of its own work.

Imagine a job that builds a draw list. Halfway through, it needs the result of a culling job that's currently running on another worker. What should it do?

Block on a condition variable. The worker now sits idle waiting for the culling job, even though the queue is full of other work. Wasted core.
Spin-wait. The worker burns CPU watching a flag. Even more wasted core.
Use continuation-passing. Restructure the draw-list job into two jobs: the part before the wait, and a continuation that runs when culling finishes. Works, but it's invasive to the caller and you can't suspend in the middle of a complex function.

None of those is what Naughty Dog reached for. They wanted to write code like this:

job_with_wait · the "obvious" code

void buildDrawList() {
  prepareCommandBuffer();
  JobCounter* cullingDone = scheduleCulling();
  waitForCounter(cullingDone, 0);            // block until culling is done
  encodeVisibleObjects();
}

"Wait for the counter" looks blocking, but they wanted it to not waste the worker. Their answer was : lightweight user-mode threads that you switch by hand, in user space, with no kernel involvement.

What is a fiber

A fiber is a stack and a saved register state. Switching from fiber A to fiber B means: save A's CPU registers onto A's stack, load B's registers from B's stack, jump to where B was suspended. The OS doesn't know. With hand-written assembly, the switch is on the order of tens of cycles (Boost.Fiber documents "less than a hundred"). Compare to a thread context switch, which crosses the user-kernel boundary and costs roughly a microsecond, more once you count the TLB and cache penalty on wake-up.

On Windows, fibers are a kernel-supported primitive (ConvertThreadToFiber, CreateFiber, SwitchToFiber)^[21]. On every other platform you can either use Boost.Context (which is mostly hand-written assembly per ABI) or implement the switch yourself in about thirty lines of asm per platform. Marl, Google's open-source job system^[22], has implementations for x86, x64, aarch64, mips, ppc, riscv, and more in its osfiber_asm_*.S files.

The Naughty Dog architecture

Putting it all together, Gyrling's GDC 2015 talk describes:

6 worker threads, pinned one per PS4 core. The threads are the execution units.
160 fibers total: 128 with 64 KiB stacks for ordinary jobs, 32 with 512 KiB stacks for jobs that need more stack.
3 priority queues (low, normal, high). No work-stealing in the classical Cilk sense; the queues are global.
Atomic counters are the synchronization primitive. waitForCounter(n) doesn't block the worker; it parks the current fiber on the counter and switches to another fiber. When the counter drops to n, the parked fiber becomes runnable again^[2].

The result is that jobs can wait inline, in normal-looking code, without wasting workers. A thread never blocks on an OS primitive; it only ever switches fibers. The hardware stays saturated.

Step through a fiber switch:

idle. press Step to advance.

One worker, three fibers. The worker switches between them as each fiber waits on a counter. Dotted lines show suspended call stacks.

Each fiber preserves its full call stack across a switch. That is what makes waitForCounter() callable from the middle of any function, no matter how deep in the call graph, without the caller knowing.

Fibers versus C++20 coroutines

A natural question in 2026: should I use C++20 coroutines instead of fibers? They sound similar. They're not.

Stackful (fibers). Each fiber has a real stack of fixed size. Any function can suspend the fiber, no matter how deep. Switching is fast but the stack is allocated up front^[23].
Stackless (C++20 coroutines). The compiler transforms the coroutine function into a state machine. Only that function can suspend; if you want a callee to suspend, you have to mark it a coroutine too, and so on up the chain. Frames are heap-allocated and just hold the captured state, typically tens of bytes.

For a game engine job system, the choice usually comes down to: do you want to be able to call waitForCounter() from anywhere in a job's call stack? Fibers say yes; coroutines say only from coroutine-marked frames. Naughty Dog, Marl, and most AAA engines pick fibers for exactly this reason. The "everything is a coroutine" tax is real, and engines have a lot of code that wasn't written with suspension in mind.

Fibers aren't free

Each fiber stack is real memory. Naughty Dog's 160 fibers cost 128 × 64 KiB + 32 × 512 KiB ≈ 24 MiB just for stacks, before any actual work. The choice is which problems you'd rather have: the runtime cost of preallocated stacks, or the language tax of stackless coroutines everywhere. There is no correct answer; there are tradeoffs.

10A working scheduler, in your language

Below is a complete, working scheduler in two languages: modern C++ (20) and Rust. The C++ version is about 230 lines and uses no dependencies beyond the standard library. The Rust version is around 200 lines, also stdlib-only. Both implement the same design: a fixed pool of OS-thread workers, per-worker Chase-Lev-flavored work-stealing deques, dependency tracking with atomic refcounts, and a simple wait-for primitive built on condition variables (no fibers; we'll discuss the cost in the callout below).

Pick a language. Read the comments; every interesting line has one.

scheduler · pick a language ↓

// ───────────────────────────────────────────────────────────
// scheduler.cpp · work-stealing job system with DAG support
// Build: g++ -std=c++20 -O2 -pthread scheduler.cpp
// ───────────────────────────────────────────────────────────
#include <atomic>
#include <condition_variable>
#include <functional>
#include <mutex>
#include <random>
#include <thread>
#include <vector>
#include <new>

namespace mpg {

// We hardcode 64 here to dodge libstdc++ ABI warnings on
// hardware_destructive_interference_size. For Apple Silicon
// builds, raise this to 128.
constexpr size_t kCacheLine = 64;

struct Task {
  std::function<void()>          body;
  std::atomic<int>                 openPrereqs{0};       // std::atomic is non-copyable
  std::vector<Task*>                dependents;        // so we initialize in-place here
  std::atomic<bool>                done{false};
};

// A bounded ring-buffer deque, owned by one worker.
// Owner pushes/pops at bottom; thieves CAS-steal at top.
// This is the Chase-Lev structure, simplified for clarity.
class Deque {
  static constexpr size_t N = 1024;       // per-worker capacity
  alignas(kCacheLine) std::atomic<int64_t> top{0};
  alignas(kCacheLine) std::atomic<int64_t> bottom{0};
  alignas(kCacheLine) Task* buffer[N]{};
public:
  bool push(Task* t) {
    int64_t b = bottom.load(std::memory_order_relaxed);
    int64_t tp = top.load(std::memory_order_acquire);
    if (b - tp >= (int64_t)N) return false;             // full
    buffer[b % N] = t;
    // Release ensures the buffer write is visible BEFORE the
    // bottom bump that a thief might observe.
    bottom.store(b + 1, std::memory_order_release);
    return true;
  }

  Task* popOwner() {                                  // fast path, by owner
    int64_t b = bottom.load(std::memory_order_relaxed) - 1;
    bottom.store(b, std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    int64_t tp = top.load(std::memory_order_relaxed);
    if (tp <= b) {
      Task* t = buffer[b % N];
      if (tp != b) return t;                              // uncontested
      // last element: race a thief for it
      int64_t expected = tp;
      bool won = top.compare_exchange_strong(
        expected, tp + 1,
        std::memory_order_seq_cst, std::memory_order_relaxed);
      bottom.store(b + 1, std::memory_order_relaxed);
      return won ? t : nullptr;
    } else {
      bottom.store(b + 1, std::memory_order_relaxed);
      return nullptr;                                     // empty
    }
  }

  Task* steal() {                                       // slow path, by thief
    int64_t tp = top.load(std::memory_order_acquire);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    int64_t b = bottom.load(std::memory_order_acquire);
    if (tp < b) {
      Task* t = buffer[tp % N];
      int64_t expected = tp;
      bool won = top.compare_exchange_strong(
        expected, tp + 1,
        std::memory_order_seq_cst, std::memory_order_relaxed);
      return won ? t : nullptr;
    }
    return nullptr;
  }
};

class Scheduler {
  std::vector<std::unique_ptr<Deque>> deques;
  std::vector<std::thread>             workers;
  std::atomic<bool>                  running{true};
  std::atomic<int>                   inFlight{0};
  std::mutex                          sleepMu;
  std::condition_variable             wakeCv;

  static thread_local int tlsIndex;

public:
  Scheduler(int n = -1) {
    if (n < 0) n = std::max(1, (int)std::thread::hardware_concurrency() - 1);
    deques.reserve(n);
    for (int i = 0; i < n; i++)
      deques.emplace_back(std::make_unique<Deque>());
    workers.reserve(n);
    for (int i = 0; i < n; i++)
      workers.emplace_back([this, i] { workerLoop(i); });
  }

  ~Scheduler() {
    running.store(false, std::memory_order_release);
    wakeCv.notify_all();
    for (auto& t : workers) t.join();
  }

  // Build a task and connect its prereqs. Returns a pointer the
  // caller owns; release with delete after wait().
  // Caller invariant: prereqs were submitted before this task.
  Task* submit(std::function<void()> body,
                std::initializer_list<Task*> prereqs = {}) {
    Task* t = new Task();
    t->body = std::move(body);
    int open = 0;
    for (Task* p : prereqs) {
      if (!p->done.load(std::memory_order_acquire)) {
        p->dependents.push_back(t);
        open++;
      }
    }
    t->openPrereqs.store(open, std::memory_order_release);
    inFlight.fetch_add(1, std::memory_order_relaxed);
    if (open == 0) enqueue(t);
    return t;
  }

  void wait() {                                       // drain all in-flight tasks
    while (inFlight.load(std::memory_order_acquire) > 0) {
      if (Task* t = tryGet(-1)) runTask(t);             // help out
      else std::this_thread::yield();
    }
  }

private:
  void enqueue(Task* t) {
    int idx = tlsIndex >= 0 ? tlsIndex : 0;
    deques[idx]->push(t);
    wakeCv.notify_one();
  }

  Task* tryGet(int myIdx) {
    if (myIdx >= 0)
      if (Task* t = deques[myIdx]->popOwner()) return t;
    // Try to steal from a random victim.
    thread_local std::mt19937 rng{std::random_device{}()};
    std::uniform_int_distribution<int> pick(0, (int)deques.size() - 1);
    for (int attempt = 0; attempt < (int)deques.size() * 2; attempt++) {
      int v = pick(rng);
      if (v == myIdx) continue;
      if (Task* t = deques[v]->steal()) return t;
    }
    return nullptr;
  }

  void runTask(Task* t) {
    t->body();
    t->done.store(true, std::memory_order_release);
    for (Task* d : t->dependents) {
      if (d->openPrereqs.fetch_sub(1, std::memory_order_acq_rel) == 1)
        enqueue(d);
    }
    inFlight.fetch_sub(1, std::memory_order_acq_rel);
  }

  void workerLoop(int idx) {
    tlsIndex = idx;
    while (running.load(std::memory_order_acquire)) {
      if (Task* t = tryGet(idx)) {
        runTask(t);
      } else {
        std::unique_lock g(sleepMu);
        wakeCv.wait_for(g, std::chrono::microseconds(100));
      }
    }
  }
};

thread_local int Scheduler::tlsIndex = -1;

} // namespace mpg

// ── usage ────────────────────────────────────────────────
// mpg::Scheduler sched;
// auto* a = sched.submit([] { /* work */ });
// auto* b = sched.submit([] { /* work */ }, {a});  // b depends on a
// auto* c = sched.submit([] { /* work */ }, {a, b});
// sched.wait();
// // (then delete a, b, c, or use a pool)

// ───────────────────────────────────────────────────────────
// scheduler.rs · work-stealing job system with DAG support
// Build: rustc -O scheduler.rs
// ───────────────────────────────────────────────────────────
use std::collections::VecDeque;
use std::sync::atomic::{AtomicBool, AtomicI32, Ordering};
use std::sync::{Arc, Condvar, Mutex};
use std::thread;

pub struct Task {
    body: Mutex<Option<Box<dyn FnOnce() + Send>>>,
    open_prereqs: AtomicI32,
    dependents: Mutex<Vec<Arc<Task>>>,
    done: AtomicBool,
}

// Per-worker deque. Owner: push_back / pop_back (LIFO).
// Thief: pop_front (FIFO). Mutex-protected for brevity; in
// production you'd use crossbeam::deque, which is lock-free.
struct WorkerDeque {
    inner: Mutex<VecDeque<Arc<Task>>>,
}

impl WorkerDeque {
    fn new() -> Self { Self { inner: Mutex::new(VecDeque::new()) } }
    fn push(&self, t: Arc<Task>)   { self.inner.lock().unwrap().push_back(t); }
    fn pop_owner(&self) -> Option<Arc<Task>> {
        self.inner.lock().unwrap().pop_back()
    }
    fn steal(&self) -> Option<Arc<Task>> {
        self.inner.lock().unwrap().pop_front()
    }
}

pub struct Scheduler {
    deques: Arc<Vec<WorkerDeque>>,
    running: Arc<AtomicBool>,
    in_flight: Arc<AtomicI32>,
    wake: Arc<(Mutex<()>, Condvar)>,
    workers: Vec<thread::JoinHandle<()>>,
}

impl Scheduler {
    pub fn new(n: usize) -> Self {
        let n = n.max(1);
        let deques: Arc<Vec<_>> = Arc::new((0..n).map(|_| WorkerDeque::new()).collect());
        let running = Arc::new(AtomicBool::new(true));
        let in_flight = Arc::new(AtomicI32::new(0));
        let wake = Arc::new((Mutex::new(()), Condvar::new()));
        let mut workers = Vec::new();
        for i in 0..n {
            let (deques, running, in_flight, wake) =
                (deques.clone(), running.clone(), in_flight.clone(), wake.clone());
            workers.push(thread::spawn(move || {
                worker_loop(i, deques, running, in_flight, wake);
            }));
        }
        Self { deques, running, in_flight, wake, workers }
    }

    pub fn submit(&self, body: Box<dyn FnOnce() + Send>,
                  prereqs: &[Arc<Task>]) -> Arc<Task> {
        let t = Arc::new(Task {
            body: Mutex::new(Some(body)),
            open_prereqs: AtomicI32::new(0),
            dependents: Mutex::new(Vec::new()),
            done: AtomicBool::new(false),
        });
        let mut open = 0;
        for p in prereqs {
            if !p.done.load(Ordering::Acquire) {
                p.dependents.lock().unwrap().push(t.clone());
                open += 1;
            }
        }
        t.open_prereqs.store(open, Ordering::Release);
        self.in_flight.fetch_add(1, Ordering::Relaxed);
        if open == 0 { self.enqueue(t.clone()); }
        t
    }

    fn enqueue(&self, t: Arc<Task>) {
        self.deques[0].push(t);
        self.wake.1.notify_one();
    }

    pub fn wait(&self) {
        while self.in_flight.load(Ordering::Acquire) > 0 {
            thread::yield_now();
        }
    }
}

fn try_get(my_idx: usize, deques: &[WorkerDeque]) -> Option<Arc<Task>> {
    if let Some(t) = deques[my_idx].pop_owner() { return Some(t); }
    // Round-robin through the other deques starting from our neighbor.
    // A production version would use a per-thread RNG (rand crate) so
    // thieves don't pile onto the same victim.
    for offset in 1..deques.len() {
        let v = (my_idx + offset) % deques.len();
        if let Some(t) = deques[v].steal() { return Some(t); }
    }
    None
}

fn run_task(t: Arc<Task>, deques: &[WorkerDeque],
            in_flight: &AtomicI32) {
    if let Some(body) = t.body.lock().unwrap().take() {
        body();
    }
    t.done.store(true, Ordering::Release);
    let dependents = t.dependents.lock().unwrap().drain(..).collect::<Vec<_>>();
    for d in dependents {
        if d.open_prereqs.fetch_sub(1, Ordering::AcqRel) == 1 {
            deques[0].push(d);
        }
    }
    in_flight.fetch_sub(1, Ordering::AcqRel);
}

fn worker_loop(idx: usize,
               deques: Arc<Vec<WorkerDeque>>,
               running: Arc<AtomicBool>,
               in_flight: Arc<AtomicI32>,
               wake: Arc<(Mutex<()>, Condvar)>) {
    while running.load(Ordering::Acquire) {
        if let Some(t) = try_get(idx, &deques) {
            run_task(t, &deques, &in_flight);
        } else {
            let g = wake.0.lock().unwrap();
            let _ = wake.1
                .wait_timeout(g, std::time::Duration::from_micros(100))
                .unwrap();
        }
    }
}

What's intentionally missing

This implementation is meant to read clearly, not to be the fastest possible, and it assumes the caller submits tasks in topological order (parent before child) from a single thread. Production versions:

Synchronize the dependents vector so a task can register itself with a prerequisite that's still running, without a race against that prerequisite finishing.
Pool tasks instead of new/delete per task. Reinalter's part 2^[7] shows the per-thread linear allocator pattern.
Use fibers (or coroutines) so jobs can wait mid-run. See §9 and Marl's source^[22].
Replace the global submission point with the calling worker's own deque, so the main thread can also be a scheduler client.
Add priority queues, named threads for non-thread-safe APIs, and profiler hooks.
Replace the spin/sleep wake path with an event count primitive for better latency.

11Try it yourself

The playground below runs a JavaScript port of the C++ scheduler above. The library is exposed as MPGJobs; you can build task graphs, run them, and see the worker timeline animate as the graph executes. Hit Run (or Ctrl+Enter / Cmd+Enter) to execute. Output prints below; the worker timeline is drawn on the right.

▸ playground.js · live JavaScript scheduler, in your browser

// Build a small frame-shaped graph and run it on 4 workers.
const sched = MPGJobs.create({ workers: 4 });

// "work" simulates a job by burning some microseconds.
const work = (ms) => () => MPGJobs.busy(ms);

// Root: gather input.
const input = sched.submit(work(0.4), 'input');

// Three things branch off input.
const ai       = sched.submit(work(1.8), 'AI',       [input]);
const gameplay = sched.submit(work(1.2), 'gameplay', [input]);
const physics  = sched.submit(work(3.0), 'physics',  [input]);

// Render path joins on all three.
const culling  = sched.submit(work(2.5), 'culling',  [physics, gameplay]);
const animation = sched.submit(work(2.0), 'animation', [gameplay]);
const draw     = sched.submit(work(2.4), 'draw lists', [culling, animation, ai]);

// Audio runs alongside everything.
const audio    = sched.submit(work(1.5), 'audio',    [gameplay]);

const present  = sched.submit(work(0.3), 'present',  [draw, audio]);

const t0 = MPGJobs.now();
await sched.run();
const t1 = MPGJobs.now();

console.log(`frame completed in ${(t1 - t0).toFixed(2)} ms`);
console.log(`critical path: input → physics → culling → draw → present`);
console.log(`(try changing the worker count or job sizes)`);

Edit anything. Drop the worker count to 1 (a single-threaded version) and the frame stretches out. Bump it to 8 and the parallel jobs collapse into a tight bar. Try a chain of 30 dependent jobs and you'll see that adding workers doesn't help past the critical-path length.

12How Unity does it

Unity's job system, shipped fully in 2018^[8], is a managed-language thin wrapper on top of a native C++ scheduler that looks a lot like what we just built. Three pieces matter.

The job interfaces

You implement a job by writing a struct with one method. The struct is the body; the fields are the inputs and outputs. Unity copies the struct into the scheduler, so jobs can't capture references the way a closure would.

IJob: one unit of work, runs once on a worker.
IJobParallelFor: N independent iterations; the scheduler picks a batch size and hands out ranges to workers.
IJobFor: similar to IJobParallelFor but lets you control batching and parallelism more precisely. Preferred in modern code.
IJobChunk / IJobEntity: ECS-specific; iterates over archetype chunks. Used inside DOTS / Entities packages.

Particles.cs · a Unity IJobParallelFor

using Unity.Burst;
using Unity.Collections;
using Unity.Jobs;
using Unity.Mathematics;

[BurstCompile]
struct UpdateParticles : IJobParallelFor {
    public NativeArray<float3> positions;
    [ReadOnly] public NativeArray<float3> velocities;
    public float dt;

    public void Execute(int index) {
        positions[index] += velocities[index] * dt;
    }
}

// At call site:
var job   = new UpdateParticles { positions = pos, velocities = vel, dt = 0.016f };
var handle = job.Schedule(positions.Length, 64);   // batchSize=64
handle.Complete();                                // fence at end of frame

Three things to notice

1. Handles are dependencies. Schedule returns a JobHandle. Pass it into the next Schedule call to express "this depends on that." JobHandle.CombineDependencies(a, b, c) gives you a handle that's satisfied when all three are. The result is the same DAG you'd build with our prereqs array^[8].

2. The safety system catches data races at schedule time. Every NativeArray tracks which jobs read or write it. If you try to schedule a job that writes to a container that another scheduled job is also writing, Unity throws an exception with a clear message before either job runs. The AtomicSafetyHandle system runs in the editor and play mode only; it's compiled out of release builds^[8]. It's the single most useful feature of the Unity job system if you're new to multithreading, because it turns "an undetectable race" into "a build-time error."

3. Burst is a separate piece. [BurstCompile] hands the job's struct to Unity's Burst compiler, which translates a restricted dialect of C# called HPC# (high-performance C#) through LLVM into target-specific machine code with full SIMD codegen^[24]. The Unity.Mathematics vector types (float4, float3, int4) map directly onto hardware SIMD registers. The job system and Burst are orthogonal: you can use jobs without Burst, and you can Burst-compile functions that aren't jobs. Together they're typically why you see "10x to 100x" speedup numbers in Unity benchmarks. As of Unity 6.6 Burst is becoming built into the engine rather than a separate package^[24].

Awaitable is not the job system

Unity 6 added an Awaitable type with async/await integration. It's for main-thread sequencing ("wait until next frame", "wait for a download to finish"), not for parallel compute. The docs are explicit: "to get the most of multi-core CPUs and parallelize your algorithms, use the job system instead." Use both. They live at different layers.

13How Unreal does it

Unreal has two layered task systems, the old one and the new one, and learning Unreal multithreading in 2026 mostly means understanding which one to reach for.

TaskGraph (the legacy system)

The original Unreal task system, FTaskGraphInterface, is a job graph where each task has prerequisites and runs on either a specific named thread or "any thread." The named threads are GameThread, ActualRenderingThread, RHIThread, AudioThread, and a few others^[25]. The reason named threads exist is that many of Unreal's subsystems (UObjects, the RHI, the audio engine) are not thread-safe, so work that touches them has to be funneled onto a specific thread.

This works but it's rigid. As soon as you have several non-thread-safe APIs you can't share a thread, you either add more named threads or you queue everything onto one and bottleneck on it. Both options are bad. Hence the new API.

UE::Tasks (the new system)

Introduced in UE 5.0 and matured through 5.3+, UE::Tasks lives in Tasks/Task.h. The model is closer to what we built in §10: launch a task, attach prerequisites, the scheduler handles the rest. The signature interest is the Pipe primitive^[9]:

unreal_tasks.cpp · the new API

using namespace UE::Tasks;

// Independent task, no prereqs.
FTask a = Launch(UE_SOURCE_LOCATION, [] { DoPhysics(); });

// Dependent task: waits for a, then runs.
FTask b = Launch(UE_SOURCE_LOCATION, [] { DoCulling(); }, a);

// Multiple prereqs.
FTask c = Launch(UE_SOURCE_LOCATION, [] { BuildDrawLists(); },
                Prerequisites(a, b));

// Pipe: any tasks given the same pipe run serially.
static FPipe WorldPipe(TEXT("UWorld"));
Launch(UE_SOURCE_LOCATION, [] { TickActor(actor1); }, WorldPipe);
Launch(UE_SOURCE_LOCATION, [] { TickActor(actor2); }, WorldPipe);
// Both run on workers, but only one at a time.

A Pipe is the modern replacement for named-thread affinity. Instead of saying "this work has to run on the game thread because UWorld isn't thread-safe," you say "this work shares the UWorld pipe, run them one at a time in submission order." The scheduler can put successive pipe tasks on different worker threads as long as they don't overlap in time. That's much better load distribution than nailing everything to one thread.

Other Unreal pieces worth knowing

FRunnable / FRunnableThread: a single long-lived dedicated thread. Use for streaming workers, network threads, or anything that wants its own execution forever.
ParallelFor(N, lambda): parallel loop helper built on the task graph. Docs note it uses TaskGraph for "unbalanced tasks and offers better work distribution among threads at the cost of a little bit more synchronization"^[26].
TPromise<T> / TFuture<T>: standard producer/consumer future-promise pair. The producer eventually calls SetValue; the consumer waits on the future.
Render thread / RHI thread pipelining: the game thread runs frame N+1's simulation while the render thread builds frame N's commands and the RHI thread submits frame N−1 to the GPU^[27]. UE 5.4 (2024) extended this with multiple parallel RHI workers.

Where to look in the engine

If you want to read the source: Engine/Source/Runtime/Core/Public/Tasks/ for the new API, Engine/Source/Runtime/Core/Public/Async/TaskGraphInterfaces.h for the legacy one. Alex Stevens's Unreal Fest Gold Coast 2024 talk "How to Benefit from Multithreading in Your Unreal Engine Projects"^[28] is the best recent walkthrough, with a companion repository of measured examples.

14The 2026 state of the art

The next decade of job systems is going to look different from the last one. Three threads worth following.

C++26 senders and receivers

In June 2024 at the WG21 St. Louis plenary, the C++ standard committee adopted std::execution (paper P2300) into the C++26 working draft^[10]. It's the first time the C++ standard library has a vocabulary for describing parallel and async work that isn't tied to the OS thread API. Briefly:

A sender is a lazy description of an async operation. Think "a function object for sync code, but for async."
A receiver is a bundle of callbacks: set_value, set_error, set_stopped.
A scheduler is a factory for senders that complete on a particular execution context (CPU thread pool, GPU, custom).
Composition is via algorithms like then, let_value, when_all, start_detached.

stdexec_sketch.cpp · what a C++26 job graph looks like

using namespace std::execution;

auto sched = thread_pool.get_scheduler();

auto physics  = schedule(sched) | then([] { return runPhysics(); });
auto animation = schedule(sched) | then([] { return runAnim();    });
auto render    = when_all(physics, animation)
              | then([](auto&& physRes, auto&& animRes) {
                  buildDrawLists(physRes, animRes);
                });

sync_wait(render);

NVIDIA's stdexec is the reference implementation and you can use it today^[29]. The CUDA scheduler in the same repo demonstrates the design's intended payoff: the same sender pipeline can target either a CPU thread pool or a GPU, transparently. Whether engines actually adopt this is a separate question, but the vocabulary is now standard, which means library authors can write portable async code without rolling a runtime.

Structured concurrency

Nathaniel Smith's 2018 essay "Notes on structured concurrency"^[30] argued that unconstrained spawn (or go, or async) primitives are the concurrency analog of goto: they leak control flow out of lexical scope, making cancellation, error propagation, and resource cleanup intractable. The proposed alternative, a "nursery" that wraps a set of child tasks in a scope that cannot exit until all children complete, has since been adopted by Trio (Python), Kotlin coroutines, Swift concurrency, and the C++ senders/receivers proposal. For job systems, the practical impact is: an exception thrown by a leaf job propagates cleanly back to its parent, and cancellation of the parent cancels every child. Both are surprisingly hard to get right with raw thread pools.

GPU work graphs

The same idea is now appearing on the GPU. D3D12 work graphs, demoed at GDC 2024^[31], let a shader enqueue more shader work directly, in arbitrary DAG shapes, without going back to the CPU. The GPU manages its own scheduler. The CPU job system you just learned about is the model; the GPU side is finally catching up.

Why this section matters even if you skip the libraries

Sender/receiver shapes are showing up in real code, and "structured" gets used as a noun on multithreading slides now. Even if your engine never adopts std::execution, the underlying abstraction makes every other job system easier to read.

15Pitfalls and how to spot them

A list of things that will go wrong, what they look like, and how to fix them. Every one of these has bitten me or someone I know.

The ABA problem

A thread reads value A from a CAS slot, gets preempted, returns to find the slot still says A, and CASes successfully. But while it was away, the value went from A to B and back to A; the slot is no longer the same object, even though the CAS thinks it is^[32]. The classic fix is a tag/stamp packed into the high bits of the pointer (a "tagged pointer"), so two A's a million operations apart are distinguishable. Safe-memory-reclamation schemes like hazard pointers, RCU, or epoch-based reclamation also work.

Thread 1 wants to pop the top. It reads the top pointer (call it A) and computes the new top. Then Thread 2 races ahead: it pops A, pops B, then pushes back what is happens to live at the same address as the original A. Thread 1 wakes up. Its CAS sees A again and succeeds. But the next pointer is now stale.

Spin-loop pitfalls

Don't busy-spin on an atomic without telling the CPU. A tight while (!flag) burns power, starves the SMT sibling, and on Skylake-and-later trips a memory-order-violation pipeline flush when the wait ends. Use _mm_pause() on x86 (or __builtin_ia32_pause(), YieldProcessor()) which emits the PAUSE instruction^[33]. On ARM use __yield(). If you spin for more than a few hundred iterations, fall back to sleeping on a futex or event-count primitive.

spinwait.cpp · the right way

// Pattern: spin briefly with PAUSE, then fall back to sleeping.
// We hope the flag flips within a microsecond, but we don't want
// to waste a whole core if it doesn't.
auto startTime = std::chrono::steady_clock::now();
while (!readyFlag.load(std::memory_order_acquire)) {
  // Emit ~64 PAUSE instructions. PAUSE tells the CPU we're spinning,
  // so it can yield SMT resources to the sibling and avoid a pipeline
  // flush when the wait finally ends.
  for (int spins = 0; spins < 64; spins++)
    _mm_pause();

  // If we've been spinning too long, give up the core. Either park the
  // current fiber on the flag (best in a fiber-aware scheduler) or
  // sleep on a futex/event-count primitive.
  auto elapsedNs = std::chrono::duration_cast<std::chrono::microseconds>(
      std::chrono::steady_clock::now() - startTime).count();
  if (elapsedNs > 10) {
    parkFiberOrSleep(readyFlag);                         // then give up
    return;
  }
}

Priority inversion

A high-priority task is blocked by a resource held by a low-priority task, while a medium-priority task preempts the low one, indefinitely delaying the high one^[34]. The Mars Pathfinder mission hit this in 1997. The fix is priority inheritance (boost the holder's priority while the high-priority task waits) or just avoiding long lock-holds. For a job system, the practical risk is a long-running low-priority job parking a fiber the high-priority queue needs. Solve it by sizing the fiber pool generously.

Three threads: High, Med, Low. Low takes a lock. High arrives, blocks. Med wakes and preempts Low. High keeps waiting. Toggle "priority inheritance" to fix it: Low is temporarily boosted to High's priority while it holds the lock, so Med can't get in.

Memory ordering bugs

The worst kind: code that looks right on x86 because x86 is total-store-order and forgives most sins, but fails on ARM the moment you ship to a Switch or a phone. Russ Cox's "Hardware Memory Models" essay^[35] is the clearest walkthrough of why x86 forgives and ARM doesn't. Test on weak hardware. Run sanitizers (TSan finds most ordering bugs). When in doubt, use seq_cst and measure; you can always relax later if the measurement says it matters.

trials

r1=0, r2=0 (reorder)

other outcomes

Classic store-buffer test: both threads store 1 to X (then Y), then load Y (then X) into a local. On x86, you'd never see both reads return 0. On ARM, you can, because the store buffers and weak ordering allow the load to be observed before the store on the same thread completes. This is the bug class that fires the moment you ship to a Switch.

Hyperthreading surprise

Two SMT siblings share an L1 cache, execution units, store buffers, and TLB. Two heavy threads on one physical core fight each other for those resources, and may be slower than running serialized^[36]. If your engine has eight worker threads and the box has eight physical / sixteen logical cores, prefer to pin to physical cores when possible. Treat the SMT siblings as a bonus lane for asymmetric work (audio next to render is usually fine; two render workers on the same core is usually not).

throughput

···

L1 contention

···

Each physical core has two SMT lanes that share L1, the execution units, and the store buffer. When two heavy threads land on the same core, they fight over the same resources. Spread them across physical cores and throughput roughly doubles.

16Where to go from here

You now have a code-level picture of what a job system is and why every modern engine has one. The rest is reading other people's code: every shipped engine has details that surprise you, and the way to build instincts is to read a few.

Read these libraries

EnkiTS: Doug Binks's zlib-licensed task scheduler in plain C++^[37]. About 1,500 lines. No fibers. The cleanest small example of a production-quality scheduler.
Marl: Google's hybrid thread + fiber scheduler^[22]. Cross-platform fiber assembly, work-stealing, around 5,000 lines. Industrial-strength.
FiberTaskingLib: an explicit reimplementation of the Naughty Dog design^[38]. Reading this is the easiest way to understand the Gyrling talk.
Intel GTS: Intel's Games Task Scheduler^[39]. Optimized for game workloads, with detailed documentation of design choices.
NVIDIA stdexec: the reference C++26 senders/receivers implementation^[29]. Read for the future, not the present.

Read these papers

Blumofe and Leiserson, Scheduling Multithreaded Computations by Work Stealing^[20]. The mathematical foundation.
Chase and Lev, Dynamic Circular Work-Stealing Deque^[6]. The data structure.
Lê, Pop, Cohen, Zappa Nardelli, Correct and Efficient Work-Stealing for Weak Memory Models^[18]. The version you'd actually ship.
Frigo, Leiserson, Randall, The Implementation of the Cilk-5 Multithreaded Language^[19]. The work-first principle.

Talks

Gyrling, Parallelizing the Naughty Dog Engine Using Fibers, GDC 2015^[2]. The canonical AAA reference.
Tatarchuk, Destiny's Multithreaded Rendering Architecture, GDC 2015^[3]. The companion case study.
Parent, Better Code: Concurrency, code::dive 2016^[40]. The C++ futures-and-channels mindset.
Stevens, How to Benefit from Multithreading in Your Unreal Engine Projects, Unreal Fest Gold Coast 2024^[28]. The recent Epic perspective.

The final exam

Five questions covering the whole tutorial. If you can answer all five without scrolling back, you've got the fundamentals.

17Sources & further reading

Numbered citations refer to the superscripts above. Everything below is either freely available on the open web or linked from a GDC vault page.

A note on originality

The prose, code, CSS, and interactive demos on this page are original writing. Implementation details of the Chase-Lev deque follow Chase and Lev (2005) [6] with the correctness corrections from Lê et al. (2013) [18]; both are attributed at the point of use. Architecture numbers for the Naughty Dog system (6 workers, 160 fibers, 3 queues) come from the Gyrling GDC 2015 deck [2]. The Unity and Unreal API descriptions are paraphrases of the linked official documentation. The opening "fixed-thread era is over" framing tracks the Tatarchuk GDC 2015 talk's "designed from the ground up for job-based multithreading" assertion [3].

Albrecht, T. (2009). Pitfalls of Object-Oriented Programming. Game Connect: Asia Pacific. PDF. The canonical primary source for data-oriented design.
Gyrling, C. (2015). Parallelizing the Naughty Dog Engine Using Fibers. GDC. GDC Vault, slides PDF. The defining fiber-based job-system talk.
Tatarchuk, N. (2015). Destiny's Multithreaded Rendering Architecture. GDC. Slides. "A game engine must be designed from the ground up for job-based multithreading."
Genova, B. (2015). Multithreading the Entire Destiny Engine. GDC. GDC Vault. The Bungie engine-side companion talk.
Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., & Zhou, Y. (1995). Cilk: An Efficient Multithreaded Runtime System. PPoPP. ACM DL. The first practical work-stealing scheduler.
Chase, D., & Lev, Y. (2005). Dynamic Circular Work-Stealing Deque. SPAA. PDF. The deque used in essentially every modern scheduler.
Reinalter, S. (2015–2016). Job System 2.0: Lock-Free Work Stealing. Molecular Musings blog. Part 1, Part 2, Part 3, Part 4, Part 5. A complete worked C++ implementation.
Unity Technologies. Job System Overview / Unity Manual. Unity 6.3. docs.unity3d.com. Also: Entities job scheduling.
Epic Games. Tasks System in Unreal Engine. UE 5.7. dev.epicgames.com. Plus the Launch API.
Dominiak, M., Baker, L., Howes, L., Shoop, K., Garland, M., Niebler, E., & Adelstein Lelbach, B. (2024). P2300R10 std::execution. WG21 paper, adopted into C++26 at the St. Louis plenary, 29 June 2024. wg21.link.
Andersson, J. (2010). Parallel Futures of a Game Engine (v2.0). Stockholm Game Developer Forum. SlideShare. DICE's Frostbite engine job-graph design.
Lemire, D. (2023). Measuring the size of the cache line empirically. lemire.me. 64 B on x86-64, 128 B on Apple M-series.
Wikipedia. False sharing. en.wikipedia.org. "If any data in a cache line is modified, the entire cache line must be synchronized."
alic.dev (2023). The False Sharing Penalty. alic.dev/blog/false-sharing. A measured benchmark showing roughly 2× slowdown from false sharing on x86.
Lemire, D. (2021). Memory access on the Apple M1 processor. lemire.me.
Vyukov, D. Bounded MPMC queue. 1024cores. 1024cores.net (mirror: sites.google.com). The reference MPMC design.
Preshing, J. (2012). Acquire and Release Semantics. preshing.com. The canonical practitioner explanation of memory ordering.
Lê, N. M., Pop, A., Cohen, A., & Zappa Nardelli, F. (2013). Correct and Efficient Work-Stealing for Weak Memory Models. PPoPP. PDF. The version of Chase-Lev that's correct on ARM and POWER.
Frigo, M., Leiserson, C. E., & Randall, K. H. (1998). The Implementation of the Cilk-5 Multithreaded Language. PLDI. PDF. The work-first principle.
Blumofe, R. D., & Leiserson, C. E. (1999). Scheduling Multithreaded Computations by Work Stealing. J. ACM. PDF. Proves work stealing is asymptotically optimal.
Microsoft. Fibers (Win32). Microsoft Learn. learn.microsoft.com.
Google. Marl: a hybrid thread / fiber task scheduler written in C++ 11. GitHub: github.com/google/marl. Cross-platform fiber implementation.
Various (2024). Stackless vs. Stackful Coroutines. Discussions and primary references: cppreference; Boost.Fiber overview at boost.org.
Unity Technologies. Burst User Manual. Package 1.8. docs.unity3d.com. Also: Burst becoming built-in in 6.6.
Epic Games. FTaskGraphInterface API. dev.epicgames.com.
Epic Games. ParallelFor API. dev.epicgames.com.
Epic Games. Parallel Rendering Overview. UE 5.7. dev.epicgames.com.
Stevens, A. (2024). How to Benefit from Multithreading in Your Unreal Engine Projects. Unreal Fest Gold Coast. Talk page; companion repo.
NVIDIA. stdexec: reference implementation of P2300. GitHub: github.com/NVIDIA/stdexec.
Smith, N. J. (2018). Notes on structured concurrency, or: Go statement considered harmful. vorpus.org. Plus Roman Elizarov's independent Kotlin treatment.
Chajdas, M. & Hargreaves, S. (2024). D3D12 Work Graphs. GDC. gpuopen.com. GPU-side work-graph scheduling.
Wikipedia. ABA problem. en.wikipedia.org.
Intel. PAUSE: Spin Loop Hint. felixcloutier.com.
Wikipedia. Priority inversion. en.wikipedia.org. The Mars Pathfinder incident is the classic.
Cox, R. (2021). Hardware Memory Models (Memory Models, Part 1). research.swtch.com.
Wikipedia. Simultaneous multithreading. en.wikipedia.org.
Binks, D. enkiTS: a permissively-licensed, lightweight, embeddable, task scheduler. GitHub: github.com/dougbinks/enkiTS.
RichieSams. FiberTaskingLib. GitHub: github.com/RichieSams/FiberTaskingLib. A reimplementation of the Naughty Dog architecture.
Intel. GTS: Games Task Scheduler. GitHub: github.com/GameTechDev/GTS-GamesTaskScheduler.
Parent, S. (2016). Better Code: Concurrency. code::dive. YouTube. "No raw synchronization primitives."
Frykholm, N. (2010). Task Management: A Practical Example. Bitsquid blog. bitsquid.blogspot.com. The single-global-queue counterpoint to work-stealing.
Desrochers, C. (2014). moodycamel::ConcurrentQueue. GitHub: github.com/cameron314/concurrentqueue; design write-up at moodycamel.com.