Memory Allocators
from Scratch
A 60 Hz frame is 16.6 milliseconds. A generic malloc on a contested heap can spike past a millisecond on its own, and a shipping game allocates and frees tens of thousands of times per frame. The fix is not a faster malloc, it's a different allocator for each kind of lifetime: a bump pointer for per-frame scratch, a pool for fixed-size particles, a buddy or TLSF for the texture heap, a slab for the entity cache, and a generational handle so a freed slot can be reused without a stale pointer slipping through. We build every one of them, with a live demo for each, and we cite the 1965 paper that started it, the 2004 paper that Unity's dynamic heap is built on, and the 1994 paper Linux's SLUB still descends from.
01Why every engine writes its own allocator
A generic malloc has a hard job. It must serve any size, from a 12-byte node to a 16-megabyte texture, with no idea how long the caller will hold it, from any thread, and with no help from the caller about which of those it's about to do next. The price of that generality is a free-list search, a heap-lock acquisition, sometimes a syscall, and an answer whose worst case can be hundreds of times slower than its median. An engine knows more than that. It knows the per-frame scratch lifetime is one frame. It knows particles are 96 bytes each. It knows the texture pool needs alignment to 4 KB. The job of an engine allocator is to turn that domain knowledge into faster code with simpler invariants, without ever surprising the frame budget.
The widget below samples allocation latencies from two allocators on the same workload. The first is the system malloc under realistic fragmentation. The second is a linear bump allocator. Both report a median in the same ballpark. The tail is the story: malloc has a heavy right tail where the rarest calls hit a page fault, a free-list scan, or a heap-lock wait, and the worst sample is two orders of magnitude over the median. The bump allocator has effectively no tail because every call does the same three instructions:
A working understanding of the six allocator families that show up in every shipping engine: linear, stack, pool, buddy, TLSF, slab. A live race that pushes the same workload through each and counts the holes it leaves behind. A working AoS → SoA trace, and a generational-handle implementation that rejects a stale handle the same way a Bevy Entity or an EnTT entity does[2][3]. By the end you'll know why Unity's dynamic heap uses TLSF[4], why Unreal's per-frame allocator is called FMemStack and not FMemPool[5], why the slab paper from 1994 is still the right answer for object caches[6], and why "buddy wastes 25% on average" is folklore[7].
Four questions every allocator answers
The rest of the tutorial walks one allocator per section, and each one is built to answer four questions in a different way. You can mostly predict an allocator's behavior from the answers:
- What sizes does it serve? One fixed size (pool, slab), a small set of size classes (TLSF, slab caches), powers of two (buddy), or anything (linear, free-list).
- What lifetime does it serve? A single frame (linear, stack), a single scope (stack with a marker), object-by-object (pool, slab, buddy, TLSF), or anywhere from a frame to the whole session (general-purpose).
- What's the worst-case time for one allocation? Constant (linear, stack, pool, slab fast path, TLSF), logarithmic in the heap size (buddy), or unbounded (first-fit, best-fit on long free lists).
- What kind of waste does it produce? No waste (linear), internal fragmentation only (pool, buddy, slab), external fragmentation (general-purpose, free-list), or some bounded combination.
02A short history of getting allocation right
The C library's malloc didn't drop fully formed. Its shape is the cumulative answer to sixty years of "this is faster on real hardware" arguments, each of which forced someone to write a better allocator down. A tour, because the modern set only makes sense in context:
__alloc_pages sixty years later.
apr_pool implement.
SmallObjectAllocator: per-size pools chained by index, with a free list embedded in the empty slots themselves. The C++ template-library spelling of an idea that was already in every game engine and most kernels.
std over the next decade, including the polymorphic memory resources of C++17[20].
03Under the call to malloc
Before any of the specialized allocators make sense, the picture under a generic malloc needs to be on paper. The call doesn't reach for a magic heap; it reaches for a data structure the C library has been maintaining since the program started, and the cost of one call is the cost of finding a fit in that structure, plus a system call when the structure is empty.
A generic heap is a region of address space, broken into blocks. Each block has a header (size, free/used flag, sometimes a back-pointer for coalescing) and a payload that the caller sees. Free blocks are threaded onto one or more free lists, sorted by size or by address depending on the policy. A call to malloc(n) finds a free block whose payload is at least n bytes, splits off the remainder, and returns the payload pointer. A call to free(p) looks at the header in front of p, marks it free, and tries to coalesce with the previous and next blocks via boundary tags[10].
Three things on the slow path are what make malloc spike past a millisecond:
- The free-list scan. First-fit walks the list; best-fit walks all of it; segregated fits index by size class. On a heap with thousands of free blocks, a linear scan is bounded only by how many fits exist, not by the request.
- The heap lock. Modern allocators (jemalloc, tcmalloc, mimalloc) keep a thread-local cache so most calls never touch a lock; the kernel-allocator lineage from Bonwick & Adams 2001[8] uses per-CPU magazines for the same purpose. Either way, the global tier underneath is still a mutex, and a spike happens when two threads need it at once.
- The system call. When the heap is empty, the allocator asks the kernel for more pages with
mmaporsbrk. The kernel zeroes the pages on first touch (a page fault per 4 KB). One page fault is microseconds, not nanoseconds; a thousand of them in a frame is the entire budget.
Every allocator in this tutorial removes one or more of those costs by giving up some of the generality. A linear allocator removes the free-list scan by removing the free list. A pool removes the size search by fixing the size. A slab removes the constructor cost by reusing the slot. Each is a worse general-purpose allocator and a better one for its workload.
"Calling new" in C++ does two things: operator new (which calls malloc) plus the constructor. The constructor is yours. The malloc half is what this tutorial replaces. operator new[] on a non-trivial type also stores an array cookie ahead of the returned pointer, which makes delete and delete[] not interchangeable. None of the specialized allocators in the rest of this tutorial use operator new; they all take raw bytes and let the engine code construct in place with placement new.
04The linear (bump) allocator
The simplest allocator that's still useful. One pointer. Every allocation moves the pointer forward by the requested size. There is no per-object free. The entire region resets in one instruction at the end of the frame, the level, or whatever lifetime the caller picked. Unreal calls its version FMemStack[5]; Naughty Dog ships the same pattern under its own name for per-job scratch[24]; Apache calls a sibling design apr_pool. Hanson 1990 named the academic idea[13]; Ryan Fleury's "Untangling Lifetimes" essay is the clearest modern walkthrough[26].
The widget is the picture you keep in your head. Click alloc 32 to grab 32 bytes; the head pointer moves. Free is the reset button: one assignment, all 256 bytes released at once. Try to allocate past the end and you get the only failure mode this allocator has, out-of-memory:
The code, in 30 lines
// One pointer of state. No headers, no free lists, no locks. struct LinearAllocator { char* base; // start of the region char* head; // next free byte char* end; // one past the end void init(void* memory, size_t bytes) { base = head = (char*)memory; end = base + bytes; } // Allocate `size` bytes aligned to `alignment` (default 16, SIMD-friendly). void* allocate(size_t size, size_t alignment = 16) { // Round head up to alignment (alignment is required to be a power of two). uintptr_t raw = reinterpret_cast<uintptr_t>(head); uintptr_t aligned = (raw + alignment - 1) & ~(uintptr_t(alignment) - 1); char* candidate = reinterpret_cast<char*>(aligned); if (candidate + size > end) return nullptr; // out of memory head = candidate + size; return candidate; } // No per-object free. The whole region resets in one assignment. void reset() { head = base; } size_t used() const { return head - base; } size_t capacity() const { return end - base; } };
Every line of the allocator's hot path is on screen. allocate is a load, a round-up, a compare, a store. reset is a single store. The only branch is the out-of-memory check, and the alignment math is a power-of-two mask. On every architecture, this is in the noise of a frame.
This is the teaching version, not a shipping one. A production linear allocator adds: a chain of pages so it can grow when one region fills, an explicit "checkpoint" or "mark" API so the engine can rewind to a saved head (that's the stack allocator in the next section), a debug build that overwrites freed memory with 0xCD on reset to catch use-after-reset bugs, and a thread-local instance so calls don't need a lock. Frykholm's "Custom Memory Allocation in C++" walks through every one of these layers[27].
05The stack allocator (linear with markers)
A linear allocator with one extra rule: you can save the head pointer ("push a marker") and rewind to it later ("pop to a marker"). Allocations are still pointer bumps; freeing happens in LIFO order at the granularity of a marker, not an individual object. This is what the C call stack is, generalized: a function pushes its locals on entry and pops them on exit, and the caller does the same one level up. Engines use the same shape for nested scopes that have their own scratch: load a level, push a marker, allocate everything the loader needs, pop the marker, and every byte the loader touched is free in one instruction.
Try the widget. Allocate a few blocks, hit save marker, allocate a few more, then pop to marker. Everything after the marker is gone in one step; everything before it stays:
The marker pattern in code
struct StackAllocator { char* base; char* head; char* end; using Marker = char*; // just the head value at the time of save Marker mark() const { return head; } void pop(Marker m) { head = m; } // rewind to the saved head void* allocate(size_t size, size_t alignment = 16) { uintptr_t raw = reinterpret_cast<uintptr_t>(head); uintptr_t aligned = (raw + alignment - 1) & ~(uintptr_t(alignment) - 1); char* candidate = reinterpret_cast<char*>(aligned); if (candidate + size > end) return nullptr; head = candidate + size; return candidate; } }; // RAII helper so a scope's allocations free automatically. class StackScope { StackAllocator& allocator; StackAllocator::Marker savedHead; public: explicit StackScope(StackAllocator& a) : allocator(a), savedHead(a.mark()) {} ~StackScope() { allocator.pop(savedHead); } // pop on scope exit }; // Use: void loadLevel(StackAllocator& frame) { StackScope scope(frame); // save marker auto* manifest = (Manifest*)frame.allocate(sizeof(Manifest)); parseManifest(manifest); loadAssets(frame, manifest); } // every allocation in here is freed here
The StackScope destructor is the entire lifetime story. Anything the function allocates from frame is released when the function returns. Anything allocated by a caller's scope before loadLevel is untouched, because the caller's marker is lower in the stack. This is the discipline Unreal's FMemMark documents and what Frostbite's "Scope Stack Allocation" paper named[28].
06The pool allocator (fixed-size free list)
One size of slot, many slots, a free list threaded through the empty ones. Allocate pops the head of the free list. Free pushes the slot back onto the head. Both operations are O(1) and the slots are interchangeable, which gives the allocator a property linear allocators don't have: per-object free in any order, with no external fragmentation, because every hole is exactly the same shape as every fit. Wilson's survey calls this "simple segregated storage"[1]; Alexandrescu's Loki SmallObjectAllocator is the C++ template version[15]; Boost.Pool is the canonical OSS one[29].
The fragmentation immunity is what's worth seeing. Click alloc, alloc, alloc, then free the middle one. The hole is a single slot, the same size as every other slot, and the next alloc reuses it without searching. There is no "I have 96 bytes free but no contiguous block big enough" failure mode here, because every block is already the right size:
The free list lives in the empty slots
template <typename T, size_t kSlotCount> class PoolAllocator { static_assert(sizeof(T) >= sizeof(void*), "slot must be big enough to hold a free-list pointer"); // Storage: kSlotCount slots, each sized like T and aligned for T. alignas(T) char storage[kSlotCount * sizeof(T)]; // Free-list head: index of the next free slot, or -1 if pool is full. int freeHead = 0; char* slotPtr(int i) { return storage + i * sizeof(T); } public: PoolAllocator() { // Thread the free list through every slot: slot[i] -> slot[i+1], last -> -1. for (int i = 0; i < (int)kSlotCount - 1; ++i) *reinterpret_cast<int*>(slotPtr(i)) = i + 1; *reinterpret_cast<int*>(slotPtr(kSlotCount - 1)) = -1; } // O(1): pop the free-list head. T* allocate() { if (freeHead < 0) return nullptr; // pool empty char* slot = slotPtr(freeHead); int nextHead = *reinterpret_cast<int*>(slot); freeHead = nextHead; return reinterpret_cast<T*>(slot); } // O(1): push the slot back onto the free-list head. void deallocate(T* p) { int idx = (reinterpret_cast<char*>(p) - storage) / sizeof(T); *reinterpret_cast<int*>(slotPtr(idx)) = freeHead; // next = old head freeHead = idx; // new head } };
The clever bit is that the free list costs zero memory: each empty slot's first few bytes are the link, since the slot is empty by definition. A pool of 24 slots of 96-byte particles uses exactly 24 * 96 = 2304 bytes plus a single 4-byte head index. Compare to a generic allocator's header overhead (typically 8 to 16 bytes per allocation), and a pool of small objects pays back the constant factor in cache occupancy alone.
Pools are immune to external fragmentation (free space between allocations) but pay internal fragmentation (slot bigger than what the caller actually needed). A pool of 96-byte slots holding 80-byte payloads wastes 16 bytes per slot, forever. The fix is to have several pools, one per common size, and pick the smallest pool whose slot fits. That's what the slab allocator does at scale; it's what TLSF does with size classes; and it's what every "small-object allocator" in production does, including Loki[15].
07The buddy allocator
A pool serves one size. A buddy allocator serves many, with one rule that makes coalescing on free almost free: every block is a power-of-two size, and every block has a buddy at a fixed address (its address XOR'd with its size). When a block frees, the allocator checks whether the buddy is also free, and if so the two coalesce into a block of the next power of two, recursively. Allocation is the inverse: find the smallest power-of-two block that fits the request, splitting larger free blocks if no suitable one exists. Linux's __alloc_pages has used a buddy at the page-allocator level since the kernel had a memory manager[30]; AMD's Vulkan Memory Allocator uses one for GPU memory before TLSF[18]; BitSquid documented a CPU buddy in 2015[31].
The widget is a binary tree where every node is a block of a power-of-two size. Click a leaf to allocate; the leaf darkens. Click a freed leaf to free; if its buddy is also free, both coalesce up the tree:
Why coalescing is one XOR
A buddy is determined by address. If a block is at offset a with size s (both powers of two, with a a multiple of s), the buddy of that block sits at offset a ⊕ s. So when block B frees, the allocator computes B's buddy address, looks up that address in the free table, and if it's free, merges. The merged block has size 2s and starts at the lower of the two buddy offsets, a & ~s; its own buddy at the next level sits at that offset ⊕ 2s. Coalescing terminates in O(log heap_size) steps in the worst case (no extra work after one merge fails).
// Given a block at offset `a` with size `s` (both powers of two, s aligned), // the buddy of this block is at offset (a XOR s). Free both and they merge. size_t buddyOf(size_t offset, size_t size) { return offset ^ size; } // Round a request up to the smallest power-of-two block that fits. size_t roundUpPow2(size_t n) { if (n <= 1) return 1; return 1ULL << (64 - __builtin_clzll(n - 1)); // 1 << ceil(log2(n)) } // Allocate: find the smallest free block of `size` (a power of two). // If none exists, split a larger free block recursively. Block* allocate(size_t size) { size_t classIdx = log2(size); // which size-class list for (size_t i = classIdx; i < kMaxClasses; ++i) { if (!freeList[i].empty()) { Block* big = freeList[i].pop(); while (i > classIdx) { // split down to the requested size --i; Block* upper = split(big); // upper-half buddy freeList[i].push(upper); } return big; } } return nullptr; // no free block large enough } // Free: try to merge with the buddy; repeat up the tree. void deallocate(Block* p) { size_t sz = p->size; while (sz < kMaxBlockSize) { size_t buddy = buddyOf(p->offset, sz); Block* b = blockAt(buddy); if (!b->isFree || b->size != sz) break; // buddy unavailable freeList[log2(sz)].remove(b); if (buddy < p->offset) p = b; // keep the lower-address block sz *= 2; } freeList[log2(sz)].push(p); }
Buddy is the right tool when the request sizes are roughly power-of-two-shaped: 4 KB pages, GPU texture mips (which are literally power-of-two-sized), variable-size scratch arenas. It's the wrong tool when sizes don't bunch near powers of two, because the internal-fragmentation tax is paid on every allocation. A request for 65 bytes gets a 128-byte block. That's a 49% waste, every time, until the buddy at offset 65..127 happens to be useful for another 65-byte request, which on real engine workloads it rarely is.
08TLSF: a general-purpose allocator with a worst case
Linear, stack, pool, and buddy each give up some of the generality of malloc to win constant time. TLSF (Two-Level Segregated Fit) goes the other direction: keep the generality of a free-list allocator (any size, any lifetime, no special workload assumption) but index the free lists with a two-level bitmap that the hardware's bit-scan instruction can search in one cycle. The result is a ฮ(1) allocator with a bounded worst case: the SPE 2008 paper reports a malloc path of fewer than 200 processor instructions on the x86 of the era[17]. Unity's dynamic heap uses TLSF[4], AMD's VMA defaults to it since v3.0[18], and MorphOS ships it as its system allocator[32].
The two-level structure is the whole trick. Every free block is bucketed by its size into a first-level class indexed by its power-of-two bucket, and a second-level class that linearly subdivides that bucket into 2SLI sub-buckets (shipping implementations typically pick SLI = 4 or 5; the widget below uses SLI = 2 for visual density). Each (first, second) pair has its own free list. Two bitmaps record which lists are non-empty:
Two bit-scans, two masked AND operations. FL_bitmap tells you which power-of-two buckets have any free block. The mask hides classes smaller than what you need. SL_bitmap[fl] tells you which subdivision within the chosen power-of-two bucket has a fit. Each bit-scan is one instruction on x86 and ARM. The whole search is six or seven instructions regardless of how full the heap is.
The widget below visualises the bitmaps. Pick a request size; the first-level scan lights up the smallest non-empty class at or above your size, and the second-level scan lights up the smallest sub-bucket within that class. The free list at that (fl, sl) position has the answer, popped in O(1):
The size-class encoding
For a request size n, a smallest block size 2FLI (16 bytes in the widget, so FLI = 4), and a chosen SLI (number of second-level bits), the class indices are:
Within a power-of-two bucket of size [2f, 2f+1), the second-level index sl is the position of n in that range, scaled to 2SLI sub-buckets. With SLI = 4, each power-of-two range splits into 16 sub-buckets: the 1024..2048 range gets sub-buckets at 1024, 1088, 1152, ..., each 64 bytes wide. Worst-case waste from rounding up to the next bucket is 2f โ SLI bytes, which is at most 1 / 2SLI of the requested size: with SLI = 4 the request is rounded up by at most 6.25%, with SLI = 5 at most 3.1%.
The bitmap is what makes TLSF ฮ(1). Compare to a segregated free list with the same size classes but no bitmap: finding the next non-empty class requires a linear scan over class indices, which is O(N classes). On a 64-class heap that's 64 indirect-load comparisons. The bitmap collapses all of them into one bit-scan instruction. The 2008 paper bounds the malloc path at fewer than 200 processor instructions on x86 of the era[17]; on modern CPUs the count is comparable, because the bit-scan is still one cycle and the rest is one cache-line load and a few list-link writes.
Matt Conte's mattconte/tlsf on GitHub is the canonical OSS implementation: BSD-licensed, ~1000 lines of C, 4-byte per-block header overhead, ~3 KB of allocator state per pool[33]. AMD's VMA bundles a C++ adaptation; Unity's native heap embeds a private variant.
09The slab allocator
Bonwick's 1994 paper started with one observation: in a kernel, allocating an object isn't just paying for the bytes, it's paying for the constructor. A network socket struct needs its lock initialized, its buffer pointers cleared, its linked-list links zeroed. If you free the socket and later allocate another one, you pay for that initialization again. The slab idea is to keep freed objects in their partially-constructed state so the next allocate just hands one back, with the cold-start init amortized across reuses[6]. The 2001 follow-up added per-CPU magazines so the fast path doesn't touch a lock at all[8]. Linux's SLUB (the descendant that finally retired SLAB in 6.8) runs on every Linux system in production[25].
A slab is a contiguous region of memory carved into objects of one type. A slab cache is a set of slabs, plus a partial list (slabs with some free, some used), a full list (no free objects), and an empty list (all free, ready to be returned to the page allocator). Allocation finds a partial slab, pops a free object. Free pushes it back; if the slab becomes empty, it moves to the empty list and may be reclaimed.
The widget below shows three slabs of a single object type. Each slab is a row of slots, labelled empty / partial / full. The next allocation comes from a partial slab. The magazine column on the right is a per-CPU stack of recently-freed objects: pushes and pops here are lock-free (a single CAS on a counter), and only when the magazine over- or underflows does the slow path through the cache layer fire:
Why the slab is a great default for entity caches
An ECS spawns and despawns enemies, bullets, pickups, particles, audio sources. Each type has a fixed object size. The allocation pattern is bursty (a wave of 30 enemies, then 200 bullets) and locality matters (iterating enemies should be cache-friendly). The slab cache hits every requirement:
- Fixed size per type, so no internal fragmentation beyond the slab's per-object header (usually 4 to 8 bytes, often elidable).
- Contiguous storage, so iterating live objects in a slab is sequential, prefetchable, and free of pointer-chasing. This is an AoS layout at the object level: traversals that touch every field of every object hit the line economy the next section describes.
- Per-CPU magazines, so the hot path is lock-free in the absence of cross-thread frees.
- Partial-slab promotion/demotion, so empty slabs can be returned to the page allocator and used for something else when the working set shrinks.
Frykholm's "Custom Memory Allocation in C++"[27] and Reinalter's Molecule engine memory series[34] both describe pool-and-slab-shaped layouts for gameplay entity types. The structural similarity to an ECS archetype is hard to miss: an archetype chunk is a contiguous run of one (component-set) shape, indexed by entity slot, with allocation reduced to "take the next empty slot in the partial chunk." The magazine layer is the part archetypes don't usually replicate, because the threading model is different.
10Fragmentation: internal, external, and why it isn't what you think
Fragmentation is the gap between "bytes free" and "bytes usefully allocatable". There are two kinds, and they sit at opposite ends of every allocator's tradeoff. Internal fragmentation is space wasted inside a block: a 65-byte request in a 128-byte buddy slot wastes 63 bytes; a 12-byte object in a 96-byte pool slot wastes 84. External fragmentation is space wasted between blocks: a free-list allocator with thousands of 16-byte holes and a request for 4 KB has plenty of bytes free but no usable block. Pools have only internal fragmentation by construction; linear allocators have neither (until they fail entirely); buddy has predictable internal fragmentation; general-purpose malloc can develop arbitrary external fragmentation.
The widget below runs the same allocation workload (random sizes, random lifetimes from a bimodal distribution, ~10,000 ops) through four allocators in parallel. Watch the largest contiguous free block over time: pool stays at maximum because all slots are interchangeable; buddy decays slowly because coalescing handles most frees; TLSF decays slowly thanks to size-class fits; the naive first-fit free-list decays the fastest and never recovers:
The worst case is theoretical, the typical case isn't
Robson 1971 proved that any non-relocating allocator has a worst-case overhead bound of ฮ(M log n), where M is the maximum simultaneous live bytes and n is the ratio of largest to smallest allocation[12]. The proof constructs an adversarial sequence of allocations and frees that no clever policy can survive without paying that bound. This is real, and it's why moving garbage collectors can guarantee what a native heap can't.
Wilson, Johnstone, Neely, and Boles 1995 ran the other direction: measured allocators on real C and C++ programs, with the lifetime patterns those programs actually produce, not synthetic random traces[1]. The finding, refined in Johnstone and Wilson's 1998 follow-up[14]: real programs fragment far less than worst-case theory predicts. Address-ordered best-fit and segregated fits both hit near-zero fragmentation on most benchmarks once header and alignment overhead are accounted for separately. The bimodal lifetime distribution of real programs (most objects die young; a small fraction lives forever) does most of the work; the allocator just has to not actively make it worse.
The practical implication is the inversion of the textbook advice. The theory says any allocator can be adversarially fragmented; the measurements say the choice of policy mostly doesn't matter for real workloads except when the workload is itself synthetic. A shipping engine that suffers fragmentation is almost never suffering allocator theory; it's suffering a long-lived allocation pattern (texture pool, level state, asset cache) that pinned one part of the heap and prevents coalescing of everything around it. The fix is structural (segregate by lifetime) more often than algorithmic (better allocator)[35].
The numbers that exist are GDC slides and developer blog posts. MacDougall's 2016 talk for Sony London Studio has concrete heap-fragmentation traces[35]; Dubรฉ's blog series reports 10 to 30 MB of fragmentation on an unnamed Xbox 360 title[36]; Gyrling's Naughty Dog talk covers allocator architecture without measured fragmentation[24]. Treat any "Crysis fragmented X%" or "Far Cry hit a Y MB hole" claim without a slide URL as developer folklore.
11Array-of-Structures vs Structure-of-Arrays
Allocators decide where memory comes from. Layout decides how it's arranged, and on the same CPU, with the same algorithm, the layout is what decides whether the CPU is fed or stalled. The headline experiment is Tony Albrecht's "Pitfalls of Object-Oriented Programming" from 2009[21]. On a PS3 scene-tree traversal, he reports four data points: 19.2 ms for the original OO graph of node objects, 12.9 ms after reordering the node data, 4.8 ms after switching to homogeneous arrays with linear traversal (the SoA step itself, about 4ร faster than the baseline), and 3.3 ms after adding software prefetch on top. About 6ร end-to-end, of which roughly 4ร is the layout change and the remainder is prefetching. The "20ร speedup from SoA" that gets repeated online has no shipping-workload citation behind it; Albrecht's numbers are the ones to anchor your intuition on.
The difference is a cache question. A 64-byte cache line holds, say, 4 entities at 16 bytes each (AoS) or 16 floats from one field (SoA). A loop that touches one field per entity loads the entire entity in AoS and four entities of unused fields per cache miss; the same loop in SoA loads 16 floats of just the field it cares about, which is the optimal use of the line.
struct Entity { vec3 position; // 12 B vec3 velocity; // 12 B float health; // 4 B uint32_t flags; // 4 B }; // 32 B per entity Entity entities[N]; // Loop touches positions only: for (int i = 0; i < N; ++i) entities[i].position.x += dt; // Each cache line: 64 B / 32 B = 2 entities, // of which 8 / 32 = 25% is the field we want.
struct Entities { vec3* positions; vec3* velocities; float* healths; uint32_t* flags; int count; }; // Loop touches positions only: for (int i = 0; i < entities.count; ++i) entities.positions[i].x += dt; // Each cache line: 64 B / 12 B โ 5 positions, // 100% used. ~4x the useful bandwidth per miss.
The widget below traces cache-line usage for both layouts. Pick a workload (touch one field, or touch all fields). The bars on the left show cache lines loaded; the colored fraction is the useful bytes. AoS wins when the workload touches every field of every entity (because the entity is already on the line); SoA wins when the workload touches one or two fields, which is the common case in physics, rendering, and AI passes:
When AoS is right anyway
SoA isn't always the answer. If you iterate every field of every entity together (a serializer, a debug visualizer that draws all state, an entity that's used as a single unit), AoS is fine and the indirection of SoA's separate arrays costs you. The right rule is the one Acton's CppCon 2014 talk repeats throughout[38]: look at the dominant access pattern, lay out the data for that pattern, and accept that the minor patterns will pay for it. Most engine systems have one dominant pass (physics integrates position and velocity, the renderer iterates transforms, the audio system iterates AudioSources); design for it, not for hypothetical generality.
An ECS "archetype" is exactly SoA at the storage level: an archetype is a unique combination of component types, and all entities with that combination live in one chunk where each component type is a contiguous array. EnTT's storage[3] and Bevy's storage[2] both implement this. The performance argument is the one in the widget above: a physics system that integrates position and velocity touches only those two arrays of the archetype's chunk, with full cache-line utilization.
12Generational handles: the right pointer for a slot you'll free and reuse
A raw pointer is what the compiler hands you: a 64-bit address into the address space, valid until something frees it. The "until something frees it" is the trouble. In a game with thousands of entities being spawned and destroyed every second, holding a raw pointer to an entity is a contract you can't enforce: by the time you dereference it, the slot may belong to a different entity. The C++ answer is shared_ptr and weak_ptr, which charge an atomic refcount and a control block per object. The engine answer is the generational handle: 32 or 64 bits split into an index (which slot) and a generation (which generation of that slot). Freeing bumps the generation. A handle is valid if and only if handle.gen == slot.gen. Stale handles fail loudly; live ones cost a single cache-line load to validate. Frykholm's 2011 BitSquid post is the canonical description[23]; floooh's "Handles are the better pointers" is the modern blog standard[39]; Bevy's Entity[2] and EnTT's entt::entity[3] are the shipping ECS implementations.
The widget shows the lifecycle. Spawn returns a handle. Free a slot bumps that slot's generation. A handle stored from before the free still points to the same slot index but has the old generation, and the lookup correctly rejects it as stale:
The 32-bit handle
struct Handle { uint32_t bits; // 24 bits index | 8 bits generation static Handle make(uint32_t idx, uint32_t gen) { return { (idx & 0xFFFFFF) | (gen << 24) }; } uint32_t index() const { return bits & 0xFFFFFF; } uint32_t generation() const { return bits >> 24; } }; template <typename T> class HandleTable { struct Slot { T value; uint32_t generation; // bumps on free bool alive; }; std::vector<Slot> slots; std::vector<uint32_t> freeIndices; // indices ready for reuse public: Handle spawn(const T& v) { uint32_t i; if (!freeIndices.empty()) { i = freeIndices.back(); freeIndices.pop_back(); } else { i = (uint32_t)slots.size(); slots.push_back({}); } slots[i].value = v; slots[i].alive = true; return Handle::make(i, slots[i].generation); } void free(Handle h) { if (!isValid(h)) return; slots[h.index()].alive = false; slots[h.index()].generation++; // retires every prior handle to this slot freeIndices.push_back(h.index()); // slot can now be reused } bool isValid(Handle h) const { const uint32_t i = h.index(); return i < slots.size() && slots[i].alive && slots[i].generation == h.generation(); } T* get(Handle h) { return isValid(h) ? &slots[h.index()].value : nullptr; } };
24 bits of index supports 16,777,216 entities, which is enough for any shipping game. 8 bits of generation supports 256 distinct generations per slot before it wraps. Wrap is the only failure mode: after 256 free-and-reuse cycles on the same slot, a handle from generation N would validate against generation N+256 incorrectly. The mitigation is to widen the generation field (typical AAA shapes are 16 or 24 bits), or to track separately whether a slot has ever wrapped and conservatively reject ambiguous handles. EnTT uses 12 bits of generation by default[3]; Bevy uses 32 bits[2].
A shared_ptr charges an atomic refcount per pointed-to object plus a separately-allocated control block (libstdc++ uses 24 bytes plus the deleter; libc++ similar; either way the overhead lives on the heap and is read on every copy). A weak_ptr on top of that to detect death is a second indirection. A generational handle pays one 32-bit word per handle, one 4-byte field per slot, and an O(1) lookup. Validation is a single compare. Cross-thread sharing is a value copy. There is no atomic per access, because the handle is a value: it doesn't refer to memory the way a pointer does, it refers to a slot whose memory the table owns. P0661 proposed standardising the pattern as std::slot_map[40]; the WG21 committee didn't adopt it, but everyone shipping built one anyway.
13Cheat sheet
A page you can paste into your engine's contributing guide. For each allocator: what sizes, what lifetimes, what worst-case time, what waste, and what shipping engine uses it where.
| Allocator | Sizes | Lifetimes | Worst-case | Waste | Ship example |
|---|---|---|---|---|---|
| Linear / bump | Any | Bulk-free only | O(1) | None until full | UE5 FMemStack |
| Stack (LIFO markers) | Any | Nested scopes | O(1) | None until full | Frostbite scope-stack[28] |
| Pool (fixed-size) | One size | Any per-object | O(1) | Internal only (slot vs payload) | Particle pools, bullet pools |
| Buddy | Powers of two | Any per-object | O(log heap) | ~1.44ร requested (Russell)[7] | Linux page allocator, VMA GPU |
| TLSF | Any | Any per-object | ฮ(1), <200 instr | โค 1/2SLI per alloc | Unity dynamic heap, VMA default |
| Slab | One size per cache | Any per-object | O(1) on magazine | Slab overhead + internal | Linux SLUB, ECS archetypes |
| Free-list (first-fit) | Any | Any per-object | O(n) free blocks | External fragmentation | Generic malloc, dlmalloc |
14How shipping engines stack these up
No engine uses one allocator. They use ten, each pinned to a specific subsystem, sometimes per thread. The interesting question isn't which allocator; it's which allocator for what.
Unreal Engine 5
UE5 exposes a single global FMalloc interface, with several concrete backends the engine picks based on platform and build configuration. On the production path, the choice is one of the binned allocators: MallocBinned, MallocBinned2, MallocBinned3, or MimallocBinned[41]. All four are binned: a set of fixed-size pools for small allocations (which dominate engine workloads), plus a large-block allocator for anything above the bin ceiling. Debug and instrumentation builds layer further backends on top (MallocStomp, MallocLeakDetection, MallocAnsi) for canary checks and leak tracking. On top of FMalloc, the engine provides FMemStack[5], a thread-local linear-with-markers allocator for per-frame and per-scope temporaries; FMemMark is the RAII wrapper that enforces the scope discipline. Allocations on FMemStack are documented as invalid past their scope, which is the rewind property the linear allocator section described.
Unity
Unity's native dynamic heap uses TLSF[4], documented on the engine's manual page. On top of the dynamic heap, the DOTS, Burst, and Native Collections layer exposes scoped allocators via the Allocator enum[42]: Allocator.Temp (per-frame linear, fastest), Allocator.TempJob (per-job, lifetime under four frames), Allocator.Persistent (general-purpose, backed by the TLSF heap). Each maps to a different underlying allocator and a different lifetime contract. ECS components live in archetype chunks, each chunk a contiguous run of one component-set shape, which is a slab-pattern layout at scale.
Naughty Dog, Decima, Frostbite, id Tech
Gyrling's 2015 GDC slides[24] are the most-cited public document on the Naughty Dog engine's memory architecture: per-frame linear allocators feeding per-job scratch arenas, with fiber-local ownership so cross-fiber memory pressure is observable in the profiler. Decima (Killzone, Horizon) has no dedicated allocator talk in the public record; the SIGGRAPH 2017 deck[43] covers rendering with brief memory notes. Frostbite has the Fredriksson "Scope Stack Allocation" slides for the CPU-side scope allocator[28] and the FrameGraph talk for transient GPU memory[44]. id Tech publishes its old source: the original Doom 3's idHeap in neo/idlib/Heap.cpp[45] is a tiered small-block + medium-pool + large-block layout (the BFG fork later replaced this with thin wrappers around the platform allocator).
The pattern across all of them
Strip the names and the structure is the same: a per-frame linear allocator for scratch, a general-purpose allocator (TLSF or binned) for object-lifetime allocations, a buddy or page-aligned segregated heap for large blocks, pools/slabs/archetypes for hot fixed-size types, and generational handles or weak references for any identity that outlives a frame. The interesting engineering is in how each engine decides which is which, and how their profilers attribute fragmentation back to a subsystem. The allocator menu, once you know it, is small.
15Pitfalls
Things that will burn you in production that the tutorial code is too clean to show.
- Alignment is not optional. SIMD types want 16-byte alignment. AVX wants 32. Cache lines are 64. GPU buffers want 256 or 4096. An allocator that hands back unaligned pointers will fault on x86 SIMD aligned-load instructions (
MOVAPS,VMOVDQA), fault on older ARM scalar accesses, and pay a cache-line-split load on every fat load that straddles a line even where the hardware tolerates it. Always take the requested alignment as a parameter; round the head up before returning. - Debug allocators are not free. Pattern-fill on alloc, pattern-fill on free, guard pages, leak tracking, stack-trace capture. Each adds 2 to 10 microseconds per allocation. UE5 ships
MallocLeakDetectionfor this reason as a separate backend; you do not want it in shipping builds. - Thread-local linear allocators trap fiber-based jobs. If a job allocates from a TLS arena on thread A, then yields and resumes on thread B (fibers, work-stealing schedulers), the TLS lookup on B sees a different arena. The bytes are still in A's arena, but the job can no longer reach them. Solve by attaching the arena to the job, not the thread. The job-systems tutorial covers the work-stealing case in detail.
- Pool free-list pointers are an information leak in debug builds. A freed pool slot contains the index of the next free slot, not the application's data. Some studios overwrite the rest of the slot with a poison pattern (e.g.
0xCDCDCDCD) on free to make use-after-free in the slot crash predictably. Skip this in shipping builds. - Generational handles don't survive deserialization without renumbering. A save game stores handles. After load, every spawned entity gets a fresh slot index and a fresh generation, so the saved handles point to the wrong slots. The fix is to serialize a stable ID (entity name, UUID) and translate it back to a handle at load time. This is the same problem network replication has.
- Cross-allocator frees crash silently. A pointer returned from
FMemStackpassed toFMalloc::Freecorrupts the heap. Every allocator must own its own free. The standard pattern is to make pointers from non-general-purpose allocators non-deletable (the call site holds an allocator handle, not a raw pointer). - Buddy overallocation can dominate slab gains. A 65-byte slab-style request to a buddy that rounds to 128 costs you 49% of the block. If your request distribution doesn't hug powers of two, layer a pool or a slab over the buddy rather than handing it raw sizes.
16What's next
Three threads to pull on from here.
Read the surveys. Wilson, Johnstone, Neely, Boles 1995[1] remains the best 100-page introduction to the field; Johnstone & Wilson 1998[14] is the measurement update. The Jones, Hosking, Moss Garbage Collection Handbook[46] covers the moving-GC half of the picture: the allocators native engines can't use, and what they do that buddy/TLSF/slab approximate.
Read the talks. Gyrling 2015[24], MacDougall 2016[35], Pedriana 2007[19], and Acton 2014[38] are the four GDC/CppCon talks that, between them, cover most of what shipping AAA studios do with memory. Read them in that order; the dependencies run cleanly.
Read the source. Matt Conte's TLSF[33] is the cleanest production allocator on GitHub at around a thousand lines of C; the AMD VMA repository has a buddy implementation alongside its TLSF default[18]; Bevy ECS[2] and EnTT[3] both have well-commented generational-handle implementations. The original Doom 3 idHeap source[45] is the rare AAA in-engine allocator shipped with its source tree, and worth reading end-to-end for how a real game put the pieces together.
17Sources
- Wilson, P. R., Johnstone, M. S., Neely, M., Boles, D. (1995). "Dynamic Storage Allocation: A Survey and Critical Review." Proc. International Workshop on Memory Management (IWMM '95), LNCS 986. PDF. The canonical taxonomy of allocator strategies and the critique that synthetic random traces don't match real workloads.
- Bevy ECS,
Entitydocumentation. docs.rs/bevy_ecs/entity/struct.Entity. Shipping Rust ECS using index+generation entity IDs. - EnTT, The Entity Component System guide. skypjack.github.io/entt. Shipping C++ ECS with 12-bit-generation entity IDs and a sparse-set archetype storage.
- Unity Manual, "Dynamic heap allocator." docs.unity3d.com. Explicit: "applies the Two Level Segregated Fit (TLSF) algorithm."
- Unreal Engine,
FMemStackBaseAPI documentation. dev.epicgames.com. The official spec for UE's per-frame linear allocator, including the rewind-on-pop semantics. - Bonwick, J. (1994). "The Slab Allocator: An Object-Caching Kernel Memory Allocator." USENIX Summer 1994. PDF. The origin of slab/object caching, the "construct cost amortized across reuse" thesis.
- Russell, D. L. (1977). "Internal Fragmentation in a Class of Buddy Systems." SIAM Journal on Computing 6(4), pp. 607-621. SIAM. Primary source for the ~1.44ร expected overallocation under uniform request distribution. The popular "buddy wastes 25%" claim is a rounded simplification, not a measured number.
- Bonwick, J., Adams, J. (2001). "Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources." USENIX ATC. PDF. Per-CPU magazine layer; the lock-free fast path used by every modern multicore allocator.
- Knowlton, K. C. (1965). "A Fast Storage Allocator." Communications of the ACM 8(10), pp. 623-624. ACM DL. First public description of the binary buddy algorithm.
- Knuth, D. E. (1997). The Art of Computer Programming, Volume 1: Fundamental Algorithms, 3rd ed., Addison-Wesley. Section 2.5 "Dynamic Storage Allocation," pp. 435-456. The canonical textbook treatment, source of the 50% rule and the boundary-tag idea.
- Shore, J. E. (1977). "Anomalous Behavior of the Fifty-Percent Rule in Dynamic Memory Allocation." CACM 20(11), pp. 812-820. ACM DL. The follow-up that measured when Knuth's 50% rule does and doesn't hold.
- Robson, J. M. (1971). "An Estimate of the Store Size Necessary for Dynamic Storage Allocation." Journal of the ACM 18(3), pp. 416-423. ACM DL. The proof that no non-relocating allocator can avoid worst-case fragmentation overhead.
- Hanson, D. R. (1990). "Fast Allocation and Deallocation of Memory Based on Object Lifetimes." Software: Practice and Experience 20(1), pp. 5-12. Wiley. The academic origin of region (arena, bump) allocation.
- Johnstone, M. S., Wilson, P. R. (1998). "The Memory Fragmentation Problem: Solved?" ISMM '98, ACM SIGPLAN Notices 34(3), pp. 26-36. PDF. Measured fragmentation on real C/C++ programs; near-zero for sensible policies once header overhead is factored out.
- Alexandrescu, A. (2001). Modern C++ Design, Chapter 4 "Small-Object Allocation." Addison-Wesley. O'Reilly. The Loki
SmallObjectAllocator: segregated pools by size, embedded free lists. - Masmano, M., Ripoll, I., Crespo, A., Real, J. (2004). "TLSF: A New Dynamic Memory Allocator for Real-Time Systems." Proc. ECRTS 2004, pp. 79-86. IEEE. The original TLSF paper. ฮ(1) bound via two-level bitmap and hardware bit-scan.
- Masmano, M., Ripoll, I., Real, J., Crespo, A., Wellings, A. J. (2008). "Implementation of a constant-time dynamic storage allocator." Software: Practice and Experience 38(10), pp. 995-1026. Wiley. Implementation details and measurement: ~150-200 cycles per malloc on a 2008-era x86.
- AMD Vulkan Memory Allocator. gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator. Made TLSF the default allocator in v3.0.0 (2022); shipping in many Vulkan titles and Google Filament.
- Pedriana, P. (2007). "EASTL โ Electronic Arts Standard Template Library." WG21 N2271. open-std.org. The paper that documented every place the C++ standard library hurts game engine code and proposed the fixes that eventually landed in C++17 (
std::pmr) and C++20. - Halpern, P. (2014). "Polymorphic Memory Resources." WG21 N3916. open-std.org. The design rationale for C++17's
std::pmr(adopted via P0220). - Albrecht, T. (2009). "Pitfalls of Object-Oriented Programming." Sony Computer Entertainment Europe R&D, GCAP 2009. PDF. The PS3 scene-tree case study; 19.2 ms โ 3.3 ms (~6ร end-to-end) by replacing OO node graphs with homogeneous arrays. Cite this for the actual measured speedup, not the folklore "20ร".
- Llopis, N. (2009). "Data-Oriented Design." Games from Within. gamesfromwithin.com. The blog post that named the movement.
- Frykholm, N. (2011). "Managing Decoupling Part 4: The ID Lookup Table." BitSquid blog. bitsquid.blogspot.com. The canonical AAA-engine source for the generational-handle pattern. Stingray ships it.
- Gyrling, C. (2015). "Parallelizing the Naughty Dog Engine Using Fibers." GDC 2015. Slides PDF. Per-frame and per-job linear allocators, fiber-local arenas.
- Babka, V. (2024). "Removing SLAB from the kernel." LWN. lwn.net. The 6.8 kernel ships without SLAB; SLUB is the sole allocator.
- Fleury, R. (2022). "Untangling Lifetimes: The Arena Allocator." rfleury.com. The cleanest modern essay on arena allocators in C and adjacent languages.
- Frykholm, N. (2010). "Custom Memory Allocation in C++." BitSquid blog. bitsquid.blogspot.com. The Stingray/BitSquid in-engine allocator design.
- Fredriksson, A. "Scope Stack Allocation." DICE/Frostbite. PDF. The Frostbite description of nested-scope linear allocation.
- Boost.Pool documentation. boost.org. The canonical OSS "simple segregated storage" reference.
- Gorman, M. Understanding the Linux Virtual Memory Manager, Ch. 6. kernel.org. The Linux page allocator (buddy at order 0-10) explained.
- Frykholm, N. (2015). "Allocation Adventures 3: The Buddy Allocator." BitSquid blog. bitsquid.blogspot.com. A CPU buddy allocator in an engine context.
- MorphOS, TLSF system allocator. morphos-team.net. MorphOS 2.0+ ships TLSF as the default allocator.
- Conte, M.
tlsf: Two-Level Segregated Fit memory allocator implementation. github.com/mattconte/tlsf. The canonical BSD-licensed reference implementation, ~1000 lines of C, 4-byte per-block overhead. - Reinalter, S. "Molecule engine: Memory system" series. blog.molecular-matters.com. Linear, stack, pool, arena allocators walked through in working C++.
- MacDougall, A. (2016). "Building a Low-Fragmentation Memory System for 64-bit Games." Sony London Studio, GDC 2016. Slides PDF. Concrete shipping-game fragmentation numbers; the segregated-by-size-and-lifetime architecture.
- Dubรฉ, J.-F. (2012). "Memory Management Part 4: Allocators Overhead, Waste and Fragmentation." jfdube.wordpress.com. Developer-testimony fragmentation report on a shipping Xbox 360 title (~10-30 MB of fragmentation from level-load temporaries).
- Nikolov, S. (2018). "OOP is Dead, Long Live Data-Oriented Design." CppCon 2018. YouTube. Measured cache-miss and performance deltas on an HTML rendering engine refactor.
- Acton, M. (2014). "Data-Oriented Design and C++." CppCon 2014. YouTube. The DOD framing for engine programmers: "the purpose of code is to transform data."
- Weissflog, A. (2018). "Handles are the better pointers." floooh.github.io. The modern blog standard for the handle-table pattern.
- Deutsch, A. (2017). "Slot Map Data Structure." WG21 P0661R0. open-std.org. The proposal to standardise
std::slot_map. Not adopted, but the formal definition. - Unreal Engine,
FMallocandMallocBinned3API documentation. dev.epicgames.com. The official spec for UE's allocator abstraction and binned-allocator backends. - Unity Manual, "Allocator overview." docs.unity3d.com. The DOTS / Native Collections allocator enum:
Temp,TempJob,Persistent. - Decima Engine SIGGRAPH 2017 presentation. advances.realtimerendering.com. Rendering-focused, brief memory notes.
- O'Donnell, Y. (2017). "FrameGraph: Extensible Rendering Architecture in Frostbite." GDC 2017. gdcvault.com. Transient GPU resource memory.
- id Software, Doom 3 source,
neo/idlib/Heap.cpp. github.com/id-Software/DOOM-3. Production AAA heap source: tieredidHeap::SmallAllocate/MediumAllocate/LargeAllocate. - Jones, R., Hosking, A., Moss, E. (2023). The Garbage Collection Handbook, 2nd ed., CRC Press. gchandbook.org. The book on relocating allocators (compacting GCs); the half of the field native engines can't use directly.