Tutorial 07 · Engine Programming

C++ Containers

Eight cores. A 64-byte cache line. A 70-cycle gap between an L1 hit and an L3 hit, another 200 cycles to DRAM. The container that wins on paper is rarely the one that wins on hardware, because the data structure is the layout in memory and the layout decides whether the CPU is fed or stalled. We walk every standard container the engine actually ships with: std::array, std::vector, std::deque, std::list, std::map, std::unordered_map, the adaptors, the views. Memory pictures, complexity proofs, and a live race that makes the 50× gap between vector and list visible.

Time~55 min LevelMid C++ engineer PrereqsYou've written a for loop over a vector. You've heard of big-O. The memory-model tutorial is useful if you want to know why unordered_map can't be lock-free. HardwareSome idea that DRAM is slower than cache

01Why a container choice is a performance decision

Big-O is a description of how a data structure scales. It is not a description of how fast that data structure is. A linked list and a sorted vector both report O(n) insertion in the middle, and on hardware built in 1985 they performed about the same. On hardware built since 2005 the vector wins by a margin so wide it stops being an algorithm question and becomes a memory-layout question. Bjarne Stroustrup spent an entire Going Native 2012 keynote making this point with a stopwatch^[1], and a decade later "default to vector" is the conventional advice in most modern engine codebases. For sequence containers, the layout in memory is most of the performance story; complexity, reference stability, and lookup pattern decide the rest.

The widget below makes that visible. The left column reads N integers from a contiguous block (a std::vector<int>). The right column reads N integers from nodes scattered across the heap (a std::list<int>). Both touch the same number of bytes. Both report O(n) traversal. The cache walker shows the difference: the vector reads sweep through cache lines in order; the list reads jump between unrelated lines, paying a cache miss on most accesses:

elements 256

speed 2×

vector misses

···

list misses

···

ratio

···

What you'll have by the end

A working mental model of every standard C++ container: what it actually looks like in memory, what its complexity guarantees promise (and what they don't), and which one to reach for in a given gameplay system. Plus a live container race that lets you push a million elements through each, a hash-table animator that flips between separate chaining and SwissTable-style open addressing, and a cheat sheet you can paste into your engine's contributing guide.

The framing for the rest of the tutorial

Every container in this tutorial answers four questions, and you can mostly predict its performance from the answers:

Where do the elements live? One contiguous block (vector, array), several contiguous blocks (deque, hive), or one independent heap allocation per element (list, map, node-based unordered_map)?
How are they ordered? Insertion order (vector, deque, list), sorted by key (map, set), or hashed (unordered_map, unordered_set)?
Which operations does the container want to be fast? Random index, find-by-key, push at the back, push anywhere, ordered iteration.
Which references survive which mutations? A pointer into a vector can become invalid on the next push_back; a pointer into a list survives every mutation that doesn't touch that specific node. Engine code that holds long-lived handles depends on knowing this exactly.

Answer those four for each container and the rest of the standard-library reference page reduces to small print.

02A short history of getting containers right

The C++ standard library's container set didn't drop fully formed in 1998. Its shape is the cumulative answer to twenty-five years of "this is faster on real hardware" arguments, each of which forced the committee or a library author to write the better data structure down. A short tour, because the modern set only makes sense in context:

1979

Alex Stepanov starts the generic-programming project. While at GE Research, Stepanov begins formalizing iterators and algorithms separately from the data structures they operate on. The C++ Standard Template Library, written with Meng Lee in 1993-1994, is the eventual product^[2]. The structural decision that sequence containers, associative containers, and algorithms are three orthogonal axes (you mix and match) dates from here.

1994

STL accepted into ISO C++. The first standardized container set ships: vector, deque, list, set, map, multiset, multimap, plus the stack, queue, and priority_queue adaptors. Associative containers are tree-based; the standard never names a specific tree, but the combination of logarithmic complexity and iterator-stable insert/erase narrows the practical choice to a balanced BST. Every shipping implementation (libstdc++, libc++, MSVC) lands on red-black^[3].

2003

Boost.Container ships flat_map. A sorted-vector-backed associative container. Lookup costs the same O(log n) binary search the tree gives you, but the elements live contiguously and the cache loves them. Insertion in the middle is now O(n), which is a price most lookup-heavy code is happy to pay^[4].

2007

Paul Pedriana's EASTL paper (N2271).^[5] Electronic Arts publishes its in-house STL replacement, documenting every place the standard hurts game code: allocator interface, debug-build cost, alignment, intrusive containers, fixed-storage containers. The committee starts paying attention. Most of EASTL's missing-pieces list eventually lands in std over the next decade.

2011

C++11 ships std::array, std::unordered_map, std::unordered_set.^[6] Fixed-size sequence with no heap allocation; hash tables with average-constant lookup. The unordered containers are constrained to closed-addressing (separate chaining) by a bucket API that leaks into the interface^[7]. That decision is what every "faster than unordered_map" library is replacing.

2012

Stroustrup, "Going Native" keynote.^[1] The "vector vs list, ordered insertion" benchmark that every C++ engineer has seen. The result: vector beats list by orders of magnitude at sizes that fit in cache and DRAM. The keynote popularizes "default to vector" as a rule.

2014

Facebook releases Folly's fbvector.^[8] Andrei Alexandrescu's reimplementation of std::vector with a 1.5× growth factor in the middle of its size range (so a future reallocation can reuse freed memory), jemalloc-aware sizing, and explicit relocation for trivially-movable types. The reported wins on Facebook's workloads are modest but real.

2017

Matt Kulukundis, "Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step."^[9] The CppCon talk that introduces SwissTable: 16-slot groups, a 1-byte control word per slot with the top 7 bits of the hash, SIMD parallel match. 2-3× faster than std::unordered_map on Google's workloads. Open-sourced in Abseil in 2018^[10].

2019

Facebook open-sources F14.^[11] The 14-key-per-chunk shape, related to SwissTable but with three storage variants (Value, Node, Vector) tuned for different key/value sizes. The convenience alias F14FastMap picks between the Value and Vector layouts at compile time based on entry size; users who need pointer stability reach for F14NodeMap directly.

2023

std::flat_map and std::flat_set adopted for C++23. The boost::container design from 2003, twenty years later, becomes standard. flat_map and flat_multimap arrive through P0429^[30]; flat_set and flat_multiset arrive separately through P1222^[31]. Lookup-heavy ordered associative containers no longer need a third-party dependency.

2026

C++26 adds std::hive.^[12] Matt Bentley's plf::colony, formerly used for game-engine entity pools, becomes the standard's stable-reference bucket array. Erase keeps the slot, iteration skips the gap, pointers survive every mutation that does not touch their specific element.

The recurring pattern: the algorithm is from the 1970s, the cache-friendly layout is from the 2010s, the standardization is a few years behind whatever the big shops have already shipped. By 2026 the standard has closed the sorted-associative and bucket-array gaps (flat_map, hive); the hash-table gap is the conspicuous remainder, since std::unordered_map is still constrained by the C++11 bucket API while the open-addressing variants outside std are 2-3× faster. The original 1998 containers are still the right default for most code. The remainder of this tutorial walks them in dependency order: array, vector, deque, list, map, unordered_map, the adaptors, the views, and the post-std alternatives at the end.

03The memory hierarchy the container sits on

Before any specific container makes sense, the picture of the hardware they're trying to feed. A modern CPU spends most of its time waiting for memory:

Tier	Typical latency (cycles)	Capacity
Register	~0	~32 × 8 B per core
L1 data cache	~4	32-48 KB per core
L2 cache	~12	256 KB-1 MB per core
L3 cache (shared)	~40	4-64 MB per socket
DRAM	~250	8-128 GB

Cycle counts are Skylake-class x86 rule-of-thumb numbers, drawn from Hennessy and Patterson^[13], Intel's optimization manual^[14], and Agner Fog's measurements; AMD Zen, Apple M-series, and ARM Neoverse have their own profiles, and server parts with mesh interconnects push L3 and remote-socket DRAM substantially higher. The point is the ratio, not the exact number. A miss to DRAM is roughly 60 times the cost of an L1 hit on a modern desktop core.

The unit the cache moves around is a cache line: 64 bytes on x86-64 and most ARM, 128 bytes on Apple M-series. A line holds 16 ints, or 8 doubles, or 8 64-bit pointers. When the CPU touches a byte that isn't in cache, the hardware pulls the entire line that contains it, and pays the DRAM latency once. Every subsequent byte on the same line is free.

This is the central fact behind every container choice in the rest of the tutorial. A contiguous container amortizes the cost of one DRAM fetch across 16 elements. A node-based container with a general-purpose allocator usually pays that fetch once per element, because consecutive nodes rarely share a cache line. (Custom pool allocators and intrusive containers can close some of the gap; see §16.)

What about the hardware prefetcher?

x86 (and most ARM) CPUs have a stream prefetcher that watches recent cache misses and, if they form a regular linear pattern, fetches ahead so the next miss is already on its way. A sequential vector scan triggers it: the prefetcher correctly predicts that the next line is wanted and brings it in before the CPU asks. A list traversal does not trigger it: the next node's address is in the current node's next pointer, and the prefetcher can't follow indirection^[14].

This is why the gap between vector and list scales with array size. Below ~32 KB everything fits in L1 and the difference is small. Above L3 the vector keeps up with DRAM bandwidth (10-40 GB/s) while the list bottlenecks on miss latency (~250 cycles per element on a Skylake-class core). Stroustrup's keynote demo^[1] reported the vector finishing in roughly 80 seconds while the list took close to two hours on the same hardware at 500K elements, somewhere around 50-100× depending on which run you pick. The exact factor depends on element size, allocator behaviour, and the working-set range; the order of magnitude is what's robust.

With that established, the containers come in three structural families. The rest of the sections are organized by family:

Contiguous sequence: array, vector. One block, indexable, cache-loving.
Chunked or node-based sequence: deque (chunked), list and forward_list (one node per element).
Associative: map/set (sorted, node-based, tree); unordered_map/unordered_set (hashed).

Plus the adaptors that wrap one of those (stack, queue, priority_queue) and the non-owning views (span, string_view) that point into someone else's storage.

04`std::array`: fixed size, on the stack

The smallest, simplest container. A thin wrapper around a C-style array, sized at compile time, no heap allocation, all elements live in the object itself^[6]. std::array<float, 3> is exactly three floats in twelve bytes, nothing more. It exists for two reasons: it gives a C array a real STL interface (.begin(), .size(), .data(), etc.), and it makes the size part of the type so the compiler can check it.

array_basics.cpp · what's actually in memory

// Three floats, allocated wherever the std::array object lives.
// If 'positions' is a local, it's on the stack; if it's a member of a class
// allocated on the heap, it lives in that class's heap block. No
// separate allocation either way.
std::array<float, 3> positions = {0.0f, 1.0f, 2.0f};

// sizeof(positions) is exactly 12. There's no length field; the size is
// encoded in the type so the compiler always knows it.
static_assert(sizeof(positions) == 3 * sizeof(float));

// The interface is std-container-shaped: iterate, index, ask the size.
for (float& element : positions) element *= 2.0f;

// data() returns a pointer to the first element, the same way a C array decays.
// Useful when calling C APIs (OpenGL, Vulkan, sockets) that want a raw pointer.
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0, positions.data());

Use it whenever the size is a compile-time constant and the container is small enough that copying it by value is cheap. Vertex positions, RGBA colors, 4×4 matrix rows, hash digests, fixed-channel audio mixes, lookup tables of known size. Anywhere a C programmer would have written float v[3];.

When std::array is the wrong answer

The size is part of the type. std::array<int, 3> and std::array<int, 4> are different types and you can't assign one to the other. If the size of the data is "small but not fixed" (e.g. up to 16 entries in a small-buffer-optimized list), reach for a SmallVector from EASTL or LLVM^[15], not std::array with the maximum size and a separate length counter.

05`std::vector`: the workhorse

The default. Stroustrup's advice in The C++ Programming Language, fourth edition, is "use vector as your default container"^[16] and the C++ Core Guidelines repeat it^[17]. The reasoning is exactly the cache argument from §1: a contiguous block of memory is the friendliest layout for the CPU, and most workloads don't need more than that.

A std::vector is three pointers (or equivalents) in the object itself: begin, end, and end-of-storage^[18]. The actual elements live in a single contiguous heap block. size() is end - begin; capacity() is end_of_storage - begin. Indexing is one pointer arithmetic plus one load. Iteration is a pointer walk that the hardware prefetcher loves.

The four guarantees that matter

Constant-time random access. v[i] is one pointer add, one load.
Amortized constant-time push_back. When there's capacity, push_back is one assignment plus one increment. When capacity runs out, it doubles (or 1.5×s) and reallocates. The amortized cost is O(1) regardless^[19].
Linear-time insert/erase in the middle. Every element after the insertion point has to be shifted by one slot. On contiguous memory this is one memmove, which the CPU runs at memory bandwidth.
Contiguous storage, guaranteed. &v[0], v.data(), and v.begin() all point at the same single block. You can pass v.data() to OpenGL, to memcpy, to ::write(). std::vector<bool> is the famous exception; we'll come back to that.

The catch is the third bullet's twin: every reallocation invalidates every pointer, reference, and iterator into the vector, and inserts that don't grow the vector still invalidate everything from the insertion point to the end. The next two sections take each in turn.

06Growth, amortization, and the cost of `push_back`

When a vector's size() reaches its capacity(), the next push_back has to allocate a bigger block, move every existing element into it, and free the old one. That's a linear-time operation. Doing it on every push would make push linear-time in steady state, which is terrible. The standard library hides this with amortization: it grows the block by a constant factor (typically 2× or 1.5×) so that the cost of all the reallocations across N pushes is O(N) total, which means the cost per push averages out to O(1)^[19].

One push at a time, this widget records the cost. Capacity steps up at doubling boundaries; reallocation events appear as tall red bars; the running average per push converges to a small constant even though individual pushes cost O(size):

target size 64

total work

···

reallocations

···

avg per push

···

With a 2× growth factor (libstdc++ and libc++ default for push_back) the geometric series 1 + 2 + 4 + ... + N sums to about 2N, so the average cost per push converges toward a constant near 2. The 1.5× growth factor (MSVC, and Folly's fbvector in its middle size range) reaches a slightly higher constant per push but lets a future reallocation reuse the just-freed block, which the 2× growth cannot do^[8].

eq. 1 \cdot amortized push_back cost cost (N pushes) = 1 + 2 + 4 + \dots + 2 k = 2N - 1

Geometric growth turns N pushes into about 2N total work, which is O(N) and gives O(1) amortized per push. The word amortized is doing real work: any individual push that triggers a reallocation is O(size), not O(1). Code that needs a hard real-time bound on a single push has to call reserve() first.

Why 1.5× sometimes beats 2×

Set the growth factor to r and call the initial capacity 1. After k growths the sum of every previously freed block is 1 + r + r² + … + r^k-1 = (r^k − 1) / (r − 1). The block being allocated for the next growth needs r^k units. Ask when the new allocation fits inside the cumulative freed region: r^k ≤ (r^k − 1) / (r − 1), which simplifies to r² ≤ r + 1. The largest r satisfying that inequality is the golden ratio, φ ≈ 1.618. Any factor below φ eventually fits, so the allocator can reuse the just-freed addresses; r = 2 sits permanently above φ and never can.

That's the case for picking a factor below 2. The case against is that smaller factors mean more reallocations, so the average per-push cost rises. The shipping picks split the difference: MSVC uses 1.5×, Folly's fbvector uses 1.5× in the middle of its size range and 2× at the small and very large ends^[8], libstdc++ and libc++ stay at 2×. Whether the in-place reuse actually happens depends on the allocator; jemalloc handles it well, glibc's ptmalloc less so.

In practice, if you know how big the vector will get, the cure for all of this is one line:

reserve.cpp · the one line that fixes most push_back perf

std::vector<EntityId> visibleEntities;

// We know roughly how many entities will be visible this frame.
// One allocation up front beats log2(N) reallocations during the loop.
visibleEntities.reserve(estimatedVisibleCount);

for (const Entity& entity : worldEntities) {
  if (frustum.contains(entity.bounds))
    visibleEntities.push_back(entity.id);    // no reallocation until we exceed the reservation
}

07Iterator invalidation: the rule you can't forget

Every container has a set of "things that, if you do them, will invalidate some references into the container." Get it wrong and you have a dangling pointer that the compiler can't catch. The rules are precise and worth memorizing for the containers you use most:

Container	What invalidates iterators	What invalidates pointers/references
`vector`	Anything that triggers a reallocation invalidates all iterators. Insert or erase invalidates iterators from the change point to `end()`.	Same as iterators. References are pointers in disguise here.
`deque`	Insert or erase anywhere except the front/back invalidates all iterators.	Insert at front or back leaves all pointers/references valid; insert in the middle invalidates everything.
`list`, `forward_list`	Only the iterators to erased elements are invalidated.	Only the pointers to erased elements. Everything else survives.
`map`, `set`, `multi*`	Same as `list`: only the erased ones.	Same as `list`: only the erased ones.
`unordered_map`, `unordered_set`	Insert may trigger a rehash, which invalidates all iterators. Erase invalidates only the erased iterator.	Insert with no rehash leaves pointers/references valid; rehash leaves pointers/references valid (only iterators are invalidated). Erase invalidates only the erased element's references.

Adapted from the cppreference invalidation tables^[19]^[20] and the formal ISO wording^[3]. The "pointer/reference vs iterator" distinction for unordered_map rehash is subtle and worth confirming on the cppreference page for any version you target.

Each button on the widget below performs one operation on a small vector. Two pinned iterators sit at fixed positions; they turn red the instant they go dangling. Every reallocation flips every iterator; every middle-insert flips everything from the insertion point onward.

size

capacity

it₁ status

valid

it₂ status

valid

08`std::deque`: chunked, two-ended

std::deque ("double-ended queue") gives you push_front and pop_front at O(1), which vector doesn't. It does this by storing elements in a sequence of fixed-size chunks, with a top-level index ("map") of pointers to those chunks^[21]. Inserting at the front allocates a new chunk and points the map at it; the existing chunks don't move. Indexing is two loads (chunk pointer, then element) instead of one.

Implementations pick a chunk size, and the picks differ wildly^[22]:

libstdc++: 512-byte chunks. For deque<int>, 128 elements per chunk; for any element larger than 512 bytes, one element per chunk.
libc++: 4096-byte chunks when the element is smaller than 256 bytes; a fixed 16 elements per chunk above that.
MSVC: the chunk size shrinks aggressively with element size. The formula is 16 elements for 1-byte elements, 8 for 2-byte, 4 for 4-byte (so deque<int> has chunks of 4), 2 for 8-byte, and one element per chunk for anything larger than 8 bytes. This last regime is the one with the famously poor cache behavior: a deque of any reasonably-sized struct on MSVC is approximately one heap allocation per element.

Pictured below: a generic libstdc++-style layout. Push at either end and a new chunk is allocated when the current end fills; existing chunks never move.

elements

···

chunks

···

chunk size

8 elems

Two-level structure: a map of pointers, each pointing to a fixed-size chunk. Iteration walks within a chunk (cache-friendly), then jumps to the next chunk (a small cache miss). push_front never moves existing elements; it just allocates a new chunk at the front. This is why std::deque is the natural backing store for a FIFO queue.

Use it when you need O(1) push and pop at both ends, when reference stability after front/back insertion matters, or when you want to grow without periodic 2× reallocations. Common production uses:

Job queues where producers push at one end and the worker pops from the other.
Event histories where new events go on the back and stale ones get popped off the front.
The default backing store of std::queue and std::stack. queue<T> is just queue<T, deque<T>>.

deque is not always cache-friendly

The chunked layout means iteration sweeps each chunk sequentially (good) and then jumps to the next chunk (one cache miss per chunk boundary). With a 512-byte libstdc++ chunk and 8-byte doubles, that's 64 elements per chunk and one miss per 64. Still far better than a list's one miss per element, and within a small constant of a vector's one per 16. On MSVC with its one-element-per-chunk policy for anything larger than 8 bytes, the picture is much worse, and the resulting deque behaves more like a list than a vector. For pure-iteration workloads on libstdc++ or libc++, std::vector typically wins by a small constant; for two-ended access, deque wins because the alternative is a vector with O(n) insert(begin()).

09`std::list` and `std::forward_list`

A doubly-linked list. Each element lives in its own heap allocation, plus two pointers (prev and next). std::forward_list drops the prev pointer to halve the per-node overhead, at the cost of being singly-linked^[23].

list has two real selling points in modern C++:

Reference stability under all mutations. A pointer into a node survives every insert and every erase that doesn't touch that specific node. Worth real money in code that wires up long-lived handles to objects in containers.
O(1) splice. list1.splice(pos, list2, it) moves a node from list2 into list1 by relinking pointers, no copies, no allocations^[23]. There is no other standard container that lets you move an element between containers in O(1). LRU-eviction caches and ready-queue schedulers both lean on this.

Outside of those two use cases, it is almost always the wrong choice. The cache-miss penalty from §1 is the headline; smaller costs add up too:

Per-node allocation. Each push_back on a list<int> calls the allocator. A vector calls the allocator log2(N) times.
Per-node memory overhead. On libstdc++ x86-64, a list<int> node struct is 24 bytes (8-byte prev, 8-byte next, 4-byte int, 4 bytes of padding), plus the allocator header (typically 8-16 bytes on glibc). The payload is one int. The overhead ratio sits around 6-10×, before any allocator fragmentation.
No random access. l[i] doesn't compile; you walk from begin().

The next section measures the gap directly.

10List vs vector, measured

Stroustrup's Going Native 2012 benchmark^[1]: generate N random integers; for each one, find its sorted position (linear scan from begin()) and insert it there. Repeat for both a vector and a list. The number of comparisons is the same; the number of structural mutations is the same; the constant factor is everything.

Press RACE to run a JavaScript port of the benchmark. The absolute numbers won't match a native C++ run (V8 doesn't have the same heap layout as glibc), but the ratio tracks the C++ result:

N elements 2000

vector ms

···

list ms

···

ratio

···

The vector wins because its sorted-position search sweeps contiguous memory; the list pays a cache miss per node on the same search. The insert phase costs the vector a memmove of cache-resident bytes and the list a pointer relink. Even though the list does less structural work per insert, it pays more total time because the search dominates. The lesson is Stroustrup's: complexity is the same, layout decides^[1].

When list actually wins

Two cases survive the cache argument. (a) Splice-heavy workloads where you move elements between containers without iteration: an LRU cache that touches one element per request, then moves that node to the front, is a textbook fit. (b) Code that holds long-lived pointers into the container across many mutations: a list<Job> where job dependency edges store raw Job*. In both cases pre-C++26, the contiguous alternative needs a separate indirection table to provide the reference stability that list gets for free; whether that's a better deal depends on the specifics. As of C++26, std::hive (see §16) covers case (b) with a cache-friendlier layout, so the only case that still goes to list by default is the splice-heavy one in (a).

11`std::map` and `std::set`: ordered associative

std::map<K, V> is a sorted, balanced binary search tree from keys to values. The standard requires logarithmic-time lookup, insert, and erase, and ordered iteration^[3]. The standard never names a specific data structure, but every shipping implementation (libstdc++, libc++, MSVC) uses a red-black tree^[24]. AVL trees meet the same complexity bound but rotate more on insert/erase; red-black trades a slightly taller tree for fewer rotations, which is the right balance when erase has to be O(log n) too.

Each element is its own heap-allocated node containing the key/value pair, a parent pointer, two child pointers, and one color bit. That's about 48 bytes of overhead per node on 64-bit platforms, plus the payload. Lookup is a logarithmic walk from the root, branching by key comparison at each node.

key 7

size

···

height

···

last op

···

When to use map, and when not

Use std::map when you need ordered iteration (lowest key first, highest key last) or when you need to find ranges by key (lower_bound, upper_bound, equal_range). For example: per-frame replay events keyed by timestamp, where you stream events in time order; or an interval tree by start coordinate.

Do not use std::map as a default associative container. For unordered "find by key" lookups, std::unordered_map is faster on average (constant vs logarithmic) and a flat sorted-vector (std::flat_map in C++23, or boost::container::flat_map before that) is faster than both for read-heavy workloads on small to medium key sets, because the contiguous layout fits the cache^[4].

12`std::unordered_map` and the bucket interface

A hash table. Average-constant-time lookup, insert, and erase; worst-case linear if the hash function is adversarial. C++11 standardized it with a specific implementation strategy baked into the interface: closed addressing (each bucket holds a linked list of entries that hashed to it), exposed through the public bucket_count(), bucket_size(), and local_iterator APIs^[20]. That interface choice locks every shipping std::unordered_map into separate chaining, which is why third-party hash tables routinely beat it by 2-3×^[7].

The widget shows the closed-addressing layout. Each bucket points to a linked list of {hash, key, value} nodes. Lookup hashes the key, picks the bucket, then walks the bucket's list comparing keys. The load factor (size() / bucket_count()) controls how long the bucket chains get; the default max_load_factor is 1.0, so on average each bucket holds one element.

key to insert 42

load-factor target 1.0

size

···

buckets

···

load factor

···

max chain

···

Separate chaining: each bucket is the head of a linked list. Lookup pays one cache miss to read the bucket, then one per chain node, plus the key comparison. The bucket API in std::unordered_map exposes this enough that implementations can't switch to open addressing without breaking ABI^[7].

Hash quality matters more than you think

The hash function decides whether the table runs at O(1) or degenerates to O(n). std::hash<int> on libstdc++ and libc++ is the identity function: hash(42) == 42. That works fine when the bucket count is prime (which it is on libstdc++) and fails badly when keys differ only in bits the bucket count ignores, for example integer indices that all share their low bits. Hash-flooding attacks are real, and the standard library does not defend against them. Production code on internet-facing inputs should use a randomized seed (SipHash, WyHash) or accept the DoS risk^[25].

For game code, where the inputs are usually engine-controlled, the standard's default is fine. The bigger issue is what the table costs vs. what it could cost.

13Open addressing and SwissTable

Outside the standard library, almost every modern hash table is open-addressing (no separate buckets; every entry is stored inline in one flat array, and collisions are resolved by probing). The flat layout is the cache argument from §1, applied to hash tables: one DRAM fetch covers many adjacent probe positions.

The state of the art is Google's SwissTable^[9], open-sourced in Abseil in 2018^[10]. The same general shape was adopted by Folly's F14^[11], by Rust's standard HashMap (via the hashbrown crate), and by Go's map starting in version 1.24. The chunk size diverges: SwissTable groups 16 slots so the control bytes fill one 128-bit SIMD register and rehashes at load factor 7/8 = 87.5%. F14 groups 14 slots and uses the remaining 2 bytes of the 16-byte aligned block as a per-chunk overflow count: when an insert is displaced past this chunk it is incremented, on erase decremented, and an unsuccessful probe can stop as soon as the count is zero rather than walking a chain of tombstones^[37]. F14's max load factor is 12/14 ≈ 85.7%, slightly lower than SwissTable's; that is the cost of widening the metadata.

The design has four moving parts:

The table is one flat array of slots. Each slot has a 1-byte control word next to its {key, value} entry.
The 64-bit hash is split into H1 (57 bits) for picking the starting position, and H2 (7 bits) stored in the control word for fast filtering^[26]. The 7-bit width is not arbitrary: it leaves one bit for the empty/tombstone/full state tag, and 16 of them (one per slot in a group) fit in a 128-bit SSE register, which is what the SIMD compare needs.
On x86 with SSE2, slots are grouped into groups of 16. Lookup loads the 16 control bytes into one 128-bit SIMD register and runs one _mm_cmpeq_epi8 against the H2 we're searching for; _mm_movemask_epi8 turns the result into a 16-bit mask of candidate positions^[9]. On builds without SSE2 (older ARM, portable fallback) the group size drops to 8 and bit-twiddling replaces the SIMD compare.
The candidates are then key-equality-checked; misses move to the next group. Most lookups touch one group, which is one cache line.

Press "find" on the widget below with a key. H1 picks a group; the 16 control bytes load into a single SIMD register; one compare against the 7-bit H2 of the search key produces the match mask; the matched candidate gets a full key comparison:

lookup key 42

size

···

groups

···

last probe

···

overhead

1 B / slot

SwissTable's whole trick: scan 16 slots in one SIMD instruction by storing 7 bits of the hash next to each slot. The control word also tags empty and tombstone slots, so the same instruction tells the inserter where to put a new entry. Memory overhead is 1 byte per slot. By contrast, a node-based unordered_map pays one next-pointer and (typically) one cached-hash word per node, plus the allocator header for each malloc'd node^[11].

Where the time goes

Closed-addressing std::unordered_map spends most of its lookup time on the two cache misses (bucket pointer, then first chain node) and on the linked-list pointer chase. SwissTable spends most of its time on the one cache miss to load the group's 16 control bytes; the SIMD compare is one instruction, and the key comparison is on a hot cache line. The gap matches the prediction: 2-3× faster lookup, ~30% less memory^[9]^[11].

The standard committee is aware of this. P2767, the proposal for a flat-storage std::flat_unordered_map, has been in flight for several revisions. Until it lands, the pragmatic answer for new C++ engine code is to take a dependency on absl::flat_hash_map or folly::F14FastMap and pretend std::unordered_map doesn't exist.

14The adaptors: `stack`, `queue`, `priority_queue`

Three thin wrappers, each constraining an underlying container to a specific access pattern:

std::stack<T>. LIFO. Defaults to std::deque<T> underneath, but any container with back(), push_back(), and pop_back() works. Pass std::vector<T> as the second template argument and you get a vector-backed stack with no deque overhead.
std::queue<T>. FIFO. Defaults to std::deque<T> underneath. A vector won't work (it has no efficient pop_front); a std::list does but is slower than the deque default.
std::priority_queue<T>. The biggest element is always on top. Defaults to std::vector<T> backed by a binary heap, which is exactly the right shape for this access pattern^[27].

The binary-heap layout is worth knowing about because std::priority_queue is a hot data structure in game code: AI A* searches, sound mixer priority, animation event scheduling, every "do the most important thing next" system. The heap is a vector where, for the element at index i, its children are at 2i+1 and 2i+2, and the parent is at (i-1)/2. push appends and sifts up; pop takes the root, moves the last element to the root, and sifts down.

value 42

size

···

top

···

last op compares

···

A binary max-heap stored in a vector. Push and pop are O(log n) (one sift-up or sift-down through the tree). Building a heap from a vector with std::make_heap is O(n), not O(n log n), because of the way work concentrates at the leaves^[27].

The standard library's priority queue is a max-heap by default. For Dijkstra and A*, where you want the smallest distance first, pass std::greater<T> as the comparator:

astar.cpp · min-heap for shortest-path search

struct PathNode {
  NodeId nodeId;
  float  estimatedTotalCost;        // f = g + h: A* total estimate through this node
};
struct SmallerFirst {
  bool operator()(const PathNode& a, const PathNode& b) const {
    // priority_queue is a max-heap by default; we invert the comparator
    // so the smallest cost ends up at the top.
    return a.estimatedTotalCost > b.estimatedTotalCost;
  }
};

std::priority_queue<PathNode, std::vector<PathNode>, SmallerFirst> openSet;

openSet.push({startNodeId, heuristic(startNodeId, goal)});
while (!openSet.empty()) {
  auto currentNode = openSet.top();   // O(1) lookup of best frontier node
  openSet.pop();                       // O(log n) re-heap
  if (currentNode.nodeId == goal) break;
  for (NodeId neighbor : neighborsOf(currentNode.nodeId))
    openSet.push({neighbor, gCostSoFar(neighbor) + heuristic(neighbor, goal)});
}

priority_queue's missing API

std::priority_queue does not let you change the priority of an element in place; the standard exposes no "decrease-key" operation. A Dijkstra implementation has to push a fresh entry with the lower cost and lazily skip stale ones on pop. For workloads where decrease-key is on the hot path (graph search with many revisited nodes), reach for a Fibonacci heap, a pairing heap, or an indexed binary heap; std::priority_queue is the wrong tool^[27].

15`std::span` and `std::string_view`: non-owning views

Two C++20 (and one C++17) additions that aren't really containers, but they're how you pass containers around without copying. They are both views: a pointer plus a size, owning nothing, valid only as long as the underlying storage outlives them^[28].

std::span<T> (C++20). A view of a contiguous range of T. Construct it from a std::vector<T>, a std::array<T, N>, a C array, anything contiguous. The function on the other side gets a uniform interface to all of them.
std::string_view (C++17). The same idea, specialized for character ranges. Has the methods that make string handling pleasant (find, substr, starts_with, ends_with) without owning the bytes.

Use them as parameter types whenever a function takes a contiguous range and doesn't need to extend its lifetime. The function gets to work with any contiguous source without overloading on the container type, and the caller doesn't pay for a copy:

span_param.cpp · the right shape for read-only sequence parameters

// Old way: one overload per container type, each with a copy if you
// took the parameter by value, or coupling to a specific container if
// you took it by reference. Both unsatisfying.
void drawTriangles(const std::vector<Vertex>& vs);
void drawTriangles(const std::array<Vertex, 64>& vs);
void drawTriangles(const Vertex* vs, size_t count);

// Modern way: one function, takes a view, the caller passes whatever
// they have. No copies, no overload set, no error-prone count parameter.
void drawTriangles(std::span<const Vertex> vertices);

// Call sites work uniformly:
drawTriangles(meshVertexVector);             // vector → span
drawTriangles(framePushArray);               // std::array → span
drawTriangles({rawPtr, count});              // raw pointer + count → span

Views don't own anything

A view is a pointer plus a length. The underlying storage has to outlive the view, or the pointer dangles. The single most common bug with std::string_view is constructing it from a temporary std::string and storing the view past the temporary's lifetime. Both Clang and MSVC ship lifetime-bound warnings ([[clang::lifetimebound]], /W4) that catch many of these statically; turn them on^[28].

16Beyond `std`: flat_map, hive, intrusive lists, ring buffers

The standard's container set is the default, not the ceiling. A short tour of the non-std containers most game engines pick up:

flat_map and flat_set (C++23 / Boost.Container)

A sorted vector of {key, value} pairs, wrapped in an associative interface. Lookup is O(log n) binary search on contiguous memory; insertion and middle-erase are both O(n) because everything after the change point shifts. The trade is: insertion and middle-erase are slower than the tree's O(log n), lookup and iteration are faster, memory overhead is lower. Lookup-heavy and small-to-medium key sets are the sweet spot. C++23 standardized it as std::flat_map and std::flat_set; Boost has had it since 2003^[4].

std::hive / plf::colony (C++26)

A bucket array with stable references on insert and erase. Each block of slots holds a fixed number of elements plus a skipfield that records which slots are erased. Iteration walks the skipfield, jumping over erased slots in constant time. Insert reuses an erased slot if one exists, else appends. Erase marks the slot and bumps the skipfield, leaving every pointer to other elements valid^[12].

Matt Bentley designed it for entity-component-system code where game objects come and go thousands of times per frame and other systems hold pointers into the live pool. Standardized as std::hive in C++26.

Intrusive containers (Boost.Intrusive, EASTL)

A list or tree where each element carries the link pointers, not the container. The container is just a head pointer; the user manages the storage. Two payoffs: no per-node heap allocation (the link state lives inside the element you already own), and an element can be in multiple intrusive containers at once. EASTL's intrusive_list and Boost.Intrusive's list are the canonical implementations^[5]. Engine schedulers, free-lists, and ECS sparse sets all use this pattern.

Ring buffers (custom, no std equivalent)

A fixed-capacity contiguous array with two indices (head and tail) that wrap around. Push and pop are both O(1) with no allocation; the size is a hard cap, not a soft one. Used wherever a bounded queue makes sense: audio mix ringbuffers, replay event buffers, lock-free job queues. The standard committee has not landed on a single ring-buffer design because the variants don't reduce to one shape: bounded vs growing, SPSC vs MPMC, full-on-overwrite vs full-on-block. P0260's "concurrent queues" effort has been in flight for years without settling.

SmallVector / fixed_vector (LLVM, EASTL)

A vector with an inline buffer of N elements; falls back to heap allocation when N is exceeded. The first N pushes touch no heap at all, which matches the workload pattern for "build a small list inside a function and discard it" code. LLVM's llvm::SmallVector<T, N> is the most widely cited implementation^[15], and EASTL ships fixed_vector<T, N> in the same shape.

17Allocators and why games override them

Every standard container takes an allocator template parameter. The default is std::allocator<T>, which forwards to ::operator new and ::operator delete. On most platforms that's a thread-safe general-purpose allocator (glibc malloc, jemalloc, tcmalloc), and it's the right answer for most code.

Game engines override it for three reasons^[5]:

Memory budget. Each subsystem gets a fixed pool. The renderer's containers allocate out of the renderer's pool; the AI's out of the AI's. Out-of-memory is a budget violation that's tractable to debug, not a process-wide crash.
Frame allocators. A linear-bumping arena that resets at the start of each frame. Allocating from it is one pointer increment; freeing is a no-op until the frame ends. Per-frame container allocations are essentially free if you wire this up.
Alignment. SSE/AVX/NEON vector types want 16- or 32-byte alignment. The default allocator returns whatever the platform's malloc guarantees: glibc x86-64 gives 16 bytes for objects ≥ 32 bytes and 8 bytes otherwise; Windows and macOS guarantee 16 bytes unconditionally. None of those satisfy a std::vector<__m256> on x86, which wants 32-byte alignment for AVX loads. Game code that ships DirectXMath's XMVECTOR or Unreal's VectorRegister in containers has to supply an overaligned allocator, because std::allocator doesn't.

C++17 added std::pmr (polymorphic memory resources), which lets you pick an allocator at runtime instead of at type-instantiation time. std::pmr::vector<int> is a vector of int that takes its allocator as a constructor argument. This is the modern way to integrate engine pools without making every container a different type. The standard ships std::pmr::monotonic_buffer_resource (a linear arena) and std::pmr::unsynchronized_pool_resource (a per-thread pool) out of the box.

pmr_frame.cpp · per-frame linear arena, no malloc on the hot path

// One frame's worth of scratch storage. Reused every frame.
static std::array<std::byte, 4 * 1024 * 1024> frameArenaStorage;
std::pmr::monotonic_buffer_resource frameArena(
    frameArenaStorage.data(), frameArenaStorage.size());

void simulateFrame() {
  frameArena.release();    // O(1) "free everything from last frame"

  // Containers allocate inside the arena. Each push_back is one
  // pointer bump in the arena, not a malloc.
  std::pmr::vector<VisibleEntity> visibleEntities(&frameArena);
  visibleEntities.reserve(1024);

  for (const Entity& entity : worldEntities)
    if (mainCameraFrustum.contains(entity.bounds))
      visibleEntities.push_back({entity.id, entity.transform});

  submitDraws(visibleEntities);
  // visibleEntities goes out of scope here. For trivially destructible
  // element types, no destructors run; the buffer's deallocate() on a
  // monotonic arena is a no-op, so the whole teardown is free until
  // frameArena.release() reclaims the arena at the top of next frame.
}

18The container race

One widget pushes a million elements through each of the five sequence containers, then iterates them. The numbers are JavaScript estimates with a calibration factor, not native C++ throughput, but the ratios match published native benchmarks within a factor of two. Pick a workload and a size and watch where each container falls:

N elements 10000

vector ms

···

deque ms

···

list ms

···

map ms

···

unord ms

···

Random access shows the biggest gap. vector is one indexing operation, deque is two, list doesn't support it at all (the JS demo walks from begin() per query, which is what real std::list code would have to do), map is a logarithmic tree walk, unordered_map is a hash plus a bucket walk. The same N, the same query stream; depending on the workload the spread runs from a small constant factor (push N) to one or two orders of magnitude (iterate) to several orders of magnitude (random index, where list's O(n)-per-access loses badly).

19The decision cheat sheet

A one-page summary, organized by the question you're actually asking when you reach for a container:

If you need…	Reach for	Why
A sequence of T, default case	`std::vector<T>`	Contiguous, cache-friendly, O(1) push_back amortized, the right answer for the bulk of sequence workloads^[16].
A fixed-size sequence, no heap allocation	`std::array<T, N>`	Size is part of the type. Lives inline in its owner.
A small bounded sequence, usually fits in N, sometimes overflows	`llvm::SmallVector<T, N>` or `eastl::fixed_vector`	Inline buffer for N, heap fallback past N. Zero allocations on the common path.
O(1) push and pop at both ends	`std::deque<T>`	Chunked storage means front pushes don't shift everything.
O(1) splice between containers, or stable pointers across all mutations	`std::list<T>` or `std::hive<T>` (C++26)	Node-based. Pointer to element X survives any operation that doesn't touch X. Hive is the cache-friendly version.
Lookup by key, frequent inserts and erases, doesn't fit in a flat_map	`absl::flat_hash_map<K, V>` or `folly::F14FastMap<K, V>`	Open-addressing SIMD-scan, 2-3× faster than `std::unordered_map` on the same workload^[9].
Lookup by key, small to medium static-ish key set, read-heavy	`std::flat_map<K, V>` (C++23) or `boost::container::flat_map`	Sorted vector. Binary search on contiguous memory beats hashing at small N^[4].
Ordered iteration, range queries by key	`std::map<K, V>`	Red-black tree, sorted iteration is automatic, `lower_bound`/`upper_bound` work.
LIFO behavior, want to expose only push/pop/top	`std::stack<T, std::vector<T>>`	Vector-backed stack. Default of `deque` is fine but the vector version has less per-op overhead.
FIFO behavior	`std::queue<T>`	Default `deque` backing is the right answer. For bounded lock-free, a ring buffer is faster.
"Do the highest-priority job next"	`std::priority_queue<T>`	Vector + binary heap. O(log n) push and pop, O(1) top.
A read-only view of a contiguous range	`std::span<const T>` / `std::string_view`	Pointer + size. No copy, no allocation, no template-on-container.

20The container pitfalls list

Mistakes that ship to production every year, organized by the container that produces them:

vector pitfalls

vector<bool> is not a container of bool. The standard permits a packed representation (one bit per element) and explicitly carves out the contiguous-storage and ordinary-reference guarantees so implementations can pack; every shipping implementation does. operator[] returns a proxy, &v[0] isn't a bool*, and the type doesn't satisfy the normal sequence-container requirements^[19]. Use std::vector<char>, std::vector<std::uint8_t>, or std::bitset.
Iterators across push_back are dangling. One of the most common iterator-invalidation bugs in C++ container code. Take a copy of the index instead, or call reserve() first to fix the capacity in advance.
shrink_to_fit() is a hint. It is allowed to do nothing. If you need the memory back, swap with an empty vector: std::vector<T>(v.begin(), v.end()).swap(v).

deque pitfalls

MSVC's deque collapses to one element per chunk for any element larger than 8 bytes. A deque<MyStruct> on MSVC is essentially one heap allocation per element, with all the cache misses that implies. The libstdc++ and libc++ designs cap the chunk size by bytes (512 and 4096 respectively) and behave much more predictably as the element size grows^[22].
Iteration has a per-chunk-boundary cost. Each increment checks whether we're at a chunk boundary; within a chunk the branch is predictable, but every chunk transition adds a mispredict plus an indirect load. For hot inner loops, copy to a vector first.

list pitfalls

It almost always isn't the right answer. Reach for it only when you have a real splice or a real reference-stability requirement. See §10.
size() was O(n) in C++03 and is O(1) in C++11. The committee changed it specifically so callers could stop pre-caching the size. If your code targets a really old library, this may still bite you.

map pitfalls

m[key] default-constructs the value if the key isn't present and writes the new entry. A "lookup" that mutates the map. Use find or contains (C++20) for "is it there?" checks.
String keys with no transparent comparator allocate on every lookup. Pre-C++14, m.find("foo") on a map<std::string, V> constructs a temporary std::string. Use std::less<> as the comparator to get a transparent compare^[29].

unordered_map pitfalls

The default hash for std::string is fine; the default for int is the identity. Bucketing by hash(k) % bucket_count means an identity hash collides every key that differs in high bits. For workloads with structured integer keys (especially indices), wrap the key in a hashed type or pass a custom hasher^[25].
Insertion may rehash, which invalidates all iterators (but not pointers/references). Code that inserts in a loop with iterators outstanding has to call reserve() first, just like vector^[20].
It is the slow hash table. If a hash table shows up in your profile, swap in absl::flat_hash_map or folly::F14FastMap. The interface is source-compatible enough for most call sites, but the iterator-invalidation rules differ: flat storage means rehash invalidates pointers and references too, not just iterators. Check your reference-holding code before swapping.

21What's next

The decision shape: pick the layout (one block, chunked, node-based, hashed), pick the access pattern, the container follows. The cheat sheet in §19 has the lookup table; the rest of this tutorial is the reasoning behind each row.

Past this tutorial:

The C++ memory model tutorial covers the atomics and orderings you need to make concurrent containers behave. Lock-free queues use ring buffers and rely on the orderings from that tutorial.
The job-systems tutorial uses work-stealing deques, which are concurrent variants of the deque from §8. The deque shape was specifically chosen to make stealing efficient.
The Abseil hash containers (flat_hash_map, node_hash_map) and Folly's F14 are the production-grade open-addressing implementations. Both ship benchmarks; read the F14 design doc for the trade-offs^[11].
For deep production guidance, EASTL's source is a guided tour of game-grade STL: every container is rewritten with allocator-awareness, fixed-storage variants, and intrusive nodes^[5].

22Sources & further reading

Numbered citations refer to the superscripts above. Primary sources only: standards, vendor documentation, published papers, talk transcripts.

A note on originality

The prose, code samples, CSS, and interactive demos on this page are original. The vector-vs-list framing follows Stroustrup's Going Native 2012 keynote [1]. The SwissTable description follows Matt Kulukundis's CppCon 2017 talk [9] and the Abseil design page [10]; the "why exactly 7 bits for H2" derivation in §13 is the author's reading, not a quote. The growth-factor argument in §6 starts from Folly's claim [8] but the explicit golden-ratio derivation (r² ≤ r + 1) is laid out here. The F14 description follows Facebook's open-source design doc [11]. The EASTL allocator-model critique follows Paul Pedriana's N2271 paper [5]; the three game-engine-specific allocator concerns in §17 (budget, frame arena, SIMD overalignment) and their DirectXMath/Unreal grounding are the author's framing. All deque chunk-size numbers were verified against the actual libstdc++, libc++, and MSVC STL source. All complexity guarantees and iterator-invalidation rules match cppreference and the published C++ standard.

Stroustrup, B. (2012). C++11 Style. Going Native 2012 keynote. YouTube. Includes the vector-vs-list sorted-insertion benchmark and the "cache effects dominate" argument; follow-up at isocpp.org/blog/2014/06/stroustrup-lists.
Stevens, A. (1995). Alexander Stepanov and STL Interview. Dr. Dobb's Journal. stlport.org. The history of how generic programming led to the STL container set.
ISO/IEC 14882:2020 (and subsequent revisions). Programming languages — C++. Standard reference for container complexity guarantees, sequence/associative requirements, and iterator invalidation. Open draft mirrored at eel.is/c++draft.
Boost. Boost.Container Non-Standard Containers: flat_(multi)map / flat_(multi)set. boost.org. The reference implementation of sorted-vector associative containers, dating to 2003.
Pedriana, P. (2007). EASTL: Electronic Arts Standard Template Library. WG21 paper N2271. open-std.org. The committee-facing case for the allocator, intrusive, and fixed-storage container set that game code needs.
cppreference. std::array. en.cppreference.com/w/cpp/container/array. Fixed-size sequence, no heap allocation, inline storage.
Muñoz, J. D. (2022). Advancing the State of the Art for std::unordered_map Implementations. Overload journal, ACCU. accu.org. Why the bucket API locks std::unordered_map into closed addressing, and what implementers can still optimize within that constraint.
Facebook / Meta. FBVector — Folly C++ Library. github.com/facebook/folly. The 1.5× growth factor, jemalloc cooperation, and trivial-relocation optimization that distinguish fbvector from std::vector.
Kulukundis, M. (2017). Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step. CppCon 2017. YouTube. The original public description of SwissTable: H1/H2 split, 16-slot SIMD groups, 1-byte control words, 2-3× faster than std::unordered_map.
Google. Abseil Swiss Tables and absl::Hash. abseil.io/blog/20180927-swisstables. Open-sourcing announcement and integration with Abseil's hash framework.
Facebook / Meta. F14 Hash Table. github.com/facebook/folly. 14-key chunks, three storage variants (Value/Node/Vector), engineering announcement at engineering.fb.com.
Bentley, M. Introduction of std::hive to the standard library. WG21 paper P0447 (latest revision R28). open-std.org/p0447r26; tracker at cplusplus/papers#328. The standardized bucket-array container, evolved from plf::colony; reference implementation at github.com/mattreecebentley/plf_hive. Adopted for C++26.
Hennessy, J. L., & Patterson, D. A. Computer Architecture: A Quantitative Approach, 6th ed. Morgan Kaufmann (2017). The canonical citation for cache-hierarchy latency ratios and the cycle counts in §3.
Intel. Intel® 64 and IA-32 Architectures Optimization Reference Manual. intel.com. Chapter on memory hierarchy and the hardware prefetcher's pattern requirements.
LLVM. llvm::SmallVector documentation. llvm.org. The reference inline-buffer-with-heap-fallback design.
Stroustrup, B. (2013). The C++ Programming Language, 4th ed. Addison-Wesley. §31.4: "Use vector as your default container." The recommendation that std::vector is the right default; the cppcoreguidelines mirror it.
C++ Core Guidelines Working Group. SL.con.1: Prefer using STL array or vector instead of a C array. isocpp.github.io.
GCC libstdc++. stl_vector.h source. github.com/gcc-mirror/gcc. Three-pointer layout (begin / end / end-of-storage), 2× growth factor.
cppreference. std::vector. en.cppreference.com/w/cpp/container/vector. Complexity table, iterator-invalidation rules, vector<bool> note.
cppreference. std::unordered_map. en.cppreference.com/w/cpp/container/unordered_map. Bucket API, complexity, rehash and invalidation rules.
cppreference. std::deque. en.cppreference.com/w/cpp/container/deque. Chunked-array implementation note, indexing complexity.
Chen, R. (2023). Inside STL: The deque, implementation. The Old New Thing, Microsoft DevBlogs. devblogs.microsoft.com. Side-by-side description of libstdc++, libc++, and MSVC's deque chunk sizes.
cppreference. std::list. en.cppreference.com/w/cpp/container/list. Doubly-linked layout, O(1) splice, reference-stability guarantee.
Chen, R. (2023). Inside STL: The map, set, multimap, and multiset. The Old New Thing, Microsoft DevBlogs. devblogs.microsoft.com. Confirms red-black tree as the implementation strategy across libstdc++, libc++, and MSVC.
Aumasson, J.-P., & Bernstein, D. J. (2012). SipHash: a fast short-input PRF. aumasson.jp/siphash/siphash.pdf. The keyed-hash design that became the default for hash-flooding-resistant hash tables (Rust, Python 3, Ruby).
Google. Swiss Tables Design Notes. abseil.io/about/design/swisstables. H1/H2 hash split, control-byte layout, SIMD match details.
cppreference. std::priority_queue. en.cppreference.com/w/cpp/container/priority_queue. Binary-heap backing, O(log n) push/pop, no decrease-key.
cppreference. std::span. en.cppreference.com/w/cpp/container/span. C++20 non-owning view, lifetime requirements.
cppreference. std::map. en.cppreference.com/w/cpp/container/map. std::less<> transparent comparator details under C++14.
Le, Z., et al. A Standard flat_map. WG21 paper P0429R9. open-std.org/p0429r9. The paper that brought std::flat_map and std::flat_multimap into C++23.
Zhihao Yuan. A Standard flat_set. WG21 paper P1222. wg21.link/p1222r4. The companion paper for std::flat_set and std::flat_multiset in C++23.
Sutter, H. (2014). GotW #90 Solution: Factories. herbsutter.com. Touches the "vector by default" recommendation in the context of return-type design.
Alexandrescu, A. (2014). Optimization Tips — Mo' Hustle Mo' Problems. CppCon 2014. YouTube. Small-string and small-vector optimizations behind FBString and FBVector.
Bentley, M. (2017). Colonies, performance and why you should care. CppCon 2016. YouTube. The original public talk on plf::colony, designed for game-engine entity pools.
Celis, P., Larson, P.-Å., & Munro, J. I. (1985). Robin Hood Hashing. FOCS 1985. semanticscholar.org. The variance-reducing open-addressing technique that several modern hash tables build on.
Stepanov, A., & Lee, M. (1995). The Standard Template Library. HP Labs Technical Report HPL-95-11. stepanovpapers.com. The original STL design document.
Bronson, N., & Shi, X. (2019). Open-sourcing F14 for faster, more memory-efficient hash tables. Facebook Engineering. engineering.fb.com. Source for the 14-slot chunk choice ("filters 14 slots at once"), the SSE2/NEON intra-chunk SIMD compare, the 12/14 max load factor, and the chunk-overflow-count mechanism. The F14 design markdown in the Folly repository (F14.md) gives the chunk-layout details (14 tag bytes + 2 metadata bytes = one 16-byte aligned block).

01Why a container choice is a performance decision

The framing for the rest of the tutorial

02A short history of getting containers right

03The memory hierarchy the container sits on

04std::array: fixed size, on the stack

05std::vector: the workhorse

The four guarantees that matter

06Growth, amortization, and the cost of push_back

Why 1.5× sometimes beats 2×

07Iterator invalidation: the rule you can't forget

08std::deque: chunked, two-ended

09std::list and std::forward_list

10List vs vector, measured

11std::map and std::set: ordered associative

When to use map, and when not

12std::unordered_map and the bucket interface

Hash quality matters more than you think

13Open addressing and SwissTable

Where the time goes

14The adaptors: stack, queue, priority_queue

15std::span and std::string_view: non-owning views

16Beyond std: flat_map, hive, intrusive lists, ring buffers

flat_map and flat_set (C++23 / Boost.Container)

std::hive / plf::colony (C++26)

Intrusive containers (Boost.Intrusive, EASTL)

Ring buffers (custom, no std equivalent)

SmallVector / fixed_vector (LLVM, EASTL)

17Allocators and why games override them

18The container race

19The decision cheat sheet

20The container pitfalls list

vector pitfalls

deque pitfalls

list pitfalls

map pitfalls

unordered_map pitfalls

21What's next

22Sources & further reading

See also

04`std::array`: fixed size, on the stack

05`std::vector`: the workhorse

06Growth, amortization, and the cost of `push_back`

08`std::deque`: chunked, two-ended

09`std::list` and `std::forward_list`

11`std::map` and `std::set`: ordered associative

12`std::unordered_map` and the bucket interface

14The adaptors: `stack`, `queue`, `priority_queue`

15`std::span` and `std::string_view`: non-owning views

16Beyond `std`: flat_map, hive, intrusive lists, ring buffers