All tutorials Mighty Professional
Tutorial 07 ยท Engine Programming

C++ Containers

Eight cores. A 64-byte cache line. A 70-cycle gap between an L1 hit and an L3 hit, another 200 cycles to DRAM. The container that wins on paper is rarely the one that wins on hardware, because the data structure is the layout in memory and the layout decides whether the CPU is fed or stalled. We walk every standard container the engine actually ships with: std::array, std::vector, std::deque, std::list, std::map, std::unordered_map, the adaptors, the views. Memory pictures, complexity proofs, and a live race that makes the 50ร— gap between vector and list visible.

Time~55 min LevelMid C++ engineer PrereqsYou've written a for loop over a vector. You've heard of big-O. The memory-model tutorial is useful if you want to know why unordered_map can't be lock-free. HardwareSome idea that DRAM is slower than cache

01Why a container choice is a performance decision

Big-O is a description of how a data structure scales. It is not a description of how fast that data structure is. A linked list and a sorted vector both report O(n) insertion in the middle, and on hardware built in 1985 they performed about the same. On hardware built since 2005 the vector wins by a margin so wide it stops being an algorithm question and becomes a memory-layout question. Bjarne Stroustrup spent an entire Going Native 2012 keynote making this point with a stopwatch[1], and a decade later "default to vector" is the conventional advice in most modern engine codebases. For sequence containers, the layout in memory is most of the performance story; complexity, reference stability, and lookup pattern decide the rest.

The widget below makes that visible. The left column reads N integers from a contiguous block (a std::vector<int>). The right column reads N integers from nodes scattered across the heap (a std::list<int>). Both touch the same number of bytes. Both report O(n) traversal. The cache walker shows the difference: the vector reads sweep through cache lines in order; the list reads jump between unrelated lines, paying a cache miss on most accesses:

Live ยท Cache-line walker ยท vector vs list
vector misses
ยทยทยท
list misses
ยทยทยท
ratio
ยทยทยท
A 64-byte cache line holds 16 ints. The vector's linear sweep pays one cache miss per 16 elements; the list pays one per node, because nodes from a general-purpose allocator usually land on unrelated lines. The "ratio" stat is the multiplier between them, and it grows with N until DRAM bandwidth becomes the vector's dominant bottleneck.
What you'll have by the end

A working mental model of every standard C++ container: what it actually looks like in memory, what its complexity guarantees promise (and what they don't), and which one to reach for in a given gameplay system. Plus a live container race that lets you push a million elements through each, a hash-table animator that flips between separate chaining and SwissTable-style open addressing, and a cheat sheet you can paste into your engine's contributing guide.

The framing for the rest of the tutorial

Every container in this tutorial answers four questions, and you can mostly predict its performance from the answers:

Answer those four for each container and the rest of the standard-library reference page reduces to small print.

02A short history of getting containers right

The C++ standard library's container set didn't drop fully formed in 1998. Its shape is the cumulative answer to twenty-five years of "this is faster on real hardware" arguments, each of which forced the committee or a library author to write the better data structure down. A short tour, because the modern set only makes sense in context:

1979
Alex Stepanov starts the generic-programming project. While at GE Research, Stepanov begins formalizing iterators and algorithms separately from the data structures they operate on. The C++ Standard Template Library, written with Meng Lee in 1993-1994, is the eventual product[2]. The structural decision that sequence containers, associative containers, and algorithms are three orthogonal axes (you mix and match) dates from here.
1994
STL accepted into ISO C++. The first standardized container set ships: vector, deque, list, set, map, multiset, multimap, plus the stack, queue, and priority_queue adaptors. Associative containers are tree-based; the standard never names a specific tree, but the combination of logarithmic complexity and iterator-stable insert/erase narrows the practical choice to a balanced BST. Every shipping implementation (libstdc++, libc++, MSVC) lands on red-black[3].
2003
Boost.Container ships flat_map. A sorted-vector-backed associative container. Lookup costs the same O(log n) binary search the tree gives you, but the elements live contiguously and the cache loves them. Insertion in the middle is now O(n), which is a price most lookup-heavy code is happy to pay[4].
2007
Paul Pedriana's EASTL paper (N2271).[5] Electronic Arts publishes its in-house STL replacement, documenting every place the standard hurts game code: allocator interface, debug-build cost, alignment, intrusive containers, fixed-storage containers. The committee starts paying attention. Most of EASTL's missing-pieces list eventually lands in std over the next decade.
2011
C++11 ships std::array, std::unordered_map, std::unordered_set.[6] Fixed-size sequence with no heap allocation; hash tables with average-constant lookup. The unordered containers are constrained to closed-addressing (separate chaining) by a bucket API that leaks into the interface[7]. That decision is what every "faster than unordered_map" library is replacing.
2012
Stroustrup, "Going Native" keynote.[1] The "vector vs list, ordered insertion" benchmark that every C++ engineer has seen. The result: vector beats list by orders of magnitude at sizes that fit in cache and DRAM. The keynote popularizes "default to vector" as a rule.
2014
Facebook releases Folly's fbvector.[8] Andrei Alexandrescu's reimplementation of std::vector with a 1.5ร— growth factor in the middle of its size range (so a future reallocation can reuse freed memory), jemalloc-aware sizing, and explicit relocation for trivially-movable types. The reported wins on Facebook's workloads are modest but real.
2017
Matt Kulukundis, "Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step."[9] The CppCon talk that introduces SwissTable: 16-slot groups, a 1-byte control word per slot with the top 7 bits of the hash, SIMD parallel match. 2-3ร— faster than std::unordered_map on Google's workloads. Open-sourced in Abseil in 2018[10].
2019
Facebook open-sources F14.[11] The 14-key-per-chunk shape, related to SwissTable but with three storage variants (Value, Node, Vector) tuned for different key/value sizes. The convenience alias F14FastMap picks between the Value and Vector layouts at compile time based on entry size; users who need pointer stability reach for F14NodeMap directly.
2023
std::flat_map and std::flat_set adopted for C++23. The boost::container design from 2003, twenty years later, becomes standard. flat_map and flat_multimap arrive through P0429[30]; flat_set and flat_multiset arrive separately through P1222[31]. Lookup-heavy ordered associative containers no longer need a third-party dependency.
2026
C++26 adds std::hive.[12] Matt Bentley's plf::colony, formerly used for game-engine entity pools, becomes the standard's stable-reference bucket array. Erase keeps the slot, iteration skips the gap, pointers survive every mutation that does not touch their specific element.

The recurring pattern: the algorithm is from the 1970s, the cache-friendly layout is from the 2010s, the standardization is a few years behind whatever the big shops have already shipped. By 2026 the standard has closed the sorted-associative and bucket-array gaps (flat_map, hive); the hash-table gap is the conspicuous remainder, since std::unordered_map is still constrained by the C++11 bucket API while the open-addressing variants outside std are 2-3ร— faster. The original 1998 containers are still the right default for most code. The remainder of this tutorial walks them in dependency order: array, vector, deque, list, map, unordered_map, the adaptors, the views, and the post-std alternatives at the end.

03The memory hierarchy the container sits on

Before any specific container makes sense, the picture of the hardware they're trying to feed. A modern CPU spends most of its time waiting for memory:

TierTypical latency (cycles)CapacityRelative cost of a miss
Register~0~32 ร— 8 B per core
L1 data cache~432-48 KB per core
L2 cache~12256 KB-1 MB per core
L3 cache (shared)~404-64 MB per socket
DRAM~2508-128 GB

Cycle counts are Skylake-class x86 rule-of-thumb numbers, drawn from Hennessy and Patterson[13], Intel's optimization manual[14], and Agner Fog's measurements; AMD Zen, Apple M-series, and ARM Neoverse have their own profiles, and server parts with mesh interconnects push L3 and remote-socket DRAM substantially higher. The point is the ratio, not the exact number. A miss to DRAM is roughly 60 times the cost of an L1 hit on a modern desktop core.

The unit the cache moves around is a cache line: 64 bytes on x86-64 and most ARM, 128 bytes on Apple M-series. A line holds 16 ints, or 8 doubles, or 8 64-bit pointers. When the CPU touches a byte that isn't in cache, the hardware pulls the entire line that contains it, and pays the DRAM latency once. Every subsequent byte on the same line is free.

This is the central fact behind every container choice in the rest of the tutorial. A contiguous container amortizes the cost of one DRAM fetch across 16 elements. A node-based container with a general-purpose allocator usually pays that fetch once per element, because consecutive nodes rarely share a cache line. (Custom pool allocators and intrusive containers can close some of the gap; see ยง16.)

What about the hardware prefetcher?

x86 (and most ARM) CPUs have a stream prefetcher that watches recent cache misses and, if they form a regular linear pattern, fetches ahead so the next miss is already on its way. A sequential vector scan triggers it: the prefetcher correctly predicts that the next line is wanted and brings it in before the CPU asks. A list traversal does not trigger it: the next node's address is in the current node's next pointer, and the prefetcher can't follow indirection[14].

This is why the gap between vector and list scales with array size. Below ~32 KB everything fits in L1 and the difference is small. Above L3 the vector keeps up with DRAM bandwidth (10-40 GB/s) while the list bottlenecks on miss latency (~250 cycles per element on a Skylake-class core). Stroustrup's keynote demo[1] reported the vector finishing in roughly 80 seconds while the list took close to two hours on the same hardware at 500K elements, somewhere around 50-100ร— depending on which run you pick. The exact factor depends on element size, allocator behaviour, and the working-set range; the order of magnitude is what's robust.

With that established, the containers come in three structural families. The rest of the sections are organized by family:

  1. Contiguous sequence: array, vector. One block, indexable, cache-loving.
  2. Chunked or node-based sequence: deque (chunked), list and forward_list (one node per element).
  3. Associative: map/set (sorted, node-based, tree); unordered_map/unordered_set (hashed).

Plus the adaptors that wrap one of those (stack, queue, priority_queue) and the non-owning views (span, string_view) that point into someone else's storage.

04std::array: fixed size, on the stack

The smallest, simplest container. A thin wrapper around a C-style array, sized at compile time, no heap allocation, all elements live in the object itself[6]. std::array<float, 3> is exactly three floats in twelve bytes, nothing more. It exists for two reasons: it gives a C array a real STL interface (.begin(), .size(), .data(), etc.), and it makes the size part of the type so the compiler can check it.

array_basics.cpp ยท what's actually in memory
// Three floats, allocated wherever the std::array object lives.
// If 'positions' is a local, it's on the stack; if it's a member of a class
// allocated on the heap, it lives in that class's heap block. No
// separate allocation either way.
std::array<float, 3> positions = {0.0f, 1.0f, 2.0f};

// sizeof(positions) is exactly 12. There's no length field; the size is
// encoded in the type so the compiler always knows it.
static_assert(sizeof(positions) == 3 * sizeof(float));

// The interface is std-container-shaped: iterate, index, ask the size.
for (float& element : positions) element *= 2.0f;

// data() returns a pointer to the first element, the same way a C array decays.
// Useful when calling C APIs (OpenGL, Vulkan, sockets) that want a raw pointer.
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0, positions.data());

Use it whenever the size is a compile-time constant and the container is small enough that copying it by value is cheap. Vertex positions, RGBA colors, 4ร—4 matrix rows, hash digests, fixed-channel audio mixes, lookup tables of known size. Anywhere a C programmer would have written float v[3];.

When std::array is the wrong answer

The size is part of the type. std::array<int, 3> and std::array<int, 4> are different types and you can't assign one to the other. If the size of the data is "small but not fixed" (e.g. up to 16 entries in a small-buffer-optimized list), reach for a SmallVector from EASTL or LLVM[15], not std::array with the maximum size and a separate length counter.

05std::vector: the workhorse

The default. Stroustrup's advice in The C++ Programming Language, fourth edition, is "use vector as your default container"[16] and the C++ Core Guidelines repeat it[17]. The reasoning is exactly the cache argument from ยง1: a contiguous block of memory is the friendliest layout for the CPU, and most workloads don't need more than that.

A std::vector is three pointers (or equivalents) in the object itself: begin, end, and end-of-storage[18]. The actual elements live in a single contiguous heap block. size() is end - begin; capacity() is end_of_storage - begin. Indexing is one pointer arithmetic plus one load. Iteration is a pointer walk that the hardware prefetcher loves.

Diagram ยท vector in memory
Three pointers, one block. size() is the number of constructed elements; capacity() is how much storage is reserved before the next reallocation. The gap between them is "free room"; push_back uses it without allocating.

The four guarantees that matter

The catch is the third bullet's twin: every reallocation invalidates every pointer, reference, and iterator into the vector, and inserts that don't grow the vector still invalidate everything from the insertion point to the end. The next two sections take each in turn.

06Growth, amortization, and the cost of push_back

When a vector's size() reaches its capacity(), the next push_back has to allocate a bigger block, move every existing element into it, and free the old one. That's a linear-time operation. Doing it on every push would make push linear-time in steady state, which is terrible. The standard library hides this with amortization: it grows the block by a constant factor (typically 2ร— or 1.5ร—) so that the cost of all the reallocations across N pushes is O(N) total, which means the cost per push averages out to O(1)[19].

One push at a time, this widget records the cost. Capacity steps up at doubling boundaries; reallocation events appear as tall red bars; the running average per push converges to a small constant even though individual pushes cost O(size):

Live ยท push_back amortization
total work
ยทยทยท
reallocations
ยทยทยท
avg per push
ยทยทยท
With a 2ร— growth factor (libstdc++ and libc++ default for push_back) the geometric series 1 + 2 + 4 + ... + N sums to about 2N, so the average cost per push converges toward a constant near 2. The 1.5ร— growth factor (MSVC, and Folly's fbvector in its middle size range) reaches a slightly higher constant per push but lets a future reallocation reuse the just-freed block, which the 2ร— growth cannot do[8].
eq. 1 ยท amortized push_back cost cost(N pushes) = 1 + 2 + 4 + โ€ฆ + 2k = 2N โˆ’ 1

Geometric growth turns N pushes into about 2N total work, which is O(N) and gives O(1) amortized per push. The word amortized is doing real work: any individual push that triggers a reallocation is O(size), not O(1). Code that needs a hard real-time bound on a single push has to call reserve() first.

Why 1.5ร— sometimes beats 2ร—

Set the growth factor to r and call the initial capacity 1. After k growths the sum of every previously freed block is 1 + r + rยฒ + โ€ฆ + rk-1 = (rk โˆ’ 1) / (r โˆ’ 1). The block being allocated for the next growth needs rk units. Ask when the new allocation fits inside the cumulative freed region: rk โ‰ค (rk โˆ’ 1) / (r โˆ’ 1), which simplifies to rยฒ โ‰ค r + 1. The largest r satisfying that inequality is the golden ratio, ฯ† โ‰ˆ 1.618. Any factor below ฯ† eventually fits, so the allocator can reuse the just-freed addresses; r = 2 sits permanently above ฯ† and never can.

That's the case for picking a factor below 2. The case against is that smaller factors mean more reallocations, so the average per-push cost rises. The shipping picks split the difference: MSVC uses 1.5ร—, Folly's fbvector uses 1.5ร— in the middle of its size range and 2ร— at the small and very large ends[8], libstdc++ and libc++ stay at 2ร—. Whether the in-place reuse actually happens depends on the allocator; jemalloc handles it well, glibc's ptmalloc less so.

In practice, if you know how big the vector will get, the cure for all of this is one line:

reserve.cpp ยท the one line that fixes most push_back perf
std::vector<EntityId> visibleEntities;

// We know roughly how many entities will be visible this frame.
// One allocation up front beats log2(N) reallocations during the loop.
visibleEntities.reserve(estimatedVisibleCount);

for (const Entity& entity : worldEntities) {
  if (frustum.contains(entity.bounds))
    visibleEntities.push_back(entity.id);    // no reallocation until we exceed the reservation
}

07Iterator invalidation: the rule you can't forget

Every container has a set of "things that, if you do them, will invalidate some references into the container." Get it wrong and you have a dangling pointer that the compiler can't catch. The rules are precise and worth memorizing for the containers you use most:

ContainerWhat invalidates iteratorsWhat invalidates pointers/references
vector Anything that triggers a reallocation invalidates all iterators. Insert or erase invalidates iterators from the change point to end(). Same as iterators. References are pointers in disguise here.
deque Insert or erase anywhere except the front/back invalidates all iterators. Insert at front or back leaves all pointers/references valid; insert in the middle invalidates everything.
list, forward_list Only the iterators to erased elements are invalidated. Only the pointers to erased elements. Everything else survives.
map, set, multi* Same as list: only the erased ones. Same as list: only the erased ones.
unordered_map, unordered_set Insert may trigger a rehash, which invalidates all iterators. Erase invalidates only the erased iterator. Insert with no rehash leaves pointers/references valid; rehash leaves pointers/references valid (only iterators are invalidated). Erase invalidates only the erased element's references.

Adapted from the cppreference invalidation tables[19][20] and the formal ISO wording[3]. The "pointer/reference vs iterator" distinction for unordered_map rehash is subtle and worth confirming on the cppreference page for any version you target.

Each button on the widget below performs one operation on a small vector. Two pinned iterators sit at fixed positions; they turn red the instant they go dangling. Every reallocation flips every iterator; every middle-insert flips everything from the insertion point onward.

Live ยท iterator invalidation
size
3
capacity
4
itโ‚ status
valid
itโ‚‚ status
valid
A reallocation moves the entire block, so every iterator into the old block becomes a pointer into freed memory. Even an insert that doesn't reallocate (capacity sufficient) shifts every element after the insertion point, so iterators that pointed there are now off by one. Code that holds an iterator across a mutation is a UB hazard waiting for a test to find it.

08std::deque: chunked, two-ended

std::deque ("double-ended queue") gives you push_front and pop_front at O(1), which vector doesn't. It does this by storing elements in a sequence of fixed-size chunks, with a top-level index ("map") of pointers to those chunks[21]. Inserting at the front allocates a new chunk and points the map at it; the existing chunks don't move. Indexing is two loads (chunk pointer, then element) instead of one.

Implementations pick a chunk size, and the picks differ wildly[22]:

Pictured below: a generic libstdc++-style layout. Push at either end and a new chunk is allocated when the current end fills; existing chunks never move.

Diagram ยท deque chunk map
elements
ยทยทยท
chunks
ยทยทยท
chunk size
8 elems
Two-level structure: a map of pointers, each pointing to a fixed-size chunk. Iteration walks within a chunk (cache-friendly), then jumps to the next chunk (a small cache miss). push_front never moves existing elements; it just allocates a new chunk at the front. This is why std::deque is the natural backing store for a FIFO queue.

Use it when you need O(1) push and pop at both ends, when reference stability after front/back insertion matters, or when you want to grow without periodic 2ร— reallocations. Common production uses:

deque is not always cache-friendly

The chunked layout means iteration sweeps each chunk sequentially (good) and then jumps to the next chunk (one cache miss per chunk boundary). With a 512-byte libstdc++ chunk and 8-byte doubles, that's 64 elements per chunk and one miss per 64. Still far better than a list's one miss per element, and within a small constant of a vector's one per 16. On MSVC with its one-element-per-chunk policy for anything larger than 8 bytes, the picture is much worse, and the resulting deque behaves more like a list than a vector. For pure-iteration workloads on libstdc++ or libc++, std::vector typically wins by a small constant; for two-ended access, deque wins because the alternative is a vector with O(n) insert(begin()).

09std::list and std::forward_list

A doubly-linked list. Each element lives in its own heap allocation, plus two pointers (prev and next). std::forward_list drops the prev pointer to halve the per-node overhead, at the cost of being singly-linked[23].

list has two real selling points in modern C++:

  1. Reference stability under all mutations. A pointer into a node survives every insert and every erase that doesn't touch that specific node. Worth real money in code that wires up long-lived handles to objects in containers.
  2. O(1) splice. list1.splice(pos, list2, it) moves a node from list2 into list1 by relinking pointers, no copies, no allocations[23]. There is no other standard container that lets you move an element between containers in O(1). LRU-eviction caches and ready-queue schedulers both lean on this.

Outside of those two use cases, it is almost always the wrong choice. The cache-miss penalty from ยง1 is the headline; smaller costs add up too:

The next section measures the gap directly.

10List vs vector, measured

Stroustrup's Going Native 2012 benchmark[1]: generate N random integers; for each one, find its sorted position (linear scan from begin()) and insert it there. Repeat for both a vector and a list. The number of comparisons is the same; the number of structural mutations is the same; the constant factor is everything.

Press RACE to run a JavaScript port of the benchmark. The absolute numbers won't match a native C++ run (V8 doesn't have the same heap layout as glibc), but the ratio tracks the C++ result:

Live ยท sorted-insert race
vector ms
ยทยทยท
list ms
ยทยทยท
ratio
ยทยทยท
The vector wins because its sorted-position search sweeps contiguous memory; the list pays a cache miss per node on the same search. The insert phase costs the vector a memmove of cache-resident bytes and the list a pointer relink. Even though the list does less structural work per insert, it pays more total time because the search dominates. The lesson is Stroustrup's: complexity is the same, layout decides[1].
When list actually wins

Two cases survive the cache argument. (a) Splice-heavy workloads where you move elements between containers without iteration: an LRU cache that touches one element per request, then moves that node to the front, is a textbook fit. (b) Code that holds long-lived pointers into the container across many mutations: a list<Job> where job dependency edges store raw Job*. In both cases pre-C++26, the contiguous alternative needs a separate indirection table to provide the reference stability that list gets for free; whether that's a better deal depends on the specifics. As of C++26, std::hive (see ยง16) covers case (b) with a cache-friendlier layout, so the only case that still goes to list by default is the splice-heavy one in (a).

11std::map and std::set: ordered associative

std::map<K, V> is a sorted, balanced binary search tree from keys to values. The standard requires logarithmic-time lookup, insert, and erase, and ordered iteration[3]. The standard never names a specific data structure, but every shipping implementation (libstdc++, libc++, MSVC) uses a red-black tree[24]. AVL trees meet the same complexity bound but rotate more on insert/erase; red-black trades a slightly taller tree for fewer rotations, which is the right balance when erase has to be O(log n) too.

Each element is its own heap-allocated node containing the key/value pair, a parent pointer, two child pointers, and one color bit. That's about 48 bytes of overhead per node on 64-bit platforms, plus the payload. Lookup is a logarithmic walk from the root, branching by key comparison at each node.

Live ยท red-black tree
size
ยทยทยท
height
ยทยทยท
last op
ยทยทยท
Red-black invariants: every leaf is black; the root is black; a red node has black children; every root-to-leaf path has the same number of black nodes. These constraints bound the height at ~2ยทlog2(n), which keeps lookup logarithmic. Insertion rotates and recolors when needed to restore the invariants in O(log n) time[24].

When to use map, and when not

Use std::map when you need ordered iteration (lowest key first, highest key last) or when you need to find ranges by key (lower_bound, upper_bound, equal_range). For example: per-frame replay events keyed by timestamp, where you stream events in time order; or an interval tree by start coordinate.

Do not use std::map as a default associative container. For unordered "find by key" lookups, std::unordered_map is faster on average (constant vs logarithmic) and a flat sorted-vector (std::flat_map in C++23, or boost::container::flat_map before that) is faster than both for read-heavy workloads on small to medium key sets, because the contiguous layout fits the cache[4].

12std::unordered_map and the bucket interface

A hash table. Average-constant-time lookup, insert, and erase; worst-case linear if the hash function is adversarial. C++11 standardized it with a specific implementation strategy baked into the interface: closed addressing (each bucket holds a linked list of entries that hashed to it), exposed through the public bucket_count(), bucket_size(), and local_iterator APIs[20]. That interface choice locks every shipping std::unordered_map into separate chaining, which is why third-party hash tables routinely beat it by 2-3ร—[7].

The widget shows the closed-addressing layout. Each bucket points to a linked list of {hash, key, value} nodes. Lookup hashes the key, picks the bucket, then walks the bucket's list comparing keys. The load factor (size() / bucket_count()) controls how long the bucket chains get; the default max_load_factor is 1.0, so on average each bucket holds one element.

Live ยท unordered_map (separate chaining)
size
ยทยทยท
buckets
ยทยทยท
load factor
ยทยทยท
max chain
ยทยทยท
Separate chaining: each bucket is the head of a linked list. Lookup pays one cache miss to read the bucket, then one per chain node, plus the key comparison. The bucket API in std::unordered_map exposes this enough that implementations can't switch to open addressing without breaking ABI[7].

Hash quality matters more than you think

The hash function decides whether the table runs at O(1) or degenerates to O(n). std::hash<int> on libstdc++ and libc++ is the identity function: hash(42) == 42. That works fine when the bucket count is prime (which it is on libstdc++) and fails badly when keys differ only in bits the bucket count ignores, for example integer indices that all share their low bits. Hash-flooding attacks are real, and the standard library does not defend against them. Production code on internet-facing inputs should use a randomized seed (SipHash, WyHash) or accept the DoS risk[25].

For game code, where the inputs are usually engine-controlled, the standard's default is fine. The bigger issue is what the table costs vs. what it could cost.

13Open addressing and SwissTable

Outside the standard library, almost every modern hash table is open-addressing (no separate buckets; every entry is stored inline in one flat array, and collisions are resolved by probing). The flat layout is the cache argument from ยง1, applied to hash tables: one DRAM fetch covers many adjacent probe positions.

The state of the art is Google's SwissTable[9], open-sourced in Abseil in 2018[10]. The same general shape was adopted by Folly's F14[11], by Rust's standard HashMap (via the hashbrown crate), and by Go's map starting in version 1.24. The chunk size diverges: SwissTable groups 16 slots so the control bytes fill one 128-bit SIMD register and rehashes at load factor 7/8 = 87.5%. F14 groups 14 slots and uses the remaining 2 bytes of the 16-byte aligned block as a per-chunk overflow count: when an insert is displaced past this chunk it is incremented, on erase decremented, and an unsuccessful probe can stop as soon as the count is zero rather than walking a chain of tombstones[37]. F14's max load factor is 12/14 โ‰ˆ 85.7%, slightly lower than SwissTable's; that is the cost of widening the metadata.

The design has four moving parts:

Press "find" on the widget below with a key. H1 picks a group; the 16 control bytes load into a single SIMD register; one compare against the 7-bit H2 of the search key produces the match mask; the matched candidate gets a full key comparison:

Live ยท SwissTable lookup
size
ยทยทยท
groups
ยทยทยท
last probe
ยทยทยท
overhead
1 B / slot
SwissTable's whole trick: scan 16 slots in one SIMD instruction by storing 7 bits of the hash next to each slot. The control word also tags empty and tombstone slots, so the same instruction tells the inserter where to put a new entry. Memory overhead is 1 byte per slot. By contrast, a node-based unordered_map pays one next-pointer and (typically) one cached-hash word per node, plus the allocator header for each malloc'd node[11].

Where the time goes

Closed-addressing std::unordered_map spends most of its lookup time on the two cache misses (bucket pointer, then first chain node) and on the linked-list pointer chase. SwissTable spends most of its time on the one cache miss to load the group's 16 control bytes; the SIMD compare is one instruction, and the key comparison is on a hot cache line. The gap matches the prediction: 2-3ร— faster lookup, ~30% less memory[9][11].

The standard committee is aware of this. P2767, the proposal for a flat-storage std::flat_unordered_map, has been in flight for several revisions. Until it lands, the pragmatic answer for new C++ engine code is to take a dependency on absl::flat_hash_map or folly::F14FastMap and pretend std::unordered_map doesn't exist.

14The adaptors: stack, queue, priority_queue

Three thin wrappers, each constraining an underlying container to a specific access pattern:

The binary-heap layout is worth knowing about because std::priority_queue is a hot data structure in game code: AI A* searches, sound mixer priority, animation event scheduling, every "do the most important thing next" system. The heap is a vector where, for the element at index i, its children are at 2i+1 and 2i+2, and the parent is at (i-1)/2. push appends and sifts up; pop takes the root, moves the last element to the root, and sifts down.

Live ยท binary heap
size
ยทยทยท
top
ยทยทยท
last op compares
ยทยทยท
A binary max-heap stored in a vector. Push and pop are O(log n) (one sift-up or sift-down through the tree). Building a heap from a vector with std::make_heap is O(n), not O(n log n), because of the way work concentrates at the leaves[27].

The standard library's priority queue is a max-heap by default. For Dijkstra and A*, where you want the smallest distance first, pass std::greater<T> as the comparator:

astar.cpp ยท min-heap for shortest-path search
struct PathNode {
  NodeId nodeId;
  float  estimatedTotalCost;        // f = g + h: A* total estimate through this node
};
struct SmallerFirst {
  bool operator()(const PathNode& a, const PathNode& b) const {
    // priority_queue is a max-heap by default; we invert the comparator
    // so the smallest cost ends up at the top.
    return a.estimatedTotalCost > b.estimatedTotalCost;
  }
};

std::priority_queue<PathNode, std::vector<PathNode>, SmallerFirst> openSet;

openSet.push({startNodeId, heuristic(startNodeId, goal)});
while (!openSet.empty()) {
  auto currentNode = openSet.top();   // O(1) lookup of best frontier node
  openSet.pop();                       // O(log n) re-heap
  if (currentNode.nodeId == goal) break;
  for (NodeId neighbor : neighborsOf(currentNode.nodeId))
    openSet.push({neighbor, gCostSoFar(neighbor) + heuristic(neighbor, goal)});
}
priority_queue's missing API

std::priority_queue does not let you change the priority of an element in place; the standard exposes no "decrease-key" operation. A Dijkstra implementation has to push a fresh entry with the lower cost and lazily skip stale ones on pop. For workloads where decrease-key is on the hot path (graph search with many revisited nodes), reach for a Fibonacci heap, a pairing heap, or an indexed binary heap; std::priority_queue is the wrong tool[27].

15std::span and std::string_view: non-owning views

Two C++20 (and one C++17) additions that aren't really containers, but they're how you pass containers around without copying. They are both views: a pointer plus a size, owning nothing, valid only as long as the underlying storage outlives them[28].

Use them as parameter types whenever a function takes a contiguous range and doesn't need to extend its lifetime. The function gets to work with any contiguous source without overloading on the container type, and the caller doesn't pay for a copy:

span_param.cpp ยท the right shape for read-only sequence parameters
// Old way: one overload per container type, each with a copy if you
// took the parameter by value, or coupling to a specific container if
// you took it by reference. Both unsatisfying.
void drawTriangles(const std::vector<Vertex>& vs);
void drawTriangles(const std::array<Vertex, 64>& vs);
void drawTriangles(const Vertex* vs, size_t count);

// Modern way: one function, takes a view, the caller passes whatever
// they have. No copies, no overload set, no error-prone count parameter.
void drawTriangles(std::span<const Vertex> vertices);

// Call sites work uniformly:
drawTriangles(meshVertexVector);             // vector โ†’ span
drawTriangles(framePushArray);               // std::array โ†’ span
drawTriangles({rawPtr, count});              // raw pointer + count โ†’ span
Views don't own anything

A view is a pointer plus a length. The underlying storage has to outlive the view, or the pointer dangles. The single most common bug with std::string_view is constructing it from a temporary std::string and storing the view past the temporary's lifetime. Both Clang and MSVC ship lifetime-bound warnings ([[clang::lifetimebound]], /W4) that catch many of these statically; turn them on[28].

16Beyond std: flat_map, hive, intrusive lists, ring buffers

The standard's container set is the default, not the ceiling. A short tour of the non-std containers most game engines pick up:

flat_map and flat_set (C++23 / Boost.Container)

A sorted vector of {key, value} pairs, wrapped in an associative interface. Lookup is O(log n) binary search on contiguous memory; insertion and middle-erase are both O(n) because everything after the change point shifts. The trade is: insertion and middle-erase are slower than the tree's O(log n), lookup and iteration are faster, memory overhead is lower. Lookup-heavy and small-to-medium key sets are the sweet spot. C++23 standardized it as std::flat_map and std::flat_set; Boost has had it since 2003[4].

std::hive / plf::colony (C++26)

A bucket array with stable references on insert and erase. Each block of slots holds a fixed number of elements plus a skipfield that records which slots are erased. Iteration walks the skipfield, jumping over erased slots in constant time. Insert reuses an erased slot if one exists, else appends. Erase marks the slot and bumps the skipfield, leaving every pointer to other elements valid[12].

Matt Bentley designed it for entity-component-system code where game objects come and go thousands of times per frame and other systems hold pointers into the live pool. Standardized as std::hive in C++26.

Intrusive containers (Boost.Intrusive, EASTL)

A list or tree where each element carries the link pointers, not the container. The container is just a head pointer; the user manages the storage. Two payoffs: no per-node heap allocation (the link state lives inside the element you already own), and an element can be in multiple intrusive containers at once. EASTL's intrusive_list and Boost.Intrusive's list are the canonical implementations[5]. Engine schedulers, free-lists, and ECS sparse sets all use this pattern.

Ring buffers (custom, no std equivalent)

A fixed-capacity contiguous array with two indices (head and tail) that wrap around. Push and pop are both O(1) with no allocation; the size is a hard cap, not a soft one. Used wherever a bounded queue makes sense: audio mix ringbuffers, replay event buffers, lock-free job queues. The standard committee has not landed on a single ring-buffer design because the variants don't reduce to one shape: bounded vs growing, SPSC vs MPMC, full-on-overwrite vs full-on-block. P0260's "concurrent queues" effort has been in flight for years without settling.

SmallVector / fixed_vector (LLVM, EASTL)

A vector with an inline buffer of N elements; falls back to heap allocation when N is exceeded. The first N pushes touch no heap at all, which matches the workload pattern for "build a small list inside a function and discard it" code. LLVM's llvm::SmallVector<T, N> is the most widely cited implementation[15], and EASTL ships fixed_vector<T, N> in the same shape.

17Allocators and why games override them

Every standard container takes an allocator template parameter. The default is std::allocator<T>, which forwards to ::operator new and ::operator delete. On most platforms that's a thread-safe general-purpose allocator (glibc malloc, jemalloc, tcmalloc), and it's the right answer for most code.

Game engines override it for three reasons[5]:

C++17 added std::pmr (polymorphic memory resources), which lets you pick an allocator at runtime instead of at type-instantiation time. std::pmr::vector<int> is a vector of int that takes its allocator as a constructor argument. This is the modern way to integrate engine pools without making every container a different type. The standard ships std::pmr::monotonic_buffer_resource (a linear arena) and std::pmr::unsynchronized_pool_resource (a per-thread pool) out of the box.

pmr_frame.cpp ยท per-frame linear arena, no malloc on the hot path
// One frame's worth of scratch storage. Reused every frame.
static std::array<std::byte, 4 * 1024 * 1024> frameArenaStorage;
std::pmr::monotonic_buffer_resource frameArena(
    frameArenaStorage.data(), frameArenaStorage.size());

void simulateFrame() {
  frameArena.release();    // O(1) "free everything from last frame"

  // Containers allocate inside the arena. Each push_back is one
  // pointer bump in the arena, not a malloc.
  std::pmr::vector<VisibleEntity> visibleEntities(&frameArena);
  visibleEntities.reserve(1024);

  for (const Entity& entity : worldEntities)
    if (mainCameraFrustum.contains(entity.bounds))
      visibleEntities.push_back({entity.id, entity.transform});

  submitDraws(visibleEntities);
  // visibleEntities goes out of scope here. For trivially destructible
  // element types, no destructors run; the buffer's deallocate() on a
  // monotonic arena is a no-op, so the whole teardown is free until
  // frameArena.release() reclaims the arena at the top of next frame.
}

18The container race

One widget pushes a million elements through each of the five sequence containers, then iterates them. The numbers are JavaScript estimates with a calibration factor, not native C++ throughput, but the ratios match published native benchmarks within a factor of two. Pick a workload and a size and watch where each container falls:

Live ยท five-container race
vector ms
ยทยทยท
deque ms
ยทยทยท
list ms
ยทยทยท
map ms
ยทยทยท
unord ms
ยทยทยท
Random access shows the biggest gap. vector is one indexing operation, deque is two, list doesn't support it at all (the JS demo walks from begin() per query, which is what real std::list code would have to do), map is a logarithmic tree walk, unordered_map is a hash plus a bucket walk. The same N, the same query stream; depending on the workload the spread runs from a small constant factor (push N) to one or two orders of magnitude (iterate) to several orders of magnitude (random index, where list's O(n)-per-access loses badly).

19The decision cheat sheet

A one-page summary, organized by the question you're actually asking when you reach for a container:

If you needโ€ฆReach forWhy
A sequence of T, default case std::vector<T> Contiguous, cache-friendly, O(1) push_back amortized, the right answer for the bulk of sequence workloads[16].
A fixed-size sequence, no heap allocation std::array<T, N> Size is part of the type. Lives inline in its owner.
A small bounded sequence, usually fits in N, sometimes overflows llvm::SmallVector<T, N> or eastl::fixed_vector Inline buffer for N, heap fallback past N. Zero allocations on the common path.
O(1) push and pop at both ends std::deque<T> Chunked storage means front pushes don't shift everything.
O(1) splice between containers, or stable pointers across all mutations std::list<T> or std::hive<T> (C++26) Node-based. Pointer to element X survives any operation that doesn't touch X. Hive is the cache-friendly version.
Lookup by key, frequent inserts and erases, doesn't fit in a flat_map absl::flat_hash_map<K, V> or folly::F14FastMap<K, V> Open-addressing SIMD-scan, 2-3ร— faster than std::unordered_map on the same workload[9].
Lookup by key, small to medium static-ish key set, read-heavy std::flat_map<K, V> (C++23) or boost::container::flat_map Sorted vector. Binary search on contiguous memory beats hashing at small N[4].
Ordered iteration, range queries by key std::map<K, V> Red-black tree, sorted iteration is automatic, lower_bound/upper_bound work.
LIFO behavior, want to expose only push/pop/top std::stack<T, std::vector<T>> Vector-backed stack. Default of deque is fine but the vector version has less per-op overhead.
FIFO behavior std::queue<T> Default deque backing is the right answer. For bounded lock-free, a ring buffer is faster.
"Do the highest-priority job next" std::priority_queue<T> Vector + binary heap. O(log n) push and pop, O(1) top.
A read-only view of a contiguous range std::span<const T> / std::string_view Pointer + size. No copy, no allocation, no template-on-container.

20The container pitfalls list

Mistakes that ship to production every year, organized by the container that produces them:

vector pitfalls

deque pitfalls

list pitfalls

map pitfalls

unordered_map pitfalls

21What's next

The decision shape: pick the layout (one block, chunked, node-based, hashed), pick the access pattern, the container follows. The cheat sheet in ยง19 has the lookup table; the rest of this tutorial is the reasoning behind each row.

Past this tutorial:

22Sources & further reading

Numbered citations refer to the superscripts above. Primary sources only: standards, vendor documentation, published papers, talk transcripts.

A note on originality

The prose, code samples, CSS, and interactive demos on this page are original. The vector-vs-list framing follows Stroustrup's Going Native 2012 keynote [1]. The SwissTable description follows Matt Kulukundis's CppCon 2017 talk [9] and the Abseil design page [10]; the "why exactly 7 bits for H2" derivation in ยง13 is the author's reading, not a quote. The growth-factor argument in ยง6 starts from Folly's claim [8] but the explicit golden-ratio derivation (rยฒ โ‰ค r + 1) is laid out here. The F14 description follows Facebook's open-source design doc [11]. The EASTL allocator-model critique follows Paul Pedriana's N2271 paper [5]; the three game-engine-specific allocator concerns in ยง17 (budget, frame arena, SIMD overalignment) and their DirectXMath/Unreal grounding are the author's framing. All deque chunk-size numbers were verified against the actual libstdc++, libc++, and MSVC STL source. All complexity guarantees and iterator-invalidation rules match cppreference and the published C++ standard.

  1. Stroustrup, B. (2012). C++11 Style. Going Native 2012 keynote. YouTube. Includes the vector-vs-list sorted-insertion benchmark and the "cache effects dominate" argument; follow-up at isocpp.org/blog/2014/06/stroustrup-lists.
  2. Stevens, A. (1995). Alexander Stepanov and STL Interview. Dr. Dobb's Journal. stlport.org. The history of how generic programming led to the STL container set.
  3. ISO/IEC 14882:2020 (and subsequent revisions). Programming languages โ€” C++. Standard reference for container complexity guarantees, sequence/associative requirements, and iterator invalidation. Open draft mirrored at eel.is/c++draft.
  4. Boost. Boost.Container Non-Standard Containers: flat_(multi)map / flat_(multi)set. boost.org. The reference implementation of sorted-vector associative containers, dating to 2003.
  5. Pedriana, P. (2007). EASTL: Electronic Arts Standard Template Library. WG21 paper N2271. open-std.org. The committee-facing case for the allocator, intrusive, and fixed-storage container set that game code needs.
  6. cppreference. std::array. en.cppreference.com/w/cpp/container/array. Fixed-size sequence, no heap allocation, inline storage.
  7. Muรฑoz, J. D. (2022). Advancing the State of the Art for std::unordered_map Implementations. Overload journal, ACCU. accu.org. Why the bucket API locks std::unordered_map into closed addressing, and what implementers can still optimize within that constraint.
  8. Facebook / Meta. FBVector โ€” Folly C++ Library. github.com/facebook/folly. The 1.5ร— growth factor, jemalloc cooperation, and trivial-relocation optimization that distinguish fbvector from std::vector.
  9. Kulukundis, M. (2017). Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step. CppCon 2017. YouTube. The original public description of SwissTable: H1/H2 split, 16-slot SIMD groups, 1-byte control words, 2-3ร— faster than std::unordered_map.
  10. Google. Abseil Swiss Tables and absl::Hash. abseil.io/blog/20180927-swisstables. Open-sourcing announcement and integration with Abseil's hash framework.
  11. Facebook / Meta. F14 Hash Table. github.com/facebook/folly. 14-key chunks, three storage variants (Value/Node/Vector), engineering announcement at engineering.fb.com.
  12. Bentley, M. Introduction of std::hive to the standard library. WG21 paper P0447 (latest revision R28). open-std.org/p0447r26; tracker at cplusplus/papers#328. The standardized bucket-array container, evolved from plf::colony; reference implementation at github.com/mattreecebentley/plf_hive. Adopted for C++26.
  13. Hennessy, J. L., & Patterson, D. A. Computer Architecture: A Quantitative Approach, 6th ed. Morgan Kaufmann (2017). The canonical citation for cache-hierarchy latency ratios and the cycle counts in ยง3.
  14. Intel. Intelยฎ 64 and IA-32 Architectures Optimization Reference Manual. intel.com. Chapter on memory hierarchy and the hardware prefetcher's pattern requirements.
  15. LLVM. llvm::SmallVector documentation. llvm.org. The reference inline-buffer-with-heap-fallback design.
  16. Stroustrup, B. (2013). The C++ Programming Language, 4th ed. Addison-Wesley. ยง31.4: "Use vector as your default container." The recommendation that std::vector is the right default; the cppcoreguidelines mirror it.
  17. C++ Core Guidelines Working Group. SL.con.1: Prefer using STL array or vector instead of a C array. isocpp.github.io.
  18. GCC libstdc++. stl_vector.h source. github.com/gcc-mirror/gcc. Three-pointer layout (begin / end / end-of-storage), 2ร— growth factor.
  19. cppreference. std::vector. en.cppreference.com/w/cpp/container/vector. Complexity table, iterator-invalidation rules, vector<bool> note.
  20. cppreference. std::unordered_map. en.cppreference.com/w/cpp/container/unordered_map. Bucket API, complexity, rehash and invalidation rules.
  21. cppreference. std::deque. en.cppreference.com/w/cpp/container/deque. Chunked-array implementation note, indexing complexity.
  22. Chen, R. (2023). Inside STL: The deque, implementation. The Old New Thing, Microsoft DevBlogs. devblogs.microsoft.com. Side-by-side description of libstdc++, libc++, and MSVC's deque chunk sizes.
  23. cppreference. std::list. en.cppreference.com/w/cpp/container/list. Doubly-linked layout, O(1) splice, reference-stability guarantee.
  24. Chen, R. (2023). Inside STL: The map, set, multimap, and multiset. The Old New Thing, Microsoft DevBlogs. devblogs.microsoft.com. Confirms red-black tree as the implementation strategy across libstdc++, libc++, and MSVC.
  25. Aumasson, J.-P., & Bernstein, D. J. (2012). SipHash: a fast short-input PRF. aumasson.jp/siphash/siphash.pdf. The keyed-hash design that became the default for hash-flooding-resistant hash tables (Rust, Python 3, Ruby).
  26. Google. Swiss Tables Design Notes. abseil.io/about/design/swisstables. H1/H2 hash split, control-byte layout, SIMD match details.
  27. cppreference. std::priority_queue. en.cppreference.com/w/cpp/container/priority_queue. Binary-heap backing, O(log n) push/pop, no decrease-key.
  28. cppreference. std::span. en.cppreference.com/w/cpp/container/span. C++20 non-owning view, lifetime requirements.
  29. cppreference. std::map. en.cppreference.com/w/cpp/container/map. std::less<> transparent comparator details under C++14.
  30. Le, Z., et al. A Standard flat_map. WG21 paper P0429R9. open-std.org/p0429r9. The paper that brought std::flat_map and std::flat_multimap into C++23.
  31. Zhihao Yuan. A Standard flat_set. WG21 paper P1222. open-std.org/p1222r4. The companion paper for std::flat_set and std::flat_multiset in C++23.
  32. Sutter, H. (2014). GotW #90 Solution: Factories. herbsutter.com. Touches the "vector by default" recommendation in the context of return-type design.
  33. Alexandrescu, A. (2014). Optimization Tips โ€” Mo' Hustle Mo' Problems. CppCon 2014. YouTube. Small-string and small-vector optimizations behind FBString and FBVector.
  34. Bentley, M. (2017). Colonies, performance and why you should care. CppCon 2016. YouTube. The original public talk on plf::colony, designed for game-engine entity pools.
  35. Celis, P., Larson, P.-ร…., & Munro, J. I. (1985). Robin Hood Hashing. FOCS 1985. semanticscholar.org. The variance-reducing open-addressing technique that several modern hash tables build on.
  36. Stepanov, A., & Lee, M. (1995). The Standard Template Library. HP Labs Technical Report HPL-95-11. stepanovpapers.com. The original STL design document.
  37. Bronson, N., & Shi, X. (2019). Open-sourcing F14 for faster, more memory-efficient hash tables. Facebook Engineering. engineering.fb.com. Source for the 14-slot chunk choice ("filters 14 slots at once"), the SSE2/NEON intra-chunk SIMD compare, the 12/14 max load factor, and the chunk-overflow-count mechanism. The F14 design markdown in the Folly repository (F14.md) gives the chunk-layout details (14 tag bytes + 2 metadata bytes = one 16-byte aligned block).

See also