C++ Containers
Eight cores. A 64-byte cache line. A 70-cycle gap between an L1 hit and an L3 hit, another 200 cycles to DRAM. The container that wins on paper is rarely the one that wins on hardware, because the data structure is the layout in memory and the layout decides whether the CPU is fed or stalled. We walk every standard container the engine actually ships with: std::array, std::vector, std::deque, std::list, std::map, std::unordered_map, the adaptors, the views. Memory pictures, complexity proofs, and a live race that makes the 50ร gap between vector and list visible.
01Why a container choice is a performance decision
Big-O is a description of how a data structure scales. It is not a description of how fast that data structure is. A linked list and a sorted vector both report O(n) insertion in the middle, and on hardware built in 1985 they performed about the same. On hardware built since 2005 the vector wins by a margin so wide it stops being an algorithm question and becomes a memory-layout question. Bjarne Stroustrup spent an entire Going Native 2012 keynote making this point with a stopwatch[1], and a decade later "default to vector" is the conventional advice in most modern engine codebases. For sequence containers, the layout in memory is most of the performance story; complexity, reference stability, and lookup pattern decide the rest.
The widget below makes that visible. The left column reads N integers from a contiguous block (a std::vector<int>). The right column reads N integers from nodes scattered across the heap (a std::list<int>). Both touch the same number of bytes. Both report O(n) traversal. The cache walker shows the difference: the vector reads sweep through cache lines in order; the list reads jump between unrelated lines, paying a cache miss on most accesses:
A working mental model of every standard C++ container: what it actually looks like in memory, what its complexity guarantees promise (and what they don't), and which one to reach for in a given gameplay system. Plus a live container race that lets you push a million elements through each, a hash-table animator that flips between separate chaining and SwissTable-style open addressing, and a cheat sheet you can paste into your engine's contributing guide.
The framing for the rest of the tutorial
Every container in this tutorial answers four questions, and you can mostly predict its performance from the answers:
- Where do the elements live? One contiguous block (
vector,array), several contiguous blocks (deque,hive), or one independent heap allocation per element (list,map, node-basedunordered_map)? - How are they ordered? Insertion order (
vector,deque,list), sorted by key (map,set), or hashed (unordered_map,unordered_set)? - Which operations does the container want to be fast? Random index, find-by-key, push at the back, push anywhere, ordered iteration.
- Which references survive which mutations? A pointer into a
vectorcan become invalid on the nextpush_back; a pointer into alistsurvives every mutation that doesn't touch that specific node. Engine code that holds long-lived handles depends on knowing this exactly.
Answer those four for each container and the rest of the standard-library reference page reduces to small print.
02A short history of getting containers right
The C++ standard library's container set didn't drop fully formed in 1998. Its shape is the cumulative answer to twenty-five years of "this is faster on real hardware" arguments, each of which forced the committee or a library author to write the better data structure down. A short tour, because the modern set only makes sense in context:
vector, deque, list, set, map, multiset, multimap, plus the stack, queue, and priority_queue adaptors. Associative containers are tree-based; the standard never names a specific tree, but the combination of logarithmic complexity and iterator-stable insert/erase narrows the practical choice to a balanced BST. Every shipping implementation (libstdc++, libc++, MSVC) lands on red-black[3].
std over the next decade.
std::array, std::unordered_map, std::unordered_set.[6] Fixed-size sequence with no heap allocation; hash tables with average-constant lookup. The unordered containers are constrained to closed-addressing (separate chaining) by a bucket API that leaks into the interface[7]. That decision is what every "faster than unordered_map" library is replacing.
std::vector with a 1.5ร growth factor in the middle of its size range (so a future reallocation can reuse freed memory), jemalloc-aware sizing, and explicit relocation for trivially-movable types. The reported wins on Facebook's workloads are modest but real.
std::unordered_map on Google's workloads. Open-sourced in Abseil in 2018[10].
F14FastMap picks between the Value and Vector layouts at compile time based on entry size; users who need pointer stability reach for F14NodeMap directly.
flat_map and flat_multimap arrive through P0429[30]; flat_set and flat_multiset arrive separately through P1222[31]. Lookup-heavy ordered associative containers no longer need a third-party dependency.
std::hive.[12] Matt Bentley's plf::colony, formerly used for game-engine entity pools, becomes the standard's stable-reference bucket array. Erase keeps the slot, iteration skips the gap, pointers survive every mutation that does not touch their specific element.
The recurring pattern: the algorithm is from the 1970s, the cache-friendly layout is from the 2010s, the standardization is a few years behind whatever the big shops have already shipped. By 2026 the standard has closed the sorted-associative and bucket-array gaps (flat_map, hive); the hash-table gap is the conspicuous remainder, since std::unordered_map is still constrained by the C++11 bucket API while the open-addressing variants outside std are 2-3ร faster. The original 1998 containers are still the right default for most code. The remainder of this tutorial walks them in dependency order: array, vector, deque, list, map, unordered_map, the adaptors, the views, and the post-std alternatives at the end.
03The memory hierarchy the container sits on
Before any specific container makes sense, the picture of the hardware they're trying to feed. A modern CPU spends most of its time waiting for memory:
| Tier | Typical latency (cycles) | Capacity | Relative cost of a miss |
|---|---|---|---|
| Register | ~0 | ~32 ร 8 B per core | |
| L1 data cache | ~4 | 32-48 KB per core | |
| L2 cache | ~12 | 256 KB-1 MB per core | |
| L3 cache (shared) | ~40 | 4-64 MB per socket | |
| DRAM | ~250 | 8-128 GB |
Cycle counts are Skylake-class x86 rule-of-thumb numbers, drawn from Hennessy and Patterson[13], Intel's optimization manual[14], and Agner Fog's measurements; AMD Zen, Apple M-series, and ARM Neoverse have their own profiles, and server parts with mesh interconnects push L3 and remote-socket DRAM substantially higher. The point is the ratio, not the exact number. A miss to DRAM is roughly 60 times the cost of an L1 hit on a modern desktop core.
The unit the cache moves around is a cache line: 64 bytes on x86-64 and most ARM, 128 bytes on Apple M-series. A line holds 16 ints, or 8 doubles, or 8 64-bit pointers. When the CPU touches a byte that isn't in cache, the hardware pulls the entire line that contains it, and pays the DRAM latency once. Every subsequent byte on the same line is free.
This is the central fact behind every container choice in the rest of the tutorial. A contiguous container amortizes the cost of one DRAM fetch across 16 elements. A node-based container with a general-purpose allocator usually pays that fetch once per element, because consecutive nodes rarely share a cache line. (Custom pool allocators and intrusive containers can close some of the gap; see ยง16.)
What about the hardware prefetcher?
x86 (and most ARM) CPUs have a stream prefetcher that watches recent cache misses and, if they form a regular linear pattern, fetches ahead so the next miss is already on its way. A sequential vector scan triggers it: the prefetcher correctly predicts that the next line is wanted and brings it in before the CPU asks. A list traversal does not trigger it: the next node's address is in the current node's next pointer, and the prefetcher can't follow indirection[14].
This is why the gap between vector and list scales with array size. Below ~32 KB everything fits in L1 and the difference is small. Above L3 the vector keeps up with DRAM bandwidth (10-40 GB/s) while the list bottlenecks on miss latency (~250 cycles per element on a Skylake-class core). Stroustrup's keynote demo[1] reported the vector finishing in roughly 80 seconds while the list took close to two hours on the same hardware at 500K elements, somewhere around 50-100ร depending on which run you pick. The exact factor depends on element size, allocator behaviour, and the working-set range; the order of magnitude is what's robust.
With that established, the containers come in three structural families. The rest of the sections are organized by family:
- Contiguous sequence:
array,vector. One block, indexable, cache-loving. - Chunked or node-based sequence:
deque(chunked),listandforward_list(one node per element). - Associative:
map/set(sorted, node-based, tree);unordered_map/unordered_set(hashed).
Plus the adaptors that wrap one of those (stack, queue, priority_queue) and the non-owning views (span, string_view) that point into someone else's storage.
04std::array: fixed size, on the stack
The smallest, simplest container. A thin wrapper around a C-style array, sized at compile time, no heap allocation, all elements live in the object itself[6]. std::array<float, 3> is exactly three floats in twelve bytes, nothing more. It exists for two reasons: it gives a C array a real STL interface (.begin(), .size(), .data(), etc.), and it makes the size part of the type so the compiler can check it.
// Three floats, allocated wherever the std::array object lives. // If 'positions' is a local, it's on the stack; if it's a member of a class // allocated on the heap, it lives in that class's heap block. No // separate allocation either way. std::array<float, 3> positions = {0.0f, 1.0f, 2.0f}; // sizeof(positions) is exactly 12. There's no length field; the size is // encoded in the type so the compiler always knows it. static_assert(sizeof(positions) == 3 * sizeof(float)); // The interface is std-container-shaped: iterate, index, ask the size. for (float& element : positions) element *= 2.0f; // data() returns a pointer to the first element, the same way a C array decays. // Useful when calling C APIs (OpenGL, Vulkan, sockets) that want a raw pointer. glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 0, positions.data());
Use it whenever the size is a compile-time constant and the container is small enough that copying it by value is cheap. Vertex positions, RGBA colors, 4ร4 matrix rows, hash digests, fixed-channel audio mixes, lookup tables of known size. Anywhere a C programmer would have written float v[3];.
The size is part of the type. std::array<int, 3> and std::array<int, 4> are different types and you can't assign one to the other. If the size of the data is "small but not fixed" (e.g. up to 16 entries in a small-buffer-optimized list), reach for a SmallVector from EASTL or LLVM[15], not std::array with the maximum size and a separate length counter.
05std::vector: the workhorse
The default. Stroustrup's advice in The C++ Programming Language, fourth edition, is "use vector as your default container"[16] and the C++ Core Guidelines repeat it[17]. The reasoning is exactly the cache argument from ยง1: a contiguous block of memory is the friendliest layout for the CPU, and most workloads don't need more than that.
A std::vector is three pointers (or equivalents) in the object itself: begin, end, and end-of-storage[18]. The actual elements live in a single contiguous heap block. size() is end - begin; capacity() is end_of_storage - begin. Indexing is one pointer arithmetic plus one load. Iteration is a pointer walk that the hardware prefetcher loves.
The four guarantees that matter
- Constant-time random access.
v[i]is one pointer add, one load. - Amortized constant-time push_back. When there's capacity, push_back is one assignment plus one increment. When capacity runs out, it doubles (or 1.5รs) and reallocates. The amortized cost is O(1) regardless[19].
- Linear-time insert/erase in the middle. Every element after the insertion point has to be shifted by one slot. On contiguous memory this is one
memmove, which the CPU runs at memory bandwidth. - Contiguous storage, guaranteed.
&v[0],v.data(), andv.begin()all point at the same single block. You can passv.data()to OpenGL, tomemcpy, to::write().std::vector<bool>is the famous exception; we'll come back to that.
The catch is the third bullet's twin: every reallocation invalidates every pointer, reference, and iterator into the vector, and inserts that don't grow the vector still invalidate everything from the insertion point to the end. The next two sections take each in turn.
06Growth, amortization, and the cost of push_back
When a vector's size() reaches its capacity(), the next push_back has to allocate a bigger block, move every existing element into it, and free the old one. That's a linear-time operation. Doing it on every push would make push linear-time in steady state, which is terrible. The standard library hides this with amortization: it grows the block by a constant factor (typically 2ร or 1.5ร) so that the cost of all the reallocations across N pushes is O(N) total, which means the cost per push averages out to O(1)[19].
One push at a time, this widget records the cost. Capacity steps up at doubling boundaries; reallocation events appear as tall red bars; the running average per push converges to a small constant even though individual pushes cost O(size):
Geometric growth turns N pushes into about 2N total work, which is O(N) and gives O(1) amortized per push. The word amortized is doing real work: any individual push that triggers a reallocation is O(size), not O(1). Code that needs a hard real-time bound on a single push has to call reserve() first.
Why 1.5ร sometimes beats 2ร
Set the growth factor to r and call the initial capacity 1. After k growths the sum of every previously freed block is 1 + r + rยฒ + โฆ + rk-1 = (rk โ 1) / (r โ 1). The block being allocated for the next growth needs rk units. Ask when the new allocation fits inside the cumulative freed region: rk โค (rk โ 1) / (r โ 1), which simplifies to rยฒ โค r + 1. The largest r satisfying that inequality is the golden ratio, ฯ โ 1.618. Any factor below ฯ eventually fits, so the allocator can reuse the just-freed addresses; r = 2 sits permanently above ฯ and never can.
That's the case for picking a factor below 2. The case against is that smaller factors mean more reallocations, so the average per-push cost rises. The shipping picks split the difference: MSVC uses 1.5ร, Folly's fbvector uses 1.5ร in the middle of its size range and 2ร at the small and very large ends[8], libstdc++ and libc++ stay at 2ร. Whether the in-place reuse actually happens depends on the allocator; jemalloc handles it well, glibc's ptmalloc less so.
In practice, if you know how big the vector will get, the cure for all of this is one line:
std::vector<EntityId> visibleEntities; // We know roughly how many entities will be visible this frame. // One allocation up front beats log2(N) reallocations during the loop. visibleEntities.reserve(estimatedVisibleCount); for (const Entity& entity : worldEntities) { if (frustum.contains(entity.bounds)) visibleEntities.push_back(entity.id); // no reallocation until we exceed the reservation }
07Iterator invalidation: the rule you can't forget
Every container has a set of "things that, if you do them, will invalidate some references into the container." Get it wrong and you have a dangling pointer that the compiler can't catch. The rules are precise and worth memorizing for the containers you use most:
| Container | What invalidates iterators | What invalidates pointers/references |
|---|---|---|
vector |
Anything that triggers a reallocation invalidates all iterators. Insert or erase invalidates iterators from the change point to end(). |
Same as iterators. References are pointers in disguise here. |
deque |
Insert or erase anywhere except the front/back invalidates all iterators. | Insert at front or back leaves all pointers/references valid; insert in the middle invalidates everything. |
list, forward_list |
Only the iterators to erased elements are invalidated. | Only the pointers to erased elements. Everything else survives. |
map, set, multi* |
Same as list: only the erased ones. |
Same as list: only the erased ones. |
unordered_map, unordered_set |
Insert may trigger a rehash, which invalidates all iterators. Erase invalidates only the erased iterator. | Insert with no rehash leaves pointers/references valid; rehash leaves pointers/references valid (only iterators are invalidated). Erase invalidates only the erased element's references. |
Adapted from the cppreference invalidation tables[19][20] and the formal ISO wording[3]. The "pointer/reference vs iterator" distinction for unordered_map rehash is subtle and worth confirming on the cppreference page for any version you target.
Each button on the widget below performs one operation on a small vector. Two pinned iterators sit at fixed positions; they turn red the instant they go dangling. Every reallocation flips every iterator; every middle-insert flips everything from the insertion point onward.
08std::deque: chunked, two-ended
std::deque ("double-ended queue") gives you push_front and pop_front at O(1), which vector doesn't. It does this by storing elements in a sequence of fixed-size chunks, with a top-level index ("map") of pointers to those chunks[21]. Inserting at the front allocates a new chunk and points the map at it; the existing chunks don't move. Indexing is two loads (chunk pointer, then element) instead of one.
Implementations pick a chunk size, and the picks differ wildly[22]:
- libstdc++: 512-byte chunks. For
deque<int>, 128 elements per chunk; for any element larger than 512 bytes, one element per chunk. - libc++: 4096-byte chunks when the element is smaller than 256 bytes; a fixed 16 elements per chunk above that.
- MSVC: the chunk size shrinks aggressively with element size. The formula is 16 elements for 1-byte elements, 8 for 2-byte, 4 for 4-byte (so
deque<int>has chunks of 4), 2 for 8-byte, and one element per chunk for anything larger than 8 bytes. This last regime is the one with the famously poor cache behavior: adequeof any reasonably-sized struct on MSVC is approximately one heap allocation per element.
Pictured below: a generic libstdc++-style layout. Push at either end and a new chunk is allocated when the current end fills; existing chunks never move.
Use it when you need O(1) push and pop at both ends, when reference stability after front/back insertion matters, or when you want to grow without periodic 2ร reallocations. Common production uses:
- Job queues where producers push at one end and the worker pops from the other.
- Event histories where new events go on the back and stale ones get popped off the front.
- The default backing store of
std::queueandstd::stack.queue<T>is justqueue<T, deque<T>>.
The chunked layout means iteration sweeps each chunk sequentially (good) and then jumps to the next chunk (one cache miss per chunk boundary). With a 512-byte libstdc++ chunk and 8-byte doubles, that's 64 elements per chunk and one miss per 64. Still far better than a list's one miss per element, and within a small constant of a vector's one per 16. On MSVC with its one-element-per-chunk policy for anything larger than 8 bytes, the picture is much worse, and the resulting deque behaves more like a list than a vector. For pure-iteration workloads on libstdc++ or libc++, std::vector typically wins by a small constant; for two-ended access, deque wins because the alternative is a vector with O(n) insert(begin()).
09std::list and std::forward_list
A doubly-linked list. Each element lives in its own heap allocation, plus two pointers (prev and next). std::forward_list drops the prev pointer to halve the per-node overhead, at the cost of being singly-linked[23].
list has two real selling points in modern C++:
- Reference stability under all mutations. A pointer into a node survives every insert and every erase that doesn't touch that specific node. Worth real money in code that wires up long-lived handles to objects in containers.
- O(1) splice.
list1.splice(pos, list2, it)moves a node fromlist2intolist1by relinking pointers, no copies, no allocations[23]. There is no other standard container that lets you move an element between containers in O(1). LRU-eviction caches and ready-queue schedulers both lean on this.
Outside of those two use cases, it is almost always the wrong choice. The cache-miss penalty from ยง1 is the headline; smaller costs add up too:
- Per-node allocation. Each
push_backon alist<int>calls the allocator. A vector calls the allocator log2(N) times. - Per-node memory overhead. On libstdc++ x86-64, a
list<int>node struct is 24 bytes (8-byteprev, 8-bytenext, 4-byte int, 4 bytes of padding), plus the allocator header (typically 8-16 bytes on glibc). The payload is one int. The overhead ratio sits around 6-10ร, before any allocator fragmentation. - No random access.
l[i]doesn't compile; you walk frombegin().
The next section measures the gap directly.
10List vs vector, measured
Stroustrup's Going Native 2012 benchmark[1]: generate N random integers; for each one, find its sorted position (linear scan from begin()) and insert it there. Repeat for both a vector and a list. The number of comparisons is the same; the number of structural mutations is the same; the constant factor is everything.
Press RACE to run a JavaScript port of the benchmark. The absolute numbers won't match a native C++ run (V8 doesn't have the same heap layout as glibc), but the ratio tracks the C++ result:
Two cases survive the cache argument. (a) Splice-heavy workloads where you move elements between containers without iteration: an LRU cache that touches one element per request, then moves that node to the front, is a textbook fit. (b) Code that holds long-lived pointers into the container across many mutations: a list<Job> where job dependency edges store raw Job*. In both cases pre-C++26, the contiguous alternative needs a separate indirection table to provide the reference stability that list gets for free; whether that's a better deal depends on the specifics. As of C++26, std::hive (see ยง16) covers case (b) with a cache-friendlier layout, so the only case that still goes to list by default is the splice-heavy one in (a).
11std::map and std::set: ordered associative
std::map<K, V> is a sorted, balanced binary search tree from keys to values. The standard requires logarithmic-time lookup, insert, and erase, and ordered iteration[3]. The standard never names a specific data structure, but every shipping implementation (libstdc++, libc++, MSVC) uses a red-black tree[24]. AVL trees meet the same complexity bound but rotate more on insert/erase; red-black trades a slightly taller tree for fewer rotations, which is the right balance when erase has to be O(log n) too.
Each element is its own heap-allocated node containing the key/value pair, a parent pointer, two child pointers, and one color bit. That's about 48 bytes of overhead per node on 64-bit platforms, plus the payload. Lookup is a logarithmic walk from the root, branching by key comparison at each node.
When to use map, and when not
Use std::map when you need ordered iteration (lowest key first, highest key last) or when you need to find ranges by key (lower_bound, upper_bound, equal_range). For example: per-frame replay events keyed by timestamp, where you stream events in time order; or an interval tree by start coordinate.
Do not use std::map as a default associative container. For unordered "find by key" lookups, std::unordered_map is faster on average (constant vs logarithmic) and a flat sorted-vector (std::flat_map in C++23, or boost::container::flat_map before that) is faster than both for read-heavy workloads on small to medium key sets, because the contiguous layout fits the cache[4].
12std::unordered_map and the bucket interface
A hash table. Average-constant-time lookup, insert, and erase; worst-case linear if the hash function is adversarial. C++11 standardized it with a specific implementation strategy baked into the interface: closed addressing (each bucket holds a linked list of entries that hashed to it), exposed through the public bucket_count(), bucket_size(), and local_iterator APIs[20]. That interface choice locks every shipping std::unordered_map into separate chaining, which is why third-party hash tables routinely beat it by 2-3ร[7].
The widget shows the closed-addressing layout. Each bucket points to a linked list of {hash, key, value} nodes. Lookup hashes the key, picks the bucket, then walks the bucket's list comparing keys. The load factor (size() / bucket_count()) controls how long the bucket chains get; the default max_load_factor is 1.0, so on average each bucket holds one element.
Hash quality matters more than you think
The hash function decides whether the table runs at O(1) or degenerates to O(n). std::hash<int> on libstdc++ and libc++ is the identity function: hash(42) == 42. That works fine when the bucket count is prime (which it is on libstdc++) and fails badly when keys differ only in bits the bucket count ignores, for example integer indices that all share their low bits. Hash-flooding attacks are real, and the standard library does not defend against them. Production code on internet-facing inputs should use a randomized seed (SipHash, WyHash) or accept the DoS risk[25].
For game code, where the inputs are usually engine-controlled, the standard's default is fine. The bigger issue is what the table costs vs. what it could cost.
13Open addressing and SwissTable
Outside the standard library, almost every modern hash table is open-addressing (no separate buckets; every entry is stored inline in one flat array, and collisions are resolved by probing). The flat layout is the cache argument from ยง1, applied to hash tables: one DRAM fetch covers many adjacent probe positions.
The state of the art is Google's SwissTable[9], open-sourced in Abseil in 2018[10]. The same general shape was adopted by Folly's F14[11], by Rust's standard HashMap (via the hashbrown crate), and by Go's map starting in version 1.24. The chunk size diverges: SwissTable groups 16 slots so the control bytes fill one 128-bit SIMD register and rehashes at load factor 7/8 = 87.5%. F14 groups 14 slots and uses the remaining 2 bytes of the 16-byte aligned block as a per-chunk overflow count: when an insert is displaced past this chunk it is incremented, on erase decremented, and an unsuccessful probe can stop as soon as the count is zero rather than walking a chain of tombstones[37]. F14's max load factor is 12/14 โ 85.7%, slightly lower than SwissTable's; that is the cost of widening the metadata.
The design has four moving parts:
- The table is one flat array of slots. Each slot has a 1-byte control word next to its
{key, value}entry. - The 64-bit hash is split into H1 (57 bits) for picking the starting position, and H2 (7 bits) stored in the control word for fast filtering[26]. The 7-bit width is not arbitrary: it leaves one bit for the empty/tombstone/full state tag, and 16 of them (one per slot in a group) fit in a 128-bit SSE register, which is what the SIMD compare needs.
- On x86 with SSE2, slots are grouped into groups of 16. Lookup loads the 16 control bytes into one 128-bit SIMD register and runs one
_mm_cmpeq_epi8against the H2 we're searching for;_mm_movemask_epi8turns the result into a 16-bit mask of candidate positions[9]. On builds without SSE2 (older ARM, portable fallback) the group size drops to 8 and bit-twiddling replaces the SIMD compare. - The candidates are then key-equality-checked; misses move to the next group. Most lookups touch one group, which is one cache line.
Press "find" on the widget below with a key. H1 picks a group; the 16 control bytes load into a single SIMD register; one compare against the 7-bit H2 of the search key produces the match mask; the matched candidate gets a full key comparison:
Where the time goes
Closed-addressing std::unordered_map spends most of its lookup time on the two cache misses (bucket pointer, then first chain node) and on the linked-list pointer chase. SwissTable spends most of its time on the one cache miss to load the group's 16 control bytes; the SIMD compare is one instruction, and the key comparison is on a hot cache line. The gap matches the prediction: 2-3ร faster lookup, ~30% less memory[9][11].
The standard committee is aware of this. P2767, the proposal for a flat-storage std::flat_unordered_map, has been in flight for several revisions. Until it lands, the pragmatic answer for new C++ engine code is to take a dependency on absl::flat_hash_map or folly::F14FastMap and pretend std::unordered_map doesn't exist.
14The adaptors: stack, queue, priority_queue
Three thin wrappers, each constraining an underlying container to a specific access pattern:
std::stack<T>. LIFO. Defaults tostd::deque<T>underneath, but any container withback(),push_back(), andpop_back()works. Passstd::vector<T>as the second template argument and you get a vector-backed stack with no deque overhead.std::queue<T>. FIFO. Defaults tostd::deque<T>underneath. A vector won't work (it has no efficientpop_front); astd::listdoes but is slower than the deque default.std::priority_queue<T>. The biggest element is always on top. Defaults tostd::vector<T>backed by a binary heap, which is exactly the right shape for this access pattern[27].
The binary-heap layout is worth knowing about because std::priority_queue is a hot data structure in game code: AI A* searches, sound mixer priority, animation event scheduling, every "do the most important thing next" system. The heap is a vector where, for the element at index i, its children are at 2i+1 and 2i+2, and the parent is at (i-1)/2. push appends and sifts up; pop takes the root, moves the last element to the root, and sifts down.
The standard library's priority queue is a max-heap by default. For Dijkstra and A*, where you want the smallest distance first, pass std::greater<T> as the comparator:
struct PathNode { NodeId nodeId; float estimatedTotalCost; // f = g + h: A* total estimate through this node }; struct SmallerFirst { bool operator()(const PathNode& a, const PathNode& b) const { // priority_queue is a max-heap by default; we invert the comparator // so the smallest cost ends up at the top. return a.estimatedTotalCost > b.estimatedTotalCost; } }; std::priority_queue<PathNode, std::vector<PathNode>, SmallerFirst> openSet; openSet.push({startNodeId, heuristic(startNodeId, goal)}); while (!openSet.empty()) { auto currentNode = openSet.top(); // O(1) lookup of best frontier node openSet.pop(); // O(log n) re-heap if (currentNode.nodeId == goal) break; for (NodeId neighbor : neighborsOf(currentNode.nodeId)) openSet.push({neighbor, gCostSoFar(neighbor) + heuristic(neighbor, goal)}); }
std::priority_queue does not let you change the priority of an element in place; the standard exposes no "decrease-key" operation. A Dijkstra implementation has to push a fresh entry with the lower cost and lazily skip stale ones on pop. For workloads where decrease-key is on the hot path (graph search with many revisited nodes), reach for a Fibonacci heap, a pairing heap, or an indexed binary heap; std::priority_queue is the wrong tool[27].
15std::span and std::string_view: non-owning views
Two C++20 (and one C++17) additions that aren't really containers, but they're how you pass containers around without copying. They are both views: a pointer plus a size, owning nothing, valid only as long as the underlying storage outlives them[28].
std::span<T>(C++20). A view of a contiguous range ofT. Construct it from astd::vector<T>, astd::array<T, N>, a C array, anything contiguous. The function on the other side gets a uniform interface to all of them.std::string_view(C++17). The same idea, specialized for character ranges. Has the methods that make string handling pleasant (find,substr,starts_with,ends_with) without owning the bytes.
Use them as parameter types whenever a function takes a contiguous range and doesn't need to extend its lifetime. The function gets to work with any contiguous source without overloading on the container type, and the caller doesn't pay for a copy:
// Old way: one overload per container type, each with a copy if you // took the parameter by value, or coupling to a specific container if // you took it by reference. Both unsatisfying. void drawTriangles(const std::vector<Vertex>& vs); void drawTriangles(const std::array<Vertex, 64>& vs); void drawTriangles(const Vertex* vs, size_t count); // Modern way: one function, takes a view, the caller passes whatever // they have. No copies, no overload set, no error-prone count parameter. void drawTriangles(std::span<const Vertex> vertices); // Call sites work uniformly: drawTriangles(meshVertexVector); // vector โ span drawTriangles(framePushArray); // std::array โ span drawTriangles({rawPtr, count}); // raw pointer + count โ span
A view is a pointer plus a length. The underlying storage has to outlive the view, or the pointer dangles. The single most common bug with std::string_view is constructing it from a temporary std::string and storing the view past the temporary's lifetime. Both Clang and MSVC ship lifetime-bound warnings ([[clang::lifetimebound]], /W4) that catch many of these statically; turn them on[28].
16Beyond std: flat_map, hive, intrusive lists, ring buffers
The standard's container set is the default, not the ceiling. A short tour of the non-std containers most game engines pick up:
flat_map and flat_set (C++23 / Boost.Container)
A sorted vector of {key, value} pairs, wrapped in an associative interface. Lookup is O(log n) binary search on contiguous memory; insertion and middle-erase are both O(n) because everything after the change point shifts. The trade is: insertion and middle-erase are slower than the tree's O(log n), lookup and iteration are faster, memory overhead is lower. Lookup-heavy and small-to-medium key sets are the sweet spot. C++23 standardized it as std::flat_map and std::flat_set; Boost has had it since 2003[4].
std::hive / plf::colony (C++26)
A bucket array with stable references on insert and erase. Each block of slots holds a fixed number of elements plus a skipfield that records which slots are erased. Iteration walks the skipfield, jumping over erased slots in constant time. Insert reuses an erased slot if one exists, else appends. Erase marks the slot and bumps the skipfield, leaving every pointer to other elements valid[12].
Matt Bentley designed it for entity-component-system code where game objects come and go thousands of times per frame and other systems hold pointers into the live pool. Standardized as std::hive in C++26.
Intrusive containers (Boost.Intrusive, EASTL)
A list or tree where each element carries the link pointers, not the container. The container is just a head pointer; the user manages the storage. Two payoffs: no per-node heap allocation (the link state lives inside the element you already own), and an element can be in multiple intrusive containers at once. EASTL's intrusive_list and Boost.Intrusive's list are the canonical implementations[5]. Engine schedulers, free-lists, and ECS sparse sets all use this pattern.
Ring buffers (custom, no std equivalent)
A fixed-capacity contiguous array with two indices (head and tail) that wrap around. Push and pop are both O(1) with no allocation; the size is a hard cap, not a soft one. Used wherever a bounded queue makes sense: audio mix ringbuffers, replay event buffers, lock-free job queues. The standard committee has not landed on a single ring-buffer design because the variants don't reduce to one shape: bounded vs growing, SPSC vs MPMC, full-on-overwrite vs full-on-block. P0260's "concurrent queues" effort has been in flight for years without settling.
SmallVector / fixed_vector (LLVM, EASTL)
A vector with an inline buffer of N elements; falls back to heap allocation when N is exceeded. The first N pushes touch no heap at all, which matches the workload pattern for "build a small list inside a function and discard it" code. LLVM's llvm::SmallVector<T, N> is the most widely cited implementation[15], and EASTL ships fixed_vector<T, N> in the same shape.
17Allocators and why games override them
Every standard container takes an allocator template parameter. The default is std::allocator<T>, which forwards to ::operator new and ::operator delete. On most platforms that's a thread-safe general-purpose allocator (glibc malloc, jemalloc, tcmalloc), and it's the right answer for most code.
Game engines override it for three reasons[5]:
- Memory budget. Each subsystem gets a fixed pool. The renderer's containers allocate out of the renderer's pool; the AI's out of the AI's. Out-of-memory is a budget violation that's tractable to debug, not a process-wide crash.
- Frame allocators. A linear-bumping arena that resets at the start of each frame. Allocating from it is one pointer increment; freeing is a no-op until the frame ends. Per-frame container allocations are essentially free if you wire this up.
- Alignment. SSE/AVX/NEON vector types want 16- or 32-byte alignment. The default allocator returns whatever the platform's malloc guarantees: glibc x86-64 gives 16 bytes for objects โฅ 32 bytes and 8 bytes otherwise; Windows and macOS guarantee 16 bytes unconditionally. None of those satisfy a
std::vector<__m256>on x86, which wants 32-byte alignment for AVX loads. Game code that shipsDirectXMath'sXMVECTORor Unreal'sVectorRegisterin containers has to supply an overaligned allocator, becausestd::allocatordoesn't.
C++17 added std::pmr (polymorphic memory resources), which lets you pick an allocator at runtime instead of at type-instantiation time. std::pmr::vector<int> is a vector of int that takes its allocator as a constructor argument. This is the modern way to integrate engine pools without making every container a different type. The standard ships std::pmr::monotonic_buffer_resource (a linear arena) and std::pmr::unsynchronized_pool_resource (a per-thread pool) out of the box.
// One frame's worth of scratch storage. Reused every frame. static std::array<std::byte, 4 * 1024 * 1024> frameArenaStorage; std::pmr::monotonic_buffer_resource frameArena( frameArenaStorage.data(), frameArenaStorage.size()); void simulateFrame() { frameArena.release(); // O(1) "free everything from last frame" // Containers allocate inside the arena. Each push_back is one // pointer bump in the arena, not a malloc. std::pmr::vector<VisibleEntity> visibleEntities(&frameArena); visibleEntities.reserve(1024); for (const Entity& entity : worldEntities) if (mainCameraFrustum.contains(entity.bounds)) visibleEntities.push_back({entity.id, entity.transform}); submitDraws(visibleEntities); // visibleEntities goes out of scope here. For trivially destructible // element types, no destructors run; the buffer's deallocate() on a // monotonic arena is a no-op, so the whole teardown is free until // frameArena.release() reclaims the arena at the top of next frame. }
18The container race
One widget pushes a million elements through each of the five sequence containers, then iterates them. The numbers are JavaScript estimates with a calibration factor, not native C++ throughput, but the ratios match published native benchmarks within a factor of two. Pick a workload and a size and watch where each container falls:
19The decision cheat sheet
A one-page summary, organized by the question you're actually asking when you reach for a container:
| If you needโฆ | Reach for | Why |
|---|---|---|
| A sequence of T, default case | std::vector<T> |
Contiguous, cache-friendly, O(1) push_back amortized, the right answer for the bulk of sequence workloads[16]. |
| A fixed-size sequence, no heap allocation | std::array<T, N> |
Size is part of the type. Lives inline in its owner. |
| A small bounded sequence, usually fits in N, sometimes overflows | llvm::SmallVector<T, N> or eastl::fixed_vector |
Inline buffer for N, heap fallback past N. Zero allocations on the common path. |
| O(1) push and pop at both ends | std::deque<T> |
Chunked storage means front pushes don't shift everything. |
| O(1) splice between containers, or stable pointers across all mutations | std::list<T> or std::hive<T> (C++26) |
Node-based. Pointer to element X survives any operation that doesn't touch X. Hive is the cache-friendly version. |
| Lookup by key, frequent inserts and erases, doesn't fit in a flat_map | absl::flat_hash_map<K, V> or folly::F14FastMap<K, V> |
Open-addressing SIMD-scan, 2-3ร faster than std::unordered_map on the same workload[9]. |
| Lookup by key, small to medium static-ish key set, read-heavy | std::flat_map<K, V> (C++23) or boost::container::flat_map |
Sorted vector. Binary search on contiguous memory beats hashing at small N[4]. |
| Ordered iteration, range queries by key | std::map<K, V> |
Red-black tree, sorted iteration is automatic, lower_bound/upper_bound work. |
| LIFO behavior, want to expose only push/pop/top | std::stack<T, std::vector<T>> |
Vector-backed stack. Default of deque is fine but the vector version has less per-op overhead. |
| FIFO behavior | std::queue<T> |
Default deque backing is the right answer. For bounded lock-free, a ring buffer is faster. |
| "Do the highest-priority job next" | std::priority_queue<T> |
Vector + binary heap. O(log n) push and pop, O(1) top. |
| A read-only view of a contiguous range | std::span<const T> / std::string_view |
Pointer + size. No copy, no allocation, no template-on-container. |
20The container pitfalls list
Mistakes that ship to production every year, organized by the container that produces them:
vector pitfalls
vector<bool>is not a container of bool. The standard permits a packed representation (one bit per element) and explicitly carves out the contiguous-storage and ordinary-reference guarantees so implementations can pack; every shipping implementation does.operator[]returns a proxy,&v[0]isn't abool*, and the type doesn't satisfy the normal sequence-container requirements[19]. Usestd::vector<char>,std::vector<std::uint8_t>, orstd::bitset.- Iterators across
push_backare dangling. One of the most common iterator-invalidation bugs in C++ container code. Take a copy of the index instead, or callreserve()first to fix the capacity in advance. shrink_to_fit()is a hint. It is allowed to do nothing. If you need the memory back, swap with an empty vector:std::vector<T>(v.begin(), v.end()).swap(v).
deque pitfalls
- MSVC's deque collapses to one element per chunk for any element larger than 8 bytes. A
deque<MyStruct>on MSVC is essentially one heap allocation per element, with all the cache misses that implies. The libstdc++ and libc++ designs cap the chunk size by bytes (512 and 4096 respectively) and behave much more predictably as the element size grows[22]. - Iteration has a per-chunk-boundary cost. Each increment checks whether we're at a chunk boundary; within a chunk the branch is predictable, but every chunk transition adds a mispredict plus an indirect load. For hot inner loops, copy to a
vectorfirst.
list pitfalls
- It almost always isn't the right answer. Reach for it only when you have a real splice or a real reference-stability requirement. See ยง10.
size()was O(n) in C++03 and is O(1) in C++11. The committee changed it specifically so callers could stop pre-caching the size. If your code targets a really old library, this may still bite you.
map pitfalls
m[key]default-constructs the value if the key isn't present and writes the new entry. A "lookup" that mutates the map. Usefindorcontains(C++20) for "is it there?" checks.- String keys with no transparent comparator allocate on every lookup. Pre-C++14,
m.find("foo")on amap<std::string, V>constructs a temporarystd::string. Usestd::less<>as the comparator to get a transparent compare[29].
unordered_map pitfalls
- The default hash for
std::stringis fine; the default forintis the identity. Bucketing byhash(k) % bucket_countmeans an identity hash collides every key that differs in high bits. For workloads with structured integer keys (especially indices), wrap the key in a hashed type or pass a custom hasher[25]. - Insertion may rehash, which invalidates all iterators (but not pointers/references). Code that inserts in a loop with iterators outstanding has to call
reserve()first, just like vector[20]. - It is the slow hash table. If a hash table shows up in your profile, swap in
absl::flat_hash_maporfolly::F14FastMap. The interface is source-compatible enough for most call sites, but the iterator-invalidation rules differ: flat storage means rehash invalidates pointers and references too, not just iterators. Check your reference-holding code before swapping.
21What's next
The decision shape: pick the layout (one block, chunked, node-based, hashed), pick the access pattern, the container follows. The cheat sheet in ยง19 has the lookup table; the rest of this tutorial is the reasoning behind each row.
Past this tutorial:
- The C++ memory model tutorial covers the atomics and orderings you need to make concurrent containers behave. Lock-free queues use ring buffers and rely on the orderings from that tutorial.
- The job-systems tutorial uses work-stealing deques, which are concurrent variants of the deque from ยง8. The deque shape was specifically chosen to make stealing efficient.
- The Abseil hash containers (
flat_hash_map,node_hash_map) and Folly's F14 are the production-grade open-addressing implementations. Both ship benchmarks; read the F14 design doc for the trade-offs[11]. - For deep production guidance, EASTL's source is a guided tour of game-grade STL: every container is rewritten with allocator-awareness, fixed-storage variants, and intrusive nodes[5].
22Sources & further reading
Numbered citations refer to the superscripts above. Primary sources only: standards, vendor documentation, published papers, talk transcripts.
The prose, code samples, CSS, and interactive demos on this page are original. The vector-vs-list framing follows Stroustrup's Going Native 2012 keynote [1]. The SwissTable description follows Matt Kulukundis's CppCon 2017 talk [9] and the Abseil design page [10]; the "why exactly 7 bits for H2" derivation in ยง13 is the author's reading, not a quote. The growth-factor argument in ยง6 starts from Folly's claim [8] but the explicit golden-ratio derivation (rยฒ โค r + 1) is laid out here. The F14 description follows Facebook's open-source design doc [11]. The EASTL allocator-model critique follows Paul Pedriana's N2271 paper [5]; the three game-engine-specific allocator concerns in ยง17 (budget, frame arena, SIMD overalignment) and their DirectXMath/Unreal grounding are the author's framing. All deque chunk-size numbers were verified against the actual libstdc++, libc++, and MSVC STL source. All complexity guarantees and iterator-invalidation rules match cppreference and the published C++ standard.
- Stroustrup, B. (2012). C++11 Style. Going Native 2012 keynote. YouTube. Includes the vector-vs-list sorted-insertion benchmark and the "cache effects dominate" argument; follow-up at isocpp.org/blog/2014/06/stroustrup-lists.
- Stevens, A. (1995). Alexander Stepanov and STL Interview. Dr. Dobb's Journal. stlport.org. The history of how generic programming led to the STL container set.
- ISO/IEC 14882:2020 (and subsequent revisions). Programming languages โ C++. Standard reference for container complexity guarantees, sequence/associative requirements, and iterator invalidation. Open draft mirrored at eel.is/c++draft.
- Boost. Boost.Container Non-Standard Containers: flat_(multi)map / flat_(multi)set. boost.org. The reference implementation of sorted-vector associative containers, dating to 2003.
- Pedriana, P. (2007). EASTL: Electronic Arts Standard Template Library. WG21 paper N2271. open-std.org. The committee-facing case for the allocator, intrusive, and fixed-storage container set that game code needs.
- cppreference. std::array. en.cppreference.com/w/cpp/container/array. Fixed-size sequence, no heap allocation, inline storage.
-
Muรฑoz, J. D. (2022). Advancing the State of the Art for std::unordered_map Implementations. Overload journal, ACCU. accu.org. Why the bucket API locks
std::unordered_mapinto closed addressing, and what implementers can still optimize within that constraint. -
Facebook / Meta. FBVector โ Folly C++ Library. github.com/facebook/folly. The 1.5ร growth factor, jemalloc cooperation, and trivial-relocation optimization that distinguish fbvector from
std::vector. -
Kulukundis, M. (2017). Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step. CppCon 2017. YouTube. The original public description of SwissTable: H1/H2 split, 16-slot SIMD groups, 1-byte control words, 2-3ร faster than
std::unordered_map. - Google. Abseil Swiss Tables and absl::Hash. abseil.io/blog/20180927-swisstables. Open-sourcing announcement and integration with Abseil's hash framework.
- Facebook / Meta. F14 Hash Table. github.com/facebook/folly. 14-key chunks, three storage variants (Value/Node/Vector), engineering announcement at engineering.fb.com.
-
Bentley, M. Introduction of std::hive to the standard library. WG21 paper P0447 (latest revision R28). open-std.org/p0447r26; tracker at cplusplus/papers#328. The standardized bucket-array container, evolved from
plf::colony; reference implementation at github.com/mattreecebentley/plf_hive. Adopted for C++26. - Hennessy, J. L., & Patterson, D. A. Computer Architecture: A Quantitative Approach, 6th ed. Morgan Kaufmann (2017). The canonical citation for cache-hierarchy latency ratios and the cycle counts in ยง3.
- Intel. Intelยฎ 64 and IA-32 Architectures Optimization Reference Manual. intel.com. Chapter on memory hierarchy and the hardware prefetcher's pattern requirements.
- LLVM. llvm::SmallVector documentation. llvm.org. The reference inline-buffer-with-heap-fallback design.
-
Stroustrup, B. (2013). The C++ Programming Language, 4th ed. Addison-Wesley. ยง31.4: "Use vector as your default container." The recommendation that
std::vectoris the right default; the cppcoreguidelines mirror it. -
C++ Core Guidelines Working Group. SL.con.1: Prefer using STL
arrayorvectorinstead of a C array. isocpp.github.io. - GCC libstdc++. stl_vector.h source. github.com/gcc-mirror/gcc. Three-pointer layout (begin / end / end-of-storage), 2ร growth factor.
-
cppreference. std::vector. en.cppreference.com/w/cpp/container/vector. Complexity table, iterator-invalidation rules,
vector<bool>note. - cppreference. std::unordered_map. en.cppreference.com/w/cpp/container/unordered_map. Bucket API, complexity, rehash and invalidation rules.
- cppreference. std::deque. en.cppreference.com/w/cpp/container/deque. Chunked-array implementation note, indexing complexity.
- Chen, R. (2023). Inside STL: The deque, implementation. The Old New Thing, Microsoft DevBlogs. devblogs.microsoft.com. Side-by-side description of libstdc++, libc++, and MSVC's deque chunk sizes.
- cppreference. std::list. en.cppreference.com/w/cpp/container/list. Doubly-linked layout, O(1) splice, reference-stability guarantee.
- Chen, R. (2023). Inside STL: The map, set, multimap, and multiset. The Old New Thing, Microsoft DevBlogs. devblogs.microsoft.com. Confirms red-black tree as the implementation strategy across libstdc++, libc++, and MSVC.
- Aumasson, J.-P., & Bernstein, D. J. (2012). SipHash: a fast short-input PRF. aumasson.jp/siphash/siphash.pdf. The keyed-hash design that became the default for hash-flooding-resistant hash tables (Rust, Python 3, Ruby).
- Google. Swiss Tables Design Notes. abseil.io/about/design/swisstables. H1/H2 hash split, control-byte layout, SIMD match details.
- cppreference. std::priority_queue. en.cppreference.com/w/cpp/container/priority_queue. Binary-heap backing, O(log n) push/pop, no decrease-key.
- cppreference. std::span. en.cppreference.com/w/cpp/container/span. C++20 non-owning view, lifetime requirements.
-
cppreference. std::map. en.cppreference.com/w/cpp/container/map.
std::less<>transparent comparator details under C++14. -
Le, Z., et al. A Standard flat_map. WG21 paper P0429R9. open-std.org/p0429r9. The paper that brought
std::flat_mapandstd::flat_multimapinto C++23. -
Zhihao Yuan. A Standard flat_set. WG21 paper P1222. open-std.org/p1222r4. The companion paper for
std::flat_setandstd::flat_multisetin C++23. - Sutter, H. (2014). GotW #90 Solution: Factories. herbsutter.com. Touches the "vector by default" recommendation in the context of return-type design.
- Alexandrescu, A. (2014). Optimization Tips โ Mo' Hustle Mo' Problems. CppCon 2014. YouTube. Small-string and small-vector optimizations behind FBString and FBVector.
-
Bentley, M. (2017). Colonies, performance and why you should care. CppCon 2016. YouTube. The original public talk on
plf::colony, designed for game-engine entity pools. - Celis, P., Larson, P.-ร ., & Munro, J. I. (1985). Robin Hood Hashing. FOCS 1985. semanticscholar.org. The variance-reducing open-addressing technique that several modern hash tables build on.
- Stepanov, A., & Lee, M. (1995). The Standard Template Library. HP Labs Technical Report HPL-95-11. stepanovpapers.com. The original STL design document.
- Bronson, N., & Shi, X. (2019). Open-sourcing F14 for faster, more memory-efficient hash tables. Facebook Engineering. engineering.fb.com. Source for the 14-slot chunk choice ("filters 14 slots at once"), the SSE2/NEON intra-chunk SIMD compare, the 12/14 max load factor, and the chunk-overflow-count mechanism. The F14 design markdown in the Folly repository (F14.md) gives the chunk-layout details (14 tag bytes + 2 metadata bytes = one 16-byte aligned block).