Big O Notation
A hundred objects in the test scene, ten thousand in the shipping one. The build runs at 60 fps on the laptop and 4 fps on the target box. The animation that took 1 ยตs on day one takes 18 ms once the encounter is populated. Big O is the language game programmers use to predict that, before the QA report comes in. We start with the 1894 definition, work up through the cousin notations, the master theorem, and four worked engine case studies (broad-phase collision, BVH ray traversal, A*, draw-call sorting), and run every step as a live demo in your browser.
01Why a notation, not a stopwatch
A naive collision check on the entities in a scene is two nested loops: for every pair of entities, test whether they overlap. On the four-object test scene that's six comparisons; the function takes 200 nanoseconds and nobody profiles it. On the eight-hundred-object shipping scene that's roughly 320,000 comparisons, the function takes 6 milliseconds, the physics step misses its frame budget, and the player's first encounter ships at 17 fps. A stopwatch on day one tells you nothing about day three hundred. A notation that says this function does work proportional to nยฒ does, because the multiplier the player will pay was visible the moment the function was written.
is that language. It strips a function down to its dominant growth term and throws away everything else (constants, lower-order terms, the specific hardware). What's left is the shape of the curve, which is the thing that determines whether a system survives the jump from test scene to shipping content. The shape doesn't care about your CPU. It cares about how the cost grows when the workload does.
A working understanding of asymptotic notation strong enough to read a published algorithm description and predict its runtime curve; the discipline to spot accidental quadratic behavior in a code review; the math for the four most common divide-and-conquer recurrences via the master theorem; a feel for when Big O lies (cache effects, constant factors); and four worked engine case studies (broad-phase collision, BVH ray traversal, A*, draw-call sorting) where the asymptotic choice is the design choice. By the end you'll know why The Last of Us Part II's spatial hash beats sweep-and-prune on dense crowds, why ray-traced reflections are tractable at all[1], why Unreal sorts draw calls with a radix pass[2], and why Tetris is NP-complete[3].
The shape that ate the encounter
The widget below plots three implementations of "find all pairs of objects that might be touching": brute force at O(nยฒ), sweep-and-prune at roughly O(n log n), and a uniform spatial hash at O(n + k). Drag the object-count slider:
02A short history of writing down growth rates
Big O was not invented for computer science. It was invented for analytic number theory, and the people who pinned down its modern shape did so over the course of about 80 years. A tour, because the current usage only makes sense in context:
The recurring shape: a notation invented in 1894 to bound error terms in prime-counting formulas, sharpened in 1976 to handle algorithmic lower bounds, and extended in 1999 to model the memory hierarchy that real machines actually have. The version you'll use is Knuth's; the version that catches every bug is the one with cache-aware constants stapled on.
03The definition, picked apart
Big O is defined for functions from non-negative integers (or reals) to non-negative reals. The formal statement, the way Knuth wrote it[10] and CLRS teaches it[11]:
Read aloud: "f(n) is O(g(n)) if there exist constants c and nโ such that for every n at least nโ, f(n) is sandwiched between zero and c times g(n)." The c is what lets you throw away constant factors; the nโ is what lets you throw away lower-order terms. Both are existential, which means the definition is satisfied as soon as you can find any pair that works.
The three things this definition does not say are at least as important as what it does say:
- It is not an equation. The "=" in "f = O(g)" is read "is" or "is in," not "equals." Some authors write f โ O(g) to make the set membership explicit, and that's strictly more correct. The "=" is convention because that's how Bachmann wrote it[7].
- It is one-sided. Big O gives an upper bound. f(n) = O(nยฒ) is true if f grows like nยฒ, but also if f grows like n, or n log n, or any slower function. Saying "linear search is O(nยฒ)" is technically correct and almost always misleading.
- It is asymptotic. Big O makes no claim about small n. An algorithm that is O(n) can be slower than one that is O(nยฒ) on every input you ever feed it, as long as the latter's nยฒ is multiplied by a small enough constant. The classic example: insertion sort beats quicksort below n โ 16, which is exactly why every production sort cuts over to insertion sort at small sizes[12].
Drag the widget below. The orange curve is f(n); the violet curve is c ยท g(n) for the chosen c. The vertical line is nโ. Big O holds when the orange curve stays under the violet one for every n to the right of the line:
Why is nโ existentially quantified?
The point of nโ is to let you ignore noise at small n. A function might do weird things for the first ten or hundred inputs (a startup cost, a one-time table fill, a cache that hasn't warmed up) and then settle into its real growth shape. Big O cares about the real growth shape, not the noise.
Choosing nโ = 0 is a stronger claim ("the bound holds for every input"); choosing nโ = 106 is a weaker one ("the bound holds for every input I care about"). The definition says you only need to find any nโ, which is the right tradeoff for ignoring small-input artifacts without throwing away the bound.
04Omega, Theta, and the rest of the family
Big O is an upper bound. The full asymptotic toolkit has four cousins. Each is one inequality flip away from the others:
| Name | Symbol | Meaning | Definition | Reads as |
|---|---|---|---|---|
| Big O | O(g) | Upper bound, not tight | โc, nโ โnโฅnโ: f โค cยทg | "grows no faster than" |
| Big Omega | ฮฉ(g) | Lower bound, not tight | โc, nโ โnโฅnโ: f โฅ cยทg | "grows no slower than" |
| Big Theta | ฮ(g) | Tight bound, same growth | f = O(g) and f = ฮฉ(g) | "grows exactly like" |
| Little o | o(g) | Strictly slower | โc > 0 โnโ โnโฅnโ: f < cยทg | "grows strictly slower than" |
| Little omega | ฯ(g) | Strictly faster | โc > 0 โnโ โnโฅnโ: f > cยทg | "grows strictly faster than" |
The flip from "big" to "little" is the flip from existential to universal on the constant: big-O finds one c that bounds f above; little-o requires that every positive c eventually bounds f above. That's a much stronger statement: f can't just be no-bigger-than g by some constant, it has to grow strictly slower.
The same function fits five bounds. ฮ(nยฒ) is the tightest description (matching upper and lower); the others are looser. In a tutorial, saying "merge sort is ฮ(n log n)" is more informative than "merge sort is O(n log n)" because ฮ rules out the merge sort that secretly runs in O(n) (it doesn't), while O leaves that possibility open.
What about little-o?
Little-o is what you reach for when you want to claim that one function is fundamentally faster-growing than another, not just that one happens to be bigger. The statement 1/n = o(1) says 1/n shrinks to zero, which is the asymptotic version of "1/n vanishes." The statement n log n = o(nยฒ) says no constant multiple of n log n can keep up with nยฒ: the gap between them grows without bound.
Knuth's 1976 paper makes the distinction explicit and warns against the common abuse of writing O(n) when ฮ(n) is meant[10]. The abuse is unfortunately so widespread that "an O(n log n) sort" almost always means "a ฮ(n log n) sort" in practice. Read all O() in this sense unless the author explicitly distinguishes.
Algorithm analysis usually wants ฮ (matching upper and lower bounds) for the algorithm you wrote, and ฮฉ for the problem you're solving (the best possible algorithm). "Merge sort is O(n log n)" and "comparison-based sorting is ฮฉ(n log n)" combine to say "merge sort is asymptotically optimal among comparison sorts," which is the actual claim.
05The growth zoo, racing
About eight complexity classes show up in 95% of working algorithms. Memorize their shapes and you can read most papers without doing the algebra:
| Class | Name | Where you'll see it | At n=106 (1 ns/op) |
|---|---|---|---|
| O(1) | Constant | Hash table lookup (average), array index, push to a stack | 1 ns |
| O(log n) | Logarithmic | Binary search, balanced BST operation, BVH ray traversal level | 20 ns |
| O(โn) | Square-root | Sqrt decomposition; tile size in cache-oblivious algorithms | 1 ยตs |
| O(n) | Linear | Linear search, single pass over an array, building a histogram | 1 ms |
| O(n log n) | Linearithmic | Comparison-based sort, FFT, sweep-and-prune broad phase | 20 ms |
| O(nยฒ) | Quadratic | Nested-loop pair test, bubble sort, naive matrix-vector product | 17 min |
| O(nยณ) | Cubic | Naive matrix multiply, Floyd-Warshall, all-pairs shortest path | 32 years |
| O(2n) | Exponential | Brute-force SAT, subset enumeration, game-tree search at full depth | โซ age of universe |
| O(n!) | Factorial | Brute-force TSP, permutation enumeration | โซโซ age of universe |
The "at n=106" column makes the spacing concrete: ten orders of magnitude separate O(n) from O(nยฒ), and another seventeen separate O(nยฒ) from O(nยณ). These are the gaps that distinguish a frame budget from a freeze, and a freeze from "won't terminate before the game studio goes bankrupt."
The widget below races them on a log-log plot. On log-log axes a polynomial nk is a straight line with slope k, which is why log-log is the standard way to compare growth rates by eye:
06When constants lie
Big O drops constants on purpose: a notation that cared about constants couldn't compare implementations across hardware, compilers, or decades. The price is that two algorithms with the same Big O can differ by orders of magnitude in real-world cost, and an algorithm with worse Big O can beat one with better Big O on inputs you'll actually see.
The textbook example is 100n versus nยฒ / 100. The first is O(n); the second is O(nยฒ). By any asymptotic measure, the linear algorithm wins. Solve for the crossover:
Below n = 10,000 the "worse" algorithm is genuinely faster. If you ship a system that never exceeds 10,000 elements, picking the quadratic implementation is the correct engineering decision. The asymptotic ranking is informative, not authoritative.
What constants stand for in practice
An algorithm's hidden constant absorbs all the per-operation work the abstraction throws away. Three things commonly drive a small Big O into an unshippable wall:
- Cache misses. An L1 hit is ~4 cycles; an L3 hit is ~30; a main-memory hit is ~200[15]. A linked-list traversal that visits n nodes scattered across memory is asymptotically O(n), but it's also 50ร slower than a contiguous O(n) array traversal. Both are O(n); only one is shippable in a hot loop.
- Branch mispredictions. A modern CPU pipelines 15โ20 instructions ahead; a mispredicted branch flushes that pipeline at a cost of 15โ20 cycles. An algorithm with the same Big O as a branch-free alternative but a data-dependent branch can lose by a factor of 5ร[16]. Most production sort fallbacks (insertion sort at small n, sorting networks at very small n) are about branch prediction, not asymptotic complexity.
- Memory allocation. An O(n) algorithm that mallocs once per element on a server-grade allocator pays a few hundred nanoseconds per call; the same algorithm with a bump allocator pays a few. The Big O is the same. The shipping budget is not.
Frigo et al.'s 1999 cache-oblivious paper[13] formalized the first of these into a parallel complexity model that counts cache transfers instead of CPU operations. Sometimes the asymptotic count over cache transfers is the bound that matters; the CPU-op count is constant-factor noise. The takeaway is not that Big O is wrong, it's that Big O is one of two or three complexity functions a serious analysis needs.
An algorithm with hidden constant 106 is asymptotically the same as one with hidden constant 1, but on inputs of size n โค 106 the second is up to a million times faster. The Coppersmith-Winograd matrix-multiply algorithm has a famous constant in the trillions; no shipping linear-algebra library uses it[14]. The Big O is an admission of ignorance about the constant, not a claim that the constant is small.
07Worst, best, average, amortized
An algorithm's runtime is a function of its input. Different inputs produce different runtimes. Big O over an algorithm has to specify which inputs you mean:
- Worst case. The runtime on the slowest input of size n. The default. When the literature says "quicksort is O(nยฒ)," this is the case meant.
- Best case. The runtime on the fastest input. Rarely interesting; insertion sort's best case is O(n) (sorted input), which is occasionally relevant.
- Average case. The expected runtime over some distribution of inputs. Quicksort's average is O(n log n) under a uniform-random input distribution. Useful but harder to claim cleanly: the answer depends on what you assume about the input.
- Amortized. The average runtime per operation over a long sequence. std::vector::push_back's amortized cost is O(1) even though individual pushes can cost O(n) when the vector resizes.
Quicksort: the canonical worst-versus-average gap
Quicksort partitions the array around a pivot, recurses on the two halves, and concatenates. On a random pivot it halves the array on average, giving the recurrence T(n) = 2T(n/2) + ฮ(n), which the master theorem (ยง8) solves to ฮ(n log n). On an adversarial pivot (always-smallest or always-largest), it produces partitions of sizes 0 and nโ1, giving T(n) = T(nโ1) + ฮ(n), which sums to ฮ(nยฒ).
A sorted-or-reverse-sorted input plus a naive first-element pivot triggers the worst case on every recursion. The widget runs quicksort on three input patterns:
Amortized: std::vector::push_back
A dynamic array doubles its capacity when full. Most pushes are O(1) (write to the next slot). The pushes that trigger a resize are O(n) (copy every element to a bigger buffer). Worst-case per push is O(n); that's not interesting because it's misleading. Amortized per push, summed over a sequence and divided by length, is O(1).
The proof, from Tarjan's 1985 paper that formalized the technique[17]: n pushes do at most n single-slot writes plus a sequence of resize-copies of sizes 1, 2, 4, 8, ..., 2โlog nโ. That sum is bounded by 2n. Total work is at most 3n; per-push amortized cost is at most 3 = O(1).
08The Master Theorem
Most divide-and-conquer algorithms produce a recurrence of the form:
"To solve a problem of size n, do a recursive subsolves each of size n/b, plus f(n) of additional work to combine them." Merge sort is a = 2, b = 2, f(n) = ฮ(n). Binary BVH construction with sweep partitioning is similar.
The Master Theorem[11] classifies these recurrences in three cases, depending on whether the recursive work or the combine work dominates. Let ccrit = logb a:
- Case 1: leaves dominate. If f(n) = O(nc) with c < ccrit, then T(n) = ฮ(nccrit). The recursion tree has so many leaves that they outweigh the combine work.
- Case 2: every level costs the same. If f(n) = ฮ(nccrit), then T(n) = ฮ(nccrit log n). Each of the logb n levels does the same amount of work. Merge sort lives here: ccrit = 1, f(n) = ฮ(n), total ฮ(n log n).
- Case 3: root dominates. If f(n) = ฮฉ(nc) with c > ccrit, and a regularity condition holds, then T(n) = ฮ(f(n)). The top-level combine work is so expensive that the recursion contributes nothing significant.
Worked examples
Three standard divide-and-conquer recurrences and what the theorem says about them:
| Algorithm | Recurrence | ccrit | Case | Result |
|---|---|---|---|---|
| Binary search | T(n) = T(n/2) + O(1) | 0 | Case 2 (c=0) | ฮ(log n) |
| Merge sort | T(n) = 2T(n/2) + O(n) | 1 | Case 2 | ฮ(n log n) |
| Karatsuba multiplication | T(n) = 3T(n/2) + O(n) | logโ3 โ 1.585 | Case 1 | ฮ(nlogโ3) โ ฮ(n1.585) |
| Strassen matrix multiply | T(n) = 7T(n/2) + O(nยฒ) | logโ7 โ 2.807 | Case 1 | ฮ(nlogโ7) โ ฮ(n2.807) |
| Naive matrix multiply (recursive) | T(n) = 8T(n/2) + O(nยฒ) | 3 | Case 1 | ฮ(nยณ) |
| BVH top-down (binned SAH) | T(n) = 2T(n/2) + O(n) | 1 | Case 2 | ฮ(n log n) |
The pattern: how many subproblems versus how aggressively you shrink them is the lever. Strassen's algorithm beats the naive cubic matrix multiply because it trades one recursive call for a constant amount of extra combine work; that one fewer recursion drops the exponent from 3 to logโ 7. The same insight scales: the long arc of matrix-multiply progress (Strassen 1969, Coppersmith-Winograd 1990, the modern improvements down to O(n2.371552))[18] is a sequence of finer subdivisions that drop a further. Only Strassen is practical; the rest are galactic (ยง14).
The cases that aren't in the Master Theorem
The classical Master Theorem doesn't cover every recurrence. Three gaps:
- Logs in f(n). If f(n) = ฮ(n logk n), the original theorem doesn't apply. The Akra-Bazzi method extends it.
- Uneven splits. T(n) = T(n/3) + T(2n/3) + O(n) (median-of-medians selection variants) requires Akra-Bazzi or direct analysis.
- Subtractive recurrences. T(n) = T(n-1) + O(n) (quicksort worst case, naive recursion) isn't divide-and-conquer; it's "shrink by 1." The answer is the sum ฮ(nยฒ), found by direct summation.
For the recurrences this tutorial cares about, the three-case Master Theorem above is sufficient. The Akra-Bazzi version is in CLRS chapter 4[11].
09Measuring growth empirically
Sometimes the algorithm is too complex to analyze by hand and you have to measure. The classical trick: run it at sizes n, 2n, 4n, 8n, plot on log-log axes, and read the slope.
On log-log axes, T(n) = c ยท nk becomes log T = log c + k ยท log n, which is a straight line with slope k and intercept log c. Eyeballing the slope is a coarse but reliable diagnostic: a slope of 1 means O(n), a slope of 2 means O(nยฒ), anything between 1 and 2 is suspicious (could be O(n log n), could be O(n1.5), hard to tell from slope alone).
A more discriminating test: the doubling ratio. Measure T(n) and T(2n); the ratio T(2n) / T(n) diagnoses the class:
| Doubling ratio | Diagnosis | Slope on log-log |
|---|---|---|
| ~1 | O(1) | 0 |
| ~1 + 1/log n | O(log n) | tends to 0 |
| ~2 | O(n) | 1 |
| ~2.1โ2.3 | O(n log n) | slightly above 1 |
| ~4 | O(nยฒ) | 2 |
| ~8 | O(nยณ) | 3 |
The widget below shows a measured running time and lets you guess the complexity. The data has noise (cache effects, GC pauses, scheduler hiccups) so the diagnosis isn't always exact at small n; that's the realistic case.
10Case study: broad-phase collision
The classic engine-side O(nยฒ) trap. Physics simulations need to find every pair of objects whose bounding volumes overlap. A naive double-loop checks n(nโ1)/2 pairs; at n = 1000 that's 500,000 pairs every frame, at 60 fps that's 30 million pair tests per second, and a pair test is several SIMD ops, so the broad phase alone budgets out the rest of physics on a single core.
Three standard fixes, with different complexity in different regimes:
Sweep-and-prune (sort and sweep)
Sort the bounding-box endpoints along one axis (say X). Two boxes can only overlap on the X axis if neither's max-X is before the other's min-X. Sweep through the sorted endpoints with an active list; when you hit a min-X, add the box to the active list; when you hit a max-X, remove it. Every pair of boxes simultaneously active is a candidate. Repeat on Y and Z; take the intersection.
The first frame's cost is dominated by the initial sort at O(n log n); the sweep itself is O(n + k) where k is the number of active-pair encounters. In subsequent frames the endpoint lists are nearly sorted (objects move a little, not a lot, per step), and insertion sort on a nearly-sorted list runs in O(n + s) where s is the number of out-of-order pair swaps. In steady state that puts per-frame cost at O(n + k + s), which is close to linear when both k and s are O(n). The pathological case (every box overlaps every other) blows k up to O(nยฒ) and you're back to the naive cost. The Cohen et al. 1995 I-COLLIDE paper introduced the temporal-coherence trick[4].
Uniform spatial hash
Partition space into a regular grid. Hash each object's AABB into the cells it touches. Two objects can only collide if they share a cell. Build cost is O(n); query cost is the sum over cells of the per-cell pair-test count.
If average density is d objects per cell and the cell size is at least as large as the biggest object's AABB (so each object touches O(1) cells), the per-cell pair-test count averages dยฒ/2, and total cost is O(n ยท d), which collapses to O(n) when d is bounded. The catch: d is bounded only when objects are spread out. A pile of crates on the same tile reduces to brute force inside that tile. NVIDIA GPU Gems 3 describes a GPU spatial-hash implementation that handles dense scenes by sub-binning[5].
BVH for moving objects
Build a bounding volume hierarchy over the objects' AABBs and query each object against the tree. Construction is O(n log n); per-query is O(log n) in well-spread scenes; total is O(n log n). Bullet ships btDbvtBroadphase (dynamic AABB tree) and Box2D ships b2DynamicTree as the default broad phase for exactly this reason; PhysX defaults to sweep-and-prune but supports a BVH alternative for sparse scenes[1].
The widget runs all three on the same simulated scene. At low object counts brute force wins on absolute time because its constant is tiny; at high counts it loses by an enormous margin:
11Case study: BVH ray traversal
Ray tracing a scene of n triangles by brute force means testing the ray against every triangle: O(n) per ray, O(n ยท pixels) per image. At n = 107 triangles and a 1080p image that's 2ร1013 ray-triangle tests per frame, which is several orders of magnitude past tractable.
A reduces this to O(log n) expected per ray on well-built scenes. The trick: organize triangles into a binary tree where each node's AABB encloses every triangle in its subtree. To intersect a ray with the scene, test the ray against the root's AABB; if it misses, the whole subtree is irrelevant. If it hits, recurse on both children. Continue until reaching a leaf, where you test the ray against the small bucket of triangles inside.
The Physically Based Rendering textbook lays out the math: with a balanced BVH and the surface-area heuristic for the build, expected per-ray cost is logarithmic in n[1]. The worst case is still O(n) (a degenerate ray that hits every leaf), but well-built scenes hold the log bound on every ray that matters.
12Case study: A* and the branching factor
A* searches a graph for the shortest path from start to goal. At each step it picks the frontier node with the lowest estimated total cost f(n) = g(n) + h(n), where g is the known cost so far and h is a heuristic estimate to the goal. With an admissible heuristic, A* is optimal; with no heuristic, it degenerates to Dijkstra; with a perfect heuristic, it walks a straight line to the goal.
The worst-case complexity is O(bd), where b is the branching factor (how many neighbors a node has) and d is the depth of the solution[20]. On a 2D grid b = 4 (von Neumann) or b = 8 (Moore neighborhood); on a navmesh b is the average degree of a navmesh polygon, usually 3โ6.
The exponential isn't in n (the size of the graph); it's in d (how far the goal is). For a typical 100ร100 grid with the goal 50 steps away, naive BFS would visit roughly 450 nodes if it didn't track visited ones; with the visited-set, BFS visits n = 104 nodes. A* with a good heuristic visits something like 5 ยท 50 = 250 nodes: the visited region is a corridor along the optimal path, not a disk.
The heuristic's quality is the lever. A perfect heuristic (knows the true distance) makes A* visit exactly the optimal path. An admissible-but-loose heuristic (Manhattan distance on a grid with obstacles) makes A* explore a fan along the path. An inadmissible heuristic (overestimates) makes A* fast but the result is no longer optimal:
13Case study: sorting draw calls
A renderer issues thousands of draw calls per frame. Reordering them to group calls that share a material, mesh, or render target is one of the biggest performance wins available: state changes on the GPU are expensive, and a sort that batches them amortizes the cost[2].
The standard pattern: every draw call gets a 32- or 64-bit sort key packing render target, shader, material, depth, and so on into bit fields in priority order. Sort the array of keys. Issue the calls in the sorted order. Comparing the sort cost matters because the renderer runs it every frame:
- std::sort (introsort). O(n log n) comparison-based. On a uniformly-random array of 64-bit keys, modern x86 implementations run on the order of a few milliseconds per 100,000 elements; partially-sorted input is faster because introsort short-circuits into insertion sort.
- Radix sort. O(d ยท n) where d is the number of digits (typically bytes) in the key. For a 64-bit key with 8-bit digits, d = 8. Cache-aware implementations (the ska_sort variant is a representative benchmark) run roughly several times faster than std::sort on the same input. The speedup grows with n: small at 8k calls, decisive past 100k.
Why isn't every sort a radix sort? Two reasons. First, comparison sorts work on any comparable type; radix sorts work on fixed-width integer keys. Second, naive radix sort scatters writes across buckets and trashes the cache, so the constant is larger than the textbook complexity suggests; the implementations that beat std::sort handle that explicitly. The Bevy engine's issue tracker has a detailed discussion of radix-vs-comparison for the sort phase[21]; the canonical real-time-collision-detection blog post by Christer Ericson lays out the bit-packing pattern most engines use[2].
Radix sort being O(n) (treating d as a constant for fixed-width keys) sounds like it violates the ฮฉ(n log n) lower bound for sorting. It doesn't, because the lower bound applies to comparison-based sorts; radix is not comparison-based. The lower bound's proof rests on counting binary-decision-tree leaves[11]; radix sort doesn't make comparisons, so the proof doesn't apply.
// A 64-bit draw-call sort key. Field order is priority order: // the leftmost (most significant) bits drive the coarsest grouping. // Lay out fields top-down so a single 64-bit integer compare // produces the desired ordering with no branches. struct DrawCallKey { uint64_t renderTarget : 2; // 0=shadow, 1=opaque, 2=transparent, 3=postfx uint64_t shaderProgram : 10; // 1024 programs is plenty uint64_t materialId : 14; // 16K unique materials per frame uint64_t meshId : 14; // 16K unique meshes; rebinding VBO is costly uint64_t depth : 24; // front-to-back for opaque; reversed for transparent // 64-bit compare = whole-struct compare. The bit layout // implies the priority. Sorting an array of these is one // radix pass over the 8 bytes, or one std::sort over uint64_t. }; static_assert(sizeof(DrawCallKey) == 8, "key must pack into a uint64");
The bit layout is the asymptotic algorithm: by putting render target in the high bits, every same-target draw is contiguous after sorting; by putting depth in the low bits, ties at higher levels resolve to depth order. The sort itself is incidental. The design is the key layout.
14Galactic algorithms
An algorithm is galactic if its asymptotic complexity is the best known but its hidden constants are so large that it loses to a worse-asymptotic algorithm on every input that fits in the observable universe[14]. The term was popularized in Lipton and Regan's 2010s essays on algorithm theory.
The flagship example is matrix multiplication. The naive triple-loop algorithm is O(nยณ). The progression of asymptotic improvements has been steady since the 1960s:
| Year | Author(s) | Exponent ฯ | Practical? |
|---|---|---|---|
| 1969 | Strassen | logโ7 โ 2.807 | Yes; production for n โฅ 100 |
| 1978 | Pan | 2.795 | No; constants too large |
| 1981 | Schรถnhage | 2.522 | No |
| 1990 | Coppersmith-Winograd | 2.376 | No; canonical galactic algorithm |
| 2014 | Le Gall | 2.3728639 | No |
| 2024 | Vassilevska Williams, Y. Xu, Z. Xu, Zhou[18] | 2.371552 | No |
| 2025 | Alman, Duan, Williams, Y. Xu, Z. Xu, Zhou[18] | 2.371339 | No |
Every entry past Strassen is galactic. Estimates of the crossover n at which a Coppersmith-Winograd-style algorithm would beat a cache-blocked cubic GEMM put it well past 1020, more than any concrete matrix that ships in a graphics or scientific application. For shipping graphics, audio, and physics code, GEMM is O(nยณ); the asymptotic wins are mathematical theorems, not engineering choices.
Galactic algorithms are still worth studying, for two reasons. First, they bound what's possible: every step closer to nยฒ moves the lower-bound theorem closer. Second, the techniques sometimes transfer down: Strassen's algorithm is practical, and it began the chain. If a future technique drops the constant by a factor of n, the resulting algorithm beats O(nยณ) in cells you can actually rent. The lesson is that "asymptotically faster" is a necessary, not sufficient, condition for "actually faster."
15NP-hard games
Big O can describe problems for which no polynomial-time algorithm is known, and which are believed to require exponential time. The class NP is the set of decision problems whose "yes" answers can be checked in polynomial time given a witness. The class P is the subset whose answers can also be computed in polynomial time. The P vs NP question is whether these are the same class.
A problem is NP-hard if every problem in NP reduces to it in polynomial time; NP-complete if it's both NP-hard and in NP. The Cook-Levin theorem (Cook 1971, Levin 1973 independently) showed that SAT is NP-complete, the first such proof[22]; thousands of problems have since been shown NP-complete by polynomial reduction from SAT.
Several puzzles are known NP-hard or NP-complete in their generalized forms:
| Game | Problem statement | Complexity | Reference |
|---|---|---|---|
| Generalized Tetris | Maximize survival on a known sequence of pieces | NP-complete | [3] |
| Sokoban | Push boxes onto target tiles | PSPACE-complete | [23] |
| Super Mario Bros | Reach the end of a generalized level | NP-hard | [24] |
| Minesweeper consistency | Is a given board consistent? | NP-complete | [25] |
| Generalized Sudoku | Solve an nยฒ ร nยฒ Sudoku | NP-complete | [26] |
These results apply to generalized versions (Tetris on an n ร m board, Sudoku on nยฒ ร nยฒ). The fixed-size game you play in your browser is solvable by brute force; the asymptotic claim is about the family. The practical engineering takeaway is that AI for these problems can't be expected to scale: heuristics and learned policies dominate every published agent because exact solutions are out of reach.
Procedural generation hits the same wall in the other direction. Generating a guaranteed-solvable Sokoban level requires solving Sokoban, which is PSPACE-complete[23]. Generators ship anyway, by using a constrained subset of the rules where the problem becomes tractable, or by checking solvability with a bounded search budget and discarding levels that don't certify within it.
16Beyond polynomial
Past exponential are growth rates so vast they have no operational meaning. The number of atoms in the observable universe is roughly 1080; 2n crosses that threshold at n โ 267, and n! crosses it at n โ 60. The grows faster than any primitive-recursive function; at very small inputs it produces numbers that can't be written down in conventional notation[27].
These aren't useful complexities for algorithms ("the algorithm runs in A(n, n) time" is functionally indistinguishable from "the algorithm doesn't terminate"), but they show up as inverse functions in tight upper bounds. The classic example: union-find with union-by-rank and path compression has an amortized cost per operation of ฮฑ(n), where ฮฑ is the inverse Ackermann[17]. For any n that fits in any computer ever built, ฮฑ(n) โค 4: in practical terms, the operation is constant-time. The function is technically not constant because it grows, just incomprehensibly slowly.
17Try it yourself
The playground lets you pick a data structure and an operation, and shows the asymptotic cost along with the actual operation animated. Six common structures, four operations each.
18How engines actually use this
Asymptotic complexity drives most major data-structure decisions in shipped engines, with constant-factor refinements layered on top. Concrete choices:
- Unreal Engine's
TMap. Backed byTSet, which is a chained hash table: each bucket stores indices into a contiguous element array, and collisions are resolved by walking a per-bucket chain. Average-case O(1) lookup, with the O(n) worst case showing up under pathological hash collisions. Choice of chaining (over open addressing) lets Unreal store elements densely while keeping iteration cheap. Documentation describes the design tradeoffs explicitly[28]. - Unity's
NativeParallelMultiHashMap. A Burst-compatible hash table that allows multiple values per key, used in the Entities package for spatial partitioning. Average O(1) insert and lookup, with explicit bucket-count tuning because the Burst compiler doesn't dynamically resize across job boundaries. (Older Collections versions exposed the same type asNativeMultiHashMap; renamed in Collections 2.x.) - Bevy ECS's archetype graph. Adding or removing a component changes an entity's archetype. Bevy maintains an
Edgescache on each archetype, mapping each component bundle to the destination archetype, so transitions are O(1) amortized after the first cache miss. Without it, the lookup would be O(a) in the number of archetypes per move. - Unreal's GPU particle sort. Transparent particles are sorted on the GPU via a bitonic merge sort (
FParticleSortBuffers/GPUSort.cpp) at O(n logยฒ n) work, parallel across cores. Equivalent CPU sorts at the same scale would be O(n log n) serial but pay a round-trip to upload sorted indices to the GPU, which is the wrong constant for the particle pipeline. - Spatial hash for proximity queries. A uniform grid keyed on world-space cell coordinates, with each cell storing the list of objects whose AABB touches it, gives O(1) amortized insert and O(k + 1) per-query for k objects near the query point. The cell size has to be at least as large as the largest object's AABB to keep object-touches-cells at O(1). Teschner et al. 2003 analyze the cell-size tradeoff for collision detection[29].
The shared shape: pick the asymptotic class first, then refine for cache and parallelism. An engine that picks O(nยฒ) by accident and then "optimizes" the constant will not catch the bug. An engine that picks O(n log n) and then optimizes the constant ships at 60 fps.
19Pitfalls
The mistakes that occur most often in code review, in roughly decreasing frequency:
- Accidental quadratic from string concatenation. Building a string by repeated s = s + part is O(nยฒ) in most languages (each concat copies the whole accumulator). Use a string builder or a join; both are O(n). The same bug appears in tight loops building command buffers, log lines, or shader source.
- Linear search in a hot loop. "Find this asset by name" inside an inner loop turns a per-frame cost of O(1) into O(n), which compounds with anything wrapping it. Cache the lookup outside the loop or use a hash table.
- O(log n) in a tight loop is not "free." Logarithm time is constant in casual analysis but adds up: a binary search per pixel at 1080p is 2 million log-time operations, 50 ns each, 100 ms per frame. The log shows up in profilers as a flat 8% nobody notices until they look.
- Hash table worst case. A hash table is O(1) average and O(n) worst case. If your hash function is bad, the worst case is the actual case. Use cryptographic-quality hashes for adversarial inputs (user-supplied names) and check for collision chains in QA builds.
- Treating amortized as worst-case. A vector push is O(1) amortized but O(n) on the resize. In a real-time loop, the resize is a 16 ms hitch nobody profiled. Pre-reserve when you know the size.
- Ignoring constants on small n. An O(nยฒ) algorithm with constant 1 beats an O(n log n) algorithm with constant 1000 below n โ 100. If the data structure is small and the operation is hot, the asymptotically-better algorithm is the wrong choice.
- Vector copy in a hot loop. Passing a
std::vectorby value copies it: O(n) per call. The function signature is innocuous, the cost is not. Pass by const-reference unless you actually need the copy. - Recursive complexity miscounted. A recursion that branches twice per level isn't O(n); it's O(2depth). Fibonacci-style naive recursion is the canonical bug; memoization fixes it.
20What's next
The natural follow-ups are the algorithms whose complexity this tutorial only sketched:
- Sorting and searching deep-dives. Why introsort, why radix, why pdqsort, why timsort. The O(n log n) family is more crowded than it looks.
- The cache-oblivious model. Big O over cache transfers instead of CPU operations. Frigo and Leiserson's O(N/B ยท logM/B(N/B)) bound for cache-oblivious sort is the right thing to think about for large-data engine work[13].
- Parallel algorithms. The work-span model (PRAM, Cilk). The job systems tutorial introduces the runtime; the complexity model for analyzing parallel algorithms is a tutorial of its own.
- NP-hard heuristics. If a problem is NP-hard, exact solutions are out; approximation algorithms (PTAS, FPTAS), local search, and learned heuristics are the practical answers. AI pathfinding past A* and game-AI literature live here.
- Lower bounds. When you've shown an algorithm runs in O(f(n)), the next question is whether any algorithm can do better. Information-theoretic lower bounds, comparison-tree lower bounds, and circuit-complexity lower bounds answer this and are the hardest part of the field.
For the engine programmer, the most valuable next step is empirical: profile a real engine, find every O(nยฒ) hidden in it, and replace it. The bugs are there.
Sources
- Matt Pharr, Wenzel Jakob, Greg Humphreys. Physically Based Rendering: From Theory to Implementation, 3rd ed., 2018, ยง4.3 "Bounding Volume Hierarchies." pbr-book.org. Source for the BVH-traversal expected O(log n) bound and the surface-area heuristic build.
- Christer Ericson. "Order your graphics draw calls around!" realtimecollisiondetection.net, 2008. realtimecollisiondetection.net. The standard reference for bit-packed draw-call sort keys and the radix-sort pattern.
- Erik D. Demaine, Susan Hohenberger, David Liben-Nowell. "Tetris is Hard, Even to Approximate." COCOON 2003. arXiv:cs/0210020. Original NP-completeness proof for generalized Tetris.
- Jonathan D. Cohen, Ming C. Lin, Dinesh Manocha, Madhav K. Ponamgi. "I-COLLIDE: An Interactive and Exact Collision Detection System for Large-Scale Environments." Proc. ACM Interactive 3D Graphics, 1995. The original sweep-and-prune (sort-and-sweep) paper, with the O(n + k) analysis and worst-case O(nยฒ) note.
- Scott Le Grand. "Broad-Phase Collision Detection with CUDA." GPU Gems 3, ch. 32, NVIDIA, 2007. developer.nvidia.com. Spatial-hash analysis for the GPU broad phase, including sub-binning for dense scenes.
- Erica N. Walker. "Math Origins: Orders of Growth." Mathematical Association of America Convergence, 2017. maa.org. Historical survey covering du Bois-Reymond 1870, Bachmann 1894, and Landau 1909.
- Paul Bachmann. Die analytische Zahlentheorie, vol. II, Teubner, 1894. archive.org scan. First published use of the O symbol; "O" stands for "Ordnung."
- Edmund Landau. Handbuch der Lehre von der Verteilung der Primzahlen, Teubner, 1909. archive.org scan. Introduces the little-o notation; the source of "Landau symbols."
- G. H. Hardy, J. E. Littlewood. "Some Problems of Diophantine Approximation: Part II. The Trigonometrical Series Associated with the Elliptic ฮธ-Functions." Acta Mathematica 37, 1914. Introduces the original ฮฉ notation as a non-vanishing-remainder marker; predates Knuth's stricter algorithmic ฮฉ.
- Donald E. Knuth. "Big Omicron and Big Omega and Big Theta." ACM SIGACT News 8(2):18โ24, 1976. dl.acm.org. The standardization paper for algorithm analysis: defines Big O, Big ฮฉ, Big ฮ as used today.
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, 4th ed., MIT Press, 2022, ch. 3โ4. The textbook reference for asymptotic notation, the master theorem, and the comparison-sort lower bound.
- David R. Musser. "Introspective Sorting and Selection Algorithms." Software: Practice and Experience 27(8):983โ993, 1997. PDF. The introsort design used by libstdc++, libc++, and MSVC's std::sort.
- Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. "Cache-Oblivious Algorithms." Proc. 40th IEEE FOCS, 1999. PDF. The original cache-oblivious paper; sets up the complexity model that counts memory transfers.
- Richard J. Lipton, Kenneth W. Regan. "Galactic Algorithms." Gรถdel's Lost Letter and P=NP, 23 October 2010. rjlipton.com. The post that introduced the term, credited to Regan, for asymptotically-superior algorithms that are uncompetitive at any realistically-computable input size. Coppersmith-Winograd is the standard example.
- Peter Norvig. "Teach Yourself Programming in Ten Years," "Approximate timing for various operations" table. norvig.com/21-days.html. Memory-hierarchy latency ballpark numbers (L1 ~0.5 ns, L2 ~7 ns, DRAM ~100 ns).
- Agner Fog. "The microarchitecture of Intel, AMD and VIA CPUs," chapter on branch prediction. PDF. Source for the 15โ20 cycle branch-misprediction penalty on modern x86 cores.
- Robert E. Tarjan. "Amortized Computational Complexity." SIAM Journal on Algebraic Discrete Methods 6(2):306โ318, 1985. Introduces the formal amortized-analysis framework; covers the O(ฮฑ(n)) bound for union-find.
- Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, Renfei Zhou. "New Bounds for Matrix Multiplication: from Alpha to Omega." Proc. SODA 2024 (arXiv:2307.07970); result ฯ < 2.371552. Subsequently improved by Josh Alman, Ran Duan, and the same group in "More Asymmetry Yields Faster Matrix Multiplication," Proc. SODA 2025 (arXiv:2404.16349), to ฯ < 2.371339. Currently the best published exponents. arXiv 2307.07970, arXiv 2404.16349.
- NVIDIA. "NVIDIA Turing GPU Architecture Whitepaper," 2018, "RT Core" section. PDF. Hardware-accelerated BVH traversal in fixed-function silicon.
- Peter E. Hart, Nils J. Nilsson, Bertram Raphael. "A Formal Basis for the Heuristic Determination of Minimum Cost Paths." IEEE Transactions on Systems Science and Cybernetics 4(2):100โ107, 1968. Original A* paper; the O(bd) worst-case bound and the optimality result under admissible heuristics.
- Bevy Engine. "Use radix sort for sort phase and sprite sorting." GitHub issue #4291, 2022. github.com. Proposal and benchmarks comparing radix versus comparison sort for the render-phase sort in an open-source engine. (At the time of writing, Bevy uses radix sort only for sprite picking, via the
radsortcrate; the main draw-call sort is stillstd::sort's equivalent.) - Stephen A. Cook. "The Complexity of Theorem-Proving Procedures." Proc. ACM STOC, 1971, pp. 151โ158. dl.acm.org. The original NP-completeness theorem; SAT is NP-complete. Leonid Levin proved an equivalent result independently in "Universal Sequential Search Problems," Problemy Peredachi Informatsii 9(3):115โ116, 1973 (English translation: Problems of Information Transmission 9:265โ266), hence the joint Cook-Levin name.
- Joseph Culberson. "Sokoban is PSPACE-complete." Technical Report, University of Alberta, 1997. webdocs.cs.ualberta.ca. PSPACE-completeness proof for generalized Sokoban.
- Greg Aloupis, Erik D. Demaine, Alan Guo, Giovanni Viglietta. "Classic Nintendo Games are (Computationally) Hard." Theoretical Computer Science 586:135โ160, 2015. NP-hardness proofs for generalized Super Mario Bros, Donkey Kong, Legend of Zelda, Metroid, and Pokรฉmon.
- Richard Kaye. "Minesweeper is NP-Complete." The Mathematical Intelligencer 22(2):9โ15, 2000. NP-completeness proof for the Minesweeper Consistency Problem.
- Takayuki Yato, Takahiro Seta. "Complexity and Completeness of Finding Another Solution and Its Application to Puzzles." IEICE Transactions E86-A(5):1052โ1060, 2003. NP-completeness proof for generalized nยฒรnยฒ Sudoku.
- Wilhelm Ackermann. "Zum Hilbertschen Aufbau der reellen Zahlen." Mathematische Annalen 99:118โ133, 1928. Original Ackermann function (later simplified by Pรฉter and Robinson to the two-argument form used today).
- Epic Games. "TMap Reference," Unreal Engine 5 documentation. dev.epicgames.com. Documentation of Unreal's hash-map data structure and its design choices.
- Matthias Teschner, Bruno Heidelberger, Matthias Mรผller, Danat Pomeranets, Markus Gross. "Optimized Spatial Hashing for Collision Detection of Deformable Objects." Proc. Vision, Modeling, Visualization (VMV) 2003. PDF. Cell-size analysis for uniform spatial hashing applied to collision detection; the canonical reference on the tradeoff between query cost and per-cell occupancy.