Tutorial 06 · Engine Programming

Big O Notation

A hundred objects in the test scene, ten thousand in the shipping one. The build runs at 60 fps on the laptop and 4 fps on the target box. The animation that took 1 µs on day one takes 18 ms once the encounter is populated. Big O is the language game programmers use to predict that, before the QA report comes in. We start with the 1894 definition, work up through the cousin notations, the master theorem, and four worked engine case studies (broad-phase collision, BVH ray traversal, A*, draw-call sorting), and run every step as a live demo in your browser.

Time~55 min LevelJunior to mid; review for senior PrereqsYou can read C++ or pseudocode at intermediate level. You've heard of arrays, hash tables, and trees. HardwareNone, but knowing that CPUs have caches helps in §6 and §13.

01Why a notation, not a stopwatch

A naive collision check on the entities in a scene is two nested loops: for every pair of entities, test whether they overlap. On the four-object test scene that's six comparisons; the function takes 200 nanoseconds and nobody profiles it. On the eight-hundred-object shipping scene that's roughly 320,000 comparisons, the function takes 6 milliseconds, the physics step misses its frame budget, and the player's first encounter ships at 17 fps. A stopwatch on day one tells you nothing about day three hundred. A notation that says this function does work proportional to n² does, because the multiplier the player will pay was visible the moment the function was written.

is that language. It strips a function down to its dominant growth term and throws away everything else (constants, lower-order terms, the specific hardware). What's left is the shape of the curve, which is the thing that determines whether a system survives the jump from test scene to shipping content. The shape doesn't care about your CPU. It cares about how the cost grows when the workload does.

What you'll have by the end

A working understanding of asymptotic notation strong enough to read a published algorithm description and predict its runtime curve; the discipline to spot accidental quadratic behavior in a code review; the math for the four most common divide-and-conquer recurrences via the master theorem; a feel for when Big O lies (cache effects, constant factors); and four worked engine case studies (broad-phase collision, BVH ray traversal, A*, draw-call sorting) where the asymptotic choice is the design choice. By the end you'll know why The Last of Us Part II's spatial hash beats sweep-and-prune on dense crowds, why ray-traced reflections are tractable at all^[1], why Unreal sorts draw calls with a radix pass^[2], and why Tetris is NP-complete^[3].

The shape that ate the encounter

The widget below plots three implementations of "find all pairs of objects that might be touching": brute force at O(n²), sweep-and-prune at roughly O(n log n), and a uniform spatial hash at O(n + k). Drag the object-count slider:

scene size 800

cell density 2.0

brute (O(n²))

···

sweep+prune

···

spatial hash

···

02A short history of writing down growth rates

Big O was not invented for computer science. It was invented for analytic number theory, and the people who pinned down its modern shape did so over the course of about 80 years. A tour, because the current usage only makes sense in context:

1870

Paul du Bois-Reymond defines an order-of-growth relation.^[6] The first formal notation for comparing the growth of two functions. The notation didn't survive, but the idea did: it's the same comparison Big O makes, with different ink.

1894

Paul Bachmann introduces the O symbol.^[7] Volume II of his Analytische Zahlentheorie writes "f(x) = O(g(x))" with O standing for Ordnung ("order"). This is the first appearance of the notation that survived. Bachmann uses it to bound error terms in prime-counting formulas.

1909

Edmund Landau adds little-o.^[8] Landau's Handbuch der Lehre von der Verteilung der Primzahlen ("Handbook of the Theory of the Distribution of Prime Numbers") adopts Bachmann's big-O and introduces the lower-case o for strict asymptotic dominance. The pair is still called the Landau symbols in the math literature.

1914

Hardy and Littlewood extend the family.^[9] G.H. Hardy and J.E. Littlewood introduce big-Ω as a number-theoretic lower bound (a non-vanishing remainder, "f is not little-o of g"). The symbol is borrowed later by computer science with a different and stricter definition.

1976

Knuth standardizes Big O, Big Ω, and Big Θ for algorithm analysis.^[10] Donald Knuth's "Big Omicron and Big Omega and Big Theta" in SIGACT News formalizes the trio with the definitions algorithms researchers use today. Knuth wrote it because authors were misusing O() for lower bounds (saying things like "sorting requires O(n log n) comparisons") and he wanted a notation that wouldn't bury the distinction. The modern Ω is stricter than Hardy and Littlewood's: it requires a uniform lower bound, not just an infinitely-often one.

1990

CLRS canonizes the asymptotic toolkit for teaching.^[11] The first edition of Cormen, Leiserson, Rivest (and later Stein) puts asymptotic notation in chapter 3 of the textbook that becomes the global default. Two generations of undergraduates learn Big O from this presentation.

1997

Musser publishes introsort.^[12] David Musser's "Introspective Sorting and Selection Algorithms" introduces the hybrid that runs quicksort but switches to heapsort when recursion depth exceeds 2 log₂ n. This is the moment Big-O's "worst case versus expected case" distinction becomes a shipping engineering decision: introsort guarantees O(n log n) worst case while keeping quicksort's empirical speed on the common inputs.

1999

Frigo, Leiserson, Prokop, Ramachandran formalize cache-obliviousness.^[13] The paper shows that asymptotic counts of CPU operations are an incomplete model on machines with deep memory hierarchies, and that algorithms with the same Big O can differ by orders of magnitude depending on how they interact with the cache. The first widely-cited statement that Big O over memory accesses is sometimes the bound that matters, not Big O over arithmetic operations.

2010s

The "galactic algorithm" idea enters common usage.^[14] Lipton and Regan popularize the term for asymptotically-superior algorithms whose constants are so vast they would never beat the simpler approach on any realistically-sized input. Coppersmith-Winograd matrix multiplication at O(n^2.373) is the canonical example; in practice every shipping engine uses an O(n³) cache-blocked GEMM.

2026

Asymptotics live alongside cache, branch, and parallelism cost models. Modern engine work treats Big O as the first filter (catches a quadratic before it ships) and then layers cache-line, branch-prediction, and SIMD reasoning on top of the survivors. The single-number complexity is still load-bearing for design choices; it is no longer the whole story.

The recurring shape: a notation invented in 1894 to bound error terms in prime-counting formulas, sharpened in 1976 to handle algorithmic lower bounds, and extended in 1999 to model the memory hierarchy that real machines actually have. The version you'll use is Knuth's; the version that catches every bug is the one with cache-aware constants stapled on.

03The definition, picked apart

Big O is defined for functions from non-negative integers (or reals) to non-negative reals. The formal statement, the way Knuth wrote it^[10] and CLRS teaches it^[11]:

eq. 1 \cdot big-O f(n) = O (g(n)) \Leftrightarrow \exists c > 0, n₀ \geq 0 : \forall n \geq n₀, 0 \leq f(n) \leq c \cdot g(n)

Read aloud: "f(n) is O(g(n)) if there exist constants c and n₀ such that for every n at least n₀, f(n) is sandwiched between zero and c times g(n)." The c is what lets you throw away constant factors; the n₀ is what lets you throw away lower-order terms. Both are existential, which means the definition is satisfied as soon as you can find any pair that works.

The three things this definition does not say are at least as important as what it does say:

It is not an equation. The "=" in "f = O(g)" is read "is" or "is in," not "equals." Some authors write f ∈ O(g) to make the set membership explicit, and that's strictly more correct. The "=" is convention because that's how Bachmann wrote it^[7].
It is one-sided. Big O gives an upper bound. f(n) = O(n²) is true if f grows like n², but also if f grows like n, or n log n, or any slower function. Saying "linear search is O(n²)" is technically correct and almost always misleading.
It is asymptotic. Big O makes no claim about small n. An algorithm that is O(n) can be slower than one that is O(n²) on every input you ever feed it, as long as the latter's n² is multiplied by a small enough constant. The classic example: insertion sort beats quicksort below n ≈ 16, which is exactly why every production sort cuts over to insertion sort at small sizes^[12].

Drag the widget below. The orange curve is f(n); the violet curve is c · g(n) for the chosen c. The vertical line is n₀. Big O holds when the orange curve stays under the violet one for every n to the right of the line:

multiplier c 4.0

threshold n₀ 10

choice of g(n)

f(n)

3n² + 12n + 50

status

···

For f(n) = 3n² + 12n + 50, the bound f = O(n²) holds with c = 4, n₀ = 16: f(16) = 3·256 + 192 + 50 = 1010 and 4·16² = 1024, so f(16) ≤ 4·g(16), and the inequality survives for every larger n. Push the slider down to n₀ = 13 (still at c = 4) and the bound fails: f(13) = 713 beats 4·169 = 676. Either bump c up or n₀ back up to recover. With g = n, no constant c works asymptotically, because the n² term eventually beats any linear cap. Toggle to Ω or Θ to test the lower-bound and tight-bound definitions from §4.

Why is n₀ existentially quantified?

The point of n₀ is to let you ignore noise at small n. A function might do weird things for the first ten or hundred inputs (a startup cost, a one-time table fill, a cache that hasn't warmed up) and then settle into its real growth shape. Big O cares about the real growth shape, not the noise.

Choosing n₀ = 0 is a stronger claim ("the bound holds for every input"); choosing n₀ = 10⁶ is a weaker one ("the bound holds for every input I care about"). The definition says you only need to find any n₀, which is the right tradeoff for ignoring small-input artifacts without throwing away the bound.

04Omega, Theta, and the rest of the family

Big O is an upper bound. The full asymptotic toolkit has four cousins. Each is one inequality flip away from the others:

Name	Symbol	Meaning	Definition	Reads as
Big O	O(g)	Upper bound, not tight	∃c, n₀ ∀n≥n₀: f ≤ c·g	"grows no faster than"
Big Omega	Ω(g)	Lower bound, not tight	∃c, n₀ ∀n≥n₀: f ≥ c·g	"grows no slower than"
Big Theta	Θ(g)	Tight bound, same growth	f = O(g) and f = Ω(g)	"grows exactly like"
Little o	o(g)	Strictly slower	∀c > 0 ∃n₀ ∀n≥n₀: f < c·g	"grows strictly slower than"
Little omega	ω(g)	Strictly faster	∀c > 0 ∃n₀ ∀n≥n₀: f > c·g	"grows strictly faster than"

The flip from "big" to "little" is the flip from existential to universal on the constant: big-O finds one c that bounds f above; little-o requires that every positive c eventually bounds f above. That's a much stronger statement: f can't just be no-bigger-than g by some constant, it has to grow strictly slower.

eq. 2 \cdot concrete examples 3n² + 5n = O(n²), Θ(n²), Ω(n), o(n³), ω(n)

The same function fits five bounds. Θ(n²) is the tightest description (matching upper and lower); the others are looser. In a tutorial, saying "merge sort is Θ(n log n)" is more informative than "merge sort is O(n log n)" because Θ rules out the merge sort that secretly runs in O(n) (it doesn't), while O leaves that possibility open.

What about little-o?

Little-o is what you reach for when you want to claim that one function is fundamentally faster-growing than another, not just that one happens to be bigger. The statement 1/n = o(1) says 1/n shrinks to zero, which is the asymptotic version of "1/n vanishes." The statement n log n = o(n²) says no constant multiple of n log n can keep up with n²: the gap between them grows without bound.

Knuth's 1976 paper makes the distinction explicit and warns against the common abuse of writing O(n) when Θ(n) is meant^[10]. The abuse is unfortunately so widespread that "an O(n log n) sort" almost always means "a Θ(n log n) sort" in practice. Read all O() in this sense unless the author explicitly distinguishes.

A practical rule of thumb

Algorithm analysis usually wants Θ (matching upper and lower bounds) for the algorithm you wrote, and Ω for the problem you're solving (the best possible algorithm). "Merge sort is O(n log n)" and "comparison-based sorting is Ω(n log n)" combine to say "merge sort is asymptotically optimal among comparison sorts," which is the actual claim.

05The growth zoo, racing

About eight complexity classes show up in 95% of working algorithms. Memorize their shapes and you can read most papers without doing the algebra:

Class	Name	Where you'll see it	At n=10⁶ (1 ns/op)
O(1)	Constant	Hash table lookup (average), array index, push to a stack	1 ns
O(log n)	Logarithmic	Binary search, balanced BST operation, BVH ray traversal level	20 ns
O(√n)	Square-root	Sqrt decomposition; tile size in cache-oblivious algorithms	1 µs
O(n)	Linear	Linear search, single pass over an array, building a histogram	1 ms
O(n log n)	Linearithmic	Comparison-based sort, FFT, sweep-and-prune broad phase	20 ms
O(n²)	Quadratic	Nested-loop pair test, bubble sort, naive matrix-vector product	17 min
O(n³)	Cubic	Naive matrix multiply, Floyd-Warshall, all-pairs shortest path	32 years
O(2ⁿ)	Exponential	Brute-force SAT, subset enumeration, game-tree search at full depth	≫ age of universe
O(n!)	Factorial	Brute-force TSP, permutation enumeration	≫≫ age of universe

The "at n=10⁶" column makes the spacing concrete: ten orders of magnitude separate O(n) from O(n²), and another seventeen separate O(n²) from O(n³). These are the gaps that distinguish a frame budget from a freeze, and a freeze from "won't terminate before the game studio goes bankrupt."

The widget below races them on a log-log plot. On log-log axes a polynomial n^k is a straight line with slope k, which is why log-log is the standard way to compare growth rates by eye:

per-operation time 1 ns

scan to n = 1000

O(1) O(log n) O(n) O(n log n) O(n²) O(n³) O(2ⁿ)

06When constants lie

Big O drops constants on purpose: a notation that cared about constants couldn't compare implementations across hardware, compilers, or decades. The price is that two algorithms with the same Big O can differ by orders of magnitude in real-world cost, and an algorithm with worse Big O can beat one with better Big O on inputs you'll actually see.

The textbook example is 100n versus n² / 100. The first is O(n); the second is O(n²). By any asymptotic measure, the linear algorithm wins. Solve for the crossover:

eq. 3 \cdot crossover point 100 n = n 2 / 100 \Rightarrow n = 10,000

Below n = 10,000 the "worse" algorithm is genuinely faster. If you ship a system that never exceeds 10,000 elements, picking the quadratic implementation is the correct engineering decision. The asymptotic ranking is informative, not authoritative.

linear constant a (cost = a · n) 100

quadratic constant b (cost = n² / b) 100

crossover n

···

linear wins if

···

What constants stand for in practice

An algorithm's hidden constant absorbs all the per-operation work the abstraction throws away. Three things commonly drive a small Big O into an unshippable wall:

Cache misses. An L1 hit is ~4 cycles; an L3 hit is ~30; a main-memory hit is ~200^[15]. A linked-list traversal that visits n nodes scattered across memory is asymptotically O(n), but it's also 50× slower than a contiguous O(n) array traversal. Both are O(n); only one is shippable in a hot loop.
Branch mispredictions. A modern CPU pipelines 15–20 instructions ahead; a mispredicted branch flushes that pipeline at a cost of 15–20 cycles. An algorithm with the same Big O as a branch-free alternative but a data-dependent branch can lose by a factor of 5×^[16]. Most production sort fallbacks (insertion sort at small n, sorting networks at very small n) are about branch prediction, not asymptotic complexity.
Memory allocation. An O(n) algorithm that mallocs once per element on a server-grade allocator pays a few hundred nanoseconds per call; the same algorithm with a bump allocator pays a few. The Big O is the same. The shipping budget is not.

Frigo et al.'s 1999 cache-oblivious paper^[13] formalized the first of these into a parallel complexity model that counts cache transfers instead of CPU operations. Sometimes the asymptotic count over cache transfers is the bound that matters; the CPU-op count is constant-factor noise. The takeaway is not that Big O is wrong, it's that Big O is one of two or three complexity functions a serious analysis needs.

Constants live forever

An algorithm with hidden constant 10⁶ is asymptotically the same as one with hidden constant 1, but on inputs of size n ≤ 10⁶ the second is up to a million times faster. The Coppersmith-Winograd matrix-multiply algorithm has a famous constant in the trillions; no shipping linear-algebra library uses it^[14]. The Big O is an admission of ignorance about the constant, not a claim that the constant is small.

07Worst, best, average, amortized

An algorithm's runtime is a function of its input. Different inputs produce different runtimes. Big O over an algorithm has to specify which inputs you mean:

Worst case. The runtime on the slowest input of size n. The default. When the literature says "quicksort is O(n²)," this is the case meant.
Best case. The runtime on the fastest input. Rarely interesting; insertion sort's best case is O(n) (sorted input), which is occasionally relevant.
Average case. The expected runtime over some distribution of inputs. Quicksort's average is O(n log n) under a uniform-random input distribution. Useful but harder to claim cleanly: the answer depends on what you assume about the input.
Amortized. The average runtime per operation over a long sequence. std::vector::push_back's amortized cost is O(1) even though individual pushes can cost O(n) when the vector resizes.

Quicksort: the canonical worst-versus-average gap

Quicksort partitions the array around a pivot, recurses on the two halves, and concatenates. On a random pivot it halves the array on average, giving the recurrence T(n) = 2T(n/2) + Θ(n), which the master theorem (§8) solves to Θ(n log n). On an adversarial pivot (always-smallest or always-largest), it produces partitions of sizes 0 and n−1, giving T(n) = T(n−1) + Θ(n), which sums to Θ(n²).

A sorted-or-reverse-sorted input plus a naive first-element pivot triggers the worst case on every recursion. The widget runs quicksort on three input patterns:

array size 200

comparisons at n

···

vs n log n

···

Amortized: std::vector::push_back

A dynamic array doubles its capacity when full. Most pushes are O(1) (write to the next slot). The pushes that trigger a resize are O(n) (copy every element to a bigger buffer). Worst-case per push is O(n); that's not interesting because it's misleading. Amortized per push, summed over a sequence and divided by length, is O(1).

The proof, from Tarjan's 1985 paper that formalized the technique^[17]: n pushes do at most n single-slot writes plus a sequence of resize-copies of sizes 1, 2, 4, 8, ..., 2^{⌈log n⌉}. That sum is bounded by 2n. Total work is at most 3n; per-push amortized cost is at most 3 = O(1).

pushes 128

growth factor 2.0

peak push cost

···

amortized average

···

08The Master Theorem

Most divide-and-conquer algorithms produce a recurrence of the form:

eq. 4 \cdot the standard form T(n) = a \cdot T(n / b) + f(n)

"To solve a problem of size n, do a recursive subsolves each of size n/b, plus f(n) of additional work to combine them." Merge sort is a = 2, b = 2, f(n) = Θ(n). Binary BVH construction with sweep partitioning is similar.

The Master Theorem^[11] classifies these recurrences in three cases, depending on whether the recursive work or the combine work dominates. Let c_crit = log_b a:

Case 1: leaves dominate. If f(n) = O(n^c) with c < c_crit, then T(n) = Θ(n^c_crit). The recursion tree has so many leaves that they outweigh the combine work.
Case 2: every level costs the same. If f(n) = Θ(n^c_crit), then T(n) = Θ(n^c_crit log n). Each of the log_b n levels does the same amount of work. Merge sort lives here: c_crit = 1, f(n) = Θ(n), total Θ(n log n).
Case 3: root dominates. If f(n) = Ω(n^c) with c > c_crit, and a regularity condition holds, then T(n) = Θ(f(n)). The top-level combine work is so expensive that the recursion contributes nothing significant.

Worked examples

Three standard divide-and-conquer recurrences and what the theorem says about them:

Algorithm	Recurrence	c_crit	Case	Result
Binary search	T(n) = T(n/2) + O(1)	0	Case 2 (c=0)	Θ(log n)
Merge sort	T(n) = 2T(n/2) + O(n)	1	Case 2	Θ(n log n)
Karatsuba multiplication	T(n) = 3T(n/2) + O(n)	log₂3 ≈ 1.585	Case 1	Θ(n^log₂3) ≈ Θ(n^1.585)
Strassen matrix multiply	T(n) = 7T(n/2) + O(n²)	log₂7 ≈ 2.807	Case 1	Θ(n^log₂7) ≈ Θ(n^2.807)
Naive matrix multiply (recursive)	T(n) = 8T(n/2) + O(n²)	3	Case 1	Θ(n³)
BVH top-down (binned SAH)	T(n) = 2T(n/2) + O(n)	1	Case 2	Θ(n log n)

The pattern: how many subproblems versus how aggressively you shrink them is the lever. Strassen's algorithm beats the naive cubic matrix multiply because it trades one recursive call for a constant amount of extra combine work; that one fewer recursion drops the exponent from 3 to log₂ 7. The same insight scales: the long arc of matrix-multiply progress (Strassen 1969, Coppersmith-Winograd 1990, the modern improvements down to O(n^2.371552))^[18] is a sequence of finer subdivisions that drop a further. Only Strassen is practical; the rest are galactic (§14).

The cases that aren't in the Master Theorem

The classical Master Theorem doesn't cover every recurrence. Three gaps:

Logs in f(n). If f(n) = Θ(n log^k n), the original theorem doesn't apply. The Akra-Bazzi method extends it.
Uneven splits. T(n) = T(n/3) + T(2n/3) + O(n) (median-of-medians selection variants) requires Akra-Bazzi or direct analysis.
Subtractive recurrences. T(n) = T(n-1) + O(n) (quicksort worst case, naive recursion) isn't divide-and-conquer; it's "shrink by 1." The answer is the sum Θ(n²), found by direct summation.

For the recurrences this tutorial cares about, the three-case Master Theorem above is sufficient. The Akra-Bazzi version is in CLRS chapter 4^[11].

09Measuring growth empirically

Sometimes the algorithm is too complex to analyze by hand and you have to measure. The classical trick: run it at sizes n, 2n, 4n, 8n, plot on log-log axes, and read the slope.

On log-log axes, T(n) = c · n^k becomes log T = log c + k · log n, which is a straight line with slope k and intercept log c. Eyeballing the slope is a coarse but reliable diagnostic: a slope of 1 means O(n), a slope of 2 means O(n²), anything between 1 and 2 is suspicious (could be O(n log n), could be O(n^1.5), hard to tell from slope alone).

A more discriminating test: the doubling ratio. Measure T(n) and T(2n); the ratio T(2n) / T(n) diagnoses the class:

Doubling ratio	Diagnosis	Slope on log-log
~1	O(1)	0
~1 + 1/log n	O(log n)	tends to 0
~2	O(n)	1
~2.1–2.3	O(n log n)	slightly above 1
~4	O(n²)	2
~8	O(n³)	3

The widget below shows a measured running time and lets you guess the complexity. The data has noise (cache effects, GC pauses, scheduler hiccups) so the diagnosis isn't always exact at small n; that's the realistic case.

slope (log-log fit)

···

doubling ratio

···

verdict

···

10Case study: broad-phase collision

The classic engine-side O(n²) trap. Physics simulations need to find every pair of objects whose bounding volumes overlap. A naive double-loop checks n(n−1)/2 pairs; at n = 1000 that's 500,000 pairs every frame, at 60 fps that's 30 million pair tests per second, and a pair test is several SIMD ops, so the broad phase alone budgets out the rest of physics on a single core.

Three standard fixes, with different complexity in different regimes:

Sweep-and-prune (sort and sweep)

Sort the bounding-box endpoints along one axis (say X). Two boxes can only overlap on the X axis if neither's max-X is before the other's min-X. Sweep through the sorted endpoints with an active list; when you hit a min-X, add the box to the active list; when you hit a max-X, remove it. Every pair of boxes simultaneously active is a candidate. Repeat on Y and Z; take the intersection.

The first frame's cost is dominated by the initial sort at O(n log n); the sweep itself is O(n + k) where k is the number of active-pair encounters. In subsequent frames the endpoint lists are nearly sorted (objects move a little, not a lot, per step), and insertion sort on a nearly-sorted list runs in O(n + s) where s is the number of out-of-order pair swaps. In steady state that puts per-frame cost at O(n + k + s), which is close to linear when both k and s are O(n). The pathological case (every box overlaps every other) blows k up to O(n²) and you're back to the naive cost. The Cohen et al. 1995 I-COLLIDE paper introduced the temporal-coherence trick^[4].

Uniform spatial hash

Partition space into a regular grid. Hash each object's AABB into the cells it touches. Two objects can only collide if they share a cell. Build cost is O(n); query cost is the sum over cells of the per-cell pair-test count.

If average density is d objects per cell and the cell size is at least as large as the biggest object's AABB (so each object touches O(1) cells), the per-cell pair-test count averages d²/2, and total cost is O(n · d), which collapses to O(n) when d is bounded. The catch: d is bounded only when objects are spread out. A pile of crates on the same tile reduces to brute force inside that tile. NVIDIA GPU Gems 3 describes a GPU spatial-hash implementation that handles dense scenes by sub-binning^[5].

BVH for moving objects

Build a bounding volume hierarchy over the objects' AABBs and query each object against the tree. Construction is O(n log n); per-query is O(log n) in well-spread scenes; total is O(n log n). Bullet ships btDbvtBroadphase (dynamic AABB tree) and Box2D ships b2DynamicTree as the default broad phase for exactly this reason; PhysX defaults to sweep-and-prune but supports a BVH alternative for sparse scenes^[1].

The widget runs all three on the same simulated scene. At low object counts brute force wins on absolute time because its constant is tiny; at high counts it loses by an enormous margin:

objects 120

grid cell size 50

brute pairs/frame

···

SAP pairs/frame

···

hash pairs/frame

···

"Pairs/frame" is the number of close-proximity checks each algorithm queues for the narrow phase. The simulation runs in real wall-clock time, not collapsed into one paint, so the pacing matches the algorithm's actual cost shape. Note that very small grid cells push the hash toward brute force (because every cell has at most one object and the hash overhead dominates); very large cells push it toward brute force on the inside of each cell. There's a sweet spot near "cell size ≈ object diameter."

11Case study: BVH ray traversal

Ray tracing a scene of n triangles by brute force means testing the ray against every triangle: O(n) per ray, O(n · pixels) per image. At n = 10⁷ triangles and a 1080p image that's 2×10¹³ ray-triangle tests per frame, which is several orders of magnitude past tractable.

A reduces this to O(log n) expected per ray on well-built scenes. The trick: organize triangles into a binary tree where each node's AABB encloses every triangle in its subtree. To intersect a ray with the scene, test the ray against the root's AABB; if it misses, the whole subtree is irrelevant. If it hits, recurse on both children. Continue until reaching a leaf, where you test the ray against the small bucket of triangles inside.

The Physically Based Rendering textbook lays out the math: with a balanced BVH and the surface-area heuristic for the build, expected per-ray cost is logarithmic in n^[1]. The worst case is still O(n) (a degenerate ray that hits every leaf), but well-built scenes hold the log bound on every ray that matters.

scene objects 80

rays per click 1

brute tests

···

BVH tests

···

speedup

···

The BVH is built with a median split on the longer axis at each level. Brute force always tests n objects; BVH tests the AABB of every node on the ray's path plus the leaf objects in the leaves it descends into. For a well-spread scene, the descended-into count is log₂(n) bounded by a small constant. Rays that traverse a lot of empty space see the biggest speedups; rays that graze many objects see less. NVIDIA's RT cores fix this in hardware by performing the BVH walk in fixed-function silicon^[19].

12Case study: A* and the branching factor

A* searches a graph for the shortest path from start to goal. At each step it picks the frontier node with the lowest estimated total cost f(n) = g(n) + h(n), where g is the known cost so far and h is a heuristic estimate to the goal. With an admissible heuristic, A* is optimal; with no heuristic, it degenerates to Dijkstra; with a perfect heuristic, it walks a straight line to the goal.

The worst-case complexity is O(b^d), where b is the branching factor (how many neighbors a node has) and d is the depth of the solution^[20]. On a 2D grid b = 4 (von Neumann) or b = 8 (Moore neighborhood); on a navmesh b is the average degree of a navmesh polygon, usually 3–6.

The exponential isn't in n (the size of the graph); it's in d (how far the goal is). For a typical 100×100 grid with the goal 50 steps away, naive BFS would visit roughly 4⁵⁰ nodes if it didn't track visited ones; with the visited-set, BFS visits n = 10⁴ nodes. A* with a good heuristic visits something like 5 · 50 = 250 nodes: the visited region is a corridor along the optimal path, not a disk.

The heuristic's quality is the lever. A perfect heuristic (knows the true distance) makes A* visit exactly the optimal path. An admissible-but-loose heuristic (Manhattan distance on a grid with obstacles) makes A* explore a fan along the path. An inadmissible heuristic (overestimates) makes A* fast but the result is no longer optimal:

obstacle density 0.20

nodes expanded

···

path length

···

optimal?

···

Dijkstra is A* with h = 0; it explores every reachable node up to the goal's cost, which is a disk. Manhattan heuristic is admissible on 4-connected grids; A* explores a wedge biased toward the goal. Weighted A* multiplies h by a factor > 1 to push harder toward the goal; faster, but the path is no longer guaranteed optimal. The shape change is the asymptotic improvement: from O(n) for Dijkstra to O(d · b_eff) for A* with a good heuristic, where b_eff is the effective branching factor a heuristic prunes the search to.

13Case study: sorting draw calls

A renderer issues thousands of draw calls per frame. Reordering them to group calls that share a material, mesh, or render target is one of the biggest performance wins available: state changes on the GPU are expensive, and a sort that batches them amortizes the cost^[2].

The standard pattern: every draw call gets a 32- or 64-bit sort key packing render target, shader, material, depth, and so on into bit fields in priority order. Sort the array of keys. Issue the calls in the sorted order. Comparing the sort cost matters because the renderer runs it every frame:

std::sort (introsort). O(n log n) comparison-based. On a uniformly-random array of 64-bit keys, modern x86 implementations run on the order of a few milliseconds per 100,000 elements; partially-sorted input is faster because introsort short-circuits into insertion sort.
Radix sort. O(d · n) where d is the number of digits (typically bytes) in the key. For a 64-bit key with 8-bit digits, d = 8. Cache-aware implementations (the ska_sort variant is a representative benchmark) run roughly several times faster than std::sort on the same input. The speedup grows with n: small at 8k calls, decisive past 100k.

Why isn't every sort a radix sort? Two reasons. First, comparison sorts work on any comparable type; radix sorts work on fixed-width integer keys. Second, naive radix sort scatters writes across buckets and trashes the cache, so the constant is larger than the textbook complexity suggests; the implementations that beat std::sort handle that explicitly. The Bevy engine's issue tracker has a detailed discussion of radix-vs-comparison for the sort phase^[21]; the canonical real-time-collision-detection blog post by Christer Ericson lays out the bit-packing pattern most engines use^[2].

Radix sort being O(n) (treating d as a constant for fixed-width keys) sounds like it violates the Ω(n log n) lower bound for sorting. It doesn't, because the lower bound applies to comparison-based sorts; radix is not comparison-based. The lower bound's proof rests on counting binary-decision-tree leaves^[11]; radix sort doesn't make comparisons, so the proof doesn't apply.

draw_key.hpp · packing a sort key

// A 64-bit draw-call sort key. Field order is priority order:
// the leftmost (most significant) bits drive the coarsest grouping.
// Lay out fields top-down so a single 64-bit integer compare
// produces the desired ordering with no branches.
struct DrawCallKey {
  uint64_t renderTarget   : 2;   // 0=shadow, 1=opaque, 2=transparent, 3=postfx
  uint64_t shaderProgram  : 10;  // 1024 programs is plenty
  uint64_t materialId     : 14;  // 16K unique materials per frame
  uint64_t meshId         : 14;  // 16K unique meshes; rebinding VBO is costly
  uint64_t depth          : 24;  // front-to-back for opaque; reversed for transparent

  // 64-bit compare = whole-struct compare. The bit layout
  // implies the priority. Sorting an array of these is one
  // radix pass over the 8 bytes, or one std::sort over uint64_t.
};
static_assert(sizeof(DrawCallKey) == 8, "key must pack into a uint64");

The bit layout is the asymptotic algorithm: by putting render target in the high bits, every same-target draw is contiguous after sorting; by putting depth in the low bits, ties at higher levels resolve to depth order. The sort itself is incidental. The design is the key layout.

14Galactic algorithms

An algorithm is galactic if its asymptotic complexity is the best known but its hidden constants are so large that it loses to a worse-asymptotic algorithm on every input that fits in the observable universe^[14]. The term was popularized in Lipton and Regan's 2010s essays on algorithm theory.

The flagship example is matrix multiplication. The naive triple-loop algorithm is O(n³). The progression of asymptotic improvements has been steady since the 1960s:

Year	Author(s)	Exponent ω	Practical?
1969	Strassen	log₂7 ≈ 2.807	Yes; production for n ≥ 100
1978	Pan	2.795	No; constants too large
1981	Schönhage	2.522	No
1990	Coppersmith-Winograd	2.376	No; canonical galactic algorithm
2014	Le Gall	2.3728639	No
2024	Vassilevska Williams, Y. Xu, Z. Xu, Zhou^[18]	2.371552	No
2025	Alman, Duan, Williams, Y. Xu, Z. Xu, Zhou^[18]	2.371339	No

Every entry past Strassen is galactic. Estimates of the crossover n at which a Coppersmith-Winograd-style algorithm would beat a cache-blocked cubic GEMM put it well past 10²⁰, more than any concrete matrix that ships in a graphics or scientific application. For shipping graphics, audio, and physics code, GEMM is O(n³); the asymptotic wins are mathematical theorems, not engineering choices.

Galactic algorithms are still worth studying, for two reasons. First, they bound what's possible: every step closer to n² moves the lower-bound theorem closer. Second, the techniques sometimes transfer down: Strassen's algorithm is practical, and it began the chain. If a future technique drops the constant by a factor of n, the resulting algorithm beats O(n³) in cells you can actually rent. The lesson is that "asymptotically faster" is a necessary, not sufficient, condition for "actually faster."

15NP-hard games

Big O can describe problems for which no polynomial-time algorithm is known, and which are believed to require exponential time. The class NP is the set of decision problems whose "yes" answers can be checked in polynomial time given a witness. The class P is the subset whose answers can also be computed in polynomial time. The P vs NP question is whether these are the same class.

A problem is NP-hard if every problem in NP reduces to it in polynomial time; NP-complete if it's both NP-hard and in NP. The Cook-Levin theorem (Cook 1971, Levin 1973 independently) showed that SAT is NP-complete, the first such proof^[22]; thousands of problems have since been shown NP-complete by polynomial reduction from SAT.

Several puzzles are known NP-hard or NP-complete in their generalized forms:

Game	Problem statement	Complexity	Reference
Generalized Tetris	Maximize survival on a known sequence of pieces	NP-complete	^[3]
Sokoban	Push boxes onto target tiles	PSPACE-complete	^[23]
Super Mario Bros	Reach the end of a generalized level	NP-hard	^[24]
Minesweeper consistency	Is a given board consistent?	NP-complete	^[25]
Generalized Sudoku	Solve an n² × n² Sudoku	NP-complete	^[26]

These results apply to generalized versions (Tetris on an n × m board, Sudoku on n² × n²). The fixed-size game you play in your browser is solvable by brute force; the asymptotic claim is about the family. The practical engineering takeaway is that AI for these problems can't be expected to scale: heuristics and learned policies dominate every published agent because exact solutions are out of reach.

Procedural generation hits the same wall in the other direction. Generating a guaranteed-solvable Sokoban level requires solving Sokoban, which is PSPACE-complete^[23]. Generators ship anyway, by using a constrained subset of the rules where the problem becomes tractable, or by checking solvability with a bounded search budget and discarding levels that don't certify within it.

16Beyond polynomial

Past exponential are growth rates so vast they have no operational meaning. The number of atoms in the observable universe is roughly 10⁸⁰; 2ⁿ crosses that threshold at n ≈ 267, and n! crosses it at n ≈ 60. The grows faster than any primitive-recursive function; at very small inputs it produces numbers that can't be written down in conventional notation^[27].

These aren't useful complexities for algorithms ("the algorithm runs in A(n, n) time" is functionally indistinguishable from "the algorithm doesn't terminate"), but they show up as inverse functions in tight upper bounds. The classic example: union-find with union-by-rank and path compression has an amortized cost per operation of α(n), where α is the inverse Ackermann^[17]. For any n that fits in any computer ever built, α(n) ≤ 4: in practical terms, the operation is constant-time. The function is technically not constant because it grows, just incomprehensibly slowly.

n 25

···

2ⁿ

···

n³

···

17Try it yourself

The playground lets you pick a data structure and an operation, and shows the asymptotic cost along with the actual operation animated. Six common structures, four operations each.

data structure

operation

n 32

cost (this run)

···

asymptotic

···

18How engines actually use this

Asymptotic complexity drives most major data-structure decisions in shipped engines, with constant-factor refinements layered on top. Concrete choices:

Unreal Engine's TMap. Backed by TSet, which is a chained hash table: each bucket stores indices into a contiguous element array, and collisions are resolved by walking a per-bucket chain. Average-case O(1) lookup, with the O(n) worst case showing up under pathological hash collisions. Choice of chaining (over open addressing) lets Unreal store elements densely while keeping iteration cheap. Documentation describes the design tradeoffs explicitly^[28].
Unity's NativeParallelMultiHashMap. A Burst-compatible hash table that allows multiple values per key, used in the Entities package for spatial partitioning. Average O(1) insert and lookup, with explicit bucket-count tuning because the Burst compiler doesn't dynamically resize across job boundaries. (Older Collections versions exposed the same type as NativeMultiHashMap; renamed in Collections 2.x.)
Bevy ECS's archetype graph. Adding or removing a component changes an entity's archetype. Bevy maintains an Edges cache on each archetype, mapping each component bundle to the destination archetype, so transitions are O(1) amortized after the first cache miss. Without it, the lookup would be O(a) in the number of archetypes per move.
Unreal's GPU particle sort. Transparent particles are sorted on the GPU via a bitonic merge sort (FParticleSortBuffers / GPUSort.cpp) at O(n log² n) work, parallel across cores. Equivalent CPU sorts at the same scale would be O(n log n) serial but pay a round-trip to upload sorted indices to the GPU, which is the wrong constant for the particle pipeline.
Spatial hash for proximity queries. A uniform grid keyed on world-space cell coordinates, with each cell storing the list of objects whose AABB touches it, gives O(1) amortized insert and O(k + 1) per-query for k objects near the query point. The cell size has to be at least as large as the largest object's AABB to keep object-touches-cells at O(1). Teschner et al. 2003 analyze the cell-size tradeoff for collision detection^[29].

The shared shape: pick the asymptotic class first, then refine for cache and parallelism. An engine that picks O(n²) by accident and then "optimizes" the constant will not catch the bug. An engine that picks O(n log n) and then optimizes the constant ships at 60 fps.

19Pitfalls

The mistakes that occur most often in code review, in roughly decreasing frequency:

Accidental quadratic from string concatenation. Building a string by repeated s = s + part is O(n²) in most languages (each concat copies the whole accumulator). Use a string builder or a join; both are O(n). The same bug appears in tight loops building command buffers, log lines, or shader source.
Linear search in a hot loop. "Find this asset by name" inside an inner loop turns a per-frame cost of O(1) into O(n), which compounds with anything wrapping it. Cache the lookup outside the loop or use a hash table.
O(log n) in a tight loop is not "free." Logarithm time is constant in casual analysis but adds up: a binary search per pixel at 1080p is 2 million log-time operations, 50 ns each, 100 ms per frame. The log shows up in profilers as a flat 8% nobody notices until they look.
Hash table worst case. A hash table is O(1) average and O(n) worst case. If your hash function is bad, the worst case is the actual case. Use cryptographic-quality hashes for adversarial inputs (user-supplied names) and check for collision chains in QA builds.
Treating amortized as worst-case. A vector push is O(1) amortized but O(n) on the resize. In a real-time loop, the resize is a 16 ms hitch nobody profiled. Pre-reserve when you know the size.
Ignoring constants on small n. An O(n²) algorithm with constant 1 beats an O(n log n) algorithm with constant 1000 below n ≈ 100. If the data structure is small and the operation is hot, the asymptotically-better algorithm is the wrong choice.
Vector copy in a hot loop. Passing a std::vector by value copies it: O(n) per call. The function signature is innocuous, the cost is not. Pass by const-reference unless you actually need the copy.
Recursive complexity miscounted. A recursion that branches twice per level isn't O(n); it's O(2^depth). Fibonacci-style naive recursion is the canonical bug; memoization fixes it.

20What's next

The natural follow-ups are the algorithms whose complexity this tutorial only sketched:

Sorting and searching deep-dives. Why introsort, why radix, why pdqsort, why timsort. The O(n log n) family is more crowded than it looks.
The cache-oblivious model. Big O over cache transfers instead of CPU operations. Frigo and Leiserson's O(N/B · log_M/B(N/B)) bound for cache-oblivious sort is the right thing to think about for large-data engine work^[13].
Parallel algorithms. The work-span model (PRAM, Cilk). The job systems tutorial introduces the runtime; the complexity model for analyzing parallel algorithms is a tutorial of its own.
NP-hard heuristics. If a problem is NP-hard, exact solutions are out; approximation algorithms (PTAS, FPTAS), local search, and learned heuristics are the practical answers. AI pathfinding past A* and game-AI literature live here.
Lower bounds. When you've shown an algorithm runs in O(f(n)), the next question is whether any algorithm can do better. Information-theoretic lower bounds, comparison-tree lower bounds, and circuit-complexity lower bounds answer this and are the hardest part of the field.

For the engine programmer, the most valuable next step is empirical: profile a real engine, find every O(n²) hidden in it, and replace it. The bugs are there.

Sources

Matt Pharr, Wenzel Jakob, Greg Humphreys. Physically Based Rendering: From Theory to Implementation, 3rd ed., 2018, §4.3 "Bounding Volume Hierarchies." pbr-book.org. Source for the BVH-traversal expected O(log n) bound and the surface-area heuristic build.
Christer Ericson. "Order your graphics draw calls around!" realtimecollisiondetection.net, 2008. realtimecollisiondetection.net. The standard reference for bit-packed draw-call sort keys and the radix-sort pattern.
Erik D. Demaine, Susan Hohenberger, David Liben-Nowell. "Tetris is Hard, Even to Approximate." COCOON 2003. arXiv:cs/0210020. Original NP-completeness proof for generalized Tetris.
Jonathan D. Cohen, Ming C. Lin, Dinesh Manocha, Madhav K. Ponamgi. "I-COLLIDE: An Interactive and Exact Collision Detection System for Large-Scale Environments." Proc. ACM Interactive 3D Graphics, 1995. The original sweep-and-prune (sort-and-sweep) paper, with the O(n + k) analysis and worst-case O(n²) note.
Scott Le Grand. "Broad-Phase Collision Detection with CUDA." GPU Gems 3, ch. 32, NVIDIA, 2007. developer.nvidia.com. Spatial-hash analysis for the GPU broad phase, including sub-binning for dense scenes.
Erica N. Walker. "Math Origins: Orders of Growth." Mathematical Association of America Convergence, 2017. maa.org. Historical survey covering du Bois-Reymond 1870, Bachmann 1894, and Landau 1909.
Paul Bachmann. Die analytische Zahlentheorie, vol. II, Teubner, 1894. archive.org scan. First published use of the O symbol; "O" stands for "Ordnung."
Edmund Landau. Handbuch der Lehre von der Verteilung der Primzahlen, Teubner, 1909. archive.org scan. Introduces the little-o notation; the source of "Landau symbols."
G. H. Hardy, J. E. Littlewood. "Some Problems of Diophantine Approximation: Part II. The Trigonometrical Series Associated with the Elliptic θ-Functions." Acta Mathematica 37, 1914. Introduces the original Ω notation as a non-vanishing-remainder marker; predates Knuth's stricter algorithmic Ω.
Donald E. Knuth. "Big Omicron and Big Omega and Big Theta." ACM SIGACT News 8(2):18–24, 1976. dl.acm.org. The standardization paper for algorithm analysis: defines Big O, Big Ω, Big Θ as used today.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, 4th ed., MIT Press, 2022, ch. 3–4. The textbook reference for asymptotic notation, the master theorem, and the comparison-sort lower bound.
David R. Musser. "Introspective Sorting and Selection Algorithms." Software: Practice and Experience 27(8):983–993, 1997. PDF. The introsort design used by libstdc++, libc++, and MSVC's std::sort.
Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. "Cache-Oblivious Algorithms." Proc. 40th IEEE FOCS, 1999. PDF. The original cache-oblivious paper; sets up the complexity model that counts memory transfers.
Richard J. Lipton, Kenneth W. Regan. "Galactic Algorithms." Gödel's Lost Letter and P=NP, 23 October 2010. rjlipton.com. The post that introduced the term, credited to Regan, for asymptotically-superior algorithms that are uncompetitive at any realistically-computable input size. Coppersmith-Winograd is the standard example.
Peter Norvig. "Teach Yourself Programming in Ten Years," "Approximate timing for various operations" table. norvig.com/21-days.html. Memory-hierarchy latency ballpark numbers (L1 ~0.5 ns, L2 ~7 ns, DRAM ~100 ns).
Agner Fog. "The microarchitecture of Intel, AMD and VIA CPUs," chapter on branch prediction. PDF. Source for the 15–20 cycle branch-misprediction penalty on modern x86 cores.
Robert E. Tarjan. "Amortized Computational Complexity." SIAM Journal on Algebraic Discrete Methods 6(2):306–318, 1985. Introduces the formal amortized-analysis framework; covers the O(α(n)) bound for union-find.
Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, Renfei Zhou. "New Bounds for Matrix Multiplication: from Alpha to Omega." Proc. SODA 2024 (arXiv:2307.07970); result ω < 2.371552. Subsequently improved by Josh Alman, Ran Duan, and the same group in "More Asymmetry Yields Faster Matrix Multiplication," Proc. SODA 2025 (arXiv:2404.16349), to ω < 2.371339. Currently the best published exponents. arXiv 2307.07970, arXiv 2404.16349.
NVIDIA. "NVIDIA Turing GPU Architecture Whitepaper," 2018, "RT Core" section. PDF. Hardware-accelerated BVH traversal in fixed-function silicon.
Peter E. Hart, Nils J. Nilsson, Bertram Raphael. "A Formal Basis for the Heuristic Determination of Minimum Cost Paths." IEEE Transactions on Systems Science and Cybernetics 4(2):100–107, 1968. Original A* paper; the O(b^d) worst-case bound and the optimality result under admissible heuristics.
Bevy Engine. "Use radix sort for sort phase and sprite sorting." GitHub issue #4291, 2022. github.com. Proposal and benchmarks comparing radix versus comparison sort for the render-phase sort in an open-source engine. (At the time of writing, Bevy uses radix sort only for sprite picking, via the radsort crate; the main draw-call sort is still std::sort's equivalent.)
Stephen A. Cook. "The Complexity of Theorem-Proving Procedures." Proc. ACM STOC, 1971, pp. 151–158. dl.acm.org. The original NP-completeness theorem; SAT is NP-complete. Leonid Levin proved an equivalent result independently in "Universal Sequential Search Problems," Problemy Peredachi Informatsii 9(3):115–116, 1973 (English translation: Problems of Information Transmission 9:265–266), hence the joint Cook-Levin name.
Joseph Culberson. "Sokoban is PSPACE-complete." Technical Report, University of Alberta, 1997. webdocs.cs.ualberta.ca. PSPACE-completeness proof for generalized Sokoban.
Greg Aloupis, Erik D. Demaine, Alan Guo, Giovanni Viglietta. "Classic Nintendo Games are (Computationally) Hard." Theoretical Computer Science 586:135–160, 2015. NP-hardness proofs for generalized Super Mario Bros, Donkey Kong, Legend of Zelda, Metroid, and Pokémon.
Richard Kaye. "Minesweeper is NP-Complete." The Mathematical Intelligencer 22(2):9–15, 2000. NP-completeness proof for the Minesweeper Consistency Problem.
Takayuki Yato, Takahiro Seta. "Complexity and Completeness of Finding Another Solution and Its Application to Puzzles." IEICE Transactions E86-A(5):1052–1060, 2003. NP-completeness proof for generalized n²×n² Sudoku.
Wilhelm Ackermann. "Zum Hilbertschen Aufbau der reellen Zahlen." Mathematische Annalen 99:118–133, 1928. Original Ackermann function (later simplified by Péter and Robinson to the two-argument form used today).
Epic Games. "TMap Reference," Unreal Engine 5 documentation. dev.epicgames.com. Documentation of Unreal's hash-map data structure and its design choices.
Matthias Teschner, Bruno Heidelberger, Matthias Müller, Danat Pomeranets, Markus Gross. "Optimized Spatial Hashing for Collision Detection of Deformable Objects." Proc. Vision, Modeling, Visualization (VMV) 2003. PDF. Cell-size analysis for uniform spatial hashing applied to collision detection; the canonical reference on the tradeoff between query cost and per-cell occupancy.