All tutorials Mighty Professional
Tutorial 06 ยท Engine Programming

Big O Notation

A hundred objects in the test scene, ten thousand in the shipping one. The build runs at 60 fps on the laptop and 4 fps on the target box. The animation that took 1 ยตs on day one takes 18 ms once the encounter is populated. Big O is the language game programmers use to predict that, before the QA report comes in. We start with the 1894 definition, work up through the cousin notations, the master theorem, and four worked engine case studies (broad-phase collision, BVH ray traversal, A*, draw-call sorting), and run every step as a live demo in your browser.

Time~55 min LevelJunior to mid; review for senior PrereqsYou can read C++ or pseudocode at intermediate level. You've heard of arrays, hash tables, and trees. HardwareNone, but knowing that CPUs have caches helps in ยง6 and ยง13.

01Why a notation, not a stopwatch

A naive collision check on the entities in a scene is two nested loops: for every pair of entities, test whether they overlap. On the four-object test scene that's six comparisons; the function takes 200 nanoseconds and nobody profiles it. On the eight-hundred-object shipping scene that's roughly 320,000 comparisons, the function takes 6 milliseconds, the physics step misses its frame budget, and the player's first encounter ships at 17 fps. A stopwatch on day one tells you nothing about day three hundred. A notation that says this function does work proportional to nยฒ does, because the multiplier the player will pay was visible the moment the function was written.

is that language. It strips a function down to its dominant growth term and throws away everything else (constants, lower-order terms, the specific hardware). What's left is the shape of the curve, which is the thing that determines whether a system survives the jump from test scene to shipping content. The shape doesn't care about your CPU. It cares about how the cost grows when the workload does.

What you'll have by the end

A working understanding of asymptotic notation strong enough to read a published algorithm description and predict its runtime curve; the discipline to spot accidental quadratic behavior in a code review; the math for the four most common divide-and-conquer recurrences via the master theorem; a feel for when Big O lies (cache effects, constant factors); and four worked engine case studies (broad-phase collision, BVH ray traversal, A*, draw-call sorting) where the asymptotic choice is the design choice. By the end you'll know why The Last of Us Part II's spatial hash beats sweep-and-prune on dense crowds, why ray-traced reflections are tractable at all[1], why Unreal sorts draw calls with a radix pass[2], and why Tetris is NP-complete[3].

The shape that ate the encounter

The widget below plots three implementations of "find all pairs of objects that might be touching": brute force at O(nยฒ), sweep-and-prune at roughly O(n log n), and a uniform spatial hash at O(n + k). Drag the object-count slider:

Live ยท Pair-test cost as the scene scales
brute (O(nยฒ))
ยทยทยท
sweep+prune
ยทยทยท
spatial hash
ยทยทยท
Comparisons-per-frame is a proxy for cost; on a modern CPU each pair test is a handful of cycles, so the absolute milliseconds depend on hardware, but the ratios between the curves are robust. Sweep-and-prune's average is O(n log n) with a worst case of O(nยฒ) when every box overlaps every other[4]; the spatial hash is approximately O(n + k) where k is the number of actually-overlapping pairs[5].

02A short history of writing down growth rates

Big O was not invented for computer science. It was invented for analytic number theory, and the people who pinned down its modern shape did so over the course of about 80 years. A tour, because the current usage only makes sense in context:

1870
Paul du Bois-Reymond defines an order-of-growth relation.[6] The first formal notation for comparing the growth of two functions. The notation didn't survive, but the idea did: it's the same comparison Big O makes, with different ink.
1894
Paul Bachmann introduces the O symbol.[7] Volume II of his Analytische Zahlentheorie writes "f(x) = O(g(x))" with O standing for Ordnung ("order"). This is the first appearance of the notation that survived. Bachmann uses it to bound error terms in prime-counting formulas.
1909
Edmund Landau adds little-o.[8] Landau's Handbuch der Lehre von der Verteilung der Primzahlen ("Handbook of the Theory of the Distribution of Prime Numbers") adopts Bachmann's big-O and introduces the lower-case o for strict asymptotic dominance. The pair is still called the Landau symbols in the math literature.
1914
Hardy and Littlewood extend the family.[9] G.H. Hardy and J.E. Littlewood introduce big-ฮฉ as a number-theoretic lower bound (a non-vanishing remainder, "f is not little-o of g"). The symbol is borrowed later by computer science with a different and stricter definition.
1976
Knuth standardizes Big O, Big ฮฉ, and Big ฮ˜ for algorithm analysis.[10] Donald Knuth's "Big Omicron and Big Omega and Big Theta" in SIGACT News formalizes the trio with the definitions algorithms researchers use today. Knuth wrote it because authors were misusing O() for lower bounds (saying things like "sorting requires O(n log n) comparisons") and he wanted a notation that wouldn't bury the distinction. The modern ฮฉ is stricter than Hardy and Littlewood's: it requires a uniform lower bound, not just an infinitely-often one.
1990
CLRS canonizes the asymptotic toolkit for teaching.[11] The first edition of Cormen, Leiserson, Rivest (and later Stein) puts asymptotic notation in chapter 3 of the textbook that becomes the global default. Two generations of undergraduates learn Big O from this presentation.
1997
Musser publishes introsort.[12] David Musser's "Introspective Sorting and Selection Algorithms" introduces the hybrid that runs quicksort but switches to heapsort when recursion depth exceeds 2 logโ‚‚ n. This is the moment Big-O's "worst case versus expected case" distinction becomes a shipping engineering decision: introsort guarantees O(n log n) worst case while keeping quicksort's empirical speed on the common inputs.
1999
Frigo, Leiserson, Prokop, Ramachandran formalize cache-obliviousness.[13] The paper shows that asymptotic counts of CPU operations are an incomplete model on machines with deep memory hierarchies, and that algorithms with the same Big O can differ by orders of magnitude depending on how they interact with the cache. The first widely-cited statement that Big O over memory accesses is sometimes the bound that matters, not Big O over arithmetic operations.
2010s
The "galactic algorithm" idea enters common usage.[14] Lipton and Regan popularize the term for asymptotically-superior algorithms whose constants are so vast they would never beat the simpler approach on any realistically-sized input. Coppersmith-Winograd matrix multiplication at O(n2.373) is the canonical example; in practice every shipping engine uses an O(nยณ) cache-blocked GEMM.
2026
Asymptotics live alongside cache, branch, and parallelism cost models. Modern engine work treats Big O as the first filter (catches a quadratic before it ships) and then layers cache-line, branch-prediction, and SIMD reasoning on top of the survivors. The single-number complexity is still load-bearing for design choices; it is no longer the whole story.

The recurring shape: a notation invented in 1894 to bound error terms in prime-counting formulas, sharpened in 1976 to handle algorithmic lower bounds, and extended in 1999 to model the memory hierarchy that real machines actually have. The version you'll use is Knuth's; the version that catches every bug is the one with cache-aware constants stapled on.

03The definition, picked apart

Big O is defined for functions from non-negative integers (or reals) to non-negative reals. The formal statement, the way Knuth wrote it[10] and CLRS teaches it[11]:

eq. 1 ยท big-O f(n) = O ( g(n) )   โ‡”   โˆƒ c > 0, nโ‚€ โ‰ฅ 0 : โˆ€ n โ‰ฅ nโ‚€, 0 โ‰ค f(n) โ‰ค c ยท g(n)

Read aloud: "f(n) is O(g(n)) if there exist constants c and nโ‚€ such that for every n at least nโ‚€, f(n) is sandwiched between zero and c times g(n)." The c is what lets you throw away constant factors; the nโ‚€ is what lets you throw away lower-order terms. Both are existential, which means the definition is satisfied as soon as you can find any pair that works.

The three things this definition does not say are at least as important as what it does say:

Drag the widget below. The orange curve is f(n); the violet curve is c ยท g(n) for the chosen c. The vertical line is nโ‚€. Big O holds when the orange curve stays under the violet one for every n to the right of the line:

Live ยท Find c and nโ‚€ that prove the bound
f(n)
3nยฒ + 12n + 50
status
ยทยทยท
For f(n) = 3nยฒ + 12n + 50, the bound f = O(nยฒ) holds with c = 4, nโ‚€ = 16: f(16) = 3ยท256 + 192 + 50 = 1010 and 4ยท16ยฒ = 1024, so f(16) โ‰ค 4ยทg(16), and the inequality survives for every larger n. Push the slider down to nโ‚€ = 13 (still at c = 4) and the bound fails: f(13) = 713 beats 4ยท169 = 676. Either bump c up or nโ‚€ back up to recover. With g = n, no constant c works asymptotically, because the nยฒ term eventually beats any linear cap. Toggle to ฮฉ or ฮ˜ to test the lower-bound and tight-bound definitions from ยง4.
Why is nโ‚€ existentially quantified?

The point of nโ‚€ is to let you ignore noise at small n. A function might do weird things for the first ten or hundred inputs (a startup cost, a one-time table fill, a cache that hasn't warmed up) and then settle into its real growth shape. Big O cares about the real growth shape, not the noise.

Choosing nโ‚€ = 0 is a stronger claim ("the bound holds for every input"); choosing nโ‚€ = 106 is a weaker one ("the bound holds for every input I care about"). The definition says you only need to find any nโ‚€, which is the right tradeoff for ignoring small-input artifacts without throwing away the bound.

04Omega, Theta, and the rest of the family

Big O is an upper bound. The full asymptotic toolkit has four cousins. Each is one inequality flip away from the others:

Name Symbol Meaning Definition Reads as
Big O O(g) Upper bound, not tight โˆƒc, nโ‚€ โˆ€nโ‰ฅnโ‚€: f โ‰ค cยทg "grows no faster than"
Big Omega ฮฉ(g) Lower bound, not tight โˆƒc, nโ‚€ โˆ€nโ‰ฅnโ‚€: f โ‰ฅ cยทg "grows no slower than"
Big Theta ฮ˜(g) Tight bound, same growth f = O(g) and f = ฮฉ(g) "grows exactly like"
Little o o(g) Strictly slower โˆ€c > 0 โˆƒnโ‚€ โˆ€nโ‰ฅnโ‚€: f < cยทg "grows strictly slower than"
Little omega ฯ‰(g) Strictly faster โˆ€c > 0 โˆƒnโ‚€ โˆ€nโ‰ฅnโ‚€: f > cยทg "grows strictly faster than"

The flip from "big" to "little" is the flip from existential to universal on the constant: big-O finds one c that bounds f above; little-o requires that every positive c eventually bounds f above. That's a much stronger statement: f can't just be no-bigger-than g by some constant, it has to grow strictly slower.

eq. 2 ยท concrete examples 3nยฒ + 5n = O(nยฒ), ฮ˜(nยฒ), ฮฉ(n), o(nยณ), ฯ‰(n)

The same function fits five bounds. ฮ˜(nยฒ) is the tightest description (matching upper and lower); the others are looser. In a tutorial, saying "merge sort is ฮ˜(n log n)" is more informative than "merge sort is O(n log n)" because ฮ˜ rules out the merge sort that secretly runs in O(n) (it doesn't), while O leaves that possibility open.

What about little-o?

Little-o is what you reach for when you want to claim that one function is fundamentally faster-growing than another, not just that one happens to be bigger. The statement 1/n = o(1) says 1/n shrinks to zero, which is the asymptotic version of "1/n vanishes." The statement n log n = o(nยฒ) says no constant multiple of n log n can keep up with nยฒ: the gap between them grows without bound.

Knuth's 1976 paper makes the distinction explicit and warns against the common abuse of writing O(n) when ฮ˜(n) is meant[10]. The abuse is unfortunately so widespread that "an O(n log n) sort" almost always means "a ฮ˜(n log n) sort" in practice. Read all O() in this sense unless the author explicitly distinguishes.

A practical rule of thumb

Algorithm analysis usually wants ฮ˜ (matching upper and lower bounds) for the algorithm you wrote, and ฮฉ for the problem you're solving (the best possible algorithm). "Merge sort is O(n log n)" and "comparison-based sorting is ฮฉ(n log n)" combine to say "merge sort is asymptotically optimal among comparison sorts," which is the actual claim.

05The growth zoo, racing

About eight complexity classes show up in 95% of working algorithms. Memorize their shapes and you can read most papers without doing the algebra:

Class Name Where you'll see it At n=106 (1 ns/op)
O(1)ConstantHash table lookup (average), array index, push to a stack1 ns
O(log n)LogarithmicBinary search, balanced BST operation, BVH ray traversal level20 ns
O(โˆšn)Square-rootSqrt decomposition; tile size in cache-oblivious algorithms1 ยตs
O(n)LinearLinear search, single pass over an array, building a histogram1 ms
O(n log n)LinearithmicComparison-based sort, FFT, sweep-and-prune broad phase20 ms
O(nยฒ)QuadraticNested-loop pair test, bubble sort, naive matrix-vector product17 min
O(nยณ)CubicNaive matrix multiply, Floyd-Warshall, all-pairs shortest path32 years
O(2n)ExponentialBrute-force SAT, subset enumeration, game-tree search at full depthโ‰ซ age of universe
O(n!)FactorialBrute-force TSP, permutation enumerationโ‰ซโ‰ซ age of universe

The "at n=106" column makes the spacing concrete: ten orders of magnitude separate O(n) from O(nยฒ), and another seventeen separate O(nยฒ) from O(nยณ). These are the gaps that distinguish a frame budget from a freeze, and a freeze from "won't terminate before the game studio goes bankrupt."

The widget below races them on a log-log plot. On log-log axes a polynomial nk is a straight line with slope k, which is why log-log is the standard way to compare growth rates by eye:

Live ยท Common growth rates on log-log axes
O(1) O(log n) O(n) O(n log n) O(nยฒ) O(nยณ) O(2โฟ)
Hover to read off the per-class cost at a given n. The "per-operation time" slider sets how many nanoseconds one elementary step costs; a 16.6 ms reference line marks the 60 fps frame budget so you can read off "how big an n fits in a frame at this complexity." A 1 GHz operation rate is generous for most game-engine work; modern CPUs do 3โ€“5 of these per cycle for trivial ops but pay 100+ ns for L3 misses.

06When constants lie

Big O drops constants on purpose: a notation that cared about constants couldn't compare implementations across hardware, compilers, or decades. The price is that two algorithms with the same Big O can differ by orders of magnitude in real-world cost, and an algorithm with worse Big O can beat one with better Big O on inputs you'll actually see.

The textbook example is 100n versus nยฒ / 100. The first is O(n); the second is O(nยฒ). By any asymptotic measure, the linear algorithm wins. Solve for the crossover:

eq. 3 ยท crossover point 100n = n2 / 100 โ‡’ n = 10,000

Below n = 10,000 the "worse" algorithm is genuinely faster. If you ship a system that never exceeds 10,000 elements, picking the quadratic implementation is the correct engineering decision. The asymptotic ranking is informative, not authoritative.

Live ยท Where does the "worse" algorithm cross over?
crossover n
ยทยทยท
linear wins if
ยทยทยท
Both curves are plotted on linear axes so the crossover is visible as the intersection. The linear curve always wins eventually; whether it wins on the inputs you ship is the question.

What constants stand for in practice

An algorithm's hidden constant absorbs all the per-operation work the abstraction throws away. Three things commonly drive a small Big O into an unshippable wall:

Frigo et al.'s 1999 cache-oblivious paper[13] formalized the first of these into a parallel complexity model that counts cache transfers instead of CPU operations. Sometimes the asymptotic count over cache transfers is the bound that matters; the CPU-op count is constant-factor noise. The takeaway is not that Big O is wrong, it's that Big O is one of two or three complexity functions a serious analysis needs.

Constants live forever

An algorithm with hidden constant 106 is asymptotically the same as one with hidden constant 1, but on inputs of size n โ‰ค 106 the second is up to a million times faster. The Coppersmith-Winograd matrix-multiply algorithm has a famous constant in the trillions; no shipping linear-algebra library uses it[14]. The Big O is an admission of ignorance about the constant, not a claim that the constant is small.

07Worst, best, average, amortized

An algorithm's runtime is a function of its input. Different inputs produce different runtimes. Big O over an algorithm has to specify which inputs you mean:

Quicksort: the canonical worst-versus-average gap

Quicksort partitions the array around a pivot, recurses on the two halves, and concatenates. On a random pivot it halves the array on average, giving the recurrence T(n) = 2T(n/2) + ฮ˜(n), which the master theorem (ยง8) solves to ฮ˜(n log n). On an adversarial pivot (always-smallest or always-largest), it produces partitions of sizes 0 and nโˆ’1, giving T(n) = T(nโˆ’1) + ฮ˜(n), which sums to ฮ˜(nยฒ).

A sorted-or-reverse-sorted input plus a naive first-element pivot triggers the worst case on every recursion. The widget runs quicksort on three input patterns:

Live ยท Quicksort comparisons under three input patterns
comparisons at n
ยทยทยท
vs n log n
ยทยทยท
The "sorted" and "reverse" curves are the quadratic worst case; they grow as nยฒ/2. Random input gives the expected n log n curve. Production C++ standard library implementations switch to heapsort when recursion depth exceeds 2 logโ‚‚ n to guarantee a worst-case O(n log n), the Musser introsort design[12].

Amortized: std::vector::push_back

A dynamic array doubles its capacity when full. Most pushes are O(1) (write to the next slot). The pushes that trigger a resize are O(n) (copy every element to a bigger buffer). Worst-case per push is O(n); that's not interesting because it's misleading. Amortized per push, summed over a sequence and divided by length, is O(1).

The proof, from Tarjan's 1985 paper that formalized the technique[17]: n pushes do at most n single-slot writes plus a sequence of resize-copies of sizes 1, 2, 4, 8, ..., 2โŒˆlog nโŒ‰. That sum is bounded by 2n. Total work is at most 3n; per-push amortized cost is at most 3 = O(1).

Live ยท The cost of each push, with running average
peak push cost
ยทยทยท
amortized average
ยทยทยท
The growth factor controls the resize-cost-versus-memory-overhead tradeoff. 2ร— (the standard) wastes up to half the buffer; 1.5ร— (Facebook's folly::fbvector) wastes less but resizes more often. Both have amortized O(1) push; the constant is different.

08The Master Theorem

Most divide-and-conquer algorithms produce a recurrence of the form:

eq. 4 ยท the standard form T(n) = a ยท T(n / b ) + f(n)

"To solve a problem of size n, do a recursive subsolves each of size n/b, plus f(n) of additional work to combine them." Merge sort is a = 2, b = 2, f(n) = ฮ˜(n). Binary BVH construction with sweep partitioning is similar.

The Master Theorem[11] classifies these recurrences in three cases, depending on whether the recursive work or the combine work dominates. Let ccrit = logb a:

Worked examples

Three standard divide-and-conquer recurrences and what the theorem says about them:

Algorithm Recurrence ccrit Case Result
Binary searchT(n) = T(n/2) + O(1)0Case 2 (c=0)ฮ˜(log n)
Merge sortT(n) = 2T(n/2) + O(n)1Case 2ฮ˜(n log n)
Karatsuba multiplicationT(n) = 3T(n/2) + O(n)logโ‚‚3 โ‰ˆ 1.585Case 1ฮ˜(nlogโ‚‚3) โ‰ˆ ฮ˜(n1.585)
Strassen matrix multiplyT(n) = 7T(n/2) + O(nยฒ)logโ‚‚7 โ‰ˆ 2.807Case 1ฮ˜(nlogโ‚‚7) โ‰ˆ ฮ˜(n2.807)
Naive matrix multiply (recursive)T(n) = 8T(n/2) + O(nยฒ)3Case 1ฮ˜(nยณ)
BVH top-down (binned SAH)T(n) = 2T(n/2) + O(n)1Case 2ฮ˜(n log n)

The pattern: how many subproblems versus how aggressively you shrink them is the lever. Strassen's algorithm beats the naive cubic matrix multiply because it trades one recursive call for a constant amount of extra combine work; that one fewer recursion drops the exponent from 3 to logโ‚‚ 7. The same insight scales: the long arc of matrix-multiply progress (Strassen 1969, Coppersmith-Winograd 1990, the modern improvements down to O(n2.371552))[18] is a sequence of finer subdivisions that drop a further. Only Strassen is practical; the rest are galactic (ยง14).

The cases that aren't in the Master Theorem

The classical Master Theorem doesn't cover every recurrence. Three gaps:

  • Logs in f(n). If f(n) = ฮ˜(n logk n), the original theorem doesn't apply. The Akra-Bazzi method extends it.
  • Uneven splits. T(n) = T(n/3) + T(2n/3) + O(n) (median-of-medians selection variants) requires Akra-Bazzi or direct analysis.
  • Subtractive recurrences. T(n) = T(n-1) + O(n) (quicksort worst case, naive recursion) isn't divide-and-conquer; it's "shrink by 1." The answer is the sum ฮ˜(nยฒ), found by direct summation.

For the recurrences this tutorial cares about, the three-case Master Theorem above is sufficient. The Akra-Bazzi version is in CLRS chapter 4[11].

09Measuring growth empirically

Sometimes the algorithm is too complex to analyze by hand and you have to measure. The classical trick: run it at sizes n, 2n, 4n, 8n, plot on log-log axes, and read the slope.

On log-log axes, T(n) = c ยท nk becomes log T = log c + k ยท log n, which is a straight line with slope k and intercept log c. Eyeballing the slope is a coarse but reliable diagnostic: a slope of 1 means O(n), a slope of 2 means O(nยฒ), anything between 1 and 2 is suspicious (could be O(n log n), could be O(n1.5), hard to tell from slope alone).

A more discriminating test: the doubling ratio. Measure T(n) and T(2n); the ratio T(2n) / T(n) diagnoses the class:

Doubling ratioDiagnosisSlope on log-log
~1O(1)0
~1 + 1/log nO(log n)tends to 0
~2O(n)1
~2.1โ€“2.3O(n log n)slightly above 1
~4O(nยฒ)2
~8O(nยณ)3

The widget below shows a measured running time and lets you guess the complexity. The data has noise (cache effects, GC pauses, scheduler hiccups) so the diagnosis isn't always exact at small n; that's the realistic case.

Live ยท Guess the complexity from measured data
slope (log-log fit)
ยทยทยท
doubling ratio
ยทยทยท
verdict
ยทยทยท
Click "new mystery curve" to randomize the underlying algorithm; pick a complexity class to check. The slope is the least-squares fit of log T vs log n; the doubling ratio is computed from the two largest measured points. Real profiling data is noisier than this; the empirical method is reliable down to about a factor of 2 on the exponent.

10Case study: broad-phase collision

The classic engine-side O(nยฒ) trap. Physics simulations need to find every pair of objects whose bounding volumes overlap. A naive double-loop checks n(nโˆ’1)/2 pairs; at n = 1000 that's 500,000 pairs every frame, at 60 fps that's 30 million pair tests per second, and a pair test is several SIMD ops, so the broad phase alone budgets out the rest of physics on a single core.

Three standard fixes, with different complexity in different regimes:

Sweep-and-prune (sort and sweep)

Sort the bounding-box endpoints along one axis (say X). Two boxes can only overlap on the X axis if neither's max-X is before the other's min-X. Sweep through the sorted endpoints with an active list; when you hit a min-X, add the box to the active list; when you hit a max-X, remove it. Every pair of boxes simultaneously active is a candidate. Repeat on Y and Z; take the intersection.

The first frame's cost is dominated by the initial sort at O(n log n); the sweep itself is O(n + k) where k is the number of active-pair encounters. In subsequent frames the endpoint lists are nearly sorted (objects move a little, not a lot, per step), and insertion sort on a nearly-sorted list runs in O(n + s) where s is the number of out-of-order pair swaps. In steady state that puts per-frame cost at O(n + k + s), which is close to linear when both k and s are O(n). The pathological case (every box overlaps every other) blows k up to O(nยฒ) and you're back to the naive cost. The Cohen et al. 1995 I-COLLIDE paper introduced the temporal-coherence trick[4].

Uniform spatial hash

Partition space into a regular grid. Hash each object's AABB into the cells it touches. Two objects can only collide if they share a cell. Build cost is O(n); query cost is the sum over cells of the per-cell pair-test count.

If average density is d objects per cell and the cell size is at least as large as the biggest object's AABB (so each object touches O(1) cells), the per-cell pair-test count averages dยฒ/2, and total cost is O(n ยท d), which collapses to O(n) when d is bounded. The catch: d is bounded only when objects are spread out. A pile of crates on the same tile reduces to brute force inside that tile. NVIDIA GPU Gems 3 describes a GPU spatial-hash implementation that handles dense scenes by sub-binning[5].

BVH for moving objects

Build a bounding volume hierarchy over the objects' AABBs and query each object against the tree. Construction is O(n log n); per-query is O(log n) in well-spread scenes; total is O(n log n). Bullet ships btDbvtBroadphase (dynamic AABB tree) and Box2D ships b2DynamicTree as the default broad phase for exactly this reason; PhysX defaults to sweep-and-prune but supports a BVH alternative for sparse scenes[1].

The widget runs all three on the same simulated scene. At low object counts brute force wins on absolute time because its constant is tiny; at high counts it loses by an enormous margin:

Live ยท Three broad-phase algorithms on the same scene
brute pairs/frame
ยทยทยท
SAP pairs/frame
ยทยทยท
hash pairs/frame
ยทยทยท
"Pairs/frame" is the number of close-proximity checks each algorithm queues for the narrow phase. The simulation runs in real wall-clock time, not collapsed into one paint, so the pacing matches the algorithm's actual cost shape. Note that very small grid cells push the hash toward brute force (because every cell has at most one object and the hash overhead dominates); very large cells push it toward brute force on the inside of each cell. There's a sweet spot near "cell size โ‰ˆ object diameter."

11Case study: BVH ray traversal

Ray tracing a scene of n triangles by brute force means testing the ray against every triangle: O(n) per ray, O(n ยท pixels) per image. At n = 107 triangles and a 1080p image that's 2ร—1013 ray-triangle tests per frame, which is several orders of magnitude past tractable.

A reduces this to O(log n) expected per ray on well-built scenes. The trick: organize triangles into a binary tree where each node's AABB encloses every triangle in its subtree. To intersect a ray with the scene, test the ray against the root's AABB; if it misses, the whole subtree is irrelevant. If it hits, recurse on both children. Continue until reaching a leaf, where you test the ray against the small bucket of triangles inside.

The Physically Based Rendering textbook lays out the math: with a balanced BVH and the surface-area heuristic for the build, expected per-ray cost is logarithmic in n[1]. The worst case is still O(n) (a degenerate ray that hits every leaf), but well-built scenes hold the log bound on every ray that matters.

Live ยท Brute force vs BVH, traced live
brute tests
ยทยทยท
BVH tests
ยทยทยท
speedup
ยทยทยท
The BVH is built with a median split on the longer axis at each level. Brute force always tests n objects; BVH tests the AABB of every node on the ray's path plus the leaf objects in the leaves it descends into. For a well-spread scene, the descended-into count is log2(n) bounded by a small constant. Rays that traverse a lot of empty space see the biggest speedups; rays that graze many objects see less. NVIDIA's RT cores fix this in hardware by performing the BVH walk in fixed-function silicon[19].

12Case study: A* and the branching factor

A* searches a graph for the shortest path from start to goal. At each step it picks the frontier node with the lowest estimated total cost f(n) = g(n) + h(n), where g is the known cost so far and h is a heuristic estimate to the goal. With an admissible heuristic, A* is optimal; with no heuristic, it degenerates to Dijkstra; with a perfect heuristic, it walks a straight line to the goal.

The worst-case complexity is O(bd), where b is the branching factor (how many neighbors a node has) and d is the depth of the solution[20]. On a 2D grid b = 4 (von Neumann) or b = 8 (Moore neighborhood); on a navmesh b is the average degree of a navmesh polygon, usually 3โ€“6.

The exponential isn't in n (the size of the graph); it's in d (how far the goal is). For a typical 100ร—100 grid with the goal 50 steps away, naive BFS would visit roughly 450 nodes if it didn't track visited ones; with the visited-set, BFS visits n = 104 nodes. A* with a good heuristic visits something like 5 ยท 50 = 250 nodes: the visited region is a corridor along the optimal path, not a disk.

The heuristic's quality is the lever. A perfect heuristic (knows the true distance) makes A* visit exactly the optimal path. An admissible-but-loose heuristic (Manhattan distance on a grid with obstacles) makes A* explore a fan along the path. An inadmissible heuristic (overestimates) makes A* fast but the result is no longer optimal:

Live ยท A* frontier under different heuristics
nodes expanded
ยทยทยท
path length
ยทยทยท
optimal?
ยทยทยท
Dijkstra is A* with h = 0; it explores every reachable node up to the goal's cost, which is a disk. Manhattan heuristic is admissible on 4-connected grids; A* explores a wedge biased toward the goal. Weighted A* multiplies h by a factor > 1 to push harder toward the goal; faster, but the path is no longer guaranteed optimal. The shape change is the asymptotic improvement: from O(n) for Dijkstra to O(d ยท beff) for A* with a good heuristic, where beff is the effective branching factor a heuristic prunes the search to.

13Case study: sorting draw calls

A renderer issues thousands of draw calls per frame. Reordering them to group calls that share a material, mesh, or render target is one of the biggest performance wins available: state changes on the GPU are expensive, and a sort that batches them amortizes the cost[2].

The standard pattern: every draw call gets a 32- or 64-bit sort key packing render target, shader, material, depth, and so on into bit fields in priority order. Sort the array of keys. Issue the calls in the sorted order. Comparing the sort cost matters because the renderer runs it every frame:

Why isn't every sort a radix sort? Two reasons. First, comparison sorts work on any comparable type; radix sorts work on fixed-width integer keys. Second, naive radix sort scatters writes across buckets and trashes the cache, so the constant is larger than the textbook complexity suggests; the implementations that beat std::sort handle that explicitly. The Bevy engine's issue tracker has a detailed discussion of radix-vs-comparison for the sort phase[21]; the canonical real-time-collision-detection blog post by Christer Ericson lays out the bit-packing pattern most engines use[2].

Radix sort being O(n) (treating d as a constant for fixed-width keys) sounds like it violates the ฮฉ(n log n) lower bound for sorting. It doesn't, because the lower bound applies to comparison-based sorts; radix is not comparison-based. The lower bound's proof rests on counting binary-decision-tree leaves[11]; radix sort doesn't make comparisons, so the proof doesn't apply.

draw_key.hpp ยท packing a sort key
// A 64-bit draw-call sort key. Field order is priority order:
// the leftmost (most significant) bits drive the coarsest grouping.
// Lay out fields top-down so a single 64-bit integer compare
// produces the desired ordering with no branches.
struct DrawCallKey {
  uint64_t renderTarget   : 2;   // 0=shadow, 1=opaque, 2=transparent, 3=postfx
  uint64_t shaderProgram  : 10;  // 1024 programs is plenty
  uint64_t materialId     : 14;  // 16K unique materials per frame
  uint64_t meshId         : 14;  // 16K unique meshes; rebinding VBO is costly
  uint64_t depth          : 24;  // front-to-back for opaque; reversed for transparent

  // 64-bit compare = whole-struct compare. The bit layout
  // implies the priority. Sorting an array of these is one
  // radix pass over the 8 bytes, or one std::sort over uint64_t.
};
static_assert(sizeof(DrawCallKey) == 8, "key must pack into a uint64");

The bit layout is the asymptotic algorithm: by putting render target in the high bits, every same-target draw is contiguous after sorting; by putting depth in the low bits, ties at higher levels resolve to depth order. The sort itself is incidental. The design is the key layout.

14Galactic algorithms

An algorithm is galactic if its asymptotic complexity is the best known but its hidden constants are so large that it loses to a worse-asymptotic algorithm on every input that fits in the observable universe[14]. The term was popularized in Lipton and Regan's 2010s essays on algorithm theory.

The flagship example is matrix multiplication. The naive triple-loop algorithm is O(nยณ). The progression of asymptotic improvements has been steady since the 1960s:

YearAuthor(s)Exponent ฯ‰Practical?
1969Strassenlogโ‚‚7 โ‰ˆ 2.807Yes; production for n โ‰ฅ 100
1978Pan2.795No; constants too large
1981Schรถnhage2.522No
1990Coppersmith-Winograd2.376No; canonical galactic algorithm
2014Le Gall2.3728639No
2024Vassilevska Williams, Y. Xu, Z. Xu, Zhou[18]2.371552No
2025Alman, Duan, Williams, Y. Xu, Z. Xu, Zhou[18]2.371339No

Every entry past Strassen is galactic. Estimates of the crossover n at which a Coppersmith-Winograd-style algorithm would beat a cache-blocked cubic GEMM put it well past 1020, more than any concrete matrix that ships in a graphics or scientific application. For shipping graphics, audio, and physics code, GEMM is O(nยณ); the asymptotic wins are mathematical theorems, not engineering choices.

Galactic algorithms are still worth studying, for two reasons. First, they bound what's possible: every step closer to nยฒ moves the lower-bound theorem closer. Second, the techniques sometimes transfer down: Strassen's algorithm is practical, and it began the chain. If a future technique drops the constant by a factor of n, the resulting algorithm beats O(nยณ) in cells you can actually rent. The lesson is that "asymptotically faster" is a necessary, not sufficient, condition for "actually faster."

15NP-hard games

Big O can describe problems for which no polynomial-time algorithm is known, and which are believed to require exponential time. The class NP is the set of decision problems whose "yes" answers can be checked in polynomial time given a witness. The class P is the subset whose answers can also be computed in polynomial time. The P vs NP question is whether these are the same class.

A problem is NP-hard if every problem in NP reduces to it in polynomial time; NP-complete if it's both NP-hard and in NP. The Cook-Levin theorem (Cook 1971, Levin 1973 independently) showed that SAT is NP-complete, the first such proof[22]; thousands of problems have since been shown NP-complete by polynomial reduction from SAT.

Several puzzles are known NP-hard or NP-complete in their generalized forms:

GameProblem statementComplexityReference
Generalized TetrisMaximize survival on a known sequence of piecesNP-complete[3]
SokobanPush boxes onto target tilesPSPACE-complete[23]
Super Mario BrosReach the end of a generalized levelNP-hard[24]
Minesweeper consistencyIs a given board consistent?NP-complete[25]
Generalized SudokuSolve an nยฒ ร— nยฒ SudokuNP-complete[26]

These results apply to generalized versions (Tetris on an n ร— m board, Sudoku on nยฒ ร— nยฒ). The fixed-size game you play in your browser is solvable by brute force; the asymptotic claim is about the family. The practical engineering takeaway is that AI for these problems can't be expected to scale: heuristics and learned policies dominate every published agent because exact solutions are out of reach.

Procedural generation hits the same wall in the other direction. Generating a guaranteed-solvable Sokoban level requires solving Sokoban, which is PSPACE-complete[23]. Generators ship anyway, by using a constrained subset of the rules where the problem becomes tractable, or by checking solvability with a bounded search budget and discarding levels that don't certify within it.

16Beyond polynomial

Past exponential are growth rates so vast they have no operational meaning. The number of atoms in the observable universe is roughly 1080; 2n crosses that threshold at n โ‰ˆ 267, and n! crosses it at n โ‰ˆ 60. The grows faster than any primitive-recursive function; at very small inputs it produces numbers that can't be written down in conventional notation[27].

These aren't useful complexities for algorithms ("the algorithm runs in A(n, n) time" is functionally indistinguishable from "the algorithm doesn't terminate"), but they show up as inverse functions in tight upper bounds. The classic example: union-find with union-by-rank and path compression has an amortized cost per operation of ฮฑ(n), where ฮฑ is the inverse Ackermann[17]. For any n that fits in any computer ever built, ฮฑ(n) โ‰ค 4: in practical terms, the operation is constant-time. The function is technically not constant because it grows, just incomprehensibly slowly.

Live ยท How big is big? Magnitude at small n
n!
ยทยทยท
2โฟ
ยทยทยท
nยณ
ยทยทยท
All values are shown as their log10: a bar at 80 represents a number with 81 digits. The "atoms in universe" reference line (logโ‚โ‚€ โ‰ˆ 80) and the "nanoseconds since big bang" line (logโ‚โ‚€ โ‰ˆ 26.6) make the difference between "expensive" and "not finishable" visually concrete. n! at n โ‰ˆ 60 crosses atoms-in-universe (60! โ‰ˆ 8.3 ร— 1081); 2n doesn't until n โ‰ˆ 267 (2266 โ‰ˆ 1.2 ร— 1080). Both shape the limit of brute-force search.

17Try it yourself

The playground lets you pick a data structure and an operation, and shows the asymptotic cost along with the actual operation animated. Six common structures, four operations each.

Live ยท Data structures versus operations
cost (this run)
ยทยทยท
asymptotic
ยทยทยท
"Cost" is the number of structural visits (comparisons, pointer follows, hash probes) the operation performed. The asymptotic line shows the Big O for this (structure, operation) combo. Note that the playground's hash table assumes a good hash; degenerate hashes degrade to O(n) worst case.

18How engines actually use this

Asymptotic complexity drives most major data-structure decisions in shipped engines, with constant-factor refinements layered on top. Concrete choices:

The shared shape: pick the asymptotic class first, then refine for cache and parallelism. An engine that picks O(nยฒ) by accident and then "optimizes" the constant will not catch the bug. An engine that picks O(n log n) and then optimizes the constant ships at 60 fps.

19Pitfalls

The mistakes that occur most often in code review, in roughly decreasing frequency:

20What's next

The natural follow-ups are the algorithms whose complexity this tutorial only sketched:

For the engine programmer, the most valuable next step is empirical: profile a real engine, find every O(nยฒ) hidden in it, and replace it. The bugs are there.

Sources

  1. Matt Pharr, Wenzel Jakob, Greg Humphreys. Physically Based Rendering: From Theory to Implementation, 3rd ed., 2018, ยง4.3 "Bounding Volume Hierarchies." pbr-book.org. Source for the BVH-traversal expected O(log n) bound and the surface-area heuristic build.
  2. Christer Ericson. "Order your graphics draw calls around!" realtimecollisiondetection.net, 2008. realtimecollisiondetection.net. The standard reference for bit-packed draw-call sort keys and the radix-sort pattern.
  3. Erik D. Demaine, Susan Hohenberger, David Liben-Nowell. "Tetris is Hard, Even to Approximate." COCOON 2003. arXiv:cs/0210020. Original NP-completeness proof for generalized Tetris.
  4. Jonathan D. Cohen, Ming C. Lin, Dinesh Manocha, Madhav K. Ponamgi. "I-COLLIDE: An Interactive and Exact Collision Detection System for Large-Scale Environments." Proc. ACM Interactive 3D Graphics, 1995. The original sweep-and-prune (sort-and-sweep) paper, with the O(n + k) analysis and worst-case O(nยฒ) note.
  5. Scott Le Grand. "Broad-Phase Collision Detection with CUDA." GPU Gems 3, ch. 32, NVIDIA, 2007. developer.nvidia.com. Spatial-hash analysis for the GPU broad phase, including sub-binning for dense scenes.
  6. Erica N. Walker. "Math Origins: Orders of Growth." Mathematical Association of America Convergence, 2017. maa.org. Historical survey covering du Bois-Reymond 1870, Bachmann 1894, and Landau 1909.
  7. Paul Bachmann. Die analytische Zahlentheorie, vol. II, Teubner, 1894. archive.org scan. First published use of the O symbol; "O" stands for "Ordnung."
  8. Edmund Landau. Handbuch der Lehre von der Verteilung der Primzahlen, Teubner, 1909. archive.org scan. Introduces the little-o notation; the source of "Landau symbols."
  9. G. H. Hardy, J. E. Littlewood. "Some Problems of Diophantine Approximation: Part II. The Trigonometrical Series Associated with the Elliptic ฮธ-Functions." Acta Mathematica 37, 1914. Introduces the original ฮฉ notation as a non-vanishing-remainder marker; predates Knuth's stricter algorithmic ฮฉ.
  10. Donald E. Knuth. "Big Omicron and Big Omega and Big Theta." ACM SIGACT News 8(2):18โ€“24, 1976. dl.acm.org. The standardization paper for algorithm analysis: defines Big O, Big ฮฉ, Big ฮ˜ as used today.
  11. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, 4th ed., MIT Press, 2022, ch. 3โ€“4. The textbook reference for asymptotic notation, the master theorem, and the comparison-sort lower bound.
  12. David R. Musser. "Introspective Sorting and Selection Algorithms." Software: Practice and Experience 27(8):983โ€“993, 1997. PDF. The introsort design used by libstdc++, libc++, and MSVC's std::sort.
  13. Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. "Cache-Oblivious Algorithms." Proc. 40th IEEE FOCS, 1999. PDF. The original cache-oblivious paper; sets up the complexity model that counts memory transfers.
  14. Richard J. Lipton, Kenneth W. Regan. "Galactic Algorithms." Gรถdel's Lost Letter and P=NP, 23 October 2010. rjlipton.com. The post that introduced the term, credited to Regan, for asymptotically-superior algorithms that are uncompetitive at any realistically-computable input size. Coppersmith-Winograd is the standard example.
  15. Peter Norvig. "Teach Yourself Programming in Ten Years," "Approximate timing for various operations" table. norvig.com/21-days.html. Memory-hierarchy latency ballpark numbers (L1 ~0.5 ns, L2 ~7 ns, DRAM ~100 ns).
  16. Agner Fog. "The microarchitecture of Intel, AMD and VIA CPUs," chapter on branch prediction. PDF. Source for the 15โ€“20 cycle branch-misprediction penalty on modern x86 cores.
  17. Robert E. Tarjan. "Amortized Computational Complexity." SIAM Journal on Algebraic Discrete Methods 6(2):306โ€“318, 1985. Introduces the formal amortized-analysis framework; covers the O(ฮฑ(n)) bound for union-find.
  18. Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, Renfei Zhou. "New Bounds for Matrix Multiplication: from Alpha to Omega." Proc. SODA 2024 (arXiv:2307.07970); result ฯ‰ < 2.371552. Subsequently improved by Josh Alman, Ran Duan, and the same group in "More Asymmetry Yields Faster Matrix Multiplication," Proc. SODA 2025 (arXiv:2404.16349), to ฯ‰ < 2.371339. Currently the best published exponents. arXiv 2307.07970, arXiv 2404.16349.
  19. NVIDIA. "NVIDIA Turing GPU Architecture Whitepaper," 2018, "RT Core" section. PDF. Hardware-accelerated BVH traversal in fixed-function silicon.
  20. Peter E. Hart, Nils J. Nilsson, Bertram Raphael. "A Formal Basis for the Heuristic Determination of Minimum Cost Paths." IEEE Transactions on Systems Science and Cybernetics 4(2):100โ€“107, 1968. Original A* paper; the O(bd) worst-case bound and the optimality result under admissible heuristics.
  21. Bevy Engine. "Use radix sort for sort phase and sprite sorting." GitHub issue #4291, 2022. github.com. Proposal and benchmarks comparing radix versus comparison sort for the render-phase sort in an open-source engine. (At the time of writing, Bevy uses radix sort only for sprite picking, via the radsort crate; the main draw-call sort is still std::sort's equivalent.)
  22. Stephen A. Cook. "The Complexity of Theorem-Proving Procedures." Proc. ACM STOC, 1971, pp. 151โ€“158. dl.acm.org. The original NP-completeness theorem; SAT is NP-complete. Leonid Levin proved an equivalent result independently in "Universal Sequential Search Problems," Problemy Peredachi Informatsii 9(3):115โ€“116, 1973 (English translation: Problems of Information Transmission 9:265โ€“266), hence the joint Cook-Levin name.
  23. Joseph Culberson. "Sokoban is PSPACE-complete." Technical Report, University of Alberta, 1997. webdocs.cs.ualberta.ca. PSPACE-completeness proof for generalized Sokoban.
  24. Greg Aloupis, Erik D. Demaine, Alan Guo, Giovanni Viglietta. "Classic Nintendo Games are (Computationally) Hard." Theoretical Computer Science 586:135โ€“160, 2015. NP-hardness proofs for generalized Super Mario Bros, Donkey Kong, Legend of Zelda, Metroid, and Pokรฉmon.
  25. Richard Kaye. "Minesweeper is NP-Complete." The Mathematical Intelligencer 22(2):9โ€“15, 2000. NP-completeness proof for the Minesweeper Consistency Problem.
  26. Takayuki Yato, Takahiro Seta. "Complexity and Completeness of Finding Another Solution and Its Application to Puzzles." IEICE Transactions E86-A(5):1052โ€“1060, 2003. NP-completeness proof for generalized nยฒร—nยฒ Sudoku.
  27. Wilhelm Ackermann. "Zum Hilbertschen Aufbau der reellen Zahlen." Mathematische Annalen 99:118โ€“133, 1928. Original Ackermann function (later simplified by Pรฉter and Robinson to the two-argument form used today).
  28. Epic Games. "TMap Reference," Unreal Engine 5 documentation. dev.epicgames.com. Documentation of Unreal's hash-map data structure and its design choices.
  29. Matthias Teschner, Bruno Heidelberger, Matthias Mรผller, Danat Pomeranets, Markus Gross. "Optimized Spatial Hashing for Collision Detection of Deformable Objects." Proc. Vision, Modeling, Visualization (VMV) 2003. PDF. Cell-size analysis for uniform spatial hashing applied to collision detection; the canonical reference on the tradeoff between query cost and per-cell occupancy.

See also