Sorting Algorithms
Every frame your engine sorts something: draw calls by material and depth, transparent sprites back-to-front, particles, broad-phase AABBs, animation events. The routines doing the work are not std::sort as your textbook drew it. They are introsort with insertion-sort cutoffs, pdqsort with branchless partitioning, timsort with a Fibonacci stack, and 32-bit radix passes packed by hand. We build each one from scratch, run them live in your browser, and end with three case studies from shipping engines.
01Why a sort, not a stopwatch
The frame budget at 60 fps is 16.67 ms. A modern open-world title submits 5,000 to 15,000 draw calls per frame and wants them grouped by render target, then shader, then material, then depth, so the GPU swaps state as few times as possible[1]. A comparison sort with a comparator that dereferences a material or shader pointer per element does on the order of 1.4 n log n key comparisons (a hundred thousand to a quarter-million at this scale), each spending tens of nanoseconds once the pointer chasing leaves L1; the total can run past a millisecond. A four-pass radix sort over a packed 32-bit key finishes the same work in roughly 0.1 ms. Bevy benchmarked exactly this swap in 2022 and reported median sort time on its opaque-3D sort phase (the many_cubes benchmark) dropping from 0.728 ms to 0.094 ms[2]. The difference is not the asymptotic class. The difference is which sort, on what key, with what memory layout.
This tutorial walks through the family of sorts an engine programmer actually meets. Comparison sorts (quicksort, mergesort, heapsort, and their modern hybrids) are bounded below by Ω(n log n) for n distinct elements, a bound we prove in §3. Non-comparison sorts (radix, counting, bucket) escape that bound by looking at the bits of the keys directly, and that is why engine renderers reach for them. Parallel sorts (bitonic, GPU radix) trade total work for parallel depth, which is what makes them fit on a GPU.
A working mental model of every sort that ships in a major language runtime or game engine. Knowledge of which one to pick on which data, why std::sort in libstdc++ is introsort while libc++ and Rust ship pdqsort-derived implementations, why Python and V8 use timsort (and why Python 3.11 swapped it for powersort), how the timsort stack-invariant bug shipped for thirteen years before anyone noticed, why Unreal sorts particles on the GPU with bitonic merge but draw calls on the CPU with radix, and how to pack a 32-bit sort key that puts opaque-front-to-back, then transparent-back-to-front, in one comparison.
The cost of the wrong sort
The widget below puts a comparison sort and a radix sort on the same axis: object count along the X, milliseconds per frame along the Y. The comparison curve is n log n with a measured per-comparison cost of ~30 ns (the cost of one pointer dereference plus one string compare, ballparked from Norvig's latency numbers[3]). The radix curve is linear in n at ~3 ns per element per pass, four passes for a 32-bit key. The crossover is around 200 items; above 1,000 the gap is order-of-magnitude:
02A short history of sorting
Sorting predates computing by a century: tabulating machines for the 1890 US Census used punched-card sort routines invented by Herman Hollerith, and the algorithm they ran was a radix-by-column LSD pass. Modern sorting research started with von Neumann in 1945 and is still active. The dates that matter for engine work:
qsort implementations and a generation of C libraries descend from this paper.
std::sort has been introsort since the GCC 3.x era; MSVC and libc++ followed (libc++ finally caught up in LLVM 14)[10].
list.sort. The design document Objects/listsort.txt is a model of how to write up a sort. Joshua Bloch's port landed in OpenJDK in 2009 (JDK-6804124) and shipped with Java 7 in 2011 as the default sort for object arrays. V8 adopted timsort for JavaScript Array.prototype.sort in V8 v7.0 (2018).
slice::sort_unstable is pdqsort; Boost shipped a port; the algorithm became the de facto modern quicksort.
Two patterns hold across the timeline. One: every winning production sort is a hybrid. Quicksort plus insertion sort plus heapsort fallback (introsort, pdqsort); mergesort plus insertion sort on small runs plus galloping merge (timsort, powersort). Nobody ships the single-routine textbook algorithm. Two: the bug surface is not in the inner loop, it is in the boilerplate around it. The 2015 timsort stack-invariant bug and the binary-search midpoint overflow Joshua Bloch reported in 2006[17] were both in framing logic (merge state, midpoint arithmetic), not in the comparison itself.
03Why comparison sorts can't beat n log n
Before building any sort, it is worth knowing the floor. The claim is that any algorithm that sorts by comparing pairs of elements (no peeking at their bits, no precomputed table) must do at least n log2 n comparisons in the worst case, up to constants. The proof is a counting argument. It is short and worth doing in full because it explains why radix sort and counting sort can exist at all: they violate the assumption.
Model a comparison sort as a decision tree: the root is the first comparison the algorithm makes; the two children are the two possible outcomes; each leaf is one possible final permutation. For an input of n distinct elements there are n! possible permutations, so the tree needs at least n! leaves. A binary tree with L leaves has height at least log2 L. So the height (the worst-case number of comparisons on some input) is at least log2(n!).
The middle step uses Stirling's approximation: n! > (n/e)n for n ≥ 1. The constant log2(e) ≈ 1.443 is the only thing absorbed into the Ω. The conclusion: any sort that uses only pairwise comparisons does at least n log2 n comparisons in the worst case[18]. Mergesort, heapsort, and well-tuned quicksort hit this bound up to a constant.
The widget below builds the decision tree for n = 3 by hand. There are 3! = 6 permutations, so the tree has 6 leaves and minimum height ⌈log2(6)⌉ = 3 comparisons. Click a permutation to highlight the path:
So how does radix sort beat the bound?
Radix sort never asks "is a less than b?". It looks at the bits of each key directly: bucket by the low byte, then the next byte, then the next. The decision-tree model does not apply because the algorithm is not making binary decisions on comparison outcomes; it is doing a constant amount of work per byte per element.
For an n-element array of w-bit keys, radix sort runs in O(w · n / b) where b is the bits per pass. With w = 32, b = 8, that is four linear passes. Asymptotically it is O(n) when w is fixed (which it is for draw-call keys). The bound in §3 only forbids beating n log n with comparisons.
04Insertion sort, the small-n champion
Insertion sort is the sort every production routine falls back to at small sizes. The asymptotic class is O(n²), which textbooks then file under "bad" and move on. The textbooks are wrong about that part. Insertion sort wins outright below roughly 16 to 32 elements on modern CPUs, which is exactly the regime where the recursive sorts spend most of their time once they have partitioned the data down. The reasons it wins are not asymptotic, they are mechanical: a tight inner loop that the branch predictor learns instantly, no recursion overhead, no pivot selection, and a memory access pattern that prefetches perfectly.
template<typename It, typename Compare> void insertion_sort(It first, It last, Compare less) { // Outer loop: bring the i-th element into its place among [first, i). for (It cursor = first + 1; cursor != last; ++cursor) { auto key = std::move(*cursor); // Save the element we're placing. It target = cursor; // Walk left while the element on the left is bigger than key. while (target != first && less(key, *(target - 1))) { *target = std::move(*(target - 1)); // Shift the bigger one right by one slot. --target; } *target = std::move(key); // Drop key into the gap. } }
Cost analysis: the outer loop runs n − 1 times; the inner loop walks back as far as the new element needs to go. On a fully sorted input the inner loop exits immediately and the total cost is O(n). On a reverse-sorted input the inner loop walks all the way back every time and the total cost is n(n−1)/2 = Θ(n²). The average case on random data is Θ(n²) with constant ½: half of the elements on the left are larger on average.
The widget below races insertion sort against a recursive sort (a simple quicksort) at varying array sizes. The crossover is the cutoff every production sort uses:
Why don't all sorts cut over at the same n?
The crossover depends on the cost of one comparison. For primitive types (ints, floats), the comparator is one CPU instruction, so insertion sort's quadratic comparisons are cheap and the crossover sits higher (Java's DualPivotQuicksort cuts at n=44). For boxed objects with virtual comparators, each compare is a pointer-dereference plus an indirect call, which is expensive, and the crossover sits lower (Java's timsort uses a 32-element minimum run length, effectively the insertion-sort threshold; libc++ uses n=6 for non-trivial types).
Cache behavior matters too. Insertion sort touches contiguous memory in a tight loop; on a hot cache line it does a few dozen swaps before missing. Quicksort partition reads from both ends of the range, which hits two cache lines at once. At small n that overhead dominates.
05Quicksort: partition, recurse, pray
Tony Hoare's quicksort, from a 1959 stint at Moscow State University, is the most-cited sorting algorithm and the foundation of every modern production sort except the timsort-family. The idea is one of the cleanest in algorithm design: pick a pivot, partition the array into "less than pivot" and "greater than pivot," recurse on each half. The expected cost is n log n, but the worst case is n², and the difference depends entirely on how you choose the pivot.
The partition
Two partition schemes ship in production code. Hoare's partition is faster on average; Lomuto's partition is simpler to write correctly. Most modern sorts use a tuned variant of Hoare's. The C++ pseudocode for Hoare-style:
template<typename It, typename Compare> It hoare_partition(It first, It last, Compare less) { auto pivot = *first; // In production, median-of-3 here, not just first. It leftCursor = first - 1; It rightCursor = last; while (true) { // Walk leftCursor right until we find something ≥ pivot. do { ++leftCursor; } while (less(*leftCursor, pivot)); // Walk rightCursor left until we find something ≤ pivot. do { --rightCursor; } while (less(pivot, *rightCursor)); if (leftCursor >= rightCursor) return rightCursor; // Cursors met or crossed; done. std::iter_swap(leftCursor, rightCursor); // Swap the out-of-place pair. } }
The recursion is the easy part. The hard part, the part where every quicksort variant earns its keep, is picking the pivot.
The pivot problem
If the pivot is the median of the data, quicksort halves the array each call and runs in n log n. If the pivot is the minimum (or maximum), one partition is empty and the recursion depth equals n: quadratic, plus a stack overflow.
Pivot strategies in roughly increasing sophistication:
- First element. Trivial. Worst case on already-sorted inputs, which is the most common shape in real data. This is the textbook version; nobody ships it.
- Random element. Expected n log n, no adversarial input can force quadratic, but in practice the random number generator costs more per call than the saved comparisons.
- Median-of-three. Sample the first, middle, and last element; use their median. Defeats sorted and reverse-sorted inputs. Bentley and McIlroy's 1993 paper uses this for small arrays[8].
- Pseudomedian-of-nine ("ninther"). For large arrays, sample three groups of three, take the median of each, then the median of those medians. Bentley and McIlroy again. Much harder to defeat with crafted input.
- Median of medians (BFPRT). Compute the exact median in O(n). Guarantees n log n worst case but the constant is so bad that nobody uses it for sorting outside textbooks.
Modern hybrids (introsort, pdqsort) give up on pivot perfection and accept that any deterministic strategy can be defeated. They monitor partition quality, and if the partition is too imbalanced, they take a fallback path: introsort switches to heapsort, pdqsort shuffles pivot candidates. §8 and §10.
The widget below shows quicksort running on data you control. Switch the pivot strategy and the input shape to see when quicksort runs at n log n and when it falls apart:
06Mergesort, the stability champion
Mergesort is von Neumann's algorithm. Split the array in half; recursively sort each half; merge the two sorted halves into one. The merge step is the substance of the algorithm: two cursors walking forward, copying the smaller of the two current elements into the output. The recursion is log2n levels deep; each level does O(n) work; total n log n, worst case and average case both.
Mergesort's selling points over quicksort:
- Stable. Elements with equal keys end up in their original relative order. This is required for the "sort by department, then sort by name" pattern, and for draw-call sorts where ties are broken by submission order.
- Worst case is n log n. No adversarial input. Quicksort needs pivot tricks; mergesort needs nothing.
- External-memory friendly. Merge is purely sequential reads and writes, no random access. Sorts on disk and tape were almost always mergesorts in the mainframe era and still are for very large inputs.
- Parallelizable cleanly. Each split can run on its own core with no synchronization until the merge.
The drawbacks are the reason it is not the default in-memory sort. Mergesort needs O(n) auxiliary memory for the merge buffer; quicksort sorts in place with O(log n) stack. Cache-wise, the merge step reads from two ranges and writes to a third, which on most CPUs costs more than quicksort's two-cursor partition that reads and writes the same range. For in-memory primitive arrays the constant on mergesort is about 1.5× to 2× quicksort, which is why std::sort in C++ is not mergesort.
template<typename T> void merge(T* a, T* buffer, int lo, int mid, int hi) { // Copy [lo, hi) to scratch buffer; merge from buffer back into a. for (int k = lo; k < hi; ++k) buffer[k] = a[k]; int leftCursor = lo, rightCursor = mid; for (int writeAt = lo; writeAt < hi; ++writeAt) { if (leftCursor >= mid) a[writeAt] = buffer[rightCursor++]; else if (rightCursor >= hi) a[writeAt] = buffer[leftCursor++]; else if (buffer[rightCursor] < buffer[leftCursor]) a[writeAt] = buffer[rightCursor++]; else a[writeAt] = buffer[leftCursor++]; // '<' not '<=': preserves stability. } } template<typename T> void mergesort(T* a, T* buffer, int lo, int hi) { if (hi - lo < 2) return; // Production: cut over to insertion sort below ~16. int mid = lo + (hi - lo) / 2; // Note: (lo + hi) / 2 overflows; this avoids that bug. mergesort(a, buffer, lo, mid); mergesort(a, buffer, mid, hi); merge(a, buffer, lo, mid, hi); }
The midpoint comment is not a stylistic note. Joshua Bloch found in 2006 that the canonical (lo + hi) / 2 in the Java standard library's binary search overflowed on arrays larger than 230, a bug that had survived since 1969[17]. The fix, lo + (hi − lo) / 2, is the same one mergesort needs. Use it.
07Heapsort, the worst-case insurance policy
Heapsort, J.W.J. Williams 1964[6], is the in-place worst-case-n log n sort. Quicksort can degrade to quadratic; mergesort needs n auxiliary words; heapsort needs neither. Production code uses heapsort as a fallback: introsort calls it when recursion depth exceeds 2 log2n, and embedded systems that cannot afford the merge buffer use it as their default.
The algorithm has two phases. First, build a max-heap in place: the largest element ends up at index 0. Then repeatedly swap the root (largest remaining) to the end of the array and sift down what landed at the root to restore the heap property. Each swap-pop step takes O(log n) sift-down time, done n times: O(n log n) total.
The non-obvious part is the build. The naive method (insert n elements one by one) is n log n. Floyd's bottom-up heapify, published the same year as Williams' paper[7], is O(n). The proof is a geometric sum: half the array is at the leaves and needs zero sift-down, a quarter is at level h−1 and needs at most one sift-down step, an eighth needs at most two, and the sum Σ k · 2−k is bounded by a constant.
template<typename T> void sift_down(T* a, int root, int heapEnd) { // Walk root down toward the leaves; at each step, swap with the larger child if needed. while (true) { int leftChild = 2 * root + 1; int rightChild = 2 * root + 2; int biggest = root; if (leftChild < heapEnd && a[leftChild] > a[biggest]) biggest = leftChild; if (rightChild < heapEnd && a[rightChild] > a[biggest]) biggest = rightChild; if (biggest == root) return; // Heap property restored. std::swap(a[root], a[biggest]); root = biggest; // Follow the swap down. } } template<typename T> void heapsort(T* a, int n) { // Phase 1: build a max-heap in place, bottom-up. Floyd's O(n) heapify. for (int i = n / 2 - 1; i >= 0; --i) sift_down(a, i, n); // Phase 2: extract the max n times, growing a sorted suffix. for (int heapEnd = n - 1; heapEnd > 0; --heapEnd) { std::swap(a[0], a[heapEnd]); // Largest remaining moves to its final slot. sift_down(a, 0, heapEnd); // Restore heap on the shrinking prefix. } }
The performance footnote is the part textbooks usually omit. Heapsort is n log n in the worst case, but on random data its constant is roughly 2 to 3 times worse than quicksort's[9]. The reason is cache: heapsort's sift-down walks a parent-child chain that doubles in stride at every level, so once the heap exceeds the L1 cache the parent reads cost L2 or memory latency. Quicksort partitions over contiguous memory, prefetches well. Heapsort is the right answer when you need the worst-case guarantee; it is the wrong answer when you do not.
08Introsort: quicksort with a parachute
David Musser's 1997 introsort[9] is the algorithm that made quicksort safe to standardize. The idea is one sentence long: run quicksort, monitor recursion depth, and if it exceeds 2 log2n, finish the current subarray with heapsort. Quicksort's average case stays intact; heapsort's worst-case bound becomes the worst case for the whole algorithm. Add an insertion-sort cutoff at small n, and you have the std::sort that has shipped in libstdc++ since GCC 3.x and in MSVC since Visual Studio 2008. libc++ implemented the n log n worst-case guarantee with introsort in LLVM 14 (2022); LLVM 17 (2023) replaced it with a pdqsort-derived implementation contributed by Google (§10)[10].
template<typename It> void introsort_loop(It first, It last, int depthBudget) { while (last - first > 16) { // Insertion-sort cutoff at the end of the call. if (depthBudget == 0) { // Quicksort recursion is going too deep. heapsort(first, last); // Fall back to the worst-case-safe sort. return; } --depthBudget; It pivot = median_of_three(first, first + (last - first) / 2, last - 1); It cut = hoare_partition(first, last, *pivot); introsort_loop(cut, last, depthBudget); // Recurse on the right half. last = cut; // Loop again on the left half (tail-call elimination). } } template<typename It> void introsort(It first, It last) { int depthBudget = 2 * log2(last - first); // 2 log₂n is the introsort constant. introsort_loop(first, last, depthBudget); insertion_sort(first, last); // One final pass cleans up everything < 16. }
Musser's killer-sequence experiment is worth repeating: he constructed a 100,000-element input designed to defeat median-of-three quicksort and measured introsort's running time at 1/200 of the median-of-three quicksort's[9]. Heapsort kicks in once, finishes the bad subarray in n log n, and the average-case behavior is untouched on the rest of the input.
Why 2 log2n?
The constant in front of the log is a tradeoff. A perfectly balanced quicksort recurses exactly log2n deep; a median-of-three quicksort on random data averages somewhat above that. The factor 2 gives quicksort enough rope to finish on inputs that lean unlucky but not pathological. Set it to 1.0 and heapsort triggers on benign inputs; set it to 4.0 and the quadratic worst case approaches before heapsort saves you. 2 log2n is Musser's measured sweet spot and has been left alone.
Introsort's downside is that it inherits all of quicksort's bad cases for "many equal keys": the standard Hoare partition stalls when most elements equal the pivot. Bentley and McIlroy 1993 solved this with a three-way partition (the Dutch national flag problem), which is what makes pdqsort, in §10, genuinely better than introsort on real data.
09Timsort, the natural-runs mergesort
Tim Peters started writing timsort for Python's list.sort in 2002. The design document, Objects/listsort.txt in the CPython repo, is the best how-to-write-up-a-sort artifact in the wild and worth reading in full[11]. The premise: real data is not random. Logs come in mostly-sorted, then with a few out-of-order entries. Database query results arrive in chunks that are themselves sorted. Game-engine event streams from the prior frame are usually monotonic in timestamp. A sort that exploits existing order beats one that ignores it.
Timsort is a stable mergesort tuned for runs. The four moving parts:
- Run detection. Scan left-to-right looking for the longest stretch of consecutive elements that is non-decreasing or strictly decreasing. Reverse the decreasing runs in place (so they become increasing). A "run" is the unit of work.
- Minimum run length (minrun). If the natural run is shorter than minrun, use insertion sort to extend it to minrun elements. Minrun is chosen between 32 and 64 so that n / minrun is a power of two (or just under), which gives balanced merges[11].
- The merge stack. Push each run onto a stack. After each push, check three Fibonacci-style invariants on the top three runs (X, Y, Z, top of stack last): runlen(X) > runlen(Y) + runlen(Z), runlen(Y) > runlen(Z), runlen(X+Y) > runlen(Z). If any is violated, merge Y with the smaller of X and Z. This keeps merges roughly balanced.
- Galloping merge. A merge between two runs that are very imbalanced (most of one run is smaller than most of the other) uses exponential search instead of linear, so a long suffix of "wins" by one side does not pay O(k) for each element.
The 2015 invariant bug
Stijn de Gouw and colleagues at CWI Amsterdam, working in the KeY verification tool to prove timsort correct, discovered that the merge stack's invariant was not actually being maintained[14]. mergeCollapse only checked the invariant against the top three runs of the stack. After it merged the top runs and pushed the result back, runs further down the stack could be left in a state that violated the invariant globally. The error did not produce wrong output (the sort still returned a sorted array), but it broke the bound on stack depth: de Gouw's team constructed run-length sequences that overflowed the fixed-size stack OpenJDK had allocated, crashing the JVM with ArrayIndexOutOfBoundsException.
Two fixes were available: enlarge the stack to cover the actual worst case under the broken invariant, or fix mergeCollapse to enforce the invariant for all stack entries. Both Python (bpo-23515, fixed in 3.4.4 / 2.7.11, December 2015) and OpenJDK chose the first path: re-derive the stack size from a worst-case analysis. de Gouw's paper proposes the second path as the more principled fix; it has not been adopted in mainline implementations.
Two lessons. One, formal verification of standard-library sort is a tractable thing in 2026 and was not in 2002. Two, the inner loops of sort algorithms get exhaustively tested; the framing logic (stack invariants, merge ordering, recursion bounds) gets read by few and proved by fewer. The bug had been in production for thirteen years.
10pdqsort, the branchless modern quicksort
Pattern-defeating quicksort, Orson Peters 2021[16], is the descendant of introsort. It adds three ideas that, together, make it the fastest general-purpose unstable sort on most real inputs:
- BlockQuicksort partitioning. Edelkamp and Weiß 2016[21] showed that the quicksort partition's unpredictable branch on the pivot comparison costs more than the comparison itself. Their fix: collect the comparison outcomes for a block of ~64 elements into a bitmask first, then do the swaps in a separate pass with cmov-style branchless code. Branch mispredicts go from one-per-element to zero.
- Pattern defeat. Pdqsort tracks whether the most recent partition was unbalanced (one side < n/8, say). If yes, it shuffles the pivot candidates to a different position before retrying, which probabilistically breaks any adversarial input. If the imbalanced partition keeps happening, it falls back to heapsort the way introsort does.
- Three-way partition for equal keys. When pdqsort detects that many elements equal the pivot (the partition repeatedly produces a tiny right side), it switches to a three-way partition (less, equal, greater) and the equal-elements region gets skipped entirely on the recursion. Inputs with many duplicate keys run in O(n · k) where k is the number of distinct values.
Pdqsort is now the default unstable sort in Rust (slice::sort_unstable), one of the algorithms shipped in Boost.Sort, and the basis of the new std::sort that Google contributed to LLVM libc++ in 2022. On Google's internal benchmarks of std::sort, the pdqsort-aligned implementation produced a ~50% speedup on integer arrays and a ~10% to 20% speedup on mixed object workloads[22].
The branchless partition is worth seeing in detail. The conventional Hoare partition has a data-dependent branch on the pivot comparison; on random data the branch predictor is wrong half the time. BlockQuicksort buffers two arrays of indices (the positions of "wrong-side" elements found in the left and right blocks), then swaps them in a paired pass. The inner loop has no data-dependent branches at all on modern x86 because the comparison result is folded into an arithmetic operation:
template<typename T> void block_scan_left(T* block, T pivot, unsigned char* offsets, int& count) { count = 0; for (int i = 0; i < 64; ++i) { offsets[count] = (unsigned char)i; count += (block[i] >= pivot); // No branch: bool-to-int extends the count. } } // Symmetric block_scan_right. Then a second loop swaps element pairs from the // two offset arrays; that pass has one predictable loop branch and no data-dependent ones.
Combined effect: pdqsort matches introsort on adversarial inputs (both fall back to heapsort), beats it by 30 to 70% on integer workloads (branchless partition), beats it dramatically on inputs with many equal keys (three-way partition), and runs in linear time on the "ascending then a few out-of-order" and "descending then a few out-of-order" patterns that timsort catches but introsort handles in n log n.
11Powersort: timsort with a proof
Munro and Wild's 2018 paper "Nearly-Optimal Mergesorts"[15] attacked the part of timsort that even Tim Peters described as a heuristic: the merge policy. Timsort's three-element stack-invariant check is a pragmatic rule that mostly produces balanced merges. Mostly is not the same as optimal. Powersort replaces the heuristic with a rule derived from the Mehlhorn nearly-optimal binary search tree construction: assign each "run boundary" a power based on its position in the array, and merge runs whose power is exceeded by the next boundary.
The result is a stable mergesort that is provably within an additive linear term of the information-theoretic optimum for the entropy of the run-length sequence. In English: powersort is at least as good as timsort on every input, and strictly better when the input has many runs of unequal length. Tim Peters merged powersort into CPython in September 2021 (gh-78742 / PR #28108), shipping in Python 3.11 in October 2022[23]. The patch kept the other timsort machinery (run detection, galloping, minrun, insertion-sort cutoff) intact and swapped only the merge-trigger rule. PyPy, Apple's WebKit, and AssemblyScript adopted powersort in the same period.
Powersort is the right default for "modern stable sort" in a fresh codebase. Timsort with the original heuristic is what shipped, and what most existing systems still run, because the difference is measurable but not dramatic (within 5 to 10% on most workloads, up to 30% on inputs with long structured runs). The shape of the algorithm is the same: run-detection, push-on-stack, merge-when-policy-triggers. Powersort changes one rule, the merge trigger, and inherits everything else.
What is a "power" of a run boundary?
Powersort assigns each boundary between two runs a power equal to the number of leading zero bits in the binary representation of (start1 + len1 + start2) / total_length, viewed as a fraction. A boundary that bisects the array gets a high power; a boundary near one end gets a low power.
The merge rule: maintain a stack; when pushing a new run, merge previous-top runs whose boundary-power is at least the new run's boundary-power, then push. This is equivalent to building a balanced binary tree of merges over the runs, weighted by their length. Mehlhorn 1977 proved this construction gives a tree whose total weighted path length is within an additive linear term of optimal[24].
The asymmetry with timsort: timsort decides whether to merge based on the lengths of the top three runs. Powersort decides based on the position of the boundary in the array. The latter is a property of the input structure, not the algorithm's state, so it produces strictly better merge trees.
12Radix sort: linear time, no comparisons
Radix sort is the right answer for "I have a million 32-bit keys and I want them sorted." It runs in O(d · n) where d is the number of digits (or bytes, or whatever radix you pick), reads memory in long sequential streams, and parallelizes cleanly across cores or GPU threads. The cost is restricted to keys that are bounded-width integers or fixed-precision floats reinterpreted as integers. Every modern engine renderer uses it for draw-call sorting (§16).
Two flavors. LSD radix sort is the standard for sorting numeric keys: scan low byte first, then next byte, then next. Each pass is a stable bucket sort on 256 buckets (for a byte-radix). After four passes on 32-bit keys the array is fully sorted. MSD radix sort is the variant for strings and variable-length keys.
void radix_sort_u32(uint32_t* keys, uint32_t* scratch, int n) { uint32_t* src = keys, * dst = scratch; for (int passShift = 0; passShift < 32; passShift += 8) { int bucketCount[256] = {0}; // Pass 1: count how many elements go in each bucket. for (int i = 0; i < n; ++i) ++bucketCount[(src[i] >> passShift) & 0xFF]; // Prefix sum: turn counts into starting offsets. int total = 0; for (int b = 0; b < 256; ++b) { int c = bucketCount[b]; bucketCount[b] = total; total += c; } // Pass 2: scatter elements into dst at their bucket offsets. for (int i = 0; i < n; ++i) { int bucket = (src[i] >> passShift) & 0xFF; dst[bucketCount[bucket]++] = src[i]; } std::swap(src, dst); // Next pass reads from where we just wrote. } if (src != keys) std::memcpy(keys, src, n * sizeof(uint32_t)); }
Cost: 4 passes × n reads × n writes = 8n memory operations, each sequential. On a 16 GB/s memory bus that is roughly 0.5 ns per key per pass: 2 ns/key total for the whole sort. A comparison sort at the same scale costs 30 to 60 ns/key. The crossover with introsort is around n = 100 for 32-bit integer keys; below that the radix sort's setup cost (256-entry bucket histograms, four passes regardless of n) wins.
Sorting floats with a radix sort
The trick that makes radix sort useful in renderers is that IEEE 754 floats, reinterpreted as 32-bit ints, sort correctly for positive values. (Negative-zero handling and negative-value sign-flipping need three extra lines.) Aras Pranckevičius's "Rough sorting by depth" post is the canonical reference for using the top bits of a float as part of a packed sort key[26]: "Floating point numbers have the nice property that if you interpret their bits as integers, larger numbers result in larger integers — within the same sign." A 10-bit slice of a positive float gives you 1024 depth buckets, which is enough granularity for any front-to-back or back-to-front render pass.
13Counting sort: when the key range is tiny
Counting sort is the inner pass radix sort wraps. Given an array of n keys, each in [0, k), sort in O(n + k): count occurrences of each value, prefix-sum the counts, scatter elements into output. When k = O(n) the cost is linear; when k is much larger than n the histogram-allocation cost dominates and counting sort loses to a comparison sort.
Use cases in engines:
- Particle priority bucketing. A particle system with 4-bit priority gives 16 buckets; one counting pass groups particles per frame in O(n).
- Material ID grouping. If material IDs are dense in [0, 256), counting sort by material ID puts every draw call of the same material together with one pass.
- Histogram-based effects. Tone mapping's luminance histogram is a counting sort on log-luminance; the "sort" output is the prefix sums.
Java's Arrays.sort for byte, char, and short primitives uses counting sort, because the key range (256, 65536, 65536) is small enough that the histogram allocation is cheap[12]. For int and longer types it switches to dual-pivot quicksort, because allocating a 4-billion-entry histogram for 32-bit ints is not free.
14Bitonic sort and the GPU
Comparison sorts and radix sort assume one cursor walking sequentially. A GPU has thousands of cursors and they cannot easily coordinate. The sorts that run well on GPUs are sorting networks, where which-position-compares-which is decided ahead of time. Ken Batcher's 1968 bitonic sort[27] is the canonical example.
The structure: a bitonic sequence is one that monotonically increases then monotonically decreases. A bitonic sequence of length n can be sorted by n/2 compare-and-swap operations in parallel (the "bitonic merge"), producing two bitonic sequences of length n/2 each, which then recurse. Total operations: O(n log²n). Total depth (longest chain of dependent operations): O(log²n). On a GPU with enough threads, depth, not total work, determines wall time.
Bitonic sort's cost summary versus the CPU sorts:
| Sort | Total work | Parallel depth | In place | Stable |
|---|---|---|---|---|
| quicksort (serial) | O(n log n) | O(n log n) | yes | no |
| mergesort (serial) | O(n log n) | O(n log n) | no | yes |
| radix sort (serial) | O(d·n) | O(d·n) | aux | yes |
| bitonic sort (parallel) | O(n log²n) | O(log²n) | yes | no |
| GPU radix sort | O(d·n) | O(d log n) | aux | yes |
Bitonic sort does more total work than mergesort (n log²n vs n log n) but every step in a level is independent. On 4096 GPU threads sorting 4096 elements, bitonic finishes in 78 parallel steps; serial mergesort would need ~50,000. The work is wasted relative to the optimum; the wall time is not. NVIDIA's GPU Gems 2, chapter 46, describes the production version[28]; Unreal's particle pipeline uses a bitonic merge sort in GPUSort.cpp to depth-sort transparent particles entirely on the GPU.
15Live race: every sort, side by side
Six sorts running on the same input, animated as bar charts. Click "go" and watch them race. Pick the input shape from the dropdown to see how each algorithm responds to sorted, reverse-sorted, partially-sorted, and few-unique inputs:
16Engine case: sorting draw calls
Every renderer sorts its draw calls before submitting them. The reasons are direct: GPU state changes (binding a new shader, new texture, new vertex buffer) cost real time, and the cost is per change, not per call. Five thousand draw calls in arbitrary order can mean five thousand shader swaps. Five thousand draw calls sorted by shader can mean ten. Sorting is the cheap way to recover most of the throughput a naive renderer leaves on the table.
The Ericson 2008 design pattern, still the standard fifteen years later[1], packs everything that matters into a single 64-bit sort key:
The bit-order encodes the sort hierarchy. Render-target group first, then opaque-vs-transparent split (because their depth orders are opposite), then by shader, then by material, then by depth bucket. The trick: for opaque draws the depth is encoded as the float bits directly (front-to-back, which lets the early-Z reject more pixels); for transparent draws the depth is bit-inverted (back-to-front, required for correct alpha blending). One sort, two orderings.
Once the key is packed, the sort itself is a radix-sort on the integer. Unreal Engine ships RadixSort32 in its core library as a stable 32-bit radix sort designed for this purpose[25]. Bevy evaluated swapping its comparison sort for radsort (a stable radix sort crate) in 2022 and reported the median sort_phase_system time on the many_cubes benchmark dropping from 0.728 ms to 0.094 ms[2]. The result was workload-dependent: phases with many similar keys saw the linear-time radix win; phases with few items saw the radix sort's fixed setup cost lose to a tuned comparison sort. The lesson generalizes: radix-sort wins are most pronounced at scale, on densely-packed keys.
17Engine case: sort-and-sweep broadphase
Collision detection's broadphase finds candidate pairs of objects whose AABBs overlap. The naive algorithm is n²; sweep-and-prune reduces it to roughly n log n per frame by sorting AABB endpoints along an axis and sweeping a moving "active list" across them[29].
The sort here is interesting because the input is almost-sorted from frame to frame. Object positions change a little per frame; the endpoint list mostly preserves order from the previous frame. Insertion sort, which is O(n) on already-sorted input, beats every n log n sort on this workload. Production physics engines (Bullet, Havok, PhysX) use insertion sort exactly here for exactly this reason, sometimes specialized to update the endpoint structure incrementally rather than re-running the sort from scratch.
Some physics engines skip sort-and-sweep entirely. Box2D's modern broadphase uses a b2DynamicTree, a dynamically rebalanced AABB tree, with rotations chosen by the same surface-area heuristic that BVH builders use[30]. The ordering work doesn't disappear, it moves inside the data structure; the tree's incremental rebalancing fills the role of an explicit per-frame sort.
18Picking the right sort
The decision table for "which sort do I reach for":
| Workload | Right answer | Why |
|---|---|---|
| Generic C++, types with cheap compare | std::sort (introsort / pdqsort) | Fast on average, n log n worst case, in place, library-provided. |
| Generic C++, types with expensive compare | pdqsort or timsort | Fewer comparisons; timsort if stability matters, pdqsort if it doesn't. |
| Generic, must be stable | std::stable_sort (timsort or merge) | Stability needs explicit handling; the unstable sorts don't preserve equal-key order. |
| Real data, often partially sorted | timsort (Python, Java, V8) or powersort | Adaptive: O(n) on sorted prefixes, run-detection avoids redundant work. |
| Bounded integer keys, n > 100 | radix sort (LSD) | Linear time, no comparisons, parallelizes cleanly. Engine renderer default. |
| Tiny integer range (k = O(1)) | counting sort | One pass, no comparisons. Java does this for byte/char/short. |
| n < 32, any keys | insertion sort | Recursive sorts pay more overhead than the inner-loop work at this scale. |
| Worst-case latency budget (real-time) | heapsort or introsort | Guaranteed n log n; quicksort alone can blow the budget on bad input. |
| Data on GPU | bitonic sort or GPU radix | Both fit GPU parallelism. Radix wins for fixed-width keys; bitonic for arbitrary compare. |
| Data on disk, larger than RAM | external mergesort | Sequential I/O, log-many-merge, the textbook external-sort algorithm. |
| Strings of varied length | MSD radix or three-way radix quicksort | Avoids re-scanning prefixes; O(N) where N is total character count. |
The most common production mistake is "I'll just call std::sort." It is right roughly 80% of the time. The 20% where it is wrong are the cases that show up in profilers as a percent or two of frame time that nobody notices until they look. The case-study sections above are the canonical wins; the table covers the rest.
19Pitfalls
Sort bugs in production code, ranked by frequency:
- Non-strict weak ordering.
std::sortrequires a comparator that is a strict weak order: irreflexive (!(a < a)), antisymmetric (a < b ⇒ !(b < a)), transitive (a < b ∧ b < c ⇒ a < c), and the equivalence relation derived from "neither is less than the other" must be transitive too. Violations are undefined behavior, including infinite loops and segfaults on libstdc++. A common bug: comparing floats where one might be NaN. NaN < anything is false, so NaN breaks the partition. - Unstable when you needed stable.
std::sortis allowed to reorder equal elements;std::stable_sortis not. Multi-pass sorts (sort by depth, then by material) only work if both passes are stable. The bug shows up as occasional flickering on sprites with the same material and similar depth. - Comparator state. Comparators with mutable state (a counter, a cache) violate the assumption that the order is deterministic. The sort may call the comparator more than once on the same pair, or in arbitrary order. Stateful comparators that produce inconsistent results crash; ones that produce consistent results but mutate are surprisingly hard to detect.
- Integer overflow in the comparator. Writing return a - b; as a comparator works for floats and bounded ints; on 32-bit ints with both very large and very negative values it overflows and produces wrong-sign results. Write return (a > b) - (a < b); or return a < b ? -1 : (b < a ? 1 : 0);.
- Sorting a container that mutates during the sort. Iterators invalidate when the underlying container resizes. The sort can corrupt memory if a comparator triggers a container modification. Common when comparators call user code that triggers asset loads.
- The 2003 Java binary-search overflow. Joshua Bloch found that the canonical (lo + hi) / 2 in Java's stdlib binary search overflowed for arrays larger than 230[17]. The same bug appears in any mergesort midpoint that uses the naive formula. Use lo + (hi − lo) / 2.
- Allocating a merge buffer in the inner loop. A mergesort that
new[]s its scratch buffer for every recursive call costs more in allocation than it saves in algorithmic work. Allocate once at the top. - Sorting pointers when the indirection costs more than the sort. A sort that compares
a->materialagainstb->materialdereferences twice per compare. On n = 10,000 with a 30 ns dereference, that is 300 µs in pointer chasing per pass. Pack the sort key into the array directly (the draw-call key trick).
20What's next
The natural follow-ups, in roughly increasing depth:
- Selection algorithms. "Find the k-th smallest" without sorting the whole array. Quickselect is to quicksort what binary search is to a sorted lookup. Median-of-medians (BFPRT 1973) gives a deterministic linear-time selection. The introselect fallback in Musser 1997 is to selection what introsort is to sorting.
- External-memory sorting. Data larger than RAM. k-way merge sort plus a tape-or-disk layout. The classic algorithm, but the modern version uses NVMe SSDs and the cost model (bandwidth instead of seek time) flips most of the choices.
- Parallel and concurrent sorts. Multi-way mergesort with work stealing, parallel quicksort, GPU radix sort. The work-span model (PRAM, Cilk). Our job systems tutorial introduces the runtime; the parallel-sort algorithms layer on top.
- Top-k and partial sort. When you don't need the whole order, just the smallest k.
std::partial_sortuses a heap-based variant; for very large k, a sample-and-sort approach is faster. - Sorting under adversarial input. Cryptographic-quality hashing of comparison keys to defeat hash-flooding attacks. The 2011 PHP / Java / Python hash-collision DOS was an attack on sort-equivalent data structures.
- Cache-oblivious sorting. Frigo and Leiserson's O(N/B · logM/B(N/B)) bound for sort over memory-transfer cost, not CPU operations[31]. The right model for sorting really large arrays.
For the engine programmer, the most useful next step is empirical: profile your renderer's draw-call sort. The bug is probably std::sort on a pointer-and-string comparator. Pack a 32-bit key, swap it for a radix sort, watch the millisecond come back.
Sources
- Christer Ericson. "Order your graphics draw calls around!" realtimecollisiondetection.net blog, October 2008. realtimecollisiondetection.net. The canonical reference for bit-packed draw-call sort keys; covers the render-target / opaque-transparent / shader / material / depth bit-order pattern.
- Bevy Engine. "Use radix sort for sort phase and sprite sorting." GitHub issue #4291, 2022. github.com/bevyengine/bevy/issues/4291. Source for the 0.728 ms → 0.094 ms median-sort-time figure on the many_cubes benchmark and the radsort migration discussion.
- Peter Norvig. "Teach Yourself Programming in Ten Years," "Approximate timing for various operations" table. norvig.com/21-days.html. Memory-hierarchy latency ballparks used to estimate per-comparison cost: L1 ~0.5 ns, branch mispredict ~5 ns, DRAM ~100 ns.
- Donald E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd ed., Addison-Wesley, 1998. Chapters 5.1 to 5.3 cover the history of mergesort, the analysis of comparison-based sorting, and the lower-bound proof.
- C. A. R. Hoare. "Quicksort." The Computer Journal 5(1):10–16, 1962. comjnl 5.1.10. The original quicksort paper. Hoare describes the algorithm he wrote at Moscow State University in 1959 while working on machine translation.
- J. W. J. Williams. "Algorithm 232 - Heapsort." Communications of the ACM 7(6):347–348, 1964. dl.acm.org/doi/10.1145/512274.512284. The original heapsort paper; also introduces the binary heap as a standalone data structure.
- Robert W. Floyd. "Algorithm 245 - Treesort 3." Communications of the ACM 7(12):701, 1964. The bottom-up heap-construction algorithm that brings heapify from O(n log n) to O(n).
- Jon L. Bentley, M. Douglas McIlroy. "Engineering a Sort Function." Software: Practice and Experience 23(11):1249–1265, November 1993. PDF. The paper most BSD
qsortimplementations descend from; introduces pseudomedian-of-nine and the three-way partition for the Dutch-flag problem. - David R. Musser. "Introspective Sorting and Selection Algorithms." Software: Practice and Experience 27(8):983–993, August 1997. PDF. The introsort paper. The 2 log2n depth budget and the killer-sequence experiment (1/200 of median-of-three quicksort's time) are both here.
- LLVM Project. "libcxx
std::sort." Implementation inlibcxx/include/__algorithm/sort.h. github.com/llvm/llvm-project. Source for the insertion-sort cutoff sizes (30 for trivial swaps, 6 otherwise) and the depth-budget constant. - Tim Peters. listsort.txt, the timsort design document. CPython repository,
Objects/listsort.txt. github.com/python/cpython. Source for the minrun-in-[32, 64] choice, the merge-stack Fibonacci-style invariants, galloping mode, and the empirical motivation (real-world inputs are not random). - Vladimir Yaroslavskiy. "Dual-Pivot Quicksort algorithm." Proposal to the OpenJDK core-libs mailing list, 2009. PDF (archived). The dual-pivot quicksort that became
java.util.Arrays.sortfor primitive types in Java 7. - Bernhard Beckert, Jonas Schiffl, Peter H. Schmitt, Mattias Ulbrich. "Proving JDK's Dual Pivot Quicksort Correct." Verified Software: Theories, Tools, and Experiments (VSTTE), 2017. PDF. The KeY-based formal verification of Yaroslavskiy's algorithm; turns "we believe this works" into a machine-checkable proof.
- Stijn de Gouw, Jurriaan Rot, Frank S. de Boer, Richard Bubel, Reiner Hähnle. "OpenJDK's java.utils.Collection.sort() is broken: The Good, the Bad and the Worst Case." Computer Aided Verification (CAV), 2015. The discovery of the 2002–2015 timsort stack-invariant bug; Python 3.4.4 and OpenJDK both patched as a result. envisage-project.eu summary.
- J. Ian Munro, Sebastian Wild. "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs." European Symposium on Algorithms (ESA), 2018. arXiv:1805.04154. The powersort paper. Adopted by CPython 3.11, PyPy, AssemblyScript, and Apple's WebKit.
- Orson Peters. "Pattern-defeating Quicksort." arXiv preprint, June 2021. arXiv:2106.05123. The pdqsort paper. Now the default unstable sort in Rust (
slice::sort_unstable) and in Boost.Sort. - Joshua Bloch. "Nearly All Binary Searches and Mergesorts are Broken." Google Research Blog, June 2006. ai.googleblog.com. The discovery that (lo + hi) / 2 overflows for arrays larger than 230 and the fix that propagated through every standard library.
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, 4th ed., MIT Press, 2022, chapter 8 "Sorting in Linear Time" and chapter 8.1 "Lower bounds for sorting." The decision-tree lower-bound proof in canonical form.
- Lester R. Ford, Selmer M. Johnson. "A Tournament Problem." The American Mathematical Monthly 66(5):387–389, May 1959. The merge-insertion sort that achieves the information-theoretic minimum number of comparisons for small n.
- M. Douglas McIlroy. "A Killer Adversary for Quicksort." Software: Practice and Experience 29(4):341–344, April 1999. PDF. Shows that for any deterministic pivot rule, an adversary can construct input forcing quadratic time. The motivation for the introsort fallback.
- Stefan Edelkamp, Armin Weiß. "BlockQuicksort: Avoiding Branch Mispredictions in Quicksort." European Symposium on Algorithms (ESA), 2016. arXiv:1604.06697. The block-partition technique that pdqsort builds on; demonstrates 2× to 3× speedup from branchless partitioning on integer arrays.
- Danila Kutenin. "Changing std::sort at Google's Scale and Beyond." danlark.org blog, April 2022. danlark.org. Source for the ~50% speedup figure on integer arrays from swapping libc++'s introsort for pdqsort across Google's codebase.
- CPython issue gh-78742 / Pull Request #28108. "Use Powersort merge strategy." Author: Tim Peters; merged September 6, 2021; shipped in Python 3.11 (October 2022). github.com/python/cpython/pull/28108.
- Kurt Mehlhorn. "A best possible bound for the weighted path length of binary search trees." SIAM Journal on Computing 6(2):235–239, 1977. The nearly-optimal binary-search-tree construction whose principle powersort applies to merge ordering.
- Epic Games. "RadixSort32." Unreal Engine documentation. docs.unrealengine.com. The 32-bit radix sort that ships in Unreal's core for draw-call sorting and similar use cases.
- Aras Pranckevičius. "Rough sorting by depth." aras-p.info blog, January 2014. aras-p.info. The reference for using the top bits of a float-as-int as a radix-sortable depth field.
- Kenneth E. Batcher. "Sorting networks and their applications." Proc. AFIPS Spring Joint Computer Conference, 1968, pp. 307–314. PDF. The original bitonic sort paper and the foundational sorting-network reference.
- Peter Kipfer, Rüdiger Westermann. "Improved GPU Sorting." GPU Gems 2, chapter 46, NVIDIA, 2005. developer.nvidia.com. Production GPU bitonic sort implementation with cache-tiling and merge-step optimizations.
- Jonathan D. Cohen, Ming C. Lin, Dinesh Manocha, Madhav K. Ponamgi. "I-COLLIDE: An Interactive and Exact Collision Detection System for Large-Scale Environments." Proc. ACM Interactive 3D Graphics, 1995. The original sweep-and-prune (sort-and-sweep) paper.
- Ingo Wald, Solomon Boulos, Peter Shirley. "Ray Tracing Deformable Scenes Using Dynamic Bounding Volume Hierarchies." ACM Transactions on Graphics 26(1), 2007. The surface-area heuristic for BVH construction; the same heuristic drives the tree-rebalancing pass in physics broadphase trees.
- Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. "Cache-Oblivious Algorithms." Proc. 40th IEEE FOCS, 1999. PDF. The cache-oblivious sort bound, O(N/B · logM/B(N/B)), that becomes the right complexity model for very large arrays.