All tutorials Mighty Professional
Tutorial 08 · Engine Programming

Sorting Algorithms

Every frame your engine sorts something: draw calls by material and depth, transparent sprites back-to-front, particles, broad-phase AABBs, animation events. The routines doing the work are not std::sort as your textbook drew it. They are introsort with insertion-sort cutoffs, pdqsort with branchless partitioning, timsort with a Fibonacci stack, and 32-bit radix passes packed by hand. We build each one from scratch, run them live in your browser, and end with three case studies from shipping engines.

Time~60 min LevelJunior to mid; review for senior PrereqsYou can read C++ or pseudocode at intermediate level. Big O notation (read the prior tutorial if it feels rusty). HardwareNone, but a feel for cache and branch prediction helps in §10 and §16.

01Why a sort, not a stopwatch

The frame budget at 60 fps is 16.67 ms. A modern open-world title submits 5,000 to 15,000 draw calls per frame and wants them grouped by render target, then shader, then material, then depth, so the GPU swaps state as few times as possible[1]. A comparison sort with a comparator that dereferences a material or shader pointer per element does on the order of 1.4 n log n key comparisons (a hundred thousand to a quarter-million at this scale), each spending tens of nanoseconds once the pointer chasing leaves L1; the total can run past a millisecond. A four-pass radix sort over a packed 32-bit key finishes the same work in roughly 0.1 ms. Bevy benchmarked exactly this swap in 2022 and reported median sort time on its opaque-3D sort phase (the many_cubes benchmark) dropping from 0.728 ms to 0.094 ms[2]. The difference is not the asymptotic class. The difference is which sort, on what key, with what memory layout.

This tutorial walks through the family of sorts an engine programmer actually meets. Comparison sorts (quicksort, mergesort, heapsort, and their modern hybrids) are bounded below by Ω(n log n) for n distinct elements, a bound we prove in §3. Non-comparison sorts (radix, counting, bucket) escape that bound by looking at the bits of the keys directly, and that is why engine renderers reach for them. Parallel sorts (bitonic, GPU radix) trade total work for parallel depth, which is what makes them fit on a GPU.

What you'll have by the end

A working mental model of every sort that ships in a major language runtime or game engine. Knowledge of which one to pick on which data, why std::sort in libstdc++ is introsort while libc++ and Rust ship pdqsort-derived implementations, why Python and V8 use timsort (and why Python 3.11 swapped it for powersort), how the timsort stack-invariant bug shipped for thirteen years before anyone noticed, why Unreal sorts particles on the GPU with bitonic merge but draw calls on the CPU with radix, and how to pack a 32-bit sort key that puts opaque-front-to-back, then transparent-back-to-front, in one comparison.

The cost of the wrong sort

The widget below puts a comparison sort and a radix sort on the same axis: object count along the X, milliseconds per frame along the Y. The comparison curve is n log n with a measured per-comparison cost of ~30 ns (the cost of one pointer dereference plus one string compare, ballparked from Norvig's latency numbers[3]). The radix curve is linear in n at ~3 ns per element per pass, four passes for a 32-bit key. The crossover is around 200 items; above 1,000 the gap is order-of-magnitude:

Live · Comparison vs radix at frame scale
comparison sort
···
radix sort (4 pass)
···
frame budget at 60 fps
16.67 ms
The comparison curve assumes a sort that hits memory on every compare; if your keys fit in registers and your data is hot in L1, the per-compare drops to 5 to 8 ns and the crossover moves right. The radix curve assumes a 32-bit packed key with one byte per pass; widening to 64 bits and 8 passes doubles its slope but does not change the shape. Bevy benchmarked this kind of swap in 2022 and saw its opaque-3D sort phase drop from 0.728 ms to 0.094 ms on the many_cubes benchmark[2].

02A short history of sorting

Sorting predates computing by a century: tabulating machines for the 1890 US Census used punched-card sort routines invented by Herman Hollerith, and the algorithm they ran was a radix-by-column LSD pass. Modern sorting research started with von Neumann in 1945 and is still active. The dates that matter for engine work:

1945
John von Neumann describes merge sort.[4] The first computer-program sort, designed for the EDVAC. Two passes per merge level, log2n levels, the bound is n log n. Stable, in-place is awkward, external-memory friendly. Knuth's Art of Computer Programming vol. 3 traces the lineage.
1959
Shell sort. Donald Shell publishes a gap-decreasing insertion sort variant. Not optimal asymptotically (analysis is messy), but it was the first sort that beat insertion in practice on real machines. Still appears occasionally in embedded contexts for its small code footprint.
1961
Tony Hoare publishes quicksort.[5] First appears as Algorithm 64 in Communications of the ACM, July 1961; the analysis paper follows in The Computer Journal the next year. Hoare had devised it in 1959 working on a Russian-English machine translation project at the UK's National Physical Laboratory, after a year studying probability at Moscow State University under Kolmogorov. The partition step swaps elements around a pivot until the array is split into "less than pivot" and "greater than pivot." Expected n log n, worst case . The worst case will matter in §5.
1964
Williams introduces heap sort. Floyd fixes the build.[6] J.W.J. Williams publishes heapsort in Communications of the ACM. Robert Floyd publishes a bottom-up heap-construction algorithm in the same year that brings the build cost from O(n log n) down to O(n)[7]. In-place, guaranteed n log n, cache-hostile.
1993
Bentley and McIlroy engineer the qsort that ships.[8] "Engineering a Sort Function" in Software: Practice and Experience. Pseudomedian-of-nine pivot selection, three-way partition for the Dutch-national-flag problem (many equal keys), explicit small-array fallback. Most BSD qsort implementations and a generation of C libraries descend from this paper.
1997
Musser publishes introsort.[9] "Introspective Sorting and Selection Algorithms" gives quicksort a worst-case guarantee: monitor recursion depth, switch to heapsort once it exceeds 2 log2n. Worst case becomes n log n, average case stays at quicksort's constant. libstdc++ std::sort has been introsort since the GCC 3.x era; MSVC and libc++ followed (libc++ finally caught up in LLVM 14)[10].
2002
Tim Peters writes timsort.[11] Adaptive natural-runs mergesort for CPython's list.sort. The design document Objects/listsort.txt is a model of how to write up a sort. Joshua Bloch's port landed in OpenJDK in 2009 (JDK-6804124) and shipped with Java 7 in 2011 as the default sort for object arrays. V8 adopted timsort for JavaScript Array.prototype.sort in V8 v7.0 (2018).
2009
Yaroslavskiy's dual-pivot quicksort enters the JDK.[12] Vladimir Yaroslavskiy proposes a quicksort variant with two pivots that partitions into three regions per call. Java 7's Arrays.sort for primitive types adopts it. Verified correct in 2017 using the KeY proof tool[13].
2015
The timsort stack-invariant bug.[14] Stijn de Gouw et al. prove, while trying to verify timsort in KeY, that the merge-collapse invariant was wrong. The Java implementation crashes on a constructed sequence of runs; Python's was sized too small in principle but not in practice. The patch made it into Python 3.4.4 and OpenJDK. A sort algorithm in production for thirteen years had a worst-case stack-overflow bug.
2018
Munro and Wild publish powersort.[15] "Nearly-Optimal Mergesorts" gives timsort a merge policy with provable optimality up to an additive linear term. CPython 3.11 swaps timsort's heuristic for the powersort rule in 2022; PyPy, AssemblyScript, and Apple's WebKit follow.
2021
Orson Peters publishes pdqsort.[16] Pattern-defeating quicksort: introsort plus block-partitioning (Edelkamp and Weiß's BlockQuicksort) plus pivot-shuffling on bad partitions plus the Dutch-flag trick for equal keys. Branchless inner partition. Rust's slice::sort_unstable is pdqsort; Boost shipped a port; the algorithm became the de facto modern quicksort.
2026
The sorts that ship. A modern engine ends up with three or four routines: introsort or pdqsort for general comparison sorting, radix sort for fixed-width keys (draw calls, particles), bitonic sort on the GPU for what is already on the GPU. The choice is not "which is fastest" in the abstract; it is "which fits the shape of the data and the memory it lives in."

Two patterns hold across the timeline. One: every winning production sort is a hybrid. Quicksort plus insertion sort plus heapsort fallback (introsort, pdqsort); mergesort plus insertion sort on small runs plus galloping merge (timsort, powersort). Nobody ships the single-routine textbook algorithm. Two: the bug surface is not in the inner loop, it is in the boilerplate around it. The 2015 timsort stack-invariant bug and the binary-search midpoint overflow Joshua Bloch reported in 2006[17] were both in framing logic (merge state, midpoint arithmetic), not in the comparison itself.

03Why comparison sorts can't beat n log n

Before building any sort, it is worth knowing the floor. The claim is that any algorithm that sorts by comparing pairs of elements (no peeking at their bits, no precomputed table) must do at least n log2 n comparisons in the worst case, up to constants. The proof is a counting argument. It is short and worth doing in full because it explains why radix sort and counting sort can exist at all: they violate the assumption.

Model a comparison sort as a decision tree: the root is the first comparison the algorithm makes; the two children are the two possible outcomes; each leaf is one possible final permutation. For an input of n distinct elements there are n! possible permutations, so the tree needs at least n! leaves. A binary tree with L leaves has height at least log2 L. So the height (the worst-case number of comparisons on some input) is at least log2(n!).

eq. 1 · the lower bound T(n) log2(n!) log2((n/e)n) = n log2(n) − n log2(e) = Ω(n log n)

The middle step uses Stirling's approximation: n! > (n/e)n for n ≥ 1. The constant log2(e) ≈ 1.443 is the only thing absorbed into the Ω. The conclusion: any sort that uses only pairwise comparisons does at least n log2 n comparisons in the worst case[18]. Mergesort, heapsort, and well-tuned quicksort hit this bound up to a constant.

The widget below builds the decision tree for n = 3 by hand. There are 3! = 6 permutations, so the tree has 6 leaves and minimum height ⌈log2(6)⌉ = 3 comparisons. Click a permutation to highlight the path:

Live · The decision tree for n = 3
leaves needed
3! = 6
min height
⌈log₂6⌉ = 3
path to this leaf
3
Every comparison sort on three elements has a decision tree shaped like this one (with the comparisons reordered). The shortest possible worst-case path is the tree's height. For n = 3 the optimum is 3 comparisons; for n = 10 it is 22 (⌈log2(10!)⌉ = 22). Merge insertion sort by Ford and Johnson 1959 achieves the information-theoretic optimum for small n[19]; nobody uses it because the bookkeeping costs more than the comparisons it saves.
So how does radix sort beat the bound?

Radix sort never asks "is a less than b?". It looks at the bits of each key directly: bucket by the low byte, then the next byte, then the next. The decision-tree model does not apply because the algorithm is not making binary decisions on comparison outcomes; it is doing a constant amount of work per byte per element.

For an n-element array of w-bit keys, radix sort runs in O(w · n / b) where b is the bits per pass. With w = 32, b = 8, that is four linear passes. Asymptotically it is O(n) when w is fixed (which it is for draw-call keys). The bound in §3 only forbids beating n log n with comparisons.

04Insertion sort, the small-n champion

Insertion sort is the sort every production routine falls back to at small sizes. The asymptotic class is O(n²), which textbooks then file under "bad" and move on. The textbooks are wrong about that part. Insertion sort wins outright below roughly 16 to 32 elements on modern CPUs, which is exactly the regime where the recursive sorts spend most of their time once they have partitioned the data down. The reasons it wins are not asymptotic, they are mechanical: a tight inner loop that the branch predictor learns instantly, no recursion overhead, no pivot selection, and a memory access pattern that prefetches perfectly.

insertion sort · C++
template<typename It, typename Compare>
void insertion_sort(It first, It last, Compare less) {
  // Outer loop: bring the i-th element into its place among [first, i).
  for (It cursor = first + 1; cursor != last; ++cursor) {
    auto key = std::move(*cursor);                // Save the element we're placing.
    It target = cursor;
    // Walk left while the element on the left is bigger than key.
    while (target != first && less(key, *(target - 1))) {
      *target = std::move(*(target - 1));        // Shift the bigger one right by one slot.
      --target;
    }
    *target = std::move(key);                     // Drop key into the gap.
  }
}

Cost analysis: the outer loop runs n − 1 times; the inner loop walks back as far as the new element needs to go. On a fully sorted input the inner loop exits immediately and the total cost is O(n). On a reverse-sorted input the inner loop walks all the way back every time and the total cost is n(n−1)/2 = Θ(n²). The average case on random data is Θ(n²) with constant ½: half of the elements on the left are larger on average.

The widget below races insertion sort against a recursive sort (a simple quicksort) at varying array sizes. The crossover is the cutoff every production sort uses:

Live · Where insertion sort beats quicksort
crossover n
···
insertion at n=16
···
quick at n=16
···
Times are measured by counting array writes; the unit is "swap-equivalents." Switch input shape to see how the crossover moves: insertion sort is unbeatable on already-sorted input (it runs in linear time), and it stays competitive on "few unique" because the inner loop tends to find its home quickly. Cutoff sizes used in production: libstdc++ n=16, libc++ n=30 for trivial swaps else 6, Java's DualPivotQuicksort n=44 (was 47 through Java 14, retuned since), pdqsort n=24, Go n=12[10].
Why don't all sorts cut over at the same n?

The crossover depends on the cost of one comparison. For primitive types (ints, floats), the comparator is one CPU instruction, so insertion sort's quadratic comparisons are cheap and the crossover sits higher (Java's DualPivotQuicksort cuts at n=44). For boxed objects with virtual comparators, each compare is a pointer-dereference plus an indirect call, which is expensive, and the crossover sits lower (Java's timsort uses a 32-element minimum run length, effectively the insertion-sort threshold; libc++ uses n=6 for non-trivial types).

Cache behavior matters too. Insertion sort touches contiguous memory in a tight loop; on a hot cache line it does a few dozen swaps before missing. Quicksort partition reads from both ends of the range, which hits two cache lines at once. At small n that overhead dominates.

05Quicksort: partition, recurse, pray

Tony Hoare's quicksort, from a 1959 stint at Moscow State University, is the most-cited sorting algorithm and the foundation of every modern production sort except the timsort-family. The idea is one of the cleanest in algorithm design: pick a pivot, partition the array into "less than pivot" and "greater than pivot," recurse on each half. The expected cost is n log n, but the worst case is , and the difference depends entirely on how you choose the pivot.

The partition

Two partition schemes ship in production code. Hoare's partition is faster on average; Lomuto's partition is simpler to write correctly. Most modern sorts use a tuned variant of Hoare's. The C++ pseudocode for Hoare-style:

Hoare partition · C++
template<typename It, typename Compare>
It hoare_partition(It first, It last, Compare less) {
  auto pivot = *first;             // In production, median-of-3 here, not just first.
  It leftCursor  = first  - 1;
  It rightCursor = last;
  while (true) {
    // Walk leftCursor right until we find something ≥ pivot.
    do { ++leftCursor;  } while (less(*leftCursor,  pivot));
    // Walk rightCursor left until we find something ≤ pivot.
    do { --rightCursor; } while (less(pivot, *rightCursor));
    if (leftCursor >= rightCursor) return rightCursor;  // Cursors met or crossed; done.
    std::iter_swap(leftCursor, rightCursor);          // Swap the out-of-place pair.
  }
}

The recursion is the easy part. The hard part, the part where every quicksort variant earns its keep, is picking the pivot.

The pivot problem

If the pivot is the median of the data, quicksort halves the array each call and runs in n log n. If the pivot is the minimum (or maximum), one partition is empty and the recursion depth equals n: quadratic, plus a stack overflow.

Pivot strategies in roughly increasing sophistication:

Modern hybrids (introsort, pdqsort) give up on pivot perfection and accept that any deterministic strategy can be defeated. They monitor partition quality, and if the partition is too imbalanced, they take a fallback path: introsort switches to heapsort, pdqsort shuffles pivot candidates. §8 and §10.

The widget below shows quicksort running on data you control. Switch the pivot strategy and the input shape to see when quicksort runs at n log n and when it falls apart:

Live · Quicksort with different pivots
comparisons
···
max recursion depth
···
vs n log n
···
Try "first element" with "sorted input": the comparisons climb to n(n−1)/2 and recursion depth equals n, the textbook quadratic worst case. Switch to "median-of-three" and the same sorted input becomes a best-case n log n. Then try "organ-pipe" with median-of-three: McIlroy 1999 showed that for any deterministic pivot rule, an adversary can construct an input that drives it quadratic[20]. The organ-pipe shape (1, 3, 5, ..., max, ..., 4, 2) is a known killer.

06Mergesort, the stability champion

Mergesort is von Neumann's algorithm. Split the array in half; recursively sort each half; merge the two sorted halves into one. The merge step is the substance of the algorithm: two cursors walking forward, copying the smaller of the two current elements into the output. The recursion is log2n levels deep; each level does O(n) work; total n log n, worst case and average case both.

Mergesort's selling points over quicksort:

The drawbacks are the reason it is not the default in-memory sort. Mergesort needs O(n) auxiliary memory for the merge buffer; quicksort sorts in place with O(log n) stack. Cache-wise, the merge step reads from two ranges and writes to a third, which on most CPUs costs more than quicksort's two-cursor partition that reads and writes the same range. For in-memory primitive arrays the constant on mergesort is about 1.5× to 2× quicksort, which is why std::sort in C++ is not mergesort.

top-down mergesort · C++
template<typename T>
void merge(T* a, T* buffer, int lo, int mid, int hi) {
  // Copy [lo, hi) to scratch buffer; merge from buffer back into a.
  for (int k = lo; k < hi; ++k) buffer[k] = a[k];
  int leftCursor = lo, rightCursor = mid;
  for (int writeAt = lo; writeAt < hi; ++writeAt) {
    if      (leftCursor  >= mid)                       a[writeAt] = buffer[rightCursor++];
    else if (rightCursor >= hi)                        a[writeAt] = buffer[leftCursor++];
    else if (buffer[rightCursor] < buffer[leftCursor]) a[writeAt] = buffer[rightCursor++];
    else                                                a[writeAt] = buffer[leftCursor++];   // '<' not '<=': preserves stability.
  }
}

template<typename T>
void mergesort(T* a, T* buffer, int lo, int hi) {
  if (hi - lo < 2) return;            // Production: cut over to insertion sort below ~16.
  int mid = lo + (hi - lo) / 2;  // Note: (lo + hi) / 2 overflows; this avoids that bug.
  mergesort(a, buffer, lo, mid);
  mergesort(a, buffer, mid, hi);
  merge(a, buffer, lo, mid, hi);
}

The midpoint comment is not a stylistic note. Joshua Bloch found in 2006 that the canonical (lo + hi) / 2 in the Java standard library's binary search overflowed on arrays larger than 230, a bug that had survived since 1969[17]. The fix, lo + (hi − lo) / 2, is the same one mergesort needs. Use it.

Live · The mergesort recursion tree
levels
···
comparisons
···
aux memory
···
The tree has ⌈log2n⌉ levels; each level merges every element exactly once for O(n) work per level. Total: O(n log n). The aux-memory figure is n because the merge writes to a buffer the size of the active range. In-place mergesort exists (rotation-based) but its constant is so bad that nobody ships it: timsort's "in-place" merge uses a buffer no bigger than the smaller of the two runs, which is the practical compromise.

07Heapsort, the worst-case insurance policy

Heapsort, J.W.J. Williams 1964[6], is the in-place worst-case-n log n sort. Quicksort can degrade to quadratic; mergesort needs n auxiliary words; heapsort needs neither. Production code uses heapsort as a fallback: introsort calls it when recursion depth exceeds 2 log2n, and embedded systems that cannot afford the merge buffer use it as their default.

The algorithm has two phases. First, build a max-heap in place: the largest element ends up at index 0. Then repeatedly swap the root (largest remaining) to the end of the array and sift down what landed at the root to restore the heap property. Each swap-pop step takes O(log n) sift-down time, done n times: O(n log n) total.

The non-obvious part is the build. The naive method (insert n elements one by one) is n log n. Floyd's bottom-up heapify, published the same year as Williams' paper[7], is O(n). The proof is a geometric sum: half the array is at the leaves and needs zero sift-down, a quarter is at level h−1 and needs at most one sift-down step, an eighth needs at most two, and the sum Σ k · 2−k is bounded by a constant.

heapsort with Floyd's heapify · C++
template<typename T>
void sift_down(T* a, int root, int heapEnd) {
  // Walk root down toward the leaves; at each step, swap with the larger child if needed.
  while (true) {
    int leftChild  = 2 * root + 1;
    int rightChild = 2 * root + 2;
    int biggest    = root;
    if (leftChild  < heapEnd && a[leftChild]  > a[biggest]) biggest = leftChild;
    if (rightChild < heapEnd && a[rightChild] > a[biggest]) biggest = rightChild;
    if (biggest == root) return;                // Heap property restored.
    std::swap(a[root], a[biggest]);
    root = biggest;                          // Follow the swap down.
  }
}

template<typename T>
void heapsort(T* a, int n) {
  // Phase 1: build a max-heap in place, bottom-up. Floyd's O(n) heapify.
  for (int i = n / 2 - 1; i >= 0; --i) sift_down(a, i, n);
  // Phase 2: extract the max n times, growing a sorted suffix.
  for (int heapEnd = n - 1; heapEnd > 0; --heapEnd) {
    std::swap(a[0], a[heapEnd]);          // Largest remaining moves to its final slot.
    sift_down(a, 0, heapEnd);                // Restore heap on the shrinking prefix.
  }
}

The performance footnote is the part textbooks usually omit. Heapsort is n log n in the worst case, but on random data its constant is roughly 2 to 3 times worse than quicksort's[9]. The reason is cache: heapsort's sift-down walks a parent-child chain that doubles in stride at every level, so once the heap exceeds the L1 cache the parent reads cost L2 or memory latency. Quicksort partitions over contiguous memory, prefetches well. Heapsort is the right answer when you need the worst-case guarantee; it is the wrong answer when you do not.

Live · Heapsort step-by-step
phase
···
comparisons
···
swaps
···
Phase 1 ("build") starts from index ⌊n/2⌋−1 (the last internal node) and works backward toward the root, sifting each one down. Half the array (the leaves) is never touched in this phase. That is Floyd's O(n) build. Phase 2 ("extract") swaps root to the end and re-sifts; each step is one O(log n) sift-down. The visible cost asymmetry is the reason heapsort is the slowest of the n log n sorts in practice.

08Introsort: quicksort with a parachute

David Musser's 1997 introsort[9] is the algorithm that made quicksort safe to standardize. The idea is one sentence long: run quicksort, monitor recursion depth, and if it exceeds 2 log2n, finish the current subarray with heapsort. Quicksort's average case stays intact; heapsort's worst-case bound becomes the worst case for the whole algorithm. Add an insertion-sort cutoff at small n, and you have the std::sort that has shipped in libstdc++ since GCC 3.x and in MSVC since Visual Studio 2008. libc++ implemented the n log n worst-case guarantee with introsort in LLVM 14 (2022); LLVM 17 (2023) replaced it with a pdqsort-derived implementation contributed by Google (§10)[10].

introsort · C++
template<typename It>
void introsort_loop(It first, It last, int depthBudget) {
  while (last - first > 16) {                // Insertion-sort cutoff at the end of the call.
    if (depthBudget == 0) {                  // Quicksort recursion is going too deep.
      heapsort(first, last);                  // Fall back to the worst-case-safe sort.
      return;
    }
    --depthBudget;
    It pivot = median_of_three(first, first + (last - first) / 2, last - 1);
    It cut = hoare_partition(first, last, *pivot);
    introsort_loop(cut, last, depthBudget);          // Recurse on the right half.
    last = cut;                                       // Loop again on the left half (tail-call elimination).
  }
}

template<typename It>
void introsort(It first, It last) {
  int depthBudget = 2 * log2(last - first);     // 2 log₂n is the introsort constant.
  introsort_loop(first, last, depthBudget);
  insertion_sort(first, last);                       // One final pass cleans up everything < 16.
}

Musser's killer-sequence experiment is worth repeating: he constructed a 100,000-element input designed to defeat median-of-three quicksort and measured introsort's running time at 1/200 of the median-of-three quicksort's[9]. Heapsort kicks in once, finishes the bad subarray in n log n, and the average-case behavior is untouched on the rest of the input.

Why 2 log2n?

The constant in front of the log is a tradeoff. A perfectly balanced quicksort recurses exactly log2n deep; a median-of-three quicksort on random data averages somewhat above that. The factor 2 gives quicksort enough rope to finish on inputs that lean unlucky but not pathological. Set it to 1.0 and heapsort triggers on benign inputs; set it to 4.0 and the quadratic worst case approaches before heapsort saves you. 2 log2n is Musser's measured sweet spot and has been left alone.

Introsort's downside is that it inherits all of quicksort's bad cases for "many equal keys": the standard Hoare partition stalls when most elements equal the pivot. Bentley and McIlroy 1993 solved this with a three-way partition (the Dutch national flag problem), which is what makes pdqsort, in §10, genuinely better than introsort on real data.

09Timsort, the natural-runs mergesort

Tim Peters started writing timsort for Python's list.sort in 2002. The design document, Objects/listsort.txt in the CPython repo, is the best how-to-write-up-a-sort artifact in the wild and worth reading in full[11]. The premise: real data is not random. Logs come in mostly-sorted, then with a few out-of-order entries. Database query results arrive in chunks that are themselves sorted. Game-engine event streams from the prior frame are usually monotonic in timestamp. A sort that exploits existing order beats one that ignores it.

Timsort is a stable mergesort tuned for runs. The four moving parts:

Live · Timsort run detection and merge stack
runs found
···
comparisons
···
vs n log n
···
On already-sorted input timsort runs in O(n): one run found, no merges. On random input it costs roughly the same as a plain stable mergesort. The win is real-world data, which is rarely random. On CPython's sortperf.py microbenchmarks, timsort runs at or near linear time on every input shape that contains pre-existing order (already-sorted, reverse-sorted, partially-sorted, equal-elements), where mergesort and quicksort stay at n log n[11]. The merge stack height is bounded by logφ(n) ≈ 1.44 log2n when the Fibonacci invariant is preserved; if it is not, the stack overflows. Which is exactly what happened in 2015.

The 2015 invariant bug

Stijn de Gouw and colleagues at CWI Amsterdam, working in the KeY verification tool to prove timsort correct, discovered that the merge stack's invariant was not actually being maintained[14]. mergeCollapse only checked the invariant against the top three runs of the stack. After it merged the top runs and pushed the result back, runs further down the stack could be left in a state that violated the invariant globally. The error did not produce wrong output (the sort still returned a sorted array), but it broke the bound on stack depth: de Gouw's team constructed run-length sequences that overflowed the fixed-size stack OpenJDK had allocated, crashing the JVM with ArrayIndexOutOfBoundsException.

Two fixes were available: enlarge the stack to cover the actual worst case under the broken invariant, or fix mergeCollapse to enforce the invariant for all stack entries. Both Python (bpo-23515, fixed in 3.4.4 / 2.7.11, December 2015) and OpenJDK chose the first path: re-derive the stack size from a worst-case analysis. de Gouw's paper proposes the second path as the more principled fix; it has not been adopted in mainline implementations.

Two lessons. One, formal verification of standard-library sort is a tractable thing in 2026 and was not in 2002. Two, the inner loops of sort algorithms get exhaustively tested; the framing logic (stack invariants, merge ordering, recursion bounds) gets read by few and proved by fewer. The bug had been in production for thirteen years.

10pdqsort, the branchless modern quicksort

Pattern-defeating quicksort, Orson Peters 2021[16], is the descendant of introsort. It adds three ideas that, together, make it the fastest general-purpose unstable sort on most real inputs:

Pdqsort is now the default unstable sort in Rust (slice::sort_unstable), one of the algorithms shipped in Boost.Sort, and the basis of the new std::sort that Google contributed to LLVM libc++ in 2022. On Google's internal benchmarks of std::sort, the pdqsort-aligned implementation produced a ~50% speedup on integer arrays and a ~10% to 20% speedup on mixed object workloads[22].

The branchless partition is worth seeing in detail. The conventional Hoare partition has a data-dependent branch on the pivot comparison; on random data the branch predictor is wrong half the time. BlockQuicksort buffers two arrays of indices (the positions of "wrong-side" elements found in the left and right blocks), then swaps them in a paired pass. The inner loop has no data-dependent branches at all on modern x86 because the comparison result is folded into an arithmetic operation:

block partition (sketch) · C++
template<typename T>
void block_scan_left(T* block, T pivot,
                     unsigned char* offsets, int& count) {
  count = 0;
  for (int i = 0; i < 64; ++i) {
    offsets[count] = (unsigned char)i;
    count += (block[i] >= pivot);   // No branch: bool-to-int extends the count.
  }
}
// Symmetric block_scan_right. Then a second loop swaps element pairs from the
// two offset arrays; that pass has one predictable loop branch and no data-dependent ones.

Combined effect: pdqsort matches introsort on adversarial inputs (both fall back to heapsort), beats it by 30 to 70% on integer workloads (branchless partition), beats it dramatically on inputs with many equal keys (three-way partition), and runs in linear time on the "ascending then a few out-of-order" and "descending then a few out-of-order" patterns that timsort catches but introsort handles in n log n.

11Powersort: timsort with a proof

Munro and Wild's 2018 paper "Nearly-Optimal Mergesorts"[15] attacked the part of timsort that even Tim Peters described as a heuristic: the merge policy. Timsort's three-element stack-invariant check is a pragmatic rule that mostly produces balanced merges. Mostly is not the same as optimal. Powersort replaces the heuristic with a rule derived from the Mehlhorn nearly-optimal binary search tree construction: assign each "run boundary" a power based on its position in the array, and merge runs whose power is exceeded by the next boundary.

The result is a stable mergesort that is provably within an additive linear term of the information-theoretic optimum for the entropy of the run-length sequence. In English: powersort is at least as good as timsort on every input, and strictly better when the input has many runs of unequal length. Tim Peters merged powersort into CPython in September 2021 (gh-78742 / PR #28108), shipping in Python 3.11 in October 2022[23]. The patch kept the other timsort machinery (run detection, galloping, minrun, insertion-sort cutoff) intact and swapped only the merge-trigger rule. PyPy, Apple's WebKit, and AssemblyScript adopted powersort in the same period.

Powersort is the right default for "modern stable sort" in a fresh codebase. Timsort with the original heuristic is what shipped, and what most existing systems still run, because the difference is measurable but not dramatic (within 5 to 10% on most workloads, up to 30% on inputs with long structured runs). The shape of the algorithm is the same: run-detection, push-on-stack, merge-when-policy-triggers. Powersort changes one rule, the merge trigger, and inherits everything else.

What is a "power" of a run boundary?

Powersort assigns each boundary between two runs a power equal to the number of leading zero bits in the binary representation of (start1 + len1 + start2) / total_length, viewed as a fraction. A boundary that bisects the array gets a high power; a boundary near one end gets a low power.

The merge rule: maintain a stack; when pushing a new run, merge previous-top runs whose boundary-power is at least the new run's boundary-power, then push. This is equivalent to building a balanced binary tree of merges over the runs, weighted by their length. Mehlhorn 1977 proved this construction gives a tree whose total weighted path length is within an additive linear term of optimal[24].

The asymmetry with timsort: timsort decides whether to merge based on the lengths of the top three runs. Powersort decides based on the position of the boundary in the array. The latter is a property of the input structure, not the algorithm's state, so it produces strictly better merge trees.

12Radix sort: linear time, no comparisons

Radix sort is the right answer for "I have a million 32-bit keys and I want them sorted." It runs in O(d · n) where d is the number of digits (or bytes, or whatever radix you pick), reads memory in long sequential streams, and parallelizes cleanly across cores or GPU threads. The cost is restricted to keys that are bounded-width integers or fixed-precision floats reinterpreted as integers. Every modern engine renderer uses it for draw-call sorting (§16).

Two flavors. LSD radix sort is the standard for sorting numeric keys: scan low byte first, then next byte, then next. Each pass is a stable bucket sort on 256 buckets (for a byte-radix). After four passes on 32-bit keys the array is fully sorted. MSD radix sort is the variant for strings and variable-length keys.

LSD radix sort on uint32 · C++
void radix_sort_u32(uint32_t* keys, uint32_t* scratch, int n) {
  uint32_t* src = keys, * dst = scratch;
  for (int passShift = 0; passShift < 32; passShift += 8) {
    int bucketCount[256] = {0};
    // Pass 1: count how many elements go in each bucket.
    for (int i = 0; i < n; ++i)
      ++bucketCount[(src[i] >> passShift) & 0xFF];
    // Prefix sum: turn counts into starting offsets.
    int total = 0;
    for (int b = 0; b < 256; ++b) {
      int c = bucketCount[b]; bucketCount[b] = total; total += c;
    }
    // Pass 2: scatter elements into dst at their bucket offsets.
    for (int i = 0; i < n; ++i) {
      int bucket = (src[i] >> passShift) & 0xFF;
      dst[bucketCount[bucket]++] = src[i];
    }
    std::swap(src, dst);   // Next pass reads from where we just wrote.
  }
  if (src != keys) std::memcpy(keys, src, n * sizeof(uint32_t));
}

Cost: 4 passes × n reads × n writes = 8n memory operations, each sequential. On a 16 GB/s memory bus that is roughly 0.5 ns per key per pass: 2 ns/key total for the whole sort. A comparison sort at the same scale costs 30 to 60 ns/key. The crossover with introsort is around n = 100 for 32-bit integer keys; below that the radix sort's setup cost (256-entry bucket histograms, four passes regardless of n) wins.

Live · LSD radix sort, byte by byte
passes
···
total ops
···
vs n log n
···
Radix bits is a tunable: more bits per pass means fewer passes but larger histograms (which cost memory and stress the cache). 8 bits is the production default; 4 bits is used here so the bucket structure is visible. Unreal's RadixSort32 ships with an 8-bit-per-pass implementation; Bevy's radsort uses a tunable that defaults to 8[25].

Sorting floats with a radix sort

The trick that makes radix sort useful in renderers is that IEEE 754 floats, reinterpreted as 32-bit ints, sort correctly for positive values. (Negative-zero handling and negative-value sign-flipping need three extra lines.) Aras Pranckevičius's "Rough sorting by depth" post is the canonical reference for using the top bits of a float as part of a packed sort key[26]: "Floating point numbers have the nice property that if you interpret their bits as integers, larger numbers result in larger integers — within the same sign." A 10-bit slice of a positive float gives you 1024 depth buckets, which is enough granularity for any front-to-back or back-to-front render pass.

13Counting sort: when the key range is tiny

Counting sort is the inner pass radix sort wraps. Given an array of n keys, each in [0, k), sort in O(n + k): count occurrences of each value, prefix-sum the counts, scatter elements into output. When k = O(n) the cost is linear; when k is much larger than n the histogram-allocation cost dominates and counting sort loses to a comparison sort.

Use cases in engines:

Java's Arrays.sort for byte, char, and short primitives uses counting sort, because the key range (256, 65536, 65536) is small enough that the histogram allocation is cheap[12]. For int and longer types it switches to dual-pivot quicksort, because allocating a 4-billion-entry histogram for 32-bit ints is not free.

14Bitonic sort and the GPU

Comparison sorts and radix sort assume one cursor walking sequentially. A GPU has thousands of cursors and they cannot easily coordinate. The sorts that run well on GPUs are sorting networks, where which-position-compares-which is decided ahead of time. Ken Batcher's 1968 bitonic sort[27] is the canonical example.

The structure: a bitonic sequence is one that monotonically increases then monotonically decreases. A bitonic sequence of length n can be sorted by n/2 compare-and-swap operations in parallel (the "bitonic merge"), producing two bitonic sequences of length n/2 each, which then recurse. Total operations: O(n log²n). Total depth (longest chain of dependent operations): O(log²n). On a GPU with enough threads, depth, not total work, determines wall time.

Bitonic sort's cost summary versus the CPU sorts:

SortTotal workParallel depthIn placeStable
quicksort (serial)O(n log n)O(n log n)yesno
mergesort (serial)O(n log n)O(n log n)noyes
radix sort (serial)O(d·n)O(d·n)auxyes
bitonic sort (parallel)O(n log²n)O(log²n)yesno
GPU radix sortO(d·n)O(d log n)auxyes

Bitonic sort does more total work than mergesort (n log²n vs n log n) but every step in a level is independent. On 4096 GPU threads sorting 4096 elements, bitonic finishes in 78 parallel steps; serial mergesort would need ~50,000. The work is wasted relative to the optimum; the wall time is not. NVIDIA's GPU Gems 2, chapter 46, describes the production version[28]; Unreal's particle pipeline uses a bitonic merge sort in GPUSort.cpp to depth-sort transparent particles entirely on the GPU.

Live · The bitonic sort network for n = 8
total ops
···
parallel depth
···
on 8 threads
···
Each "layer" of the network can run in parallel: every compare-and-swap in a layer is between disjoint pairs of wires. The depth of the network determines wall time on a parallel machine; the total ops determines power consumed. Bitonic sort's O(log²n) depth is what makes it fit on GPUs whose strength is throughput, not latency. Sorting 4096 floats on an RTX-class GPU finishes in roughly 10 µs of wall time despite doing ~100,000 compare-and-swap ops total.

15Live race: every sort, side by side

Six sorts running on the same input, animated as bar charts. Click "go" and watch them race. Pick the input shape from the dropdown to see how each algorithm responds to sorted, reverse-sorted, partially-sorted, and few-unique inputs:

Live · Six sorts on identical input
insertion quicksort mergesort heapsort timsort radix
"Work" is counted in array accesses; the unit is unitless but lets you compare algorithms fairly on the same input. Switching to "mostly-sorted with noise" reveals timsort's adaptive behavior: it finishes in roughly n accesses while the others stay at n log n. Switching to "already sorted" shows insertion sort win outright at O(n). Heapsort never wins, but never loses spectacularly either: it is the algorithm with the smallest variance across input shapes.

16Engine case: sorting draw calls

Every renderer sorts its draw calls before submitting them. The reasons are direct: GPU state changes (binding a new shader, new texture, new vertex buffer) cost real time, and the cost is per change, not per call. Five thousand draw calls in arbitrary order can mean five thousand shader swaps. Five thousand draw calls sorted by shader can mean ten. Sorting is the cheap way to recover most of the throughput a naive renderer leaves on the table.

The Ericson 2008 design pattern, still the standard fifteen years later[1], packs everything that matters into a single 64-bit sort key:

VP
T
shader id (13b)
material id (16b)
depth (24b)
seq (8b)
render target transparency bit shader id material id depth sequence number

The bit-order encodes the sort hierarchy. Render-target group first, then opaque-vs-transparent split (because their depth orders are opposite), then by shader, then by material, then by depth bucket. The trick: for opaque draws the depth is encoded as the float bits directly (front-to-back, which lets the early-Z reject more pixels); for transparent draws the depth is bit-inverted (back-to-front, required for correct alpha blending). One sort, two orderings.

Once the key is packed, the sort itself is a radix-sort on the integer. Unreal Engine ships RadixSort32 in its core library as a stable 32-bit radix sort designed for this purpose[25]. Bevy evaluated swapping its comparison sort for radsort (a stable radix sort crate) in 2022 and reported the median sort_phase_system time on the many_cubes benchmark dropping from 0.728 ms to 0.094 ms[2]. The result was workload-dependent: phases with many similar keys saw the linear-time radix win; phases with few items saw the radix sort's fixed setup cost lose to a tuned comparison sort. The lesson generalizes: radix-sort wins are most pronounced at scale, on densely-packed keys.

Live · Pack a 64-bit draw-call key
shader binds avoided
···
material binds avoided
···
sort cost (ns/key)
~2
Bits to the left of the sort key dominate the order. "Render target first" guarantees you finish one pass before starting another; "shader second" minimizes pipeline swaps; "depth last" runs front-to-back within each material so the GPU's early-Z reject works hardest. Same draw calls, different bit-pack, different render time. The packing strategy is in FMeshDrawCommand in Unreal's renderer and in PhaseItem in Bevy.

17Engine case: sort-and-sweep broadphase

Collision detection's broadphase finds candidate pairs of objects whose AABBs overlap. The naive algorithm is ; sweep-and-prune reduces it to roughly n log n per frame by sorting AABB endpoints along an axis and sweeping a moving "active list" across them[29].

The sort here is interesting because the input is almost-sorted from frame to frame. Object positions change a little per frame; the endpoint list mostly preserves order from the previous frame. Insertion sort, which is O(n) on already-sorted input, beats every n log n sort on this workload. Production physics engines (Bullet, Havok, PhysX) use insertion sort exactly here for exactly this reason, sometimes specialized to update the endpoint structure incrementally rather than re-running the sort from scratch.

Some physics engines skip sort-and-sweep entirely. Box2D's modern broadphase uses a b2DynamicTree, a dynamically rebalanced AABB tree, with rotations chosen by the same surface-area heuristic that BVH builders use[30]. The ordering work doesn't disappear, it moves inside the data structure; the tree's incremental rebalancing fills the role of an explicit per-frame sort.

18Picking the right sort

The decision table for "which sort do I reach for":

WorkloadRight answerWhy
Generic C++, types with cheap comparestd::sort (introsort / pdqsort)Fast on average, n log n worst case, in place, library-provided.
Generic C++, types with expensive comparepdqsort or timsortFewer comparisons; timsort if stability matters, pdqsort if it doesn't.
Generic, must be stablestd::stable_sort (timsort or merge)Stability needs explicit handling; the unstable sorts don't preserve equal-key order.
Real data, often partially sortedtimsort (Python, Java, V8) or powersortAdaptive: O(n) on sorted prefixes, run-detection avoids redundant work.
Bounded integer keys, n > 100radix sort (LSD)Linear time, no comparisons, parallelizes cleanly. Engine renderer default.
Tiny integer range (k = O(1))counting sortOne pass, no comparisons. Java does this for byte/char/short.
n < 32, any keysinsertion sortRecursive sorts pay more overhead than the inner-loop work at this scale.
Worst-case latency budget (real-time)heapsort or introsortGuaranteed n log n; quicksort alone can blow the budget on bad input.
Data on GPUbitonic sort or GPU radixBoth fit GPU parallelism. Radix wins for fixed-width keys; bitonic for arbitrary compare.
Data on disk, larger than RAMexternal mergesortSequential I/O, log-many-merge, the textbook external-sort algorithm.
Strings of varied lengthMSD radix or three-way radix quicksortAvoids re-scanning prefixes; O(N) where N is total character count.

The most common production mistake is "I'll just call std::sort." It is right roughly 80% of the time. The 20% where it is wrong are the cases that show up in profilers as a percent or two of frame time that nobody notices until they look. The case-study sections above are the canonical wins; the table covers the rest.

19Pitfalls

Sort bugs in production code, ranked by frequency:

20What's next

The natural follow-ups, in roughly increasing depth:

For the engine programmer, the most useful next step is empirical: profile your renderer's draw-call sort. The bug is probably std::sort on a pointer-and-string comparator. Pack a 32-bit key, swap it for a radix sort, watch the millisecond come back.

Sources

  1. Christer Ericson. "Order your graphics draw calls around!" realtimecollisiondetection.net blog, October 2008. realtimecollisiondetection.net. The canonical reference for bit-packed draw-call sort keys; covers the render-target / opaque-transparent / shader / material / depth bit-order pattern.
  2. Bevy Engine. "Use radix sort for sort phase and sprite sorting." GitHub issue #4291, 2022. github.com/bevyengine/bevy/issues/4291. Source for the 0.728 ms → 0.094 ms median-sort-time figure on the many_cubes benchmark and the radsort migration discussion.
  3. Peter Norvig. "Teach Yourself Programming in Ten Years," "Approximate timing for various operations" table. norvig.com/21-days.html. Memory-hierarchy latency ballparks used to estimate per-comparison cost: L1 ~0.5 ns, branch mispredict ~5 ns, DRAM ~100 ns.
  4. Donald E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd ed., Addison-Wesley, 1998. Chapters 5.1 to 5.3 cover the history of mergesort, the analysis of comparison-based sorting, and the lower-bound proof.
  5. C. A. R. Hoare. "Quicksort." The Computer Journal 5(1):10–16, 1962. comjnl 5.1.10. The original quicksort paper. Hoare describes the algorithm he wrote at Moscow State University in 1959 while working on machine translation.
  6. J. W. J. Williams. "Algorithm 232 - Heapsort." Communications of the ACM 7(6):347–348, 1964. dl.acm.org/doi/10.1145/512274.512284. The original heapsort paper; also introduces the binary heap as a standalone data structure.
  7. Robert W. Floyd. "Algorithm 245 - Treesort 3." Communications of the ACM 7(12):701, 1964. The bottom-up heap-construction algorithm that brings heapify from O(n log n) to O(n).
  8. Jon L. Bentley, M. Douglas McIlroy. "Engineering a Sort Function." Software: Practice and Experience 23(11):1249–1265, November 1993. PDF. The paper most BSD qsort implementations descend from; introduces pseudomedian-of-nine and the three-way partition for the Dutch-flag problem.
  9. David R. Musser. "Introspective Sorting and Selection Algorithms." Software: Practice and Experience 27(8):983–993, August 1997. PDF. The introsort paper. The 2 log2n depth budget and the killer-sequence experiment (1/200 of median-of-three quicksort's time) are both here.
  10. LLVM Project. "libcxx std::sort." Implementation in libcxx/include/__algorithm/sort.h. github.com/llvm/llvm-project. Source for the insertion-sort cutoff sizes (30 for trivial swaps, 6 otherwise) and the depth-budget constant.
  11. Tim Peters. listsort.txt, the timsort design document. CPython repository, Objects/listsort.txt. github.com/python/cpython. Source for the minrun-in-[32, 64] choice, the merge-stack Fibonacci-style invariants, galloping mode, and the empirical motivation (real-world inputs are not random).
  12. Vladimir Yaroslavskiy. "Dual-Pivot Quicksort algorithm." Proposal to the OpenJDK core-libs mailing list, 2009. PDF (archived). The dual-pivot quicksort that became java.util.Arrays.sort for primitive types in Java 7.
  13. Bernhard Beckert, Jonas Schiffl, Peter H. Schmitt, Mattias Ulbrich. "Proving JDK's Dual Pivot Quicksort Correct." Verified Software: Theories, Tools, and Experiments (VSTTE), 2017. PDF. The KeY-based formal verification of Yaroslavskiy's algorithm; turns "we believe this works" into a machine-checkable proof.
  14. Stijn de Gouw, Jurriaan Rot, Frank S. de Boer, Richard Bubel, Reiner Hähnle. "OpenJDK's java.utils.Collection.sort() is broken: The Good, the Bad and the Worst Case." Computer Aided Verification (CAV), 2015. The discovery of the 2002–2015 timsort stack-invariant bug; Python 3.4.4 and OpenJDK both patched as a result. envisage-project.eu summary.
  15. J. Ian Munro, Sebastian Wild. "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs." European Symposium on Algorithms (ESA), 2018. arXiv:1805.04154. The powersort paper. Adopted by CPython 3.11, PyPy, AssemblyScript, and Apple's WebKit.
  16. Orson Peters. "Pattern-defeating Quicksort." arXiv preprint, June 2021. arXiv:2106.05123. The pdqsort paper. Now the default unstable sort in Rust (slice::sort_unstable) and in Boost.Sort.
  17. Joshua Bloch. "Nearly All Binary Searches and Mergesorts are Broken." Google Research Blog, June 2006. ai.googleblog.com. The discovery that (lo + hi) / 2 overflows for arrays larger than 230 and the fix that propagated through every standard library.
  18. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, 4th ed., MIT Press, 2022, chapter 8 "Sorting in Linear Time" and chapter 8.1 "Lower bounds for sorting." The decision-tree lower-bound proof in canonical form.
  19. Lester R. Ford, Selmer M. Johnson. "A Tournament Problem." The American Mathematical Monthly 66(5):387–389, May 1959. The merge-insertion sort that achieves the information-theoretic minimum number of comparisons for small n.
  20. M. Douglas McIlroy. "A Killer Adversary for Quicksort." Software: Practice and Experience 29(4):341–344, April 1999. PDF. Shows that for any deterministic pivot rule, an adversary can construct input forcing quadratic time. The motivation for the introsort fallback.
  21. Stefan Edelkamp, Armin Weiß. "BlockQuicksort: Avoiding Branch Mispredictions in Quicksort." European Symposium on Algorithms (ESA), 2016. arXiv:1604.06697. The block-partition technique that pdqsort builds on; demonstrates 2× to 3× speedup from branchless partitioning on integer arrays.
  22. Danila Kutenin. "Changing std::sort at Google's Scale and Beyond." danlark.org blog, April 2022. danlark.org. Source for the ~50% speedup figure on integer arrays from swapping libc++'s introsort for pdqsort across Google's codebase.
  23. CPython issue gh-78742 / Pull Request #28108. "Use Powersort merge strategy." Author: Tim Peters; merged September 6, 2021; shipped in Python 3.11 (October 2022). github.com/python/cpython/pull/28108.
  24. Kurt Mehlhorn. "A best possible bound for the weighted path length of binary search trees." SIAM Journal on Computing 6(2):235–239, 1977. The nearly-optimal binary-search-tree construction whose principle powersort applies to merge ordering.
  25. Epic Games. "RadixSort32." Unreal Engine documentation. docs.unrealengine.com. The 32-bit radix sort that ships in Unreal's core for draw-call sorting and similar use cases.
  26. Aras Pranckevičius. "Rough sorting by depth." aras-p.info blog, January 2014. aras-p.info. The reference for using the top bits of a float-as-int as a radix-sortable depth field.
  27. Kenneth E. Batcher. "Sorting networks and their applications." Proc. AFIPS Spring Joint Computer Conference, 1968, pp. 307–314. PDF. The original bitonic sort paper and the foundational sorting-network reference.
  28. Peter Kipfer, Rüdiger Westermann. "Improved GPU Sorting." GPU Gems 2, chapter 46, NVIDIA, 2005. developer.nvidia.com. Production GPU bitonic sort implementation with cache-tiling and merge-step optimizations.
  29. Jonathan D. Cohen, Ming C. Lin, Dinesh Manocha, Madhav K. Ponamgi. "I-COLLIDE: An Interactive and Exact Collision Detection System for Large-Scale Environments." Proc. ACM Interactive 3D Graphics, 1995. The original sweep-and-prune (sort-and-sweep) paper.
  30. Ingo Wald, Solomon Boulos, Peter Shirley. "Ray Tracing Deformable Scenes Using Dynamic Bounding Volume Hierarchies." ACM Transactions on Graphics 26(1), 2007. The surface-area heuristic for BVH construction; the same heuristic drives the tree-rebalancing pass in physics broadphase trees.
  31. Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. "Cache-Oblivious Algorithms." Proc. 40th IEEE FOCS, 1999. PDF. The cache-oblivious sort bound, O(N/B · logM/B(N/B)), that becomes the right complexity model for very large arrays.

See also