Tutorial 08 · Engine Programming

Sorting Algorithms

Every frame your engine sorts something: draw calls by material and depth, transparent sprites back-to-front, particles, broad-phase AABBs, animation events. The routines doing the work are not std::sort as your textbook drew it. They are introsort with insertion-sort cutoffs, pdqsort with branchless partitioning, timsort with a Fibonacci stack, and 32-bit radix passes packed by hand. We build each one from scratch, run them live in your browser, and end with three case studies from shipping engines.

Time~60 min LevelJunior to mid; review for senior PrereqsYou can read C++ or pseudocode at intermediate level. Big O notation (read the prior tutorial if it feels rusty). HardwareNone, but a feel for cache and branch prediction helps in §10 and §16.

01Why a sort, not a stopwatch

The frame budget at 60 fps is 16.67 ms. A modern open-world title submits 5,000 to 15,000 draw calls per frame and wants them grouped by render target, then shader, then material, then depth, so the GPU swaps state as few times as possible^[1]. A comparison sort with a comparator that dereferences a material or shader pointer per element does on the order of 1.4 n log n key comparisons (a hundred thousand to a quarter-million at this scale), each spending tens of nanoseconds once the pointer chasing leaves L1; the total can run past a millisecond. A four-pass radix sort over a packed 32-bit key finishes the same work in roughly 0.1 ms. Bevy benchmarked exactly this swap in 2022 and reported median sort time on its opaque-3D sort phase (the many_cubes benchmark) dropping from 0.728 ms to 0.094 ms^[2]. The difference is not the asymptotic class. The difference is which sort, on what key, with what memory layout.

This tutorial walks through the family of sorts an engine programmer actually meets. Comparison sorts (quicksort, mergesort, heapsort, and their modern hybrids) are bounded below by Ω(n log n) for n distinct elements, a bound we prove in §3. Non-comparison sorts (radix, counting, bucket) escape that bound by looking at the bits of the keys directly, and that is why engine renderers reach for them. Parallel sorts (bitonic, GPU radix) trade total work for parallel depth, which is what makes them fit on a GPU.

What you'll have by the end

A working mental model of every sort that ships in a major language runtime or game engine. Knowledge of which one to pick on which data, why std::sort in libstdc++ is introsort while libc++ and Rust ship pdqsort-derived implementations, why Python and V8 use timsort (and why Python 3.11 swapped it for powersort), how the timsort stack-invariant bug shipped for thirteen years before anyone noticed, why Unreal sorts particles on the GPU with bitonic merge but draw calls on the CPU with radix, and how to pack a 32-bit sort key that puts opaque-front-to-back, then transparent-back-to-front, in one comparison.

The cost of the wrong sort

The widget below puts a comparison sort and a radix sort on the same axis: object count along the X, milliseconds per frame along the Y. The comparison curve is n log n with a measured per-comparison cost of ~30 ns (the cost of one pointer dereference plus one string compare, ballparked from Norvig's latency numbers^[3]). The radix curve is linear in n at ~3 ns per element per pass, four passes for a 32-bit key. The crossover is around 200 items; above 1,000 the gap is order-of-magnitude:

draw calls 5000

per-comparison cost 30 ns

comparison sort

···

radix sort (4 pass)

···

frame budget at 60 fps

16.67 ms

02A short history of sorting

Sorting predates computing by a century: tabulating machines for the 1890 US Census used punched-card sort routines invented by Herman Hollerith, and the algorithm they ran was a radix-by-column LSD pass. Modern sorting research started with von Neumann in 1945 and is still active. The dates that matter for engine work:

1945

John von Neumann describes merge sort.^[4] The first computer-program sort, designed for the EDVAC. Two passes per merge level, log₂n levels, the bound is n log n. Stable, in-place is awkward, external-memory friendly. Knuth's Art of Computer Programming vol. 3 traces the lineage.

1959

Shell sort. Donald Shell publishes a gap-decreasing insertion sort variant. Not optimal asymptotically (analysis is messy), but it was the first sort that beat n² insertion in practice on real machines. Still appears occasionally in embedded contexts for its small code footprint.

1961

Tony Hoare publishes quicksort.^[5] First appears as Algorithm 64 in Communications of the ACM, July 1961; the analysis paper follows in The Computer Journal the next year. Hoare had devised it in 1959 working on a Russian-English machine translation project at the UK's National Physical Laboratory, after a year studying probability at Moscow State University under Kolmogorov. The partition step swaps elements around a pivot until the array is split into "less than pivot" and "greater than pivot." Expected n log n, worst case n². The worst case will matter in §5.

1964

Williams introduces heap sort. Floyd fixes the build.^[6] J.W.J. Williams publishes heapsort in Communications of the ACM. Robert Floyd publishes a bottom-up heap-construction algorithm in the same year that brings the build cost from O(n log n) down to O(n)^[7]. In-place, guaranteed n log n, cache-hostile.

1993

Bentley and McIlroy engineer the qsort that ships.^[8] "Engineering a Sort Function" in Software: Practice and Experience. Pseudomedian-of-nine pivot selection, three-way partition for the Dutch-national-flag problem (many equal keys), explicit small-array fallback. Most BSD qsort implementations and a generation of C libraries descend from this paper.

1997

Musser publishes introsort.^[9] "Introspective Sorting and Selection Algorithms" gives quicksort a worst-case guarantee: monitor recursion depth, switch to heapsort once it exceeds 2 log₂n. Worst case becomes n log n, average case stays at quicksort's constant. libstdc++ std::sort has been introsort since the GCC 3.x era; MSVC and libc++ followed (libc++ finally caught up in LLVM 14)^[10].

2002

Tim Peters writes timsort.^[11] Adaptive natural-runs mergesort for CPython's list.sort. The design document Objects/listsort.txt is a model of how to write up a sort. Joshua Bloch's port landed in OpenJDK in 2009 (JDK-6804124) and shipped with Java 7 in 2011 as the default sort for object arrays. V8 adopted timsort for JavaScript Array.prototype.sort in V8 v7.0 (2018).

2009

Yaroslavskiy's dual-pivot quicksort enters the JDK.^[12] Vladimir Yaroslavskiy proposes a quicksort variant with two pivots that partitions into three regions per call. Java 7's Arrays.sort for primitive types adopts it. Verified correct in 2017 using the KeY proof tool^[13].

2015

The timsort stack-invariant bug.^[14] Stijn de Gouw et al. prove, while trying to verify timsort in KeY, that the merge-collapse invariant was wrong. The Java implementation crashes on a constructed sequence of runs; Python's was sized too small in principle but not in practice. The patch made it into Python 3.4.4 and OpenJDK. A sort algorithm in production for thirteen years had a worst-case stack-overflow bug.

2018

Munro and Wild publish powersort.^[15] "Nearly-Optimal Mergesorts" gives timsort a merge policy with provable optimality up to an additive linear term. CPython 3.11 swaps timsort's heuristic for the powersort rule in 2022; PyPy, AssemblyScript, and Apple's WebKit follow.

2021

Orson Peters publishes pdqsort.^[16] Pattern-defeating quicksort: introsort plus block-partitioning (Edelkamp and Weiß's BlockQuicksort) plus pivot-shuffling on bad partitions plus the Dutch-flag trick for equal keys. Branchless inner partition. Rust's slice::sort_unstable is pdqsort; Boost shipped a port; the algorithm became the de facto modern quicksort.

2026

The sorts that ship. A modern engine ends up with three or four routines: introsort or pdqsort for general comparison sorting, radix sort for fixed-width keys (draw calls, particles), bitonic sort on the GPU for what is already on the GPU. The choice is not "which is fastest" in the abstract; it is "which fits the shape of the data and the memory it lives in."

Two patterns hold across the timeline. One: every winning production sort is a hybrid. Quicksort plus insertion sort plus heapsort fallback (introsort, pdqsort); mergesort plus insertion sort on small runs plus galloping merge (timsort, powersort). Nobody ships the single-routine textbook algorithm. Two: the bug surface is not in the inner loop, it is in the boilerplate around it. The 2015 timsort stack-invariant bug and the binary-search midpoint overflow Joshua Bloch reported in 2006^[17] were both in framing logic (merge state, midpoint arithmetic), not in the comparison itself.

03Why comparison sorts can't beat n log n

Before building any sort, it is worth knowing the floor. The claim is that any algorithm that sorts by comparing pairs of elements (no peeking at their bits, no precomputed table) must do at least n log₂ n comparisons in the worst case, up to constants. The proof is a counting argument. It is short and worth doing in full because it explains why radix sort and counting sort can exist at all: they violate the assumption.

Model a comparison sort as a decision tree: the root is the first comparison the algorithm makes; the two children are the two possible outcomes; each leaf is one possible final permutation. For an input of n distinct elements there are n! possible permutations, so the tree needs at least n! leaves. A binary tree with L leaves has height at least log₂ L. So the height (the worst-case number of comparisons on some input) is at least log₂(n!).

eq. 1 \cdot the lower bound T(n) \geq log 2 (n!) \geq log 2 ((n/e) n) = n log 2 (n) - n log 2 (e) = Ω(n log n)

The middle step uses Stirling's approximation: n! > (n/e)ⁿ for n ≥ 1. The constant log₂(e) ≈ 1.443 is the only thing absorbed into the Ω. The conclusion: any sort that uses only pairwise comparisons does at least n log₂ n comparisons in the worst case^[18]. Mergesort, heapsort, and well-tuned quicksort hit this bound up to a constant.

The widget below builds the decision tree for n = 3 by hand. There are 3! = 6 permutations, so the tree has 6 leaves and minimum height ⌈log₂(6)⌉ = 3 comparisons. Click a permutation to highlight the path:

trace permutation

leaves needed

3! = 6

min height

⌈log₂6⌉ = 3

path to this leaf

So how does radix sort beat the bound?

Radix sort never asks "is a less than b?". It looks at the bits of each key directly: bucket by the low byte, then the next byte, then the next. The decision-tree model does not apply because the algorithm is not making binary decisions on comparison outcomes; it is doing a constant amount of work per byte per element.

For an n-element array of w-bit keys, radix sort runs in O(w · n / b) where b is the bits per pass. With w = 32, b = 8, that is four linear passes. Asymptotically it is O(n) when w is fixed (which it is for draw-call keys). The bound in §3 only forbids beating n log n with comparisons.

04Insertion sort, the small-n champion

Insertion sort is the sort every production routine falls back to at small sizes. The asymptotic class is O(n²), which textbooks then file under "bad" and move on. The textbooks are wrong about that part. Insertion sort wins outright below roughly 16 to 32 elements on modern CPUs, which is exactly the regime where the recursive sorts spend most of their time once they have partitioned the data down. The reasons it wins are not asymptotic, they are mechanical: a tight inner loop that the branch predictor learns instantly, no recursion overhead, no pivot selection, and a memory access pattern that prefetches perfectly.

insertion sort · C++

template<typename It, typename Compare>
void insertion_sort(It first, It last, Compare less) {
  // Outer loop: bring the i-th element into its place among [first, i).
  for (It cursor = first + 1; cursor != last; ++cursor) {
    auto key = std::move(*cursor);                // Save the element we're placing.
    It target = cursor;
    // Walk left while the element on the left is bigger than key.
    while (target != first && less(key, *(target - 1))) {
      *target = std::move(*(target - 1));        // Shift the bigger one right by one slot.
      --target;
    }
    *target = std::move(key);                     // Drop key into the gap.
  }
}

Cost analysis: the outer loop runs n − 1 times; the inner loop walks back as far as the new element needs to go. On a fully sorted input the inner loop exits immediately and the total cost is O(n). On a reverse-sorted input the inner loop walks all the way back every time and the total cost is n(n−1)/2 = Θ(n²). The average case on random data is Θ(n²) with constant ½: half of the elements on the left are larger on average.

The widget below races insertion sort against a recursive sort (a simple quicksort) at varying array sizes. The crossover is the cutoff every production sort uses:

input shape

crossover n

···

insertion at n=16

···

quick at n=16

···

Times are measured by counting array writes; the unit is "swap-equivalents." Switch input shape to see how the crossover moves: insertion sort is unbeatable on already-sorted input (it runs in linear time), and it stays competitive on "few unique" because the inner loop tends to find its home quickly. Cutoff sizes used in production: libstdc++ n=16, libc++ n=30 for trivial swaps else 6, Java's DualPivotQuicksort n=44 (was 47 through Java 14, retuned since), pdqsort n=24, Go n=12^[10].

Why don't all sorts cut over at the same n?

The crossover depends on the cost of one comparison. For primitive types (ints, floats), the comparator is one CPU instruction, so insertion sort's quadratic comparisons are cheap and the crossover sits higher (Java's DualPivotQuicksort cuts at n=44). For boxed objects with virtual comparators, each compare is a pointer-dereference plus an indirect call, which is expensive, and the crossover sits lower (Java's timsort uses a 32-element minimum run length, effectively the insertion-sort threshold; libc++ uses n=6 for non-trivial types).

Cache behavior matters too. Insertion sort touches contiguous memory in a tight loop; on a hot cache line it does a few dozen swaps before missing. Quicksort partition reads from both ends of the range, which hits two cache lines at once. At small n that overhead dominates.

05Quicksort: partition, recurse, pray

Tony Hoare's quicksort, from a 1959 stint at Moscow State University, is the most-cited sorting algorithm and the foundation of every modern production sort except the timsort-family. The idea is one of the cleanest in algorithm design: pick a pivot, partition the array into "less than pivot" and "greater than pivot," recurse on each half. The expected cost is n log n, but the worst case is n², and the difference depends entirely on how you choose the pivot.

The partition

Two partition schemes ship in production code. Hoare's partition is faster on average; Lomuto's partition is simpler to write correctly. Most modern sorts use a tuned variant of Hoare's. The C++ pseudocode for Hoare-style:

Hoare partition · C++

template<typename It, typename Compare>
It hoare_partition(It first, It last, Compare less) {
  auto pivot = *first;             // In production, median-of-3 here, not just first.
  It leftCursor  = first  - 1;
  It rightCursor = last;
  while (true) {
    // Walk leftCursor right until we find something ≥ pivot.
    do { ++leftCursor;  } while (less(*leftCursor,  pivot));
    // Walk rightCursor left until we find something ≤ pivot.
    do { --rightCursor; } while (less(pivot, *rightCursor));
    if (leftCursor >= rightCursor) return rightCursor;  // Cursors met or crossed; done.
    std::iter_swap(leftCursor, rightCursor);          // Swap the out-of-place pair.
  }
}

The recursion is the easy part. The hard part, the part where every quicksort variant earns its keep, is picking the pivot.

The pivot problem

If the pivot is the median of the data, quicksort halves the array each call and runs in n log n. If the pivot is the minimum (or maximum), one partition is empty and the recursion depth equals n: quadratic, plus a stack overflow.

Pivot strategies in roughly increasing sophistication:

First element. Trivial. Worst case on already-sorted inputs, which is the most common shape in real data. This is the textbook version; nobody ships it.
Random element. Expected n log n, no adversarial input can force quadratic, but in practice the random number generator costs more per call than the saved comparisons.
Median-of-three. Sample the first, middle, and last element; use their median. Defeats sorted and reverse-sorted inputs. Bentley and McIlroy's 1993 paper uses this for small arrays^[8].
Pseudomedian-of-nine ("ninther"). For large arrays, sample three groups of three, take the median of each, then the median of those medians. Bentley and McIlroy again. Much harder to defeat with crafted input.
Median of medians (BFPRT). Compute the exact median in O(n). Guarantees n log n worst case but the constant is so bad that nobody uses it for sorting outside textbooks.

Modern hybrids (introsort, pdqsort) give up on pivot perfection and accept that any deterministic strategy can be defeated. They monitor partition quality, and if the partition is too imbalanced, they take a fallback path: introsort switches to heapsort, pdqsort shuffles pivot candidates. §8 and §10.

The widget below shows quicksort running on data you control. Switch the pivot strategy and the input shape to see when quicksort runs at n log n and when it falls apart:

pivot strategy

input shape

size 48

comparisons

···

max recursion depth

···

vs n log n

···

06Mergesort, the stability champion

Mergesort is von Neumann's algorithm. Split the array in half; recursively sort each half; merge the two sorted halves into one. The merge step is the substance of the algorithm: two cursors walking forward, copying the smaller of the two current elements into the output. The recursion is log₂n levels deep; each level does O(n) work; total n log n, worst case and average case both.

Mergesort's selling points over quicksort:

Stable. Elements with equal keys end up in their original relative order. This is required for the "sort by department, then sort by name" pattern, and for draw-call sorts where ties are broken by submission order.
Worst case is n log n. No adversarial input. Quicksort needs pivot tricks; mergesort needs nothing.
External-memory friendly. Merge is purely sequential reads and writes, no random access. Sorts on disk and tape were almost always mergesorts in the mainframe era and still are for very large inputs.
Parallelizable cleanly. Each split can run on its own core with no synchronization until the merge.

The drawbacks are the reason it is not the default in-memory sort. Mergesort needs O(n) auxiliary memory for the merge buffer; quicksort sorts in place with O(log n) stack. Cache-wise, the merge step reads from two ranges and writes to a third, which on most CPUs costs more than quicksort's two-cursor partition that reads and writes the same range. For in-memory primitive arrays the constant on mergesort is about 1.5× to 2× quicksort, which is why std::sort in C++ is not mergesort.

top-down mergesort · C++

template<typename T>
void merge(T* a, T* buffer, int lo, int mid, int hi) {
  // Copy [lo, hi) to scratch buffer; merge from buffer back into a.
  for (int k = lo; k < hi; ++k) buffer[k] = a[k];
  int leftCursor = lo, rightCursor = mid;
  for (int writeAt = lo; writeAt < hi; ++writeAt) {
    if      (leftCursor  >= mid)                       a[writeAt] = buffer[rightCursor++];
    else if (rightCursor >= hi)                        a[writeAt] = buffer[leftCursor++];
    else if (buffer[rightCursor] < buffer[leftCursor]) a[writeAt] = buffer[rightCursor++];
    else                                                a[writeAt] = buffer[leftCursor++];   // '<' not '<=': preserves stability.
  }
}

template<typename T>
void mergesort(T* a, T* buffer, int lo, int hi) {
  if (hi - lo < 2) return;            // Production: cut over to insertion sort below ~16.
  int mid = lo + (hi - lo) / 2;  // Note: (lo + hi) / 2 overflows; this avoids that bug.
  mergesort(a, buffer, lo, mid);
  mergesort(a, buffer, mid, hi);
  merge(a, buffer, lo, mid, hi);
}

The midpoint comment is not a stylistic note. Joshua Bloch found in 2006 that the canonical (lo + hi) / 2 in the Java standard library's binary search overflowed on arrays larger than 2³⁰, a bug that had survived since 1969^[17]. The fix, lo + (hi − lo) / 2, is the same one mergesort needs. Use it.

array size 16

levels

···

comparisons

···

aux memory

···

07Heapsort, the worst-case insurance policy

Heapsort, J.W.J. Williams 1964^[6], is the in-place worst-case-n log n sort. Quicksort can degrade to quadratic; mergesort needs n auxiliary words; heapsort needs neither. Production code uses heapsort as a fallback: introsort calls it when recursion depth exceeds 2 log₂n, and embedded systems that cannot afford the merge buffer use it as their default.

The algorithm has two phases. First, build a max-heap in place: the largest element ends up at index 0. Then repeatedly swap the root (largest remaining) to the end of the array and sift down what landed at the root to restore the heap property. Each swap-pop step takes O(log n) sift-down time, done n times: O(n log n) total.

The non-obvious part is the build. The naive method (insert n elements one by one) is n log n. Floyd's bottom-up heapify, published the same year as Williams' paper^[7], is O(n). The proof is a geometric sum: half the array is at the leaves and needs zero sift-down, a quarter is at level h−1 and needs at most one sift-down step, an eighth needs at most two, and the sum Σ k · 2^−k is bounded by a constant.

heapsort with Floyd's heapify · C++

template<typename T>
void sift_down(T* a, int root, int heapEnd) {
  // Walk root down toward the leaves; at each step, swap with the larger child if needed.
  while (true) {
    int leftChild  = 2 * root + 1;
    int rightChild = 2 * root + 2;
    int biggest    = root;
    if (leftChild  < heapEnd && a[leftChild]  > a[biggest]) biggest = leftChild;
    if (rightChild < heapEnd && a[rightChild] > a[biggest]) biggest = rightChild;
    if (biggest == root) return;                // Heap property restored.
    std::swap(a[root], a[biggest]);
    root = biggest;                          // Follow the swap down.
  }
}

template<typename T>
void heapsort(T* a, int n) {
  // Phase 1: build a max-heap in place, bottom-up. Floyd's O(n) heapify.
  for (int i = n / 2 - 1; i >= 0; --i) sift_down(a, i, n);
  // Phase 2: extract the max n times, growing a sorted suffix.
  for (int heapEnd = n - 1; heapEnd > 0; --heapEnd) {
    std::swap(a[0], a[heapEnd]);          // Largest remaining moves to its final slot.
    sift_down(a, 0, heapEnd);                // Restore heap on the shrinking prefix.
  }
}

The performance footnote is the part textbooks usually omit. Heapsort is n log n in the worst case, but on random data its constant is roughly 2 to 3 times worse than quicksort's^[9]. The reason is cache: heapsort's sift-down walks a parent-child chain that doubles in stride at every level, so once the heap exceeds the L1 cache the parent reads cost L2 or memory latency. Quicksort partitions over contiguous memory, prefetches well. Heapsort is the right answer when you need the worst-case guarantee; it is the wrong answer when you do not.

size 16

speed 1.0×

phase

···

comparisons

···

swaps

···

08Introsort: quicksort with a parachute

David Musser's 1997 introsort^[9] is the algorithm that made quicksort safe to standardize. The idea is one sentence long: run quicksort, monitor recursion depth, and if it exceeds 2 log₂n, finish the current subarray with heapsort. Quicksort's average case stays intact; heapsort's worst-case bound becomes the worst case for the whole algorithm. Add an insertion-sort cutoff at small n, and you have the std::sort that has shipped in libstdc++ since GCC 3.x and in MSVC since Visual Studio 2008. libc++ implemented the n log n worst-case guarantee with introsort in LLVM 14 (2022); LLVM 17 (2023) replaced it with a pdqsort-derived implementation contributed by Google (§10)^[10].

introsort · C++

template<typename It>
void introsort_loop(It first, It last, int depthBudget) {
  while (last - first > 16) {                // Insertion-sort cutoff at the end of the call.
    if (depthBudget == 0) {                  // Quicksort recursion is going too deep.
      heapsort(first, last);                  // Fall back to the worst-case-safe sort.
      return;
    }
    --depthBudget;
    It pivot = median_of_three(first, first + (last - first) / 2, last - 1);
    It cut = hoare_partition(first, last, *pivot);
    introsort_loop(cut, last, depthBudget);          // Recurse on the right half.
    last = cut;                                       // Loop again on the left half (tail-call elimination).
  }
}

template<typename It>
void introsort(It first, It last) {
  int depthBudget = 2 * log2(last - first);     // 2 log₂n is the introsort constant.
  introsort_loop(first, last, depthBudget);
  insertion_sort(first, last);                       // One final pass cleans up everything < 16.
}

Musser's killer-sequence experiment is worth repeating: he constructed a 100,000-element input designed to defeat median-of-three quicksort and measured introsort's running time at 1/200 of the median-of-three quicksort's^[9]. Heapsort kicks in once, finishes the bad subarray in n log n, and the average-case behavior is untouched on the rest of the input.

Why 2 log₂n?

The constant in front of the log is a tradeoff. A perfectly balanced quicksort recurses exactly log₂n deep; a median-of-three quicksort on random data averages somewhat above that. The factor 2 gives quicksort enough rope to finish on inputs that lean unlucky but not pathological. Set it to 1.0 and heapsort triggers on benign inputs; set it to 4.0 and the quadratic worst case approaches before heapsort saves you. 2 log₂n is Musser's measured sweet spot and has been left alone.

Introsort's downside is that it inherits all of quicksort's bad cases for "many equal keys": the standard Hoare partition stalls when most elements equal the pivot. Bentley and McIlroy 1993 solved this with a three-way partition (the Dutch national flag problem), which is what makes pdqsort, in §10, genuinely better than introsort on real data.

09Timsort, the natural-runs mergesort

Tim Peters started writing timsort for Python's list.sort in 2002. The design document, Objects/listsort.txt in the CPython repo, is the best how-to-write-up-a-sort artifact in the wild and worth reading in full^[11]. The premise: real data is not random. Logs come in mostly-sorted, then with a few out-of-order entries. Database query results arrive in chunks that are themselves sorted. Game-engine event streams from the prior frame are usually monotonic in timestamp. A sort that exploits existing order beats one that ignores it.

Timsort is a stable mergesort tuned for runs. The four moving parts:

Run detection. Scan left-to-right looking for the longest stretch of consecutive elements that is non-decreasing or strictly decreasing. Reverse the decreasing runs in place (so they become increasing). A "run" is the unit of work.
Minimum run length (minrun). If the natural run is shorter than minrun, use insertion sort to extend it to minrun elements. Minrun is chosen between 32 and 64 so that n / minrun is a power of two (or just under), which gives balanced merges^[11].
The merge stack. Push each run onto a stack. After each push, check three Fibonacci-style invariants on the top three runs (X, Y, Z, top of stack last): runlen(X) > runlen(Y) + runlen(Z), runlen(Y) > runlen(Z), runlen(X+Y) > runlen(Z). If any is violated, merge Y with the smaller of X and Z. This keeps merges roughly balanced.
Galloping merge. A merge between two runs that are very imbalanced (most of one run is smaller than most of the other) uses exponential search instead of linear, so a long suffix of "wins" by one side does not pay O(k) for each element.

input shape

runs found

···

comparisons

···

vs n log n

···

On already-sorted input timsort runs in O(n): one run found, no merges. On random input it costs roughly the same as a plain stable mergesort. The win is real-world data, which is rarely random. On CPython's sortperf.py microbenchmarks, timsort runs at or near linear time on every input shape that contains pre-existing order (already-sorted, reverse-sorted, partially-sorted, equal-elements), where mergesort and quicksort stay at n log n^[11]. The merge stack height is bounded by log_φ(n) ≈ 1.44 log₂n when the Fibonacci invariant is preserved; if it is not, the stack overflows. Which is exactly what happened in 2015.

The 2015 invariant bug

Stijn de Gouw and colleagues at CWI Amsterdam, working in the KeY verification tool to prove timsort correct, discovered that the merge stack's invariant was not actually being maintained^[14]. mergeCollapse only checked the invariant against the top three runs of the stack. After it merged the top runs and pushed the result back, runs further down the stack could be left in a state that violated the invariant globally. The error did not produce wrong output (the sort still returned a sorted array), but it broke the bound on stack depth: de Gouw's team constructed run-length sequences that overflowed the fixed-size stack OpenJDK had allocated, crashing the JVM with ArrayIndexOutOfBoundsException.

Two fixes were available: enlarge the stack to cover the actual worst case under the broken invariant, or fix mergeCollapse to enforce the invariant for all stack entries. Both Python (bpo-23515, fixed in 3.4.4 / 2.7.11, December 2015) and OpenJDK chose the first path: re-derive the stack size from a worst-case analysis. de Gouw's paper proposes the second path as the more principled fix; it has not been adopted in mainline implementations.

Two lessons. One, formal verification of standard-library sort is a tractable thing in 2026 and was not in 2002. Two, the inner loops of sort algorithms get exhaustively tested; the framing logic (stack invariants, merge ordering, recursion bounds) gets read by few and proved by fewer. The bug had been in production for thirteen years.

10pdqsort, the branchless modern quicksort

Pattern-defeating quicksort, Orson Peters 2021^[16], is the descendant of introsort. It adds three ideas that, together, make it the fastest general-purpose unstable sort on most real inputs:

BlockQuicksort partitioning. Edelkamp and Weiß 2016^[21] showed that the quicksort partition's unpredictable branch on the pivot comparison costs more than the comparison itself. Their fix: collect the comparison outcomes for a block of ~64 elements into a bitmask first, then do the swaps in a separate pass with cmov-style branchless code. Branch mispredicts go from one-per-element to zero.
Pattern defeat. Pdqsort tracks whether the most recent partition was unbalanced (one side < n/8, say). If yes, it shuffles the pivot candidates to a different position before retrying, which probabilistically breaks any adversarial input. If the imbalanced partition keeps happening, it falls back to heapsort the way introsort does.
Three-way partition for equal keys. When pdqsort detects that many elements equal the pivot (the partition repeatedly produces a tiny right side), it switches to a three-way partition (less, equal, greater) and the equal-elements region gets skipped entirely on the recursion. Inputs with many duplicate keys run in O(n · k) where k is the number of distinct values.

Pdqsort is now the default unstable sort in Rust (slice::sort_unstable), one of the algorithms shipped in Boost.Sort, and the basis of the new std::sort that Google contributed to LLVM libc++ in 2022. On Google's internal benchmarks of std::sort, the pdqsort-aligned implementation produced a ~50% speedup on integer arrays and a ~10% to 20% speedup on mixed object workloads^[22].

The branchless partition is worth seeing in detail. The conventional Hoare partition has a data-dependent branch on the pivot comparison; on random data the branch predictor is wrong half the time. BlockQuicksort buffers two arrays of indices (the positions of "wrong-side" elements found in the left and right blocks), then swaps them in a paired pass. The inner loop has no data-dependent branches at all on modern x86 because the comparison result is folded into an arithmetic operation:

block partition (sketch) · C++

template<typename T>
void block_scan_left(T* block, T pivot,
                     unsigned char* offsets, int& count) {
  count = 0;
  for (int i = 0; i < 64; ++i) {
    offsets[count] = (unsigned char)i;
    count += (block[i] >= pivot);   // No branch: bool-to-int extends the count.
  }
}
// Symmetric block_scan_right. Then a second loop swaps element pairs from the
// two offset arrays; that pass has one predictable loop branch and no data-dependent ones.

Combined effect: pdqsort matches introsort on adversarial inputs (both fall back to heapsort), beats it by 30 to 70% on integer workloads (branchless partition), beats it dramatically on inputs with many equal keys (three-way partition), and runs in linear time on the "ascending then a few out-of-order" and "descending then a few out-of-order" patterns that timsort catches but introsort handles in n log n.

11Powersort: timsort with a proof

Munro and Wild's 2018 paper "Nearly-Optimal Mergesorts"^[15] attacked the part of timsort that even Tim Peters described as a heuristic: the merge policy. Timsort's three-element stack-invariant check is a pragmatic rule that mostly produces balanced merges. Mostly is not the same as optimal. Powersort replaces the heuristic with a rule derived from the Mehlhorn nearly-optimal binary search tree construction: assign each "run boundary" a power based on its position in the array, and merge runs whose power is exceeded by the next boundary.

The result is a stable mergesort that is provably within an additive linear term of the information-theoretic optimum for the entropy of the run-length sequence. In English: powersort is at least as good as timsort on every input, and strictly better when the input has many runs of unequal length. Tim Peters merged powersort into CPython in September 2021 (gh-78742 / PR #28108), shipping in Python 3.11 in October 2022^[23]. The patch kept the other timsort machinery (run detection, galloping, minrun, insertion-sort cutoff) intact and swapped only the merge-trigger rule. PyPy, Apple's WebKit, and AssemblyScript adopted powersort in the same period.

Powersort is the right default for "modern stable sort" in a fresh codebase. Timsort with the original heuristic is what shipped, and what most existing systems still run, because the difference is measurable but not dramatic (within 5 to 10% on most workloads, up to 30% on inputs with long structured runs). The shape of the algorithm is the same: run-detection, push-on-stack, merge-when-policy-triggers. Powersort changes one rule, the merge trigger, and inherits everything else.

What is a "power" of a run boundary?

Powersort assigns each boundary between two runs a power equal to the number of leading zero bits in the binary representation of (start1 + len1 + start2) / total_length, viewed as a fraction. A boundary that bisects the array gets a high power; a boundary near one end gets a low power.

The merge rule: maintain a stack; when pushing a new run, merge previous-top runs whose boundary-power is at least the new run's boundary-power, then push. This is equivalent to building a balanced binary tree of merges over the runs, weighted by their length. Mehlhorn 1977 proved this construction gives a tree whose total weighted path length is within an additive linear term of optimal^[24].

The asymmetry with timsort: timsort decides whether to merge based on the lengths of the top three runs. Powersort decides based on the position of the boundary in the array. The latter is a property of the input structure, not the algorithm's state, so it produces strictly better merge trees.

12Radix sort: linear time, no comparisons

Radix sort is the right answer for "I have a million 32-bit keys and I want them sorted." It runs in O(d · n) where d is the number of digits (or bytes, or whatever radix you pick), reads memory in long sequential streams, and parallelizes cleanly across cores or GPU threads. The cost is restricted to keys that are bounded-width integers or fixed-precision floats reinterpreted as integers. Every modern engine renderer uses it for draw-call sorting (§16).

Two flavors. LSD radix sort is the standard for sorting numeric keys: scan low byte first, then next byte, then next. Each pass is a stable bucket sort on 256 buckets (for a byte-radix). After four passes on 32-bit keys the array is fully sorted. MSD radix sort is the variant for strings and variable-length keys.

LSD radix sort on uint32 · C++

void radix_sort_u32(uint32_t* keys, uint32_t* scratch, int n) {
  uint32_t* src = keys, * dst = scratch;
  for (int passShift = 0; passShift < 32; passShift += 8) {
    int bucketCount[256] = {0};
    // Pass 1: count how many elements go in each bucket.
    for (int i = 0; i < n; ++i)
      ++bucketCount[(src[i] >> passShift) & 0xFF];
    // Prefix sum: turn counts into starting offsets.
    int total = 0;
    for (int b = 0; b < 256; ++b) {
      int c = bucketCount[b]; bucketCount[b] = total; total += c;
    }
    // Pass 2: scatter elements into dst at their bucket offsets.
    for (int i = 0; i < n; ++i) {
      int bucket = (src[i] >> passShift) & 0xFF;
      dst[bucketCount[bucket]++] = src[i];
    }
    std::swap(src, dst);   // Next pass reads from where we just wrote.
  }
  if (src != keys) std::memcpy(keys, src, n * sizeof(uint32_t));
}

Cost: 4 passes × n reads × n writes = 8n memory operations, each sequential. On a 16 GB/s memory bus that is roughly 0.5 ns per key per pass: 2 ns/key total for the whole sort. A comparison sort at the same scale costs 30 to 60 ns/key. The crossover with introsort is around n = 100 for 32-bit integer keys; below that the radix sort's setup cost (256-entry bucket histograms, four passes regardless of n) wins.

elements 16

radix bits per pass

passes

···

total ops

···

vs n log n

···

Radix bits is a tunable: more bits per pass means fewer passes but larger histograms (which cost memory and stress the cache). 8 bits is the production default; 4 bits is used here so the bucket structure is visible. Unreal's RadixSort32 ships with an 8-bit-per-pass implementation; Bevy's radsort uses a tunable that defaults to 8^[25].

Sorting floats with a radix sort

The trick that makes radix sort useful in renderers is that IEEE 754 floats, reinterpreted as 32-bit ints, sort correctly for positive values. (Negative-zero handling and negative-value sign-flipping need three extra lines.) Aras Pranckevičius's "Rough sorting by depth" post is the canonical reference for using the top bits of a float as part of a packed sort key^[26]: "Floating point numbers have the nice property that if you interpret their bits as integers, larger numbers result in larger integers — within the same sign." A 10-bit slice of a positive float gives you 1024 depth buckets, which is enough granularity for any front-to-back or back-to-front render pass.

13Counting sort: when the key range is tiny

Counting sort is the inner pass radix sort wraps. Given an array of n keys, each in [0, k), sort in O(n + k): count occurrences of each value, prefix-sum the counts, scatter elements into output. When k = O(n) the cost is linear; when k is much larger than n the histogram-allocation cost dominates and counting sort loses to a comparison sort.

Use cases in engines:

Particle priority bucketing. A particle system with 4-bit priority gives 16 buckets; one counting pass groups particles per frame in O(n).
Material ID grouping. If material IDs are dense in [0, 256), counting sort by material ID puts every draw call of the same material together with one pass.
Histogram-based effects. Tone mapping's luminance histogram is a counting sort on log-luminance; the "sort" output is the prefix sums.

Java's Arrays.sort for byte, char, and short primitives uses counting sort, because the key range (256, 65536, 65536) is small enough that the histogram allocation is cheap^[12]. For int and longer types it switches to dual-pivot quicksort, because allocating a 4-billion-entry histogram for 32-bit ints is not free.

14Bitonic sort and the GPU

Comparison sorts and radix sort assume one cursor walking sequentially. A GPU has thousands of cursors and they cannot easily coordinate. The sorts that run well on GPUs are sorting networks, where which-position-compares-which is decided ahead of time. Ken Batcher's 1968 bitonic sort^[27] is the canonical example.

The structure: a bitonic sequence is one that monotonically increases then monotonically decreases. A bitonic sequence of length n can be sorted by n/2 compare-and-swap operations in parallel (the "bitonic merge"), producing two bitonic sequences of length n/2 each, which then recurse. Total operations: O(n log²n). Total depth (longest chain of dependent operations): O(log²n). On a GPU with enough threads, depth, not total work, determines wall time.

Bitonic sort's cost summary versus the CPU sorts:

Sort	Total work	Parallel depth	In place	Stable
quicksort (serial)	O(n log n)	O(n log n)	yes	no
mergesort (serial)	O(n log n)	O(n log n)	no	yes
radix sort (serial)	O(d·n)	O(d·n)	aux	yes
bitonic sort (parallel)	O(n log²n)	O(log²n)	yes	no
GPU radix sort	O(d·n)	O(d log n)	aux	yes

Bitonic sort does more total work than mergesort (n log²n vs n log n) but every step in a level is independent. On 4096 GPU threads sorting 4096 elements, bitonic finishes in 78 parallel steps; serial mergesort would need ~50,000. The work is wasted relative to the optimum; the wall time is not. NVIDIA's GPU Gems 2, chapter 46, describes the production version^[28]; Unreal's particle pipeline uses a bitonic merge sort in GPUSort.cpp to depth-sort transparent particles entirely on the GPU.

array size (power of 2)

total ops

···

parallel depth

···

on 8 threads

···

15Live race: every sort, side by side

Six sorts running on the same input, animated as bar charts. Click "go" and watch them race. Pick the input shape from the dropdown to see how each algorithm responds to sorted, reverse-sorted, partially-sorted, and few-unique inputs:

input shape

array size 64

speed 1.0×

insertion quicksort mergesort heapsort timsort radix

16Engine case: sorting draw calls

Every renderer sorts its draw calls before submitting them. The reasons are direct: GPU state changes (binding a new shader, new texture, new vertex buffer) cost real time, and the cost is per change, not per call. Five thousand draw calls in arbitrary order can mean five thousand shader swaps. Five thousand draw calls sorted by shader can mean ten. Sorting is the cheap way to recover most of the throughput a naive renderer leaves on the table.

The Ericson 2008 design pattern, still the standard fifteen years later^[1], packs everything that matters into a single 64-bit sort key:

shader id (13b)

material id (16b)

depth (24b)

seq (8b)

render target transparency bit shader id material id depth sequence number

The bit-order encodes the sort hierarchy. Render-target group first, then opaque-vs-transparent split (because their depth orders are opposite), then by shader, then by material, then by depth bucket. The trick: for opaque draws the depth is encoded as the float bits directly (front-to-back, which lets the early-Z reject more pixels); for transparent draws the depth is bit-inverted (back-to-front, required for correct alpha blending). One sort, two orderings.

Once the key is packed, the sort itself is a radix-sort on the integer. Unreal Engine ships RadixSort32 in its core library as a stable 32-bit radix sort designed for this purpose^[25]. Bevy evaluated swapping its comparison sort for radsort (a stable radix sort crate) in 2022 and reported the median sort_phase_system time on the many_cubes benchmark dropping from 0.728 ms to 0.094 ms^[2]. The result was workload-dependent: phases with many similar keys saw the linear-time radix win; phases with few items saw the radix sort's fixed setup cost lose to a tuned comparison sort. The lesson generalizes: radix-sort wins are most pronounced at scale, on densely-packed keys.

primary sort field

shader binds avoided

···

material binds avoided

···

sort cost (ns/key)

Bits to the left of the sort key dominate the order. "Render target first" guarantees you finish one pass before starting another; "shader second" minimizes pipeline swaps; "depth last" runs front-to-back within each material so the GPU's early-Z reject works hardest. Same draw calls, different bit-pack, different render time. The packing strategy is in FMeshDrawCommand in Unreal's renderer and in PhaseItem in Bevy.

17Engine case: sort-and-sweep broadphase

Collision detection's broadphase finds candidate pairs of objects whose AABBs overlap. The naive algorithm is n²; sweep-and-prune reduces it to roughly n log n per frame by sorting AABB endpoints along an axis and sweeping a moving "active list" across them^[29].

The sort here is interesting because the input is almost-sorted from frame to frame. Object positions change a little per frame; the endpoint list mostly preserves order from the previous frame. Insertion sort, which is O(n) on already-sorted input, beats every n log n sort on this workload. Production physics engines (Bullet, Havok, PhysX) use insertion sort exactly here for exactly this reason, sometimes specialized to update the endpoint structure incrementally rather than re-running the sort from scratch.

Some physics engines skip sort-and-sweep entirely. Box2D's modern broadphase uses a b2DynamicTree, a dynamically rebalanced AABB tree, with rotations chosen by the same surface-area heuristic that BVH builders use^[30]. The ordering work doesn't disappear, it moves inside the data structure; the tree's incremental rebalancing fills the role of an explicit per-frame sort.

18Picking the right sort

The decision table for "which sort do I reach for":

Workload	Right answer	Why
Generic C++, types with cheap compare	`std::sort` (introsort / pdqsort)	Fast on average, n log n worst case, in place, library-provided.
Generic C++, types with expensive compare	pdqsort or timsort	Fewer comparisons; timsort if stability matters, pdqsort if it doesn't.
Generic, must be stable	`std::stable_sort` (timsort or merge)	Stability needs explicit handling; the unstable sorts don't preserve equal-key order.
Real data, often partially sorted	timsort (Python, Java, V8) or powersort	Adaptive: O(n) on sorted prefixes, run-detection avoids redundant work.
Bounded integer keys, n > 100	radix sort (LSD)	Linear time, no comparisons, parallelizes cleanly. Engine renderer default.
Tiny integer range (k = O(1))	counting sort	One pass, no comparisons. Java does this for byte/char/short.
n < 32, any keys	insertion sort	Recursive sorts pay more overhead than the inner-loop work at this scale.
Worst-case latency budget (real-time)	heapsort or introsort	Guaranteed n log n; quicksort alone can blow the budget on bad input.
Data on GPU	bitonic sort or GPU radix	Both fit GPU parallelism. Radix wins for fixed-width keys; bitonic for arbitrary compare.
Data on disk, larger than RAM	external mergesort	Sequential I/O, log-many-merge, the textbook external-sort algorithm.
Strings of varied length	MSD radix or three-way radix quicksort	Avoids re-scanning prefixes; O(N) where N is total character count.

The most common production mistake is "I'll just call std::sort." It is right roughly 80% of the time. The 20% where it is wrong are the cases that show up in profilers as a percent or two of frame time that nobody notices until they look. The case-study sections above are the canonical wins; the table covers the rest.

19Pitfalls

Sort bugs in production code, ranked by frequency:

Non-strict weak ordering. std::sort requires a comparator that is a strict weak order: irreflexive (!(a < a)), antisymmetric (a < b ⇒ !(b < a)), transitive (a < b ∧ b < c ⇒ a < c), and the equivalence relation derived from "neither is less than the other" must be transitive too. Violations are undefined behavior, including infinite loops and segfaults on libstdc++. A common bug: comparing floats where one might be NaN. NaN < anything is false, so NaN breaks the partition.
Unstable when you needed stable. std::sort is allowed to reorder equal elements; std::stable_sort is not. Multi-pass sorts (sort by depth, then by material) only work if both passes are stable. The bug shows up as occasional flickering on sprites with the same material and similar depth.
Comparator state. Comparators with mutable state (a counter, a cache) violate the assumption that the order is deterministic. The sort may call the comparator more than once on the same pair, or in arbitrary order. Stateful comparators that produce inconsistent results crash; ones that produce consistent results but mutate are surprisingly hard to detect.
Integer overflow in the comparator. Writing return a - b; as a comparator works for floats and bounded ints; on 32-bit ints with both very large and very negative values it overflows and produces wrong-sign results. Write return (a > b) - (a < b); or return a < b ? -1 : (b < a ? 1 : 0);.
Sorting a container that mutates during the sort. Iterators invalidate when the underlying container resizes. The sort can corrupt memory if a comparator triggers a container modification. Common when comparators call user code that triggers asset loads.
The 2003 Java binary-search overflow. Joshua Bloch found that the canonical (lo + hi) / 2 in Java's stdlib binary search overflowed for arrays larger than 2³⁰^[17]. The same bug appears in any mergesort midpoint that uses the naive formula. Use lo + (hi − lo) / 2.
Allocating a merge buffer in the inner loop. A mergesort that new[]s its scratch buffer for every recursive call costs more in allocation than it saves in algorithmic work. Allocate once at the top.
Sorting pointers when the indirection costs more than the sort. A sort that compares a->material against b->material dereferences twice per compare. On n = 10,000 with a 30 ns dereference, that is 300 µs in pointer chasing per pass. Pack the sort key into the array directly (the draw-call key trick).

20What's next

The natural follow-ups, in roughly increasing depth:

Selection algorithms. "Find the k-th smallest" without sorting the whole array. Quickselect is to quicksort what binary search is to a sorted lookup. Median-of-medians (BFPRT 1973) gives a deterministic linear-time selection. The introselect fallback in Musser 1997 is to selection what introsort is to sorting.
External-memory sorting. Data larger than RAM. k-way merge sort plus a tape-or-disk layout. The classic algorithm, but the modern version uses NVMe SSDs and the cost model (bandwidth instead of seek time) flips most of the choices.
Parallel and concurrent sorts. Multi-way mergesort with work stealing, parallel quicksort, GPU radix sort. The work-span model (PRAM, Cilk). Our job systems tutorial introduces the runtime; the parallel-sort algorithms layer on top.
Top-k and partial sort. When you don't need the whole order, just the smallest k. std::partial_sort uses a heap-based variant; for very large k, a sample-and-sort approach is faster.
Sorting under adversarial input. Cryptographic-quality hashing of comparison keys to defeat hash-flooding attacks. The 2011 PHP / Java / Python hash-collision DOS was an attack on sort-equivalent data structures.
Cache-oblivious sorting. Frigo and Leiserson's O(N/B · log_M/B(N/B)) bound for sort over memory-transfer cost, not CPU operations^[31]. The right model for sorting really large arrays.

For the engine programmer, the most useful next step is empirical: profile your renderer's draw-call sort. The bug is probably std::sort on a pointer-and-string comparator. Pack a 32-bit key, swap it for a radix sort, watch the millisecond come back.

Sources

Christer Ericson. "Order your graphics draw calls around!" realtimecollisiondetection.net blog, October 2008. realtimecollisiondetection.net. The canonical reference for bit-packed draw-call sort keys; covers the render-target / opaque-transparent / shader / material / depth bit-order pattern.
Bevy Engine. "Use radix sort for sort phase and sprite sorting." GitHub issue #4291, 2022. github.com/bevyengine/bevy/issues/4291. Source for the 0.728 ms → 0.094 ms median-sort-time figure on the many_cubes benchmark and the radsort migration discussion.
Peter Norvig. "Teach Yourself Programming in Ten Years," "Approximate timing for various operations" table. norvig.com/21-days.html. Memory-hierarchy latency ballparks used to estimate per-comparison cost: L1 ~0.5 ns, branch mispredict ~5 ns, DRAM ~100 ns.
Donald E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd ed., Addison-Wesley, 1998. Chapters 5.1 to 5.3 cover the history of mergesort, the analysis of comparison-based sorting, and the lower-bound proof.
C. A. R. Hoare. "Quicksort." The Computer Journal 5(1):10–16, 1962. comjnl 5.1.10. The original quicksort paper. Hoare describes the algorithm he wrote at Moscow State University in 1959 while working on machine translation.
J. W. J. Williams. "Algorithm 232 - Heapsort." Communications of the ACM 7(6):347–348, 1964. dl.acm.org/doi/10.1145/512274.512284. The original heapsort paper; also introduces the binary heap as a standalone data structure.
Robert W. Floyd. "Algorithm 245 - Treesort 3." Communications of the ACM 7(12):701, 1964. The bottom-up heap-construction algorithm that brings heapify from O(n log n) to O(n).
Jon L. Bentley, M. Douglas McIlroy. "Engineering a Sort Function." Software: Practice and Experience 23(11):1249–1265, November 1993. PDF. The paper most BSD qsort implementations descend from; introduces pseudomedian-of-nine and the three-way partition for the Dutch-flag problem.
David R. Musser. "Introspective Sorting and Selection Algorithms." Software: Practice and Experience 27(8):983–993, August 1997. PDF. The introsort paper. The 2 log₂n depth budget and the killer-sequence experiment (1/200 of median-of-three quicksort's time) are both here.
LLVM Project. "libcxx std::sort." Implementation in libcxx/include/__algorithm/sort.h. github.com/llvm/llvm-project. Source for the insertion-sort cutoff sizes (30 for trivial swaps, 6 otherwise) and the depth-budget constant.
Tim Peters. listsort.txt, the timsort design document. CPython repository, Objects/listsort.txt. github.com/python/cpython. Source for the minrun-in-[32, 64] choice, the merge-stack Fibonacci-style invariants, galloping mode, and the empirical motivation (real-world inputs are not random).
Vladimir Yaroslavskiy. "Dual-Pivot Quicksort algorithm." Proposal to the OpenJDK core-libs mailing list, 2009. PDF (archived). The dual-pivot quicksort that became java.util.Arrays.sort for primitive types in Java 7.
Bernhard Beckert, Jonas Schiffl, Peter H. Schmitt, Mattias Ulbrich. "Proving JDK's Dual Pivot Quicksort Correct." Verified Software: Theories, Tools, and Experiments (VSTTE), 2017. PDF. The KeY-based formal verification of Yaroslavskiy's algorithm; turns "we believe this works" into a machine-checkable proof.
Stijn de Gouw, Jurriaan Rot, Frank S. de Boer, Richard Bubel, Reiner Hähnle. "OpenJDK's java.utils.Collection.sort() is broken: The Good, the Bad and the Worst Case." Computer Aided Verification (CAV), 2015. The discovery of the 2002–2015 timsort stack-invariant bug; Python 3.4.4 and OpenJDK both patched as a result. envisage-project.eu summary.
J. Ian Munro, Sebastian Wild. "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs." European Symposium on Algorithms (ESA), 2018. arXiv:1805.04154. The powersort paper. Adopted by CPython 3.11, PyPy, AssemblyScript, and Apple's WebKit.
Orson Peters. "Pattern-defeating Quicksort." arXiv preprint, June 2021. arXiv:2106.05123. The pdqsort paper. Now the default unstable sort in Rust (slice::sort_unstable) and in Boost.Sort.
Joshua Bloch. "Nearly All Binary Searches and Mergesorts are Broken." Google Research Blog, June 2006. ai.googleblog.com. The discovery that (lo + hi) / 2 overflows for arrays larger than 2³⁰ and the fix that propagated through every standard library.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, 4th ed., MIT Press, 2022, chapter 8 "Sorting in Linear Time" and chapter 8.1 "Lower bounds for sorting." The decision-tree lower-bound proof in canonical form.
Lester R. Ford, Selmer M. Johnson. "A Tournament Problem." The American Mathematical Monthly 66(5):387–389, May 1959. The merge-insertion sort that achieves the information-theoretic minimum number of comparisons for small n.
M. Douglas McIlroy. "A Killer Adversary for Quicksort." Software: Practice and Experience 29(4):341–344, April 1999. PDF. Shows that for any deterministic pivot rule, an adversary can construct input forcing quadratic time. The motivation for the introsort fallback.
Stefan Edelkamp, Armin Weiß. "BlockQuicksort: Avoiding Branch Mispredictions in Quicksort." European Symposium on Algorithms (ESA), 2016. arXiv:1604.06697. The block-partition technique that pdqsort builds on; demonstrates 2× to 3× speedup from branchless partitioning on integer arrays.
Danila Kutenin. "Changing std::sort at Google's Scale and Beyond." danlark.org blog, April 2022. danlark.org. Source for the ~50% speedup figure on integer arrays from swapping libc++'s introsort for pdqsort across Google's codebase.
CPython issue gh-78742 / Pull Request #28108. "Use Powersort merge strategy." Author: Tim Peters; merged September 6, 2021; shipped in Python 3.11 (October 2022). github.com/python/cpython/pull/28108.
Kurt Mehlhorn. "A best possible bound for the weighted path length of binary search trees." SIAM Journal on Computing 6(2):235–239, 1977. The nearly-optimal binary-search-tree construction whose principle powersort applies to merge ordering.
Epic Games. "RadixSort32." Unreal Engine documentation. docs.unrealengine.com. The 32-bit radix sort that ships in Unreal's core for draw-call sorting and similar use cases.
Aras Pranckevičius. "Rough sorting by depth." aras-p.info blog, January 2014. aras-p.info. The reference for using the top bits of a float-as-int as a radix-sortable depth field.
Kenneth E. Batcher. "Sorting networks and their applications." Proc. AFIPS Spring Joint Computer Conference, 1968, pp. 307–314. PDF. The original bitonic sort paper and the foundational sorting-network reference.
Peter Kipfer, Rüdiger Westermann. "Improved GPU Sorting." GPU Gems 2, chapter 46, NVIDIA, 2005. developer.nvidia.com. Production GPU bitonic sort implementation with cache-tiling and merge-step optimizations.
Jonathan D. Cohen, Ming C. Lin, Dinesh Manocha, Madhav K. Ponamgi. "I-COLLIDE: An Interactive and Exact Collision Detection System for Large-Scale Environments." Proc. ACM Interactive 3D Graphics, 1995. The original sweep-and-prune (sort-and-sweep) paper.
Ingo Wald, Solomon Boulos, Peter Shirley. "Ray Tracing Deformable Scenes Using Dynamic Bounding Volume Hierarchies." ACM Transactions on Graphics 26(1), 2007. The surface-area heuristic for BVH construction; the same heuristic drives the tree-rebalancing pass in physics broadphase trees.
Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. "Cache-Oblivious Algorithms." Proc. 40th IEEE FOCS, 1999. PDF. The cache-oblivious sort bound, O(N/B · log_M/B(N/B)), that becomes the right complexity model for very large arrays.

01Why a sort, not a stopwatch

The cost of the wrong sort

02A short history of sorting

03Why comparison sorts can't beat n log n

04Insertion sort, the small-n champion

05Quicksort: partition, recurse, pray

The partition

The pivot problem

06Mergesort, the stability champion

07Heapsort, the worst-case insurance policy

08Introsort: quicksort with a parachute

Why 2 log2n?

09Timsort, the natural-runs mergesort

The 2015 invariant bug

10pdqsort, the branchless modern quicksort

11Powersort: timsort with a proof

12Radix sort: linear time, no comparisons

Sorting floats with a radix sort

13Counting sort: when the key range is tiny

14Bitonic sort and the GPU

15Live race: every sort, side by side

16Engine case: sorting draw calls

17Engine case: sort-and-sweep broadphase

18Picking the right sort

19Pitfalls

20What's next

Sources

See also

Why 2 log₂n?