Build a Game Engine · Resources

Data Compression for Games

Compression shrinks downloads and can make loads faster, but only as a trade: CPU cycles for bytes moved. We build the ideas up from entropy through LZ77 and Huffman to the modern codecs, then make the case that for load times decode speed, not ratio, is the number that matters. And we keep general-purpose compression strictly separate from GPU texture compression, because conflating them is the classic mistake.

Time~55 min LevelMid PrereqsThe Asset Pipeline (packages) and File Streaming. Bit Shifting helps. StackC++ & Rust

01Why compress

Three distinct payoffs, and they are not the same goal: smaller download and package size (storefront and patch budgets), higher effective load bandwidth (read fewer bytes off disk and inflate them in RAM), and as the mechanism behind both, a trade of CPU for I/O. That trade only pays when the device is I/O-bound and the decoder keeps up.

The split that runs through this whole tutorial

There are two different worlds here, and blurring them is the senior-reviewer trap. General-purpose lossless compression (zstd, LZ4, DEFLATE, Oodle Kraken) shrinks meshes, audio data, levels, and the package itself; it is decompressed to RAM before use. (BCn, ASTC) is lossy, fixed-ratio, and stays compressed in VRAM where the GPU samples it directly. Same word, opposite constraints. We build the general-purpose story first and fence the texture story off in §8.

And compression does not always speed up loads. On a fast NVMe with a slow, high-ratio codec, the decoder becomes the bottleneck and loading gets slower. That is exactly why decode speed gets its own section (§6).

02What compression can't do

Shannon's source-coding theorem sets a hard floor: no lossless coder can, on average over a source, use fewer bits per symbol than the source's ^[1].

The bound is on the average over the distribution, not any single message.
High-entropy (incompressible) data, already-compressed or encrypted bytes, cannot be losslessly shrunk, and by the pigeonhole principle any coder that shrinks some inputs must make others larger. "You can't compress random data" is the loose version of this.
Lossy compression (audio, textures) sidesteps the bound by discarding information. That's a different game; the entropy floor is about lossless.

Run-length encoding is the warm-up: replace runs of a repeated byte with (count, value). It shines on long uniform runs (masks, indexed art) and expands data with no runs (encoding ABCD as 1A1B1C1D doubles it), the simplest proof that compression can make things bigger.

03LZ77: the sliding window

Real data repeats, just not in fixed-length runs. (Ziv and Lempel, 1977) encodes the input as a mix of literals and back-references: a (distance, length) pair that copies a substring already seen, from a sliding window of recent output^[2].

A back-reference points into already-decoded output (distance back, length to copy), not an external dictionary or the future.
Matches may overlap the current position (a length greater than the distance is legal, and is how runs get encoded).
LZ77 is a family, not one format: zstd, LZ4, DEFLATE, Brotli, and Oodle Kraken are all LZ77-based but mutually incompatible. The window is bounded; the size is a ratio-vs-memory knob.

The window slides as the encoder scans. Repetitive text finds long matches; feed it high-entropy text and the matches vanish:

A teaching LZ77 encoder (not production)

struct Token { bool isMatch; uint8_t literal; int distance; int length; };

std::vector<Token> lz77Encode(const std::vector<uint8_t>& input, int windowSize, int minMatch) {
    std::vector<Token> tokens;
    int position = 0;
    while (position < (int)input.size()) {
        int bestLength = 0, bestDistance = 0;
        int windowStart = std::max(0, position - windowSize);   // the window of seen output
        for (int candidate = windowStart; candidate < position; ++candidate) {
            int length = 0;                                   // match may overlap position (length > distance is legal)
            while (position + length < (int)input.size() &&
                   input[candidate + length] == input[position + length]) ++length;
            if (length > bestLength) { bestLength = length; bestDistance = position - candidate; }
        }
        if (bestLength >= minMatch) { tokens.push_back({true, 0, bestDistance, bestLength}); position += bestLength; }
        else                       { tokens.push_back({false, input[position], 0, 0}); position += 1; }
    }
    return tokens;
}

struct Token { is_match: bool, literal: u8, distance: usize, length: usize }

fn lz77_encode(input: &[u8], window_size: usize, min_match: usize) -> Vec<Token> {
    let mut tokens = Vec::new();
    let mut position = 0;
    while position < input.len() {
        let (mut best_length, mut best_distance) = (0, 0);
        let window_start = position.saturating_sub(window_size);    // the window of seen output
        for candidate in window_start..position {
            let mut length = 0;                                  // overlapping matches (length > distance) are legal
            while position + length < input.len()
                && input[candidate + length] == input[position + length] { length += 1; }
            if length > best_length { best_length = length; best_distance = position - candidate; }
        }
        if best_length >= min_match { tokens.push(Token { is_match: true, literal: 0, distance: best_distance, length: best_length }); position += best_length; }
        else { tokens.push(Token { is_match: false, literal: input[position], distance: 0, length: 0 }); position += 1; }
    }
    tokens
}

What's intentionally missing

This finder is O(n·window) naive scanning; real codecs use hash chains, binary trees, or suffix automata, plus lazy matching. It emits no actual bitstream (no LZSS literal/match flag bits, no following entropy stage), ignores match-length caps, and isn't decode-safe or interoperable. It exists only to make the mechanism visible.

04Entropy coding

LZ77 produces a stream of tokens; an entropy coder then packs those symbols near their information content. Huffman (1952) builds an optimal : frequent symbols get shorter codes^[3].

Huffman is optimal only among whole-bit codes

It must spend an integer number of bits per symbol, so a symbol with probability 0.9 (which "wants" about 0.15 bits) still costs a full bit. Arithmetic coding, range coding, and modern ANS (asymmetric numeral systems, Duda 2013) close that fractional-bit gap, which is why ANS is an entropy back-end in zstd^[4]. And entropy coding composes with LZ; it isn't an alternative to it. DEFLATE and zstd both run LZ first, then an entropy coder.

Set the symbol frequencies and the tree rebuilds bottom-up, rarest symbols pushed deepest:

05DEFLATE and the modern codecs

DEFLATE (RFC 1951) is the classic combination: LZ77 plus Huffman, with a 32 KB window. It's the algorithm inside zlib, gzip, and PNG^[5]. (zlib and gzip are container wrappers around the raw DEFLATE stream; the 32 KB window is a DEFLATE limit, not a law of LZ.)

The modern general-purpose codecs sit at different points on a three-axis tradeoff of ratio, speed, and memory:

LZ4: very fast, lower ratio; decode can approach memory-bandwidth limits^[7].
zstd: tunable from negative "fast" levels up to level 22, with an FSE (a tANS variant) entropy coder plus Huffman for literals^[6].
Brotli: higher ratio, slower, with a static dictionary tuned for web text specifically^[8].

Calling a real codec (zstd, one-shot)

// Size the destination with the codec's bound, then ALWAYS check the result.
std::vector<char> dst(ZSTD_compressBound(src.size()));
size_t n = ZSTD_compress(dst.data(), dst.size(), src.data(), src.size(), 3);  // level 3
if (ZSTD_isError(n)) { /* handle */ }
dst.resize(n);
// Decode: get the size from the frame header, then ZSTD_decompress.

// The `zstd` crate wraps libzstd; encode_all/decode_all are the one-shot helpers.
use zstd::stream::{encode_all, decode_all};
let compressed = encode_all(&source[..], 3)?;   // level 3 (negative levels go faster)
let restored   = decode_all(&compressed[..])?;  // size read from the frame

The chart plots ratio against decode throughput. Drag the zstd level; switch the data to already-compressed and every ratio collapses:

06Decode speed sets load time

For load times, decode throughput, not ratio, often sets the floor. The arithmetic is the argument.

Worked example

A 1 GB asset set compresses 2:1 to 500 MB. An NVMe at 5 GB/s delivers those 500 MB in 0.1 s. A decoder that emits 400 MB/s of decompressed output (the convention benchmarks quote) needs 2.5 s to rebuild the gigabyte. The decoder is now the bottleneck, and a higher-ratio-but-slower codec would make it worse, not better. This is why LZ4 and zstd are tuned for decode speed, and why a codec optimal for download size can be the wrong choice for load latency.

Two API shapes matter here: block (whole buffer at once, output size known up front) and streaming (bounded memory, data arrives incrementally). Random access into a large compressed asset needs seekable framing (independent blocks); a single monolithic stream forces decode-from-the-start, which is why the packaging step lays assets out in independently decodable chunks.

07Oodle & console hardware

The games industry leaned into the decode-speed argument. RAD's Oodle suite (the Kraken codec and relatives) is a high-ratio-yet-fast LZ-family compressor distributed with Unreal Engine^[10], built on the premise that the decoder should never be the load bottleneck^[9].

Consoles took it further into silicon. The PlayStation 5's SSD runs at a raw 5.5 GB/s, and a dedicated hardware decompression unit in its I/O complex decodes Kraken transparently at load (Mark Cerny called it "zlib's smarter cousin"), yielding roughly 8 to 9 GB/s of decompressed output^[11].

Two things to state carefully

The widely-quoted "equivalent to 9 Zen 2 cores" line is a talk framing that RAD's own engineers have called out of context, so treat it as a soundbite, not a measured equivalence^[9]. And Oodle Texture is a separate product from Oodle Kraken: it's an encoder that makes BCn texture data compress better under a following LZ pass, not itself a GPU texture format. Which is the bridge to the last section.

08GPU texture compression

This is the other world. BCn and ASTC are lossy, fixed-rate, block-based formats. Each fixed-size texel block (4×4 for BCn) encodes to a fixed number of bits, and the decisive property is that the texture stays compressed in VRAM and the GPU's texture units sample it directly, decoding just the block they touch^[12]^[13].

Fixed rate plus block layout is what enables that random access: to sample a texel, the GPU finds its 4×4 block and decodes only that block in the sampler. A variable-rate format like DEFLATE's output can't address an arbitrary texel in O(1), which is exactly why general-purpose codecs can't serve as a runtime texture format^[14].

General-purpose (zstd/Kraken)GPU texture (BCn/ASTC)

When decodedfully, to RAM, before useper-block, in the sampler, on use

Ratiovariable (content-dependent)fixed by format

Accesssequential / blockrandom, by texel

Lossy?losslesslossy

They compose, and they interact with zero-copy

You ship BCn textures and pack the whole package with a lossless codec for download size (Oodle Texture exists to make that second pass work better). And recall the zero-copy point from the Asset Pipeline: a BCn blob is already in a GPU-consumable layout and can be uploaded largely as-is, but a zstd-compressed asset cannot be memory-mapped and used in place, you must decode it into a separate buffer first.

Wrong answers, and why: when the decoder is the bottleneck a higher-ratio/slower codec hurts; the GPU can't sample a general-purpose stream at all; and the texture-format distinction is about fixed-rate random access, not which one is lossy.

09Pitfalls

Compression slowed the loadFast disk, slow high-ratio codec: the decoder is the bottleneck. Pick for decode speed.

"Compress the already-compressed"Re-compressing JPEG/encrypted/random data wastes CPU; ratio is ~1.0 (entropy floor).

zstd for sampled texturesThe GPU can't sample a general-purpose stream; use BCn/ASTC for runtime textures.

Ranking codecs on ratio aloneRatio and decode speed are different axes; LZ4 wins loads it loses on size.

zlib ≠ DEFLATE ≠ gzipDEFLATE is the stream; zlib and gzip are container wrappers with headers/checksums.

One monolithic compressed blobNo seekable framing means no random access; decode must start from the beginning.

"Texture ratio depends on the image"BCn/ASTC rate is fixed by format; the content affects quality, not size.

10What's next

Resources are handled: streamed, cooked, referenced, packaged, and compressed. Now the engine can finally draw. The next phase is rendering, starting with The GPU & Graphics Pipeline and then your first triangle in Vulkan, which consumes the cooked meshes and the BCn textures this phase produced. The full path is on the series hub.

Claude E. Shannon. "A Mathematical Theory of Communication." Bell System Technical Journal, 1948. overview. The source-coding theorem: average code length is lower-bounded by entropy.
Jacob Ziv and Abraham Lempel. "A Universal Algorithm for Sequential Data Compression." IEEE Trans. Information Theory 23(3), 1977. ethw.org. The sliding-window back-reference scheme.
David A. Huffman. "A Method for the Construction of Minimum-Redundancy Codes." Proc. IRE 40(9), 1952. overview. Optimal integer-length prefix codes.
Jarosław Duda. "Asymmetric numeral systems." arXiv:1311.2540, 2013. arxiv.org. Arithmetic-class ratio at table-lookup speed; the entropy back-end in zstd.
L. Peter Deutsch. "DEFLATE Compressed Data Format Specification," RFC 1951, 1996. rfc-editor.org. DEFLATE = LZ77 + Huffman; the 32 KB window; the three block types.
Yann Collet and Murray Kucherawy. "Zstandard Compression," RFC 8878, 2021; and the zstd project site. Tunable levels (incl. negative), the benchmark table, the ANS back-end.
Yann Collet. LZ4. github.com/lz4/lz4. Very fast LZ77-family codec; decode approaching RAM-bandwidth limits; the safe decoder API.
Jyrki Alakuijala and Zoltán Szabadka. "Brotli Compressed Data Format," RFC 7932, 2016. rfc-editor.org. LZ77 + Huffman + context modeling and a static web dictionary.
Charles Bloom (RAD Game Tools). "How Oodle Kraken and Oodle Texture supercharge the IO system of the PS5." cbloomrants.blogspot.com. The decoder-never-the-bottleneck design; the comment thread where the "9 Zen 2 cores" framing is disputed.
RAD Game Tools / Epic. Oodle, and its inclusion in Unreal Engine. radgametools.com. The Kraken family; cross-platform, smaller downloads and faster loads.
Mark Cerny. "The Road to PS5" (2020), as reported by VentureBeat and Digital Foundry. venturebeat.com. 5.5 GB/s SSD; a dedicated I/O-complex decompressor for Kraken; ~8 to 9 GB/s decompressed.
The Khronos Group. Khronos Data Format Specification (BC1 through BC7 / BPTC block decoding). registry.khronos.org. Fixed-size 4×4 blocks, 64/128-bit encodings, lossy.
"S3 Texture Compression." Wikipedia. en.wikipedia.org. Fixed-rate, random pixel access, stays compressed in VRAM; the DXT ratios.
The Khronos Group / Arm. Adaptive Scalable Texture Compression (ASTC). overview. 128-bit blocks, selectable block sizes (4×4 to 12×12), per-block adaptive, lossy.