File Streaming
for Game Engines
A working file streamer of the kind that lets Spider-Man swing across Manhattan and Nanite render a quarry of one-billion-triangle statues without a loading screen. We start with one blocking read() and work up to an async, GPU-decompressing, priority-scheduling, sparse-virtual-texturing pipeline. No engine, no framework: just the data structures, the kernel APIs, and live demos.
01Why a game engine needs a streamer
The PS5 ships with 16 GB of unified memory. Marvel's Spider-Man 2 is around 100 GB on disk. Cyberpunk 2077 is 70 GB. Call of Duty's installed footprint regularly crosses 200 GB. That's the central problem: the world the player walks through is five to ten times larger than the RAM it has to live in. A is the system that hides that fact. It pages assets in as the player approaches them, pages them out when they're no longer visible, and does it fast enough that the player never sees the seams.
The job system (see the previous tutorial) was about using every CPU core. The streamer is about using every byte of storage bandwidth without melting the frame budget. It's the second hard system every shipping engine has to solve, and it's the one the player actually notices when you get it wrong. "Loadingโฆ" screens, texture pop-in, geometry pop-in, audio that cuts in two seconds late: those are all streaming bugs.
A ~250-line asynchronous streamer with priority scheduling, LRU residency, and a CPU decompression hook, in C++ and Rust. A JavaScript port runs live on this page, drives a player-through-a-world simulator, and lets you turn knobs on cache size, prefetch radius, and I/O bandwidth in a browser playground. By the end you'll know why DirectStorage exists, why Spider-Man's swing speed determines a tile size, why Nanite pages geometry like textures have been paged for fifteen years, and what "BypassIO" actually skips.
The fixed-budget reality
Three constraints make this hard.
- The world is bigger than RAM. By a lot. A modern open-world game maintains a hundred gigabytes of textures, meshes, audio, animation, navigation data, and AI scripts. Eight to sixteen of those gigabytes are available at runtime; the rest live on storage and have to come in on demand.
- Storage is slower than RAM. By a lot. A read from L1 cache takes around 4 cycles. A read from RAM takes around 300. A read from an NVMe SSD takes 70,000 to 150,000. The streamer's whole job is to schedule those reads so the player is never waiting for one.
- The frame budget is 16.6 ms. Whatever the streamer does (block, allocate, decompress, copy, bind) has to fit in the slack alongside rendering, physics, audio, and gameplay. Hitches in the streamer are visible as hitches in the game.
The widget below shows the central tradeoff. On the left, a naรฏve game that loads a whole region before the player can enter it: the level-based design every PS1-era game shipped with. On the right, a streaming game that pages content in as the player moves through the world. Wait time and peak memory go in opposite directions:
02A short history of getting fast at I/O
Streaming has been reinvented in every console generation, each time against tighter constraints. A short tour, because the current shape of the problem only makes sense in the context of the constraints it inherited:
EnqueueRequests for batched submission with D3D12 fence synchronization and lets a single request span a range of mip subresources. The boring-looking version that actually makes the API usable.
The recurring pattern: the data layout is from 2008, the data-parallel codec from 2022, the recognition that storage is the bottleneck from 2011 onward. The piece that keeps changing is which layer the platform has acknowledged. By 2026 the OS, the driver, and the GPU all treat streaming as a first-class workload. Five years ago they did not.
03The storage hierarchy and the cost of a read
Every conversation about streaming has to start with where the data actually lives, because the gap between "in L1" and "on a spinning disk" is nine orders of magnitude. Engineers who skip this part end up shipping code that's correct on the dev box and unplayable on the bottom-of-spec laptop.
| Tier | Typical latency | Bandwidth | Relative time to read 64 KB |
|---|---|---|---|
| L1 cache | ~1 ns | ~1 TB/s | |
| L3 cache | ~10 ns | ~500 GB/s | |
| DDR5 RAM | ~80 ns | ~70 GB/s | |
| NVMe Gen 5 | ~50 ยตs | ~14 GB/s | |
| NVMe Gen 4 | ~70 ยตs | ~7 GB/s | |
| SATA SSD | ~150 ยตs | ~550 MB/s | |
| HDD (sequential) | ~10 ms (first seek) | ~150 MB/s | |
| HDD (random 4 KB) | ~13 ms per IO | ~1 MB/s effective |
Sources: NVMe Gen 4 sequential throughput matches the Samsung 990 Pro datasheet (7,450 MB/s sequential read, 1.4M IOPS at QD32/16T)[13]; Gen 5 throughput matches the Crucial T705 review (~14.5 GB/s sequential)[14]. Jeff Dean's "Numbers Every Programmer Should Know"[15] is the canonical citation for the cache/RAM/SSD latency rule-of-thumb table.
Two numbers matter independently: latency (how long any single read takes) and bandwidth (how much data per second you can sustain). They are not the same. A modern NVMe can serve 7 GB/s of sequential reads, but the first 4 KB still takes 70 microseconds; if you're scattering reads, latency dominates and you'll never see the bandwidth number on the box. Designing a streamer is mostly figuring out how to coalesce reads so the bandwidth number is the one that matters.
Refresher: what's a "sequential" vs "random" read, anyway?
A sequential read asks the storage device for a contiguous range of bytes ("give me 64 KB starting at offset 0x10000"). A random read asks for many small ranges scattered across the device ("give me these forty 4-KB blocks at unrelated offsets").
On an HDD the gap is dramatic because the head has to physically move between unrelated offsets. Each seek costs ~10 ms. Forty seeks at 10 ms each is 400 ms, almost half a second, in which you'd have moved maybe a megabyte of data. The same 160 KB read sequentially is one seek plus one millisecond of transfer.
On an SSD there's no head, but there's still a controller, and small reads still pay command-processing overhead. The gap is much smaller (5-10x) but it's there. The implication for streaming: bigger, fewer reads beat smaller, more reads, even when the data is the same.
The IOPS (I/O operations per second) and bandwidth columns of an SSD spec sheet are different stories. A Samsung 990 Pro is rated for 7.45 GB/s sequential read and 1.4 million random 4-KB IOPS[13]. 1.4M ร 4 KB = 5.6 GB/s, so even the random number is bandwidth-bound at high queue depth, but only at queue depth 32 or higher. Queue depth is how many reads are in flight at once: how many you've handed to the drive that it hasn't finished yet. NVMe controllers run several NAND channels in parallel internally, so QD32 means there's enough work waiting that every channel stays busy. At queue depth 1 you submit one read, wait for it to come back, then submit the next, so the drive sees one request at a time and most of that internal parallelism does nothing. You're back to ~150 ยตs per IO, which is ~26 MB/s. The streamer's job is to keep the queue depth high.
The widget below races a fixed payload of 320 MB across four storage classes. Toggle the access pattern between sequential and random; the HDD's random column collapses to a sliver:
"NVMe is fast" is true but useless. You don't get 7 GB/s unless you submit large reads at high queue depth. A naรฏve loop that reads one 4-KB block at a time, blocking each time, will see 20-50 MB/s on the same drive. Most of the work in the rest of this tutorial is about keeping the queue full.
04The naรฏve loader
Start with the simplest thing that could possibly work. A worker thread, a queue of pending reads, a blocking read() for each one. The caller submits a request and gets back a future; the worker dequeues, reads, and signals.
// One worker thread, a queue of read requests, a blocking pread() // for each one. The caller gets back a future it can wait on. struct ReadRequest { int fileDescriptor; // already-opened file (returned by open()) int64_t byteOffset; // where in the file to start reading size_t byteCount; // how many bytes to read void* destination; // caller-provided buffer to fill std::promise<void> completionPromise; // signaled when the read finishes }; class NaiveLoader { // The mutex protects pendingQueue. Held only briefly: just long // enough to push or pop a single request. std::mutex queueMutex; // Lets the worker sleep until somebody calls notify_one(), so we // don't busy-wait on an empty queue. std::condition_variable requestAvailable; // FIFO of pending reads. Producers push at the back, the worker // pops from the front. std::queue<ReadRequest> pendingQueue; // The single OS thread that drains pendingQueue. One thread = one // outstanding read at a time. That's the design's main weakness. std::thread workerThread; // Destructor flips this to false so the worker exits its loop. bool isRunning = true; public: NaiveLoader() : workerThread([this] { workerLoop(); }) {} ~NaiveLoader() { isRunning = false; requestAvailable.notify_one(); // wake the worker so it can see isRunning workerThread.join(); } // Enqueue a read. Returns a future that becomes ready once the // worker has serviced this particular request. std::future<void> submit(int fileDescriptor, int64_t byteOffset, size_t byteCount, void* destination) { ReadRequest request{fileDescriptor, byteOffset, byteCount, destination, {}}; auto completionFuture = request.completionPromise.get_future(); { // Hold the lock only long enough to push. std::lock_guard lock(queueMutex); pendingQueue.push(std::move(request)); } requestAvailable.notify_one(); // wake the worker if it's sleeping return completionFuture; } // The worker thread runs this loop until shutdown. void workerLoop() { while (isRunning) { ReadRequest request; { std::unique_lock lock(queueMutex); // Sleep until there's a request to handle, or we're shutting down. // cv.wait() drops the lock while sleeping and re-acquires it on wake, // so the queue check below is always safe. requestAvailable.wait(lock, [&] { return !pendingQueue.empty() || !isRunning; }); if (!isRunning) return; request = std::move(pendingQueue.front()); pendingQueue.pop(); } // THE BLOCKING CALL. The worker sits here for ~70 ยตs (NVMe cache hit) // to ~10 ms (HDD seek + read). The whole point of ยง5 is to stop blocking // here so the device queue can stay full. ssize_t bytesRead = pread(request.fileDescriptor, request.destination, request.byteCount, request.byteOffset); if (bytesRead >= 0) { request.completionPromise.set_value(); } else { request.completionPromise.set_exception( std::make_exception_ptr(std::runtime_error("read failed"))); } } } };
This works. It compiles, it's about 50 lines, and on a fast SSD it will load a level in a few seconds. It's the design every junior writes first. It is also the design that doesn't scale beyond level loading, for three specific reasons.
Before reading the next list: look at the code above and try to name at least one reason it won't deliver more than a few hundred MB/s on a 7 GB/s NVMe. Two for partial credit, three for the full set.
Three things go wrong as the requests get hotter
- Queue depth of one. The worker reads one block, waits for it, then reads the next. The drive never sees more than a single outstanding request, so its internal parallelism (multiple NAND channels, multiple DMA engines) does nothing. NVMe was designed for QD32; QD1 is leaving 95% of the bandwidth on the floor.
- Sync calls cross the kernel boundary. Every
pread()is a syscall: a user-to-kernel transition, an argument copy, a return, and on some kernels a copy from page cache to user buffer. At a million small reads per second the syscall overhead alone is gigahertz of CPU[16]. - One worker can't keep up. If the streamer needs 4 GB/s of throughput and each read takes 70 ยตs, you need ~60,000 IOPS. That's at least 4 in flight at all times, and you really want 16-32 in flight to absorb latency variance. One worker doing blocking reads has exactly one in flight.
Each of these has a fix. We work through them in ยง5, then layer on the compression, residency, and priority machinery in the sections after that. The naรฏve loader is the worker pool of streaming: a correct starting point, deliberately broken so the rest of the tutorial has something to fix.
05Async I/O: let the kernel do the waiting
Synchronous I/O blocks the thread until the data arrives. Asynchronous I/O hands the kernel a description of what you want, returns immediately, and notifies you later. Every modern platform has the same shape underneath: a submission queue (SQ) that you push descriptors into and a completion queue (CQ) that the kernel pushes results onto.
- Linux: io_uring. Jens Axboe's 2019 design[16]. Two mmapped ring buffers shared between user and kernel. With
IORING_SETUP_SQPOLLthe kernel polls the submission ring on a dedicated thread; withIORING_SETUP_IOPOLLit polls the storage device directly[17]. Fully populated: zero syscalls per I/O. - Windows (legacy): I/O Completion Ports. The NT-era design[18]. Open the file with
FILE_FLAG_OVERLAPPED; associate the handle with a port; submit reads withReadFile()and an OVERLAPPED struct; reap completions withGetQueuedCompletionStatus. Still the workhorse on Windows. - Windows 11: IoRing. A direct port of the io_uring model with
BuildIoRingReadFileandBuildIoRingRegisterBuffers[19]. Pre-register buffers, submit by integer index, no per-IO pinning. The current Windows fast path under DirectStorage.
Throughput equals concurrency over latency. To hit 100,000 IOPS at 100 ยตs per IO you need 100,000 ร 100 ยตs = 10 requests in flight. Synchronous I/O has a queue depth of one, so its throughput is 1 รท 100 ยตs = 10,000 IOPS. Async I/O lifts that ceiling by submitting many requests before any of them complete.
What does "zero syscalls per I/O" actually mean?
A normal Linux read involves at least one read syscall. The thread traps into the kernel, the kernel does its work, the thread returns. On modern x86, the trap-and-return costs a few hundred nanoseconds even if nothing else happens, before any of the I/O is even initiated.
io_uring's tricks let you skip the trap on the fast path:
- Submission queue polling (
IORING_SETUP_SQPOLL). A kernel thread polls the submission ring for new entries. You push a submission queue entry into the shared mmap region, and the kernel notices on its own. No syscall to submit. - Completion polling (
IORING_SETUP_IOPOLL,O_DIRECTonly). Instead of the storage driver firing an interrupt on completion, your thread busy-waits on the completion ring. No interrupt handler, no scheduler involvement, no syscall to reap.
With both enabled, a thread can sit in user space, populate SQ entries, and read CQ entries with nothing more than atomic stores and loads. That's where the "zero syscalls" claim comes from.
Below is a Linux io_uring submission for a single read. The same pattern works in IOCP and IoRing; only the function names change.
// 1. Set up the ring once. The kernel allocates both shared // rings (submission and completion) and maps them into our address space. struct io_uring ring; io_uring_queue_init( 256, // queue depth: up to 256 outstanding reads &ring, IORING_SETUP_SQPOLL); // kernel polls the submission ring; no syscall per submit // 2. Pre-register the destination buffer. Once registered, the kernel // can DMA directly into it without re-pinning pages on every read. struct iovec destinationBuffer = { .iov_base = destinationPtr, // where the bytes will land .iov_len = 65536 // 64 KiB max read into this buffer }; io_uring_register_buffers(&ring, &destinationBuffer, 1); // 3. Per read: grab a Submission Queue Entry, fill it in, submit. struct io_uring_sqe* submissionEntry = io_uring_get_sqe(&ring); io_uring_prep_read_fixed( submissionEntry, fileDescriptor, destinationPtr, 65536, // bytes to read fileOffset, /*registeredBufferIndex=*/0); // matches the index we registered above submissionEntry->user_data = (uintptr_t)myRequestId; // tag so completions can be matched back to requests io_uring_submit(&ring); // no-op under SQPOLL (the kernel was already polling) // 4. Later, drain Completion Queue Entries (CQEs). Each CQE carries the // user_data tag back so we know which request just finished. struct io_uring_cqe* completionEntry; while (io_uring_peek_cqe(&ring, &completionEntry) == 0) { RequestId requestId = (RequestId)completionEntry->user_data; int resultCode = completionEntry->res; // bytes read, or negative errno on failure on_complete(requestId, resultCode); io_uring_cqe_seen(&ring, completionEntry); // release the slot so the kernel can reuse it }
Step through the ring with the widget. Producers (yellow) push submission entries; the kernel (blue) consumes them, performs the I/O, and writes completion entries back. The two rings move independently: you can submit 32 reads before any of them complete and the kernel will reorder them however the storage stack pleases.
It is tempting to mmap() the asset file and let the OS demand-page. Don't. Crotty, Leis, and Pavlo's CIDR 2022 paper "Are You Sure You Want to Use MMAP in Your DBMS?"[20] documents three killer issues that apply equally to game streamers: page-table contention under concurrent access, single-threaded eviction in the kernel, and TLB shootdowns. Use the explicit async APIs; they give you control over latency, ordering, and the contents of the page cache.
06Compression and the GPU decompression revolution
Storage is the bottleneck, so the obvious optimization is to send fewer bytes across it. Every modern engine ships its assets compressed, and the decompressor sits between the raw read and the resource that gets bound to a draw call. The question is which compressor and, more importantly in 2026, which processor runs it.
Two layers of compression
Texture data is unusual: most of it lives in formats at runtime, not just on disk. BC1 through BC7 (collectively "BCn") are GPU-native compressed formats: each one packs a 4ร4 block of texels into a small fixed-size payload, and the GPU's texture sampler decodes them on the fly during every lookup[21]. There's no engine-visible decompression step; the textures stay in this format from disk to render.
The different BC formats trade off bit rate, channel count, and quality. Picking the right one per texture is the single biggest lever a content team has on VRAM and disk footprint:
| Format | Bytes / block | Bits / pixel | Channels | Typical use |
|---|---|---|---|---|
| BC1 | 8 | 4 | RGB (or RGB + 1-bit ฮฑ) | Opaque diffuse / albedo textures where the cheapest format is good enough |
| BC2 | 16 | 8 | RGB + 4-bit ฮฑ | Mostly obsolete; superseded by BC3 / BC7 |
| BC3 | 16 | 8 | RGB + smooth ฮฑ | Standard for textures with gradient transparency |
| BC4 | 8 | 4 | Single channel | Heightmaps, masks, roughness, AO |
| BC5 | 16 | 8 | Two channels | Tangent-space normal maps (X and Y; Z is reconstructed) |
| BC6H | 16 | 8 | HDR RGB (no ฮฑ) | HDR cubemaps, IBL probes, lightmaps |
| BC7 | 16 | 8 | RGB or RGBA | The high-quality default. Significantly fewer artifacts than BC1/BC3, same on-disk size as BC3 |
All BCn formats compress at a fixed ratio (the block-byte count is the same regardless of content), so the encoder's job is to choose the encoding that best approximates the original 16 texels under that fixed budget. BC7 has 8 different encoding modes per block and picks the best one; BC1 has just one. That's why BC7 looks dramatically better on photographic textures while still costing the same 1 byte per pixel.
Once a texture is in BCn, the bits on disk are still bulky enough to be worth compressing further. So a shipped texture really goes through:
Three independent compressions, each addressing a different audience. BCn is for the GPU sampler (which never sees decompressed data). Oodle Texture[22] is a re-encoder that picks BCn blocks the next codec compresses well; this yields a 5-50% reduction at perceptually equivalent quality. The stream codec (GDeflate, Zstd, Oodle Kraken) compresses everything one last time for the disk.
The CPU decompression bottleneck
Until about 2022, decompression happened on the CPU. Read the compressed bytes from disk, decompress on a worker thread, upload to the GPU. CPU decompression caps out around 1-3 GB/s per core on LZ-family codecs. Once the SSD can deliver 7 GB/s of compressed data, the CPU is the bottleneck, not the storage. NVIDIA's measurement showed that under CPU decompression an actual game asset load was running at 3 GB/s on PCIe Gen 4 hardware that could have delivered 12+[23].
Moving decompression onto the GPU
The breakthrough is to do the decompression on the GPU as a compute shader, with a codec designed for SIMD throughput. NVIDIA's [23] is the canonical example. It's a variant of DEFLATE that splits the input into 64 KiB tiles, each compressed independently, with the bitstream specifically formatted to expose SIMD-level parallelism. The open spec describes a 32-way sub-stream swizzle so a warp can parse it in parallel[40]. AMD and Intel publish hardware metacommands that accelerate the decode further. DirectStorage 1.4[12] adds Zstandard with the same parallelization story.
The throughput picture, when everything is set up correctly:
How the codecs compare
The widget toggle is a teaser; the full picture is a tradeoff between three numbers. Faster decode means the engine is decompressor-limited later or never. Higher ratio means less disk and less bandwidth. Encoder speed only matters at build time but is what determines whether you can run the codec on every CI build or only on a nightly. Approximate published numbers, normalized to a single modern x86 core except where noted:
| Codec | Decode (GB/s) | Ratio vs raw | Where it runs | Notes |
|---|---|---|---|---|
| zlib (Deflate, lvl 6) | ~0.4 | ~2.0ร | 1 CPU core | The 1990s baseline. Still the format inside .pak/.zip, but slow. |
| LZ4 (fast) | ~5.0 | ~2.1ร | 1 CPU core | Decode-fast, ratio-mediocre. Good when you'd otherwise leave data uncompressed. |
| Zstandard (lvl 9) | ~2.5 | ~2.5ร | 1 CPU core | Modern default. Beats zlib on both axes. Added to DirectStorage 1.4. |
| Oodle Selkie | ~7.0 | ~2.0ร | 1 CPU core | Tuned for raw decode speed. Pair with content where ratio matters less than latency. |
| Oodle Mermaid | ~4.5 | ~2.3ร | 1 CPU core | Middle of the Oodle line. Good default if you're already on Oodle. |
| Oodle Kraken | ~2.5 | ~2.7ร | 1 CPU core or PS5 silicon | The high-ratio variant. PS5 decodes it in hardware at ~9 GB/s. |
| Oodle Leviathan | ~1.0 | ~3.0ร | 1 CPU core | Maximum ratio. Slowest of the line. Useful for cold patches and downloads. |
| GDeflate (GPU) | ~14+ | ~1.9ร | GPU compute shader | Slight ratio penalty vs vanilla DEFLATE; massive speedup. PC fast path. |
| BCPack (Xbox silicon) | ~4.8 effective | varies | Xbox Series I/O block | Texture-format-aware: works on BCn directly, so ratio depends on texture content. |
Numbers are order-of-magnitude. Oodle's published comparison[41] gives more precise figures on specific hardware; Zstd's own benchmarks[42] compare across content types. Real ratios depend heavily on what's being compressed: text and code see ~3-4ร, BCn texture data sees ~1.3-1.8ร, already-compressed audio sees almost nothing.
The geometric reading of the table: each row is a point on a speed-vs-ratio Pareto frontier. The Oodle family is built to cover that frontier with four codecs so a project can match the codec to the asset class โ animations and audio that need fast decode use Selkie or Mermaid; cold storage that needs minimum size uses Leviathan; the bulk of the runtime-loaded content uses Kraken. Zstd and zlib live on the same frontier but cover less of it.
Yes โ and that's exactly why PS5 games don't ship a GPU GDeflate decoder. The hardware Kraken block on the I/O complex delivers ~9 GB/s of decompressed data without spending a shader cycle or a CPU core. GPU compute decompression exists for the platforms that don't have that silicon: PC, where the GPU has compute slack to spare; older consoles, where neither the I/O block nor DirectStorage is available. On PS5 (and Xbox Series, with BCPack + the LZ block), the right answer is to let the silicon do it. The reason the two designs coexist is platform: PC needs the flexibility because the codec can change between OS updates, and a chip respin can't.
How does GDeflate even parallelize DEFLATE?
To understand the parallelization trick you first need one fact about GPU execution. A GPU doesn't run threads one by one the way a CPU does. It runs them in groups of 32 (NVIDIA's term is "warp"; AMD calls it a "wave") that execute the same instruction on different data, in lockstep, every clock. That's where the GPU's enormous throughput comes from. If the 32 threads in a warp can each read a different input value at the same time, you get 32ร the work. If they all want to read from the same place, they serialize and you get 1ร.
Classic DEFLATE has the second pattern. Its bitstream is fundamentally serial: a Huffman-coded literal/length token depends on the previous bits to know its own length, and the LZ77 back-references can reach arbitrarily far into the already-decoded output. You cannot start decoding mid-stream without knowing the state right before it, so 32 GPU threads pointed at a single DEFLATE stream all queue up behind one decoder. That's the worst case for a warp.
GDeflate fixes this in two stacked levels:
- Tile-level parallelism (coarse). The encoder splits the input into 64 KiB tiles and emits each one as a fully independent stream, with its own Huffman table and its own LZ77 window. Different warps decompress different tiles with no dependency between them. This is how you fill a 5,000-thread GPU.
- Sub-stream parallelism (fine, within a tile). Even inside one tile, the bits are not laid out as a single long stream. The encoder pre-shuffles them into 32 interleaved sub-streams: tokens 0, 32, 64, 96, โฆ all live in sub-stream 0; tokens 1, 33, 65, 97, โฆ live in sub-stream 1; and so on through sub-stream 31. Each of the 32 threads in a warp is assigned one sub-stream and decodes it independently. Now the warp does 32 decodes per cycle instead of 1.
"Swizzle" is the name for that pre-shuffling step. The original bitstream is conceptually a single deck of cards; the encoder deals it out into 32 hands, one per thread. The decoder's job is to read its assigned hand in order. The 32 hands recombine into the original stream after decode, so the output is byte-identical to vanilla DEFLATE โ only the layout of the bits is different.
GDeflate's compressed output is slightly larger than vanilla DEFLATE because each sub-stream carries a bit of redundant framing. You trade a small ratio penalty (a few percent) for the warp-level parallelism. In practice the tradeoff is overwhelmingly worth it: the GPU decoder hits PCIe-bandwidth numbers (~14 GB/s) where a single CPU core can manage maybe ~1 GB/s.
Zstd's parallelization story (added in DirectStorage 1.4) is similar in spirit but builds on Zstd's existing frame format, which already permits independent chunks. The sub-stream swizzle isn't needed because Zstd's frame structure already exposes parallelism the GPU can exploit.
The PS5 chose silicon
Sony picked a different path. Instead of using compute shaders, they put a dedicated hardware decompression block on the I/O complex[7]. That block decodes Oodle Kraken (a high-ratio LZ-family codec from RAD Game Tools / Epic)[24] at line rate: the SSD reads 5.5 GB/s of compressed data, the decompressor outputs ~9 GB/s of decompressed data, and the CPU and GPU never touch any of it. Bloom's analysis on Oodle Texture on PS5[25] shows the layered codec stack (Oodle Texture + Kraken) hitting ~17 GB/s effective on real BCn data.
Xbox Series did the same idea at lower throughput: BCPack (a hardware texture-format-aware codec) plus a general-purpose LZ block, both in silicon, feeding 2.4 GB/s of raw NVMe up to 4.8 GB/s effective[8].
"Effective bandwidth" is what the engine sees: bytes available per second after decompression. It depends on three things: storage raw bandwidth, decompression throughput, and compression ratio. Hitting the platform's advertised number requires all three to be in balance. The console architectures balance them in silicon; PC needs DirectStorage plus a competent compute-shader decoder to get there, and the gap was the whole reason DirectStorage exists.
07Bundles: why one big file beats a million little ones
The next question after "how do we read fast?" is "what should we be reading?" The naรฏve approach is one file per asset: one texture per .png, one mesh per .fbx. Production engines don't do this. They bundle thousands of assets into a single file with a manifest at the head. Unreal calls them .pak files; Unity calls them AssetBundles[26]; id Tech calls them WADs.
Why bundling helps
- Seek amortization. Opening a file costs tens of microseconds. Opening 10,000 files costs hundreds of milliseconds. One open + 10,000 reads at known offsets is much faster.
- Layout control. You can place co-loaded assets next to each other on disk, so a 64 KB read pulls in the whole group instead of needing one seek per asset.
- Filesystem amplification. Modern filesystems store per-file metadata (inode, dentry, journal). 10,000 tiny files cost more than one large file plus a manifest.
- Streaming-friendly format. The bundle's internal layout can be aligned to compression-block boundaries, sector boundaries, or anything else the streamer cares about.
// File layout: // [Header] fixed-size, names the version + the manifest offset. // [Asset bytes] concatenated, possibly compressed, aligned to 4 KB. // [Manifest entries] one per asset: id, offset, compressed size, uncompressed size, flags. // The manifest lives at the END so writers can stream assets without seeking back. struct BundleHeader { char magic[4]; // "MPGB" uint32_t version; uint64_t manifestOffset; uint32_t manifestEntryCount; uint32_t defaultCodec; // 0 = none, 1 = zstd, 2 = gdeflate, ... }; struct ManifestEntry { uint64_t assetId; // hash of the logical name uint64_t byteOffset; // where the asset starts in the bundle uint32_t compressedSize; uint32_t uncompressedSize; uint32_t flags; // codec override, alignment hints, etc. uint32_t reserved; };
Single huge bundles are a problem when you ship a 200 MB patch that touches one byte of one asset. The patcher either re-downloads the bundle (bad), uses binary diffing (better, fragile), or splits the bundle into smaller "chunks" the patcher can replace independently (the modern answer). Steam's Content Delivery uses fixed-size 1 MB chunks for exactly this reason; Unreal's IO Store splits .pak into .utoc/.ucas with chunked content.
08Residency pools and eviction policies
Once an asset is loaded, it sits in RAM (or VRAM) until something kicks it out. The data structure that tracks who is in and who is out is a residency pool: a fixed-size cache of assets, indexed by ID, with an eviction policy. A modern engine has several (a texture pool, a mesh pool, an audio pool, an animation pool), each with its own budget.
The policy question is which asset to evict when a new one needs space. Three classical choices:
- LRU (Least Recently Used). A doubly-linked list ordered by last access. On a hit, move to the front; on a miss, evict from the back. The textbook default.
- LFU (Least Frequently Used). Track an access counter per asset; evict the lowest. Adapts to "frequency" rather than "recency." Vulnerable to a one-time spike permanently inflating a counter.
- Hot/cold (2Q, SLRU, segmented LRU). Two LRU lists with a promotion rule, introduced by Johnson and Shasha[43]. New items enter the cold list and only move to the hot list on a second access; eviction always pulls from cold first. Scan-resistant by construction: a one-shot sweep over more items than fit in cache churns only the cold list, leaving the hot working set untouched. Linux's active/inactive page lists and Caffeine's W-TinyLFU (TinyLFU admission filter in front of an SLRU) are production variants.
- ARC (Adaptive Replacement Cache). Megiddo and Modha's 2003 FAST paper[27]. Maintains two LRU lists (one for recent items, one for frequently-accessed items), plus two "ghost" lists of evicted IDs. Ghost-list hits tell ARC which way to shift its split. Adaptive, low-overhead, used in ZFS and PostgreSQL.
Compare the two policies against the same access pattern. ARC keeps frequent items even under recent-only access bursts:
Pin lists and priority tiers
Pure LRU/ARC is rarely shipped raw. Production pools layer two things on top:
- Pin lists. Some assets must never be evicted: the player character mesh, the UI atlas, the loaded weapon textures. They're "pinned" and don't participate in eviction.
- Priority tiers. Different asset classes get different effective recency. A texture used by a UI screen is "more important" than a texture on a distant rock. The eviction ranking is
tier ร recency, not raw recency.
An alternative to discrete tiers is continuous "heat": every access bumps the heat by a constant; heat decays exponentially over time. The eviction candidate is simply the lowest-heat item. It's a clean abstraction that subsumes both recency (because heat decays) and frequency (because heat accumulates).
09Priority: what to load first
Eviction is the question of what to remove from RAM. Priority is the question of what to load first. They're symmetric: the streamer is constantly choosing between candidate reads, and the order matters as much as the reads themselves.
The priority of a pending tile is a function of player state. Typical inputs:
- Distance to player. Closer is more urgent.
- Screen-space size. A faraway tile that covers a lot of the screen (because the camera is looking at it) outranks a closer tile behind the player.
- View frustum bias. Tiles in the camera's cone get a multiplier; tiles outside it get a discount.
- Velocity prediction. If the player is moving in a direction, weight tiles in that direction higher. This is Insomniac's "streaming cone" trick[4].
- Importance flags. Quest-critical assets and player-character pieces outrank scenery.
The exact form varies by engine, but the spirit is the same: a few cheap-to-compute geometric heuristics fold into one scalar; the streamer pops candidates off a max-heap by that scalar. There's no "right" formula. You tune it against the worst-case traversal in your game and look for where pop-in appears.
A streamer working through the heap
The player (yellow dot) moves through a grid of tiles. Each tile has a priority score from the equation above; the streamer's max-heap drains in score order while the budget allows. Drag the player or change the view direction and the priorities re-rank live:
Hysteresis is the trick of using different thresholds for entering and leaving a state. Without it, a tile right on the eviction-threshold boundary can be loaded, evicted, loaded, evicted as the player wiggles. The fix is to keep an asset resident until it falls well below the load threshold, typically at 1.5x or 2x the distance. The same pattern applies to streaming: don't request a tile until it crosses a stricter "I'm about to need this" threshold, not the looser "this might be visible" one.
10Sparse Virtual Textures
Up to now we've been treating "tile" as an abstract unit. For textures specifically, there's a beautiful trick that turns the entire screen-space mip selection problem into a streaming problem. The trick is (SVT), and Sean Barrett's GDC 2008 talk[1] is where it crystallized. id Tech 5 shipped it commercially as MegaTexture in Rage[2]; every modern engine has a variant.
The idea, in three pieces
- One huge logical texture. Pretend you have a 128k ร 128k texture for the whole world. It would be 64 GB at 4 bytes per pixel; obviously it doesn't fit in VRAM.
- A small physical cache. A real GPU texture, maybe 4096 ร 4096, divided into 64 ร 64 (256 KB) tiles. This is what's actually resident.
- An indirection table. A tiny lookup texture (256 ร 256 pixels, one pixel per logical tile) that maps logical tile coordinates to physical-cache coordinates. The shader samples the indirection texture, then samples the physical cache at the offset it found.
With those three pieces, a shader that wants to sample the logical 128k ร 128k texture at UV (u, v) does this:
float4 SampleSVT(float2 logicalUv) { // 1. Find which logical tile we're in. With a 128k texture and 64-px // tiles, there are 2048 tiles on a side. logicalTileCoord is in [0, 2048). float2 logicalTileCoord = floor(logicalUv * 2048.0); // 2. Read the indirection texture at that coordinate. Each pixel in the // indirection map names the physical-cache tile coordinates that hold // the requested logical tile (or signals "not resident"). float4 indirectionEntry = indirection.Load(int3(logicalTileCoord, 0)); // 3. Translate to physical-cache UV space. // indirectionEntry.xy = which tile slot in the physical cache (in tile units) // indirectionEntry.z = which mip level we actually have resident // (may be coarser than requested if the finer one is missing) float2 offsetWithinTile = frac(logicalUv * 2048.0); float2 physicalCacheUv = (indirectionEntry.xy + offsetWithinTile) / physicalCacheTilesPerSide; // 4. Sample the physical cache. Hardware filtering works normally; the // only complication is gutter pixels at tile borders (see callout below). return physicalCache.SampleLevel(linearSampler, physicalCacheUv, indirectionEntry.z); }
Two texture samples per logical sample. The first one is into a tiny cacheable indirection texture; the second is into the physical cache. Everything else, including hardware trilinear filtering, works as normal.
How the streamer knows what to load
The remaining problem: which tiles of the logical texture should we have resident? We can't know without rendering the frame, because the answer depends on the camera. The answer is a feedback pass: a low-resolution render that writes, for every shaded pixel, the logical tile coordinates that pixel would have sampled. After the frame, the CPU (or a compute shader) reads the feedback buffer, deduplicates it, and submits load requests for any tile that's wanted and not yet resident.
Modern hardware accelerates this with , a D3D12 feature shipped in Shader Model 6.5[28]. The hardware tracks, per draw, which tiles of a tiled resource were actually sampled. The example in Microsoft's DevBlog post shows the committed footprint of a tiled-resource full-mip-chain texture dropping from 524,288 KB (~512 MiB) to 51,584 KB (~50 MiB), a 10ร reduction with no visual change[28]. Intel's GDC 2021 follow-up walks the same numbers on a scene with hundreds of gigabytes of total texture data[29]. On Xbox, this is wired directly into the BCPack decompressor and marketed as Sampler Feedback Streaming[8].
The hardware: tiled resources and sparse binding
The "physical cache" trick predates GPU hardware support. id Tech 5 transcoded tiles on the CPU and uploaded them as ordinary textures[2]. Today's GPUs give you the cache for free as a tiled resource in D3D12[30] or a sparse image in Vulkan[31]: you allocate a logical-size resource, but tiles of it are individually backed by physical memory through UpdateTileMappings[32] or vkQueueBindSparse. The shader doesn't need to know any of this; it samples the logical resource and the hardware MMU does the indirection.
Gutter pixels and the bilinear-bleed problem
Hardware trilinear filtering averages four neighboring pixels. If tile A is resident but its neighbor tile B is not, sampling near the edge of A reaches into B's memory, which has stale data. Worse, even if both tiles are resident, they may live at non-adjacent physical-cache locations, so the bleed sees pixels from a completely unrelated tile.
The fix is gutter pixels. Each tile is stored with a border of pixels from the neighbor tiles, typically a 4-pixel skirt. Filtering near the edge then samples within the gutter, which has the same data the neighbor would have provided. The gutter inflates each tile by a few percent and is the reason a "64-pixel tile" might actually be stored as 68 ร 68 = 4,624 texels per tile.
An alternative is to do the bilinear filter manually in the shader from four point samples, each translated through the indirection table. It's correct but expensive (four indirection lookups instead of one), and most engines stick with gutters.
11Cluster streaming: doing for geometry what SVT did for textures
Textures had been virtualized for fifteen years before geometry caught up. The breakthrough is Unreal Engine 5's Nanite[9], demoed in 2020 and shipped in 2022. The idea is structurally identical to SVT, but the unit is a triangle cluster, not a texel tile.
The shape of the system
- Clusters of ~128 triangles. A mesh is preprocessed into clusters: small connected groups of triangles with a shared bounding sphere and a normal cone. The cluster is the unit of everything downstream: culling, LOD, streaming.
- A DAG of LOD groups. Adjacent clusters are merged and simplified to produce a coarser LOD. This is recursive: clusters โ groups of clusters โ groups of groups, all the way up to a single root cluster representing the whole mesh.
- Pages of 128 KB. Cluster groups within the same LOD that share spatial locality are packed into 128 KB pages. The page is the unit of streaming I/O. Karis et al.'s SIGGRAPH talk[9] describes the compression scheme and the page layout in detail.
- GPU-driven cluster culling. Every frame, a compute pass walks the DAG, picks the right LOD level for each region of the screen, and emits a draw list. Clusters that aren't resident are replaced with their nearest resident ancestor, so the surface is always rendered, just at coarser detail.
The hierarchy is what makes streaming work without visible pop. A camera far from the model only ever asks for the root and the coarse intermediate levels, which are always resident. As the camera moves closer, the GPU's cluster-culling pass starts asking for fine-grained leaves; the streamer loads the corresponding pages; the renderer falls back to the coarser ancestor for any leaves that aren't yet in. Pop-in becomes a gradual sharpening rather than a hard appearance.
The recurring pattern: a fixed-budget physical cache, a logical-to-physical indirection, and a feedback signal that drives streaming. SVT applies it to texels; Nanite applies it to triangles. The same idea is being applied to BVH nodes for ray tracing, to shader variants for material streaming, and to volumetric data in renderers like Disney's Hyperion.
12DirectStorage, PS5, and the modern fast paths
Everything we've discussed so far works on top of the OS's normal file API. That used to be the best you could do on PC; on console it never was. The 2020 generation of hardware (PS5 and Xbox Series) shipped with dedicated I/O silicon, and the PC platform spent the next four years catching up. By 2026 the gap is mostly closed.
DirectStorage: what the API actually skips
Microsoft's DirectStorage, currently version 1.4[12], is the explicit PC fast path. It is not a magic spell; it is a stack of specific decisions:
- Batched submission. One
EnqueueRequestscall submits many reads. The kernel sees one syscall, not N. - BypassIO. A Windows kernel feature that skips most of the I/O stack: the filesystem filter chain, the FastFat/NTFS dispatcher, the cache manager[33]. The driver gets the request as close to "DMA from this LBA range to this physical page" as the kernel knows how to provide. Currently NVMe-only, NTFS-only, noncached-only, but on those it's the difference between 2 GB/s and 7 GB/s.
- GPU-side decompression. A compute-shader decoder for GDeflate (since 1.1) and Zstd (since 1.4) reads the compressed buffer in VRAM and writes the decompressed buffer to the target resource, in the same place the GPU was going to read it from. The CPU never sees the uncompressed bytes.
- Fence synchronization with D3D12. An I/O request can wait on or signal a D3D12 fence[11]. You can compose stream loads into your existing graphics queue without polling.
// โโ One-time setup โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ // The factory is the entry point to the DirectStorage runtime. ComPtr<IDStorageFactory> storageFactory; DStorageGetFactory(IID_PPV_ARGS(&storageFactory)); // Sized the staging buffer for the GPU-decompression path. 128 MiB is the // Microsoft-recommended default for general workloads. storageFactory->SetStagingBufferSize(DSTORAGE_STAGING_BUFFER_SIZE_128MB); // Create a "GPU queue": reads land directly in GPU resources, with // optional GPU-side decompression on the way. ComPtr<IDStorageQueue1> gpuQueue; DSTORAGE_QUEUE_DESC queueDescriptor{}; queueDescriptor.SourceType = DSTORAGE_REQUEST_SOURCE_FILE; queueDescriptor.Capacity = DSTORAGE_MAX_QUEUE_CAPACITY; // max outstanding requests queueDescriptor.Priority = DSTORAGE_PRIORITY_NORMAL; queueDescriptor.Device = d3d12Device.Get(); // the D3D12 device we'll write into storageFactory->CreateQueue(&queueDescriptor, IID_PPV_ARGS(&gpuQueue)); // Open the asset bundle once. We'll keep this handle for the lifetime // of the game and submit many reads against it. ComPtr<IDStorageFile> bundleFile; storageFactory->OpenFile(L"assets.pak", IID_PPV_ARGS(&bundleFile)); // โโ Per asset: describe the read, push it into the queue โโโโโโโโโโโโโ DSTORAGE_REQUEST textureRequest{}; // Where the compressed bytes are coming from. textureRequest.Options.SourceType = DSTORAGE_REQUEST_SOURCE_FILE; textureRequest.Source.File.Source = bundleFile.Get(); textureRequest.Source.File.Offset = manifestEntry.byteOffset; textureRequest.Source.File.Size = manifestEntry.compressedSize; // How to decompress them. GDeflate is decoded by a GPU compute shader; // the CPU never sees the uncompressed bytes. textureRequest.Options.CompressionFormat = DSTORAGE_COMPRESSION_FORMAT_GDEFLATE; // or DSTORAGE_COMPRESSION_FORMAT_ZSTD textureRequest.UncompressedSize = manifestEntry.uncompressedSize; // Where the decompressed bytes end up: directly into a region of an // existing D3D12 texture resource. textureRequest.Options.DestinationType = DSTORAGE_REQUEST_DESTINATION_TEXTURE_REGION; textureRequest.Destination.Texture.Resource = textureResource.Get(); textureRequest.Destination.Texture.SubresourceIndex = mipLevelIndex; textureRequest.Destination.Texture.Region = textureRegion; gpuQueue->EnqueueRequest(&textureRequest); // Submit the batch and signal a D3D12 fence when the GPU is finished // writing. Any other GPU work that consumes the texture can wait on // the same fence value, no CPU polling required. gpuQueue->EnqueueSignal(streamingFence.Get(), nextFenceValue); gpuQueue->Submit();
The shape mirrors io_uring: build a request, push it, eventually reap completions. The pieces that are new are which hardware sees the data and when: the bytes go disk โ DMA โ VRAM โ compute-shader decode โ final resource, with the CPU only touching the request descriptors.
PS5 and Xbox Series: silicon shortcuts
The console architectures predate DirectStorage and chose a slightly different tradeoff. Both put dedicated decompression hardware on the I/O complex, so the codec runs in fixed-function silicon rather than on a compute shader.
- PS5. Mark Cerny's "Road to PS5" describes a custom 12-channel NVMe interface, a dedicated DMA controller, two I/O coprocessors, and a hardware Oodle Kraken block[7]. Raw read: 5.5 GB/s. Effective post-Kraken: roughly 8-9 GB/s, with Oodle Texture's RDO encoder pushing real measured workloads above 17 GB/s on BCn data[25].
- Xbox Series. Microsoft's Velocity Architecture[8]: a 2.4 GB/s NVMe, a hardware LZ block, a hardware BCPack block (texture-format aware, more effective on BCn than general-purpose codecs), DirectStorage, and Sampler Feedback Streaming. Effective I/O performance: 4.8 GB/s. Microsoft summarized the change as "approximately 100x the I/O performance in current generation consoles" relative to Xbox One.
Hardware decompression is faster per watt and frees up shader cores. Compute-shader decompression is more flexible and doesn't add silicon area. The console answer made sense when you knew the codec in advance and could fix it forever; the PC answer is what made GDeflate and Zstd shippable updates rather than chip respins. Neither is wrong; they optimize for different constraints.
13A working streamer, in your language
Below is a complete streamer in two languages: modern C++ (20) and Rust. The C++ version is about 250 lines and uses no dependencies beyond the standard library. The Rust version is similar. Both implement the same design: an async loader thread that submits a fixed number of in-flight reads, a priority queue that orders requests by score, an LRU pool that bounds residency, and a callback that runs CPU-side decompression. There is no platform-specific I/O; the goal is to read clearly. In production you'd swap the loader for io_uring / IoRing / DirectStorage and the rest of the system is unchanged.
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ // streamer.cpp ยท priority-scheduled async streamer with LRU // Build: g++ -std=c++20 -O2 -pthread streamer.cpp // โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ #include <atomic> #include <condition_variable> #include <fstream> #include <functional> #include <list> #include <memory> #include <mutex> #include <queue> #include <thread> #include <unordered_map> namespace mpg { // A stable, content-agnostic identifier (typically a hash of the asset's // logical name). Pendling reads, residency pool lookups, and dedup all // key on this ID. using AssetId = uint64_t; // Everything the streamer needs to fetch and decompress one asset. struct PendingRead { AssetId assetId; float priorityScore; // higher = load sooner; computed by caller uint64_t byteOffsetInBundle; // where in the bundle to start reading uint32_t compressedByteCount; // bytes to read from disk uint32_t uncompressedByteCount;// bytes after the decoder runs }; // Comparator for the std::priority_queue. The std heap is a max-heap by // default; we order by priorityScore so the highest-priority item pops first. struct HigherPriorityFirst { bool operator()(const PendingRead& left, const PendingRead& right) const { return left.priorityScore < right.priorityScore; } }; // The data the streamer hands back to the engine once a read finishes. struct ResidentAsset { std::vector<std::byte> decompressedBytes; }; // A least-recently-used residency pool. The doubly-linked list is // ordered front=newest, back=oldest. The hash map indexes into the list // so lookups are O(1) and the "promote to front on access" is also O(1). class ResidencyPool { size_t capacityBytes; // budget set by the caller; we evict to stay under it size_t residentBytes = 0; // total size of everything currently in the pool struct CacheEntry { AssetId assetId; ResidentAsset asset; }; std::list<CacheEntry> recencyOrder; // front = most recently used std::unordered_map<AssetId, std::list<CacheEntry>::iterator> indexByAssetId; public: explicit ResidencyPool(size_t capacityInBytes) : capacityBytes(capacityInBytes) {} // Look up an asset. If found, also bump it to the front of the LRU list. // Returns nullptr if not resident. const ResidentAsset* touch(AssetId assetId) { auto indexEntry = indexByAssetId.find(assetId); if (indexEntry == indexByAssetId.end()) return nullptr; // splice() moves the node within the same list in O(1). recencyOrder.splice(recencyOrder.begin(), recencyOrder, indexEntry->second); return &indexEntry->second->asset; } // Add a freshly-decoded asset to the pool. Evict from the LRU tail until // we're back under budget. void insert(AssetId assetId, ResidentAsset asset) { if (indexByAssetId.count(assetId)) return; // already resident; nothing to do residentBytes += asset.decompressedBytes.size(); recencyOrder.push_front({assetId, std::move(asset)}); indexByAssetId[assetId] = recencyOrder.begin(); // Evict from the back (oldest) until we're under budget again. while (residentBytes > capacityBytes && !recencyOrder.empty()) { auto& evictionTarget = recencyOrder.back(); residentBytes -= evictionTarget.asset.decompressedBytes.size(); indexByAssetId.erase(evictionTarget.assetId); recencyOrder.pop_back(); } } bool contains(AssetId assetId) const { return indexByAssetId.count(assetId) > 0; } }; class Streamer { // โโ configuration โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ std::string bundlePath; // path to the asset bundle on disk std::ifstream bundleFile; // only used for the lifetime of the streamer ResidencyPool residencyPool; // the LRU cache of decoded assets int workerCount; // how many threads pull from the queue // โโ synchronization for the pending-request queue โโโโโโโโโโโโโโโโโ // The mutex protects pendingQueue + alreadyQueuedIds together. std::mutex queueMutex; std::condition_variable requestAvailable; // signalled when a new read lands std::priority_queue<PendingRead, std::vector<PendingRead>, HigherPriorityFirst> pendingQueue; std::unordered_set<AssetId> alreadyQueuedIds; // dedup; clears as workers finish // โโ worker pool โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ std::vector<std::thread> workerThreads; std::atomic<bool> isRunning{true}; // flipped to false in the destructor // The decoder callback turns compressed bytes into decompressed bytes. // In production this would dispatch to GDeflate on the GPU or to a // hardware block on console. std::function< std::vector<std::byte>(const std::byte* src, size_t srcByteCount, size_t dstByteCount) > decode; public: Streamer(std::string bundleFilePath, size_t poolCapacityBytes, int concurrentReads, auto decoderCallback) : bundlePath(std::move(bundleFilePath)), bundleFile(bundlePath, std::ios::binary), residencyPool(poolCapacityBytes), workerCount(concurrentReads), decode(std::move(decoderCallback)) { workerThreads.reserve(workerCount); for (int workerIndex = 0; workerIndex < workerCount; workerIndex++) workerThreads.emplace_back([this] { workerLoop(); }); } ~Streamer() { isRunning.store(false); requestAvailable.notify_all(); // wake every worker so they can exit for (auto& worker : workerThreads) worker.join(); } // Add a tile-load request to the priority queue. Silently dedupes // against already-resident assets and already-pending requests. void request(PendingRead incomingRequest) { std::lock_guard lock(queueMutex); if (residencyPool.contains(incomingRequest.assetId)) return; if (!alreadyQueuedIds.insert(incomingRequest.assetId).second) return; pendingQueue.push(std::move(incomingRequest)); requestAvailable.notify_one(); // nudge one sleeping worker } // Engine-facing accessor. Promotes the entry on the LRU list as a side effect. const ResidentAsset* access(AssetId assetId) { return residencyPool.touch(assetId); } private: // One of these runs on each worker thread. Pulls the highest-priority // pending read off the queue, executes it, decodes, then stores the // result in the residency pool. void workerLoop() { // Each worker keeps its own ifstream so concurrent seeks don't fight // over a single file pointer. std::ifstream perWorkerFile(bundlePath, std::ios::binary); while (isRunning.load()) { // 1. Wait for a request, then pop the highest-priority one. PendingRead request; { std::unique_lock lock(queueMutex); requestAvailable.wait(lock, [&] { return !pendingQueue.empty() || !isRunning; }); if (!isRunning) return; request = pendingQueue.top(); pendingQueue.pop(); } // 2. Read the compressed bytes from the bundle. std::vector<std::byte> compressedBytes(request.compressedByteCount); perWorkerFile.seekg(request.byteOffsetInBundle); perWorkerFile.read(reinterpret_cast<char*>(compressedBytes.data()), request.compressedByteCount); // 3. Decode on this worker. In production this would dispatch to a // GPU compute queue (GDeflate / Zstd) or a hardware block (Kraken). auto decompressedBytes = decode( compressedBytes.data(), request.compressedByteCount, request.uncompressedByteCount); // 4. Publish the result. Clear the dedup bit so a fresh request for // the same asset (e.g. after it gets evicted) can be enqueued again. { std::lock_guard lock(queueMutex); residencyPool.insert(request.assetId, ResidentAsset{std::move(decompressedBytes)}); alreadyQueuedIds.erase(request.assetId); } } } }; } // namespace mpg // โโ usage โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ // // A no-op decoder: pretend the on-disk bytes are already uncompressed. // auto identityDecoder = [](const std::byte* src, size_t srcN, size_t dstN) { // return std::vector<std::byte>(src, src + srcN); // }; // mpg::Streamer streamer( // "assets.pak", // /*poolCapacityBytes=*/ 1ull << 30, // 1 GiB residency budget // /*concurrentReads=*/ 16, // identityDecoder); // // streamer.request({.assetId=tileId, .priorityScore=score, .byteOffsetInBundle=off, ...}); // if (auto* asset = streamer.access(tileId)) bindTexture(asset);
// โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ // streamer.rs ยท priority-scheduled async streamer with LRU // Build: rustc -O streamer.rs // โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ use std::collections::{BinaryHeap, HashMap, HashSet}; use std::cmp::Ordering; use std::fs::File; use std::io::{Read, Seek, SeekFrom}; use std::sync::atomic::{AtomicBool, Ordering}; use std::sync::{Arc, Condvar, Mutex}; use std::thread; /// A stable, content-agnostic identifier (typically a hash of the asset's /// logical name). All streamer maps and dedup sets are keyed on this. pub type AssetId = u64; /// Everything the streamer needs to fetch and decompress one asset. pub struct PendingRead { pub asset_id: AssetId, pub priority_score: f32, // higher = load sooner pub byte_offset_in_bundle: u64, pub compressed_byte_count: u32, pub uncompressed_byte_count: u32, } // BinaryHeap is a max-heap by Ord. We implement Ord by priority so the // highest-priority item pops first. f32 doesn't implement Ord because of // NaN, so we use total_cmp(), which defines a total ordering. impl PartialEq for PendingRead { fn eq(&self, other: &Self) -> bool { self.priority_score == other.priority_score } } impl Eq for PendingRead {} impl PartialOrd for PendingRead { fn partial_cmp(&self, other: &Self) -> Option<Ordering> { Some(self.cmp(other)) } } impl Ord for PendingRead { fn cmp(&self, other: &Self) -> Ordering { self.priority_score.total_cmp(&other.priority_score) } } /// The data the streamer hands back to the engine. pub struct ResidentAsset { pub decompressed_bytes: Vec<u8>, } // One entry in the LRU pool. last_used_tick is bumped every time the asset // is accessed; eviction picks the entry with the smallest tick. struct CacheEntry { asset: ResidentAsset, last_used_tick: u64, } // A doubly-linked intrusive LRU list is awkward in safe Rust, so for // readability we use a HashMap plus a recency counter. Eviction is O(n) // in the number of resident assets; a production version would use a // proper LRU crate or hand-rolled list. pub struct ResidencyPool { capacity_bytes: usize, resident_bytes: usize, next_tick: u64, entries: HashMap<AssetId, CacheEntry>, } impl ResidencyPool { pub fn new(capacity_bytes: usize) -> Self { Self { capacity_bytes, resident_bytes: 0, next_tick: 0, entries: HashMap::new(), } } /// Look up an asset. If found, bump its recency so it survives eviction longer. pub fn touch(&mut self, asset_id: AssetId) -> Option<&ResidentAsset> { self.next_tick += 1; let tick_now = self.next_tick; if let Some(entry) = self.entries.get_mut(&asset_id) { entry.last_used_tick = tick_now; return Some(&entry.asset); } None } /// Add a freshly-decoded asset. Evict the least-recently-used entries /// until we're back under capacity. pub fn insert(&mut self, asset_id: AssetId, asset: ResidentAsset) { if self.entries.contains_key(&asset_id) { return; } self.resident_bytes += asset.decompressed_bytes.len(); self.next_tick += 1; self.entries.insert(asset_id, CacheEntry { asset, last_used_tick: self.next_tick }); while self.resident_bytes > self.capacity_bytes && !self.entries.is_empty() { // Find the entry with the smallest tick: the LRU victim. let victim_id = *self.entries.iter() .min_by_key(|(_, entry)| entry.last_used_tick) .unwrap().0; let evicted = self.entries.remove(&victim_id).unwrap(); self.resident_bytes -= evicted.asset.decompressed_bytes.len(); } } } // State the worker pool shares behind an Arc. The mutexes are small and // always held briefly. struct SharedState { bundle_path: String, residency_pool: Mutex<ResidencyPool>, request_queue: Mutex<RequestQueue>, request_available: Condvar, decode: Box<dyn Fn(&[u8], usize) -> Vec<u8> + Send + Sync>, } // Pairs the priority heap with a dedup set. Both live behind the same // mutex because every change touches both: pushing a request also marks // it as queued, completing a request clears the mark. struct RequestQueue { pending: BinaryHeap<PendingRead>, already_queued_ids: HashSet<AssetId>, } pub struct Streamer { shared: Arc<SharedState>, is_running: Arc<AtomicBool>, worker_handles: Vec<thread::JoinHandle<()>>, } impl Streamer { pub fn new( bundle_path: String, pool_capacity_bytes: usize, concurrent_reads: usize, decode: Box<dyn Fn(&[u8], usize) -> Vec<u8> + Send + Sync>, ) -> Self { let shared = Arc::new(SharedState { bundle_path, residency_pool: Mutex::new(ResidencyPool::new(pool_capacity_bytes)), request_queue: Mutex::new(RequestQueue { pending: BinaryHeap::new(), already_queued_ids: HashSet::new(), }), request_available: Condvar::new(), decode, }); let is_running = Arc::new(AtomicBool::new(true)); let mut worker_handles = Vec::new(); for _ in 0..concurrent_reads { let shared = shared.clone(); let is_running = is_running.clone(); worker_handles.push(thread::spawn(move || worker_loop(shared, is_running))); } Self { shared, is_running, worker_handles } } /// Add a tile-load request. Silently dedupes against pending and resident sets. pub fn request(&self, incoming: PendingRead) { let mut queue = self.shared.request_queue.lock().unwrap(); // Skip if already resident, or already queued. if self.shared.residency_pool.lock().unwrap() .entries.contains_key(&incoming.asset_id) { return; } if !queue.already_queued_ids.insert(incoming.asset_id) { return; } queue.pending.push(incoming); self.shared.request_available.notify_one(); } } // One of these runs per worker thread. fn worker_loop(shared: Arc<SharedState>, is_running: Arc<AtomicBool>) { // Each worker keeps its own File handle so concurrent seeks don't fight // over a single file position. let mut per_worker_file = File::open(&shared.bundle_path).unwrap(); while is_running.load(Ordering::Acquire) { // 1. Wait for a request, then pop the highest-priority one. let request = { let mut queue = shared.request_queue.lock().unwrap(); while queue.pending.is_empty() && is_running.load(Ordering::Acquire) { queue = shared.request_available.wait(queue).unwrap(); } if !is_running.load(Ordering::Acquire) { return; } queue.pending.pop().unwrap() }; // 2. Read compressed bytes from the bundle. let mut compressed_bytes = vec![0u8; request.compressed_byte_count as usize]; per_worker_file.seek(SeekFrom::Start(request.byte_offset_in_bundle)).unwrap(); per_worker_file.read_exact(&mut compressed_bytes).unwrap(); // 3. Decode. In production this would dispatch to GPU compute or hardware. let decompressed_bytes = (shared.decode)( &compressed_bytes, request.uncompressed_byte_count as usize); // 4. Publish to the pool and clear the dedup bit. let mut queue = shared.request_queue.lock().unwrap(); shared.residency_pool.lock().unwrap() .insert(request.asset_id, ResidentAsset { decompressed_bytes }); queue.already_queued_ids.remove(&request.asset_id); } }
This implementation is meant to read clearly, not be the fastest possible.
- The loader uses synchronous
preadon each worker. A real implementation would use io_uring / IoRing / DirectStorage to keep the device queue full from one thread. - The decompression callback is CPU-side. The PC fast path is a GPU compute shader; the console fast path is fixed-function silicon.
- The residency pool is pure LRU. Production pools have pin lists, tiers, and heat tracking (see ยง8).
- There's no batching of nearby reads. A real streamer coalesces adjacent requests to amortize per-IO overhead.
- There's no eviction notification: callers can hold a stale pointer if their asset gets evicted. Production code returns ref-counted handles, not raw pointers.
14Try it yourself
The playground below runs a JavaScript port of the streamer above. The library is exposed as MPGStream; you can drive a simulated player around a world, see tiles prioritized, loaded, and evicted, and tune cache size, bandwidth, and prefetch shape live. Hit Run (or Ctrl+Enter / Cmd+Enter). Output prints below; the world view animates on the right.
Drop bandwidthMBs to 10 and pop-in spikes: the streamer can't keep up with the player's traversal. Increase prefetchRadius to 6 and the cache thrashes because too many tiles compete for a small pool. The settings interact, which is the point.
15How Unreal does it
Unreal Engine 5 ships four overlapping streaming systems. Most projects use several at once.
- Texture Streaming.[34] The classic mip-based streamer: every texture has a set of "inline mips" (loaded with the level) and "streaming mips" (loaded on demand). The streamer ranks every visible texture by screen coverage, drops one mip at a time until the planned footprint fits in budget, and submits async loads to bring up the rest. This is what handles old-style materials.
- Streaming Virtual Texturing (SVT).[35] The SVT pattern from ยง10 wired into the material editor. A material can sample a virtual texture instead of a regular one; the engine builds the indirection table, runs the feedback pass, and pages 128 KB tiles in and out of a physical cache. Used for terrain, large detail textures, and runtime virtual textures.
- Nanite.[9] The cluster-streaming geometry pipeline from ยง11. 128-triangle clusters, 128 KB pages, GPU cluster culling, software rasterizer for micropolygons. The mesh equivalent of SVT.
- World Partition.[36] Replaces the legacy sub-level system. The world is a 2D grid of "cells"; actors are stored one-per-file under the grid; cells stream based on player position with per-data-layer load ranges. HLODs (Hierarchical Levels of Detail) supply a coarse representation of far cells so the world looks complete even when only the local cells are resident.
A typical Unreal frame might: page in actors via World Partition (large-grain spatial residency), sample materials through SVT (texel-level residency), render via Nanite (cluster-level residency), and use legacy mip streaming for whatever textures aren't virtual yet. Each layer is independent; each addresses a different scale. The win is that the engine has stopped trying to use one mechanism for all four problems.
16How Unity does it
Unity's streaming story is less ambitious than Unreal's, and the difference reflects the engines' user bases. Where Unreal builds for AAA open worlds, Unity ships building blocks and lets the project assemble them.
- Mipmap Streaming.[37] The texture-streaming equivalent of Unreal's: only the mip levels the camera actually needs are kept resident. The streamer derives the right mip per renderer using distance and screen coverage; explicit overrides are available through
Texture2D.requestedMipmapLevelfor procedurally-loaded content. - AssetBundles and Addressables. The bundle format from ยง7 wrapped in two APIs. AssetBundles is the low-level interface; Addressables is the higher-level system that handles dependency resolution, async loading, and reference counting.
- Subscenes (Entities / DOTS). The Entities package supports "subscenes": prebaked archetype chunks that load as a unit. The cell-based answer for ECS-heavy projects.
Unity doesn't currently ship a virtual-geometry equivalent of Nanite. There's a sparse-virtual-texturing prototype in HDRP and a community effort around mesh shaders, but the standard recommendation is: use Mipmap Streaming for textures, use Addressables for bundles, partition large worlds into subscenes, and bring your own priority logic.
17Pitfalls and how to spot them
Streaming bugs are usually visible. They show up as pop-in, hitches, or "the level took too long to load." A list of the classes I've personally watched hit ship.
Synchronous I/O on the main thread
The cardinal sin. Any blocking read(), fopen(), or CreateFile() on the frame thread is a hitch waiting to happen. Even a fast NVMe takes 70 ยตs to return a cache miss, and that's much longer than a single frame's worth of slack. The fix is structural: route every read through the streamer, even tiny ones. Audit your codebase for fopen in any frame-path module.
Read amplification
You needed 4 KB. You read 64 KB because that's your block size. You wasted 60 KB of bandwidth. Multiply by every read in your level-load and you're at 5x amplification. The fix is to align your asset layout to your read granularity: if reads happen in 64 KB chunks, group the assets so 64 KB chunks usefully contain related data. SVT's tile sizing is the textbook example.
HDD versus SSD assumptions
A game tuned for SSD can be unplayable on HDD. Random-access patterns that work fine on flash collapse under seek time. Spider-Man's PS4 version (described in Insomniac's GDC postmortem) solved this with aggressive tile coalescing and a streaming cone matched to the worst-case HDD[5]; the PS5 version assumed an SSD throughout. If your engine supports HDD installs, test with one and profile with the seek-time penalty present. If it doesn't, say so in the system requirements.
Cache thrash from naรฏve LRU
A scanning workload (one-time access to many tiles) plus a hot working set (a small number of always-touched tiles) is the worst case for pure LRU: the scan evicts the working set, then the working-set accesses re-evict the scan. The fix is ARC, 2Q, or any policy that distinguishes "scanned once" from "accessed repeatedly." See ยง8 and the widget there.
Priority bugs cause pop-in
The streamer loads things in priority order, but the priority function might be wrong. Common bugs: forgetting to apply the velocity bonus, weighting screen-space size correctly only when the camera is moving, picking a frustum cone too tight so a quick camera turn leaves you with no resident tiles. Always test with a fast-turning camera and a fast-moving player. Pop-in events tracked in the widget above are the proxy metric.
Decompressor starvation
Reads arrive at 8 GB/s. Decompression runs at 1 GB/s. The CPU decompressor's input queue overflows; reads back-pressure; the device idles. This is the classic pre-DirectStorage symptom. Fix it by moving decompression to the GPU (GDeflate, Zstd on GPU), running more decompressor threads, or, as a last resort, using a faster codec at a smaller compression ratio.
Memory fragmentation in the pool
Variable-sized assets in a fixed-size pool will fragment over time. After enough churn, you can't fit a 2 MB texture in 8 MB of free space because the free space is scattered across 16 holes. The two answers: pool-per-size (allocate from buckets sized to common asset sizes) or pool-per-class (separate pools for textures, meshes, audio, etc.). Most engines do both.
File handle exhaustion
Some streamers open one file per asset bundle. With 50 bundles loaded that's fine; with 5,000 it hits the per-process file handle limit and starts returning errors. The fix is to share handles across logical bundles or to use one giant bundle with offset-based requests.
18Where to go from here
Streaming, like all engine systems, gets specific fast. Once you have the pattern in your head, the practical learning is mostly reading other people's implementations and watching the production talks.
Read these libraries
- DirectStorage SDK samples: Microsoft's github.com/microsoft/DirectStorage includes a
GpuDecompressionBenchmarksample you can run locally to see actual numbers on your hardware[38]. - liburing: github.com/axboe/liburing. The reference user-space wrapper around io_uring. Tiny, readable, the canonical example of how to wire async I/O.
- NVIDIA stdexec: github.com/NVIDIA/stdexec. The C++26 senders/receivers reference implementation, useful for composing async I/O with the rest of your engine's async work.
- Bgfx and Falcor: open-source renderers with full streaming implementations. Reading them after this tutorial is the most efficient way to see production-quality variants of every pattern here.
Read these papers
- Karis, Stubbe, Wihlidal, A Deep Dive into Nanite Virtualized Geometry, SIGGRAPH 2021[9]. The mesh-streaming reference.
- Barrett, Sparse Virtual Textures, GDC 2008[1]. The texture-streaming reference.
- Megiddo & Modha, ARC: A Self-Tuning, Low Overhead Replacement Cache, FAST 2003[27]. The eviction-policy reference.
- Axboe, Efficient IO with io_uring[16]. The async-I/O reference.
- van Waveren, Software Virtual Textures[2]. The id Tech 5 implementation deep-dive.
Talks
- Cerny, The Road to PS5[7]. The canonical console-I/O architecture talk.
- Ruskin, Marvel's Spider-Man: A Technical Postmortem, GDC 2019[5]. The peak HDD-era streaming postmortem.
- Ruskin, Streaming in Sunset Overdrive's Open World, GDC 2015[4]. The streaming-cone idea.
- Karis, Journey to Nanite, HPG 2022[39]. Brian Karis's retrospective on the design.
The final exam
Five questions covering the whole tutorial. If you can answer all five without scrolling back, you've got the fundamentals.
19Sources & further reading
Numbered citations refer to the superscripts above. Everything below is either freely available on the open web or linked from a GDC vault page.
The prose, code, CSS, and interactive demos on this page are original writing. The SVT design follows Barrett (2008) [1] and van Waveren's id Tech 5 paper [2], both attributed at the point of use. Architecture numbers for the PS5 I/O complex (5.5 GB/s raw, 8-9 GB/s effective post-Kraken) come from the Cerny "Road to PS5" talk [7]. The Xbox Velocity Architecture numbers come from the Microsoft Xbox Wire post [8]. The DirectStorage API descriptions are paraphrases of the linked Microsoft DevBlog posts. The "fixed-budget physical cache + indirection + feedback" framing tracks Barrett's original presentation; the cross-application of that pattern to geometry tracks Karis et al.'s Nanite talk [9].
- Barrett, S. (2008). Sparse Virtual Textures. GDC. silverspaceship.com/src/svt. The canonical SVT primary source, including a public-domain demo and slides.
- van Waveren, J.M.P. (2012). Software Virtual Textures. id Software. PDF. id Tech 5's CPU-side tile transcoding architecture and the 120k ร 120k virtual-texture configuration shipped in Rage.
- Sanglard, F. SSD: Reboot Your Thinking. fabiensanglard.net/ssd. Sanglard's note on how Rage's tile streaming behaves dramatically differently on SSD vs HDD.
- Ruskin, E. (2015). Streaming in Sunset Overdrive's Open World. GDC. GDC Vault. The streaming-cone idea and pre-fetch heuristics for high-speed traversal.
- Ruskin, E. (2019). Marvel's Spider-Man: A Technical Postmortem. GDC. GDC Vault. Tile sizing and traversal lead time for HDD-era open-world streaming.
- Corbet, J. (2019). Ringing in a new asynchronous I/O API. LWN. lwn.net/Articles/776703. The canonical introduction to io_uring's SQ/CQ ring design.
- Cerny, M. (2020). The Road to PS5. Sony Interactive Entertainment. YouTube. Custom 12-channel NVMe, hardware Kraken decoder, 5.5 GB/s raw read, 8-9 GB/s effective.
- Microsoft. (2020). A Closer Look at Xbox Velocity Architecture. Xbox Wire. news.xbox.com. BCPack hardware codec, Sampler Feedback Streaming, 2.4 GB/s raw / 4.8 GB/s effective.
- Karis, B., Stubbe, R., & Wihlidal, G. (2021). A Deep Dive into Nanite Virtualized Geometry. SIGGRAPH Advances in Real-Time Rendering. PDF. Cluster DAG, 128-triangle cluster size, 128 KB pages, GPU cluster culling.
- Microsoft DirectX Team. (2022). DirectStorage 1.1 Now Available. Microsoft DevBlog. devblogs.microsoft.com. GDeflate introduction, 128 MiB staging buffer recommendation.
-
Microsoft DirectX Team. (2025). DirectStorage 1.3 Is Now Available. Microsoft DevBlog. devblogs.microsoft.com.
EnqueueRequests, D3D12 fence synchronization, multi-subresource destination ranges. - Microsoft DirectX Team. (2026). DirectStorage 1.4 Release Adds Support for Zstandard. Microsoft DevBlog. devblogs.microsoft.com. Zstd codec on CPU and GPU paths, Game Asset Conditioning Library.
- Samsung Semiconductor. (2022). Samsung NVMe SSD 990 PRO Datasheet, Rev. 1.0. PDF. 7.45 GB/s sequential read, 1.4M random read IOPS at QD32.
- Tom's Hardware (2024). Crucial T705 2 TB SSD Review. tomshardware.com. Phison E26 controller, 14.5 GB/s sequential read.
- Dean, J. (2009). Numbers Everyone Should Know. Stanford CS295 keynote, archived by Brendan O'Connor. brenocon.com. SSD random read โ 150 ยตs; main memory โ 100 ns; disk seek โ 10 ms.
- Axboe, J. (2019). Efficient IO with io_uring. kernel.dk. PDF. Submission-queue and completion-queue ring design, zero-syscall fast path.
-
Linux
io_uring_setup(2)man page. man7.org.IORING_SETUP_SQPOLL,IORING_SETUP_IOPOLL. - Microsoft Learn. I/O Completion Ports. learn.microsoft.com. The legacy Windows async-I/O API.
- Microsoft Learn. IoRing Win32 API. learn.microsoft.com. The Windows 11 io_uring-shaped API; pre-registered buffers, build-by-index requests.
- Crotty, A., Leis, V., & Pavlo, A. (2022). Are You Sure You Want to Use MMAP in Your Database Management System? CIDR. PDF. Page-table contention, single-threaded eviction, TLB shootdowns.
- Microsoft Learn. Texture Block Compression in Direct3D 11. learn.microsoft.com. BC1 through BC7, byte-per-block tables, use cases.
- RAD Game Tools / Epic Games. Oodle Texture. radgametools.com. Rate-distortion BCn re-encoder; 20-50% smaller output at perceptually equivalent quality.
- Pohl, T. (2022). Accelerating Load Times for DirectX Games and Apps with GDeflate for DirectStorage. NVIDIA Technical Blog. developer.nvidia.com. GDeflate design: 64 KiB tiles, 32-way sub-stream parallelism.
- RAD Game Tools / Epic Games. Oodle Kraken. radgametools.com. High-ratio LZ-family codec; PS5's hardware decompression block decodes Kraken.
- Bloom, C. (2020). How Oodle Kraken and Oodle Texture Combine for Vast Bandwidth on the PS5. cbloomrants. cbloomrants.blogspot.com. ~17 GB/s effective on BCn data with Oodle Texture pre-conditioning.
- RAD Game Tools / Epic Games. Oodle Data Compression Performance Chart. radgametools.com. Decode-speed-vs-ratio comparison across Selkie, Mermaid, Kraken, and Leviathan, with reference points for zlib and LZ4.
- Collet, Y. & Facebook. Zstandard Benchmark Page. github.com/facebook/zstd. Per-level decode throughput and ratio against the Silesia corpus and other content types.
- Johnson, T., & Shasha, D. (1994). 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. VLDB. PDF. The hot/cold split that production caches still build on.
- Unity Technologies. AssetBundle File Format. docs.unity3d.com. Bundle layout: header + manifest + data segment.
- Megiddo, N., & Modha, D. S. (2003). ARC: A Self-Tuning, Low Overhead Replacement Cache. USENIX FAST. PDF. Two LRU lists, ghost-list adaptivity, constant-time per request.
-
Andrews, C. (2019). Coming to DirectX 12 โ Sampler Feedback. Microsoft DevBlog. devblogs.microsoft.com.
MinMipandMipRegionUsedfeedback maps; 10ร residency reduction on a tiled-resource scene. - Intel. (2021). Applying DirectX Sampler Feedback: Texture Space Shading and Streaming. GDC. PDF. ~200 MB resident from a 1 GB heap referencing 350 GB of total texture data.
- Microsoft Learn. ID3D12Device::CreateReservedResource. learn.microsoft.com. D3D12 tiled resources, 64 KB tile size.
- Khronos Group. VkBindSparseInfo. Vulkan registry. registry.khronos.org. Vulkan's equivalent of D3D12 tiled resources.
- Microsoft Learn. ID3D12CommandQueue::UpdateTileMappings. learn.microsoft.com. The API call that binds physical memory to a tiled resource's tiles.
- Microsoft Learn. BypassIO. learn.microsoft.com. The Windows kernel I/O bypass used by DirectStorage; NVMe + NTFS + non-cached only.
- Epic Games. Texture Streaming Overview. Unreal Engine docs. dev.epicgames.com.
- Epic Games. Streaming Virtual Texturing. Unreal Engine docs. dev.epicgames.com. UE's SVT implementation; 128 KB tiles and a physical cache.
- Epic Games. World Partition in Unreal Engine. dev.epicgames.com. Grid-based actor streaming with per-data-layer loading ranges.
-
Unity Technologies. The Mipmap Streaming System. docs.unity3d.com. Distance- and screen-coverage-driven mip residency;
Texture2D.requestedMipmapLevelfor manual control. -
Microsoft. DirectStorage SDK and Samples. GitHub. github.com/microsoft/DirectStorage. Including
GpuDecompressionBenchmark. - Karis, B. (2022). The Journey to Nanite. High Performance Graphics keynote. PDF.
- Microsoft. GDeflate Reference Implementation. GitHub. github.com/microsoft/DirectStorage. The open GDeflate spec; 32-way sub-stream swizzle, tile format, decompression rounds.
- Guerrilla Games. Streaming the World of Horizon Zero Dawn. guerrilla-games.com. Decima's streaming architecture across Horizon and Death Stranding.
- AMD. Radeon Memory Visualizer. gpuopen.com/rmv. AMD's tool for capturing VidMem allocation timelines and residency events; the canonical streaming-debug instrument on RDNA.
-
Linux
posix_fadvise(2)man page. man7.org.POSIX_FADV_SEQUENTIAL,POSIX_FADV_WILLNEED,POSIX_FADV_DONTNEED.