Build a Game Engine · Rendering

The 2D Renderer: Sprite Batching

Drawing one textured quad is easy; drawing ten thousand at sixty frames a second is the whole game. The trick is to stop issuing one draw call per sprite and start : pack many quads into one buffer and one draw. We build a batcher in C++ and Rust on top of the Vulkan scaffold, then handle the atlas, alpha blending, and sorting that a real 2D renderer needs.

Time~55 min LevelSenior PrereqsThe Vulkan Triangle (the draw loop), Textures (descriptor sets), and 3D Math (the ortho matrix). StackC++ (Vulkan) · Rust (ash)

01The draw-call problem

Each and each state bind (pipeline, descriptor set, vertex buffer) carries a fixed CPU-side cost: the driver validates and translates state and encodes a command. Issue one draw per sprite and at a few thousand sprites the CPU can't keep up; the GPU sits idle waiting for commands. Wloka's classic talk put it bluntly: what matters is "how many batches per frame," not how many triangles^[1].

Batching speeds up the CPU, not the GPU

The common misframing. Batching cuts driver and per-draw overhead on the CPU; it does not make the GPU rasterize faster (it often does less total work, with fewer state changes). The numbers from Wloka's talk are D3D9-era and machine-specific; Vulkan and D3D12 lowered per-call overhead a lot, so don't quote one figure as current. But per-draw cost is still nonzero, so batching still pays^[1].

Drag the sprite count up and flip between one-draw-per-sprite and batched, and between three textures and a single atlas:

02The quad & camera

Every sprite is a quad: two triangles, four vertices, a six-index list (0,1,2, 2,3,0) so the diagonal vertices are shared. Per-vertex you carry position, a UV into the texture, and a tint color. Sprites live in world space; a 2D camera and an orthographic projection map them to the screen.

Orthographic projection maps a world box straight to clip space with no perspective foreshortening. A pixel-space ortho (left 0, right width, top 0, bottom height) makes one world unit one pixel. The camera's view matrix is pan (translate) and zoom (scale); the view-projection is ortho · view, computed once per frame.

It's the Y-flip, not "Vulkan is left-handed"

Vulkan's NDC is right-handed; what trips people is the Y-down framebuffer mapping and the 0..1 depth range, both separate from world handedness (the 3D Math tutorial untangles this). Handle Y by flipping the ortho matrix's Y, swapping top/bottom, or a negative viewport height. Two more: zoom about the screen center (or cursor), not the world origin, or the view drifts; and don't bake the camera into per-sprite vertices on the CPU, keep world-space vertices and transform in the shader, or every camera move forces a full rebuild.

03Sprite batching

The batcher keeps a CPU-side vertex array. Each frame: clear it, append four vertices per visible sprite (world-space corners, atlas UVs, tint), copy the array into a per-frame dynamic vertex buffer, and issue one vkCmdDrawIndexed per batch. This is the MonoGame SpriteBatch design: accumulate, then flush^[11]^[2].

Per-frame-in-flight, and what breaks a batch

The dynamic vertex buffer must be per-frame-in-flight: you can't overwrite a buffer the GPU is still reading from the previous frame, so allocate one per frame in flight and index by the current frame (exactly like the command buffers and sync objects). And a batch breaks on a texture change, a pipeline change, or a blend-state change: you flush the current batch, bind the new state, and start a fresh one^[2]. There's no universal "max sprites per batch"; it's bounded by the buffer size and the index width (16-bit indices cap one draw at 16,384 quads).

The batcher: accumulate, flush on texture change

struct SpriteVertex { float pos[2]; float uv[2]; uint32_t rgba; };  // 20 bytes, packed color

void draw(const Sprite& s) {
    if (s.texture != batchTexture) flush();    // texture change breaks the batch
    batchTexture = s.texture;
    appendQuad(verts, s);                       // 4 world-space verts, atlas UVs, tint
}

void flush() {
    if (verts.empty()) return;
    void* dst = mappedVertexBuffer[frameIndex];   // PER frame-in-flight: never overwrite an in-use buffer
    memcpy(dst, verts.data(), verts.size() * sizeof(SpriteVertex));
    vkCmdBindDescriptorSets(cmd, ..., batchTexture.set, ...);  // the combined image sampler
    vkCmdBindVertexBuffers(cmd, 0, 1, &vertexBuffer[frameIndex], &offset);
    vkCmdBindIndexBuffer(cmd, quadIndexBuffer, 0, VK_INDEX_TYPE_UINT16);
    vkCmdDrawIndexed(cmd, (verts.size() / 4) * 6, 1, 0, 0, 0);  // one draw for the whole batch
    verts.clear();
}

struct SpriteVertex { pos: [f32; 2], uv: [f32; 2], rgba: u32 }   // 20 bytes

fn draw(&mut self, s: &Sprite) {
    if s.texture != self.batch_texture { self.flush(); }   // texture change breaks the batch
    self.batch_texture = s.texture;
    self.append_quad(s);                            // 4 world-space verts, atlas UVs, tint
}

fn flush(&mut self) {
    if self.verts.is_empty() { return; }
    let dst = self.mapped_vertex_buffer[self.frame_index];   // PER frame-in-flight
    unsafe { std::ptr::copy_nonoverlapping(self.verts.as_ptr(), dst, self.verts.len()); }
    unsafe {
        device.cmd_bind_descriptor_sets(cmd, ..., &[self.batch_texture.set], &[]);
        device.cmd_bind_vertex_buffers(cmd, 0, &[self.vertex_buffer[self.frame_index]], &[0]);
        device.cmd_bind_index_buffer(cmd, self.quad_index_buffer, 0, vk::IndexType::UINT16);
        device.cmd_draw_indexed(cmd, (self.verts.len() / 4 * 6) as u32, 1, 0, 0, 0);  // one draw
    }
    self.verts.clear();
}

What's intentionally missing

A staging copy to device-local memory (we map host-visible, which is standard for per-frame streaming but can be slower to read on discrete GPUs); buffer growth / ring-buffering when the batch exceeds capacity; the 32-bit index fallback past 16k quads; frustum culling of off-screen sprites; sort-key generation; multi-threaded batch building with secondary command buffers; and descriptor-set caching.

04Instancing

The alternative to rebuilding all the vertices each frame: upload the quad's four vertices once, and put per-sprite data (position, scale, UV rect, color) in a second buffer bound at VK_VERTEX_INPUT_RATE_INSTANCE. Draw with vkCmdDraw(cmd, 6, instanceCount, 0, 0); the shader reads the static corner plus the per-instance data, indexed by gl_InstanceIndex^[3].

Both are valid; neither is universally best. Dynamic-vertex batching re-uploads all vertex data each frame (CPU rebuild plus bandwidth scaling with sprite count), but handles arbitrary per-sprite geometry and interleaved sorting easily. uploads far less when sprites are uniform quads (particles, tilemaps), but interleaving across textures is more awkward. Instancing still breaks the batch on a texture change, it cuts vertex bandwidth, not the texture-binding constraint.

05Atlas & bleeding

A packs many sprites into one image so they share a descriptor and don't break the batch; sprites reference sub-rects via UVs^[4]. (A texture array of same-size layers, selected by index, avoids the UV math; bindless indexes a big descriptor array, but is feature-gated.)

Atlas bleeding, and why mips make it worse

Bilinear filtering samples the four nearest texels, so at a sprite's edge it pulls in the neighbor sprite's texels: a visible seam. The standard fixes are a half-texel UV inset, a padding gutter between sprites (2px or more), and edge extrusion^[8]. Mipmaps make it worse: each level halves resolution, so the bleed crosses farther, and a gutter must be wide enough at the coarsest mip you use. The clean fix is per-sprite mips packed into the atlas, or a texture array, rather than mipping the whole atlas. (Wrapping/tiling also can't work inside an atlas, a sub-rect can't repeat on itself.)

Zoom on a packed edge and toggle the fixes:

06The view-projection

The view-projection is one mat4 (64 bytes), and the smallest way to hand it to the shader each frame is a push constant: the guaranteed minimum maxPushConstantsSize is 128 bytes, so one matrix fits with room to spare^[9]. Push constants have no backing memory and are fast; a UBO (via a descriptor) is the alternative when you need more.

128 bytes is the floor, not the ceiling

Many desktop GPUs expose 256, some mobile only the 128 minimum. A full model + view + projection (3 × mat4 = 192 bytes) overflows the floor, so push the premultiplied view-projection (64 bytes), not all three, and query maxPushConstantsSize rather than assuming. The VkPushConstantRange stages must match the shader stages that read it.

07Alpha blending

Transparency is a per-attachment blend: the framebuffer result is src·S_f + dst·D_f (where S_f and D_f are the source and destination blend factors) with the "over" operator^[5]. There are two conventions, and the choice matters more than it looks.

Straight alpha: src = SRC_ALPHA, dst = ONE_MINUS_SRC_ALPHA. The texture stores un-multiplied color.
Premultiplied alpha: src = ONE, dst = ONE_MINUS_SRC_ALPHA. The texture's RGB is already multiplied by its alpha.

Premultiplied composites correctly under filtering and mips

The reason to prefer premultiplied: the GPU samples four texels and interpolates, then blends, and it can't un-multiply first. With straight alpha, a fully transparent texel still contributes its stored RGB to the interpolation, so a bilinear or mip blend pulls in garbage color, the dark or colored fringe halo around cutouts. Premultiplied blending commutes with interpolation, so filtering introduces no error^[6]^[7]. It also unifies "over" and additive in one blend state (additive is premultiplied with alpha 0), which is why particle systems use it. Scope: the correctness win is specifically under filtering/mips/interpolation.

Premultiplied-alpha blend state

VkPipelineColorBlendAttachmentState blend{};
blend.blendEnable = VK_TRUE;
blend.srcColorBlendFactor = VK_BLEND_FACTOR_ONE;                  // premultiplied: RGB already × alpha
blend.dstColorBlendFactor = VK_BLEND_FACTOR_ONE_MINUS_SRC_ALPHA;
blend.colorBlendOp = VK_BLEND_OP_ADD;
blend.srcAlphaBlendFactor = VK_BLEND_FACTOR_ONE;
blend.dstAlphaBlendFactor = VK_BLEND_FACTOR_ONE_MINUS_SRC_ALPHA;
blend.alphaBlendOp = VK_BLEND_OP_ADD;
blend.colorWriteMask = VK_COLOR_COMPONENT_R_BIT | VK_COLOR_COMPONENT_G_BIT | VK_COLOR_COMPONENT_B_BIT | VK_COLOR_COMPONENT_A_BIT;
// straight alpha: set srcColorBlendFactor = VK_BLEND_FACTOR_SRC_ALPHA instead

let blend = vk::PipelineColorBlendAttachmentState::default()
    .blend_enable(true)
    .src_color_blend_factor(vk::BlendFactor::ONE)                  // premultiplied
    .dst_color_blend_factor(vk::BlendFactor::ONE_MINUS_SRC_ALPHA)
    .color_blend_op(vk::BlendOp::ADD)
    .src_alpha_blend_factor(vk::BlendFactor::ONE)
    .dst_alpha_blend_factor(vk::BlendFactor::ONE_MINUS_SRC_ALPHA)
    .alpha_blend_op(vk::BlendOp::ADD)
    .color_write_mask(vk::ColorComponentFlags::RGBA);
// straight alpha: .src_color_blend_factor(vk::BlendFactor::SRC_ALPHA) instead

08Draw order

Opaque sprites can draw in any order with the depth test on; the z-buffer resolves occlusion. Transparent sprites can't: the depth buffer stores one depth per pixel and can't represent "see-through," so blended sprites must be sorted back-to-front (painter's algorithm) and drawn after the opaques, usually with depth test on but depth write off^[10].

Sorting fights batching, and can't always win

Back-to-front breaks for intersecting or mutually overlapping transparents, no single correct order exists; that's what order-independent transparency (OIT) approximations exist for, as an advanced escape hatch, not a 2D default. And sorting fights batching: sorting purely by depth can force texture switches (batch breaks), so 2D renderers sort within a layer and lean hard on atlases to keep the batch intact. MonoGame's SpriteSortMode exposes exactly this tension^[11].

09Text rendering

Text is just the batcher again: a bitmap font is a glyph atlas plus per-glyph metrics, and you draw one quad per glyph sampling its sub-rect, so all glyphs share the font atlas and batch into one draw. Advance and kerning come from the metrics.

Bitmap glyphs blur when scaled past their baked size; the scalable answer is a signed distance field (SDF), which stores a distance instead of pixels so text stays crisp at any scale from a small atlas (MSDF preserves sharp corners). That's a tradeoff (a shader and its own corner artifacts), not free. Full text shaping (bidirectional text, ligatures, complex scripts) is a separate problem handled by a shaping library.

Wrong answers, and why: leftover draws are texture-change batch breaks (atlas fixes them, not "GPU-only" batching or instancing); and the edge halo is a straight-alpha-under-filtering artifact that premultiplied alpha fixes, not a depth or filter-quality issue.

10Pitfalls

"Batching speeds up the GPU"It cuts CPU/driver draw-call overhead. The GPU does the same (or less) work.

Draws still high after batchingTexture changes break the batch. Pack into an atlas or array.

Overwriting the in-flight VBThe dynamic vertex buffer must be per-frame-in-flight.

Atlas edge seamsBilinear bleed; add a half-texel inset and a padding gutter.

Dark halo on cutoutsStraight alpha under filtering; use premultiplied alpha.

Transparents show through wrongSort transparent back-to-front; the depth buffer can't sort them.

Pushing MVP as 3 matrices192 B overflows the 128 B push-constant floor; push the premultiplied VP.

16k-quad wall16-bit indices cap a draw at 16,384 quads; use 32-bit or split.

11What's next

The renderer can now put thousands of sprites and glyphs on screen in a handful of draws. The 2D rendering track is complete. Next is The Audio Engine, the last subsystem before the 2D-game capstone wires the loop, renderer, input, audio, physics, and assets into a playable game. The full path is on the series hub.

Matthias Wloka. "Batch, Batch, Batch: What Does It Really Mean?" GDC 2003. nvidia.com. Draw calls are CPU-bound; what matters is batches per frame, not triangles.
Matt DesLauriers. "Sprite Batching" (lwjgl-basics wiki). github.com/mattdesl. Accumulate quad vertices; flush on texture change, buffer full, or end of frame.
Joey de Vries. LearnOpenGL, "Instancing." learnopengl.com. Per-instance attributes and one instanced draw replacing thousands of per-object calls.
Joey de Vries. LearnOpenGL, 2D Game "Final thoughts." learnopengl.com. Batching quads and sprite sheets to cut texture-state switches.
The Khronos Group. Vulkan Specification, "The Framebuffer" (blending). docs.vulkan.org. The blend equation and VkPipelineColorBlendAttachmentState factors.
Inigo Quilez. "Premultiplied alpha." iquilezles.org. Premultiplied blending commutes with scaling/interpolation; straight-alpha error peaks at texel edges.
Eric Haines. "GPUs prefer premultiplication" (Real-Time Rendering blog). realtimerendering.com. The GPU interpolates then blends and can't un-premultiply, so straight alpha fringes under filtering.
CodeAndWeb. TexturePacker "Texture Settings." codeandweb.com. Shape padding, extrude, and alpha-bleed to kill atlas bleeding and dark halos.
Victor Blanco. Vulkan Guide, "Push Constants." vkguide.dev. Push constants for small per-frame data; the 128-byte guaranteed minimum.
Tomas Akenine-Möller, Eric Haines, Naty Hoffman, et al. Real-Time Rendering, 4th ed., transparency chapter. realtimerendering.com. Back-to-front "over" blending and why overlapping transparents resist sorting.
MonoGame. SpriteBatch documentation and source. docs.monogame.net. The canonical deferred batcher and SpriteSortMode (the sort-vs-batch tension).
ash (Rust). docs.rs/ash. Version 0.38: VK_VERTEX_INPUT_RATE_INSTANCE, the blend-state setters, cmd_push_constants, cmd_draw_indexed.