Build a Game Engine · 3D Rendering

Deferred Rendering & the G-buffer

A naive forward renderer shades every fragment of every object against every light, including fragments that get overwritten. Deferred rendering breaks that: rasterize the scene once into a G-buffer of surface attributes, then light each visible pixel exactly once. It decouples shading from object count, at the price of bandwidth, and it's where the whole architecture conversation (MSAA, transparency, tiled vs clustered, Forward+) lives.

Time~55 min LevelSenior PrereqsThe Going 3D tutorial (MRT-capable dynamic rendering, depth), PBR (the BRDF the lighting pass runs), and Shadows. StackC++ & Rust (Vulkan) · GLSL

01Why forward struggles

A naive forward renderer loops every light inside the shading of every fragment, so it does work proportional to (fragments shaded × lights). And because the fragment shader runs on hidden fragments too, it shades pixels the depth test later discards: overdraw. As dynamic light count climbs, this gets expensive^[1].

Scope it: "naive" forward, and overdraw is wasted shading

Real forward renderers cull, per-object light lists, light volumes, a depth pre-pass to kill overdraw before the expensive shading. The bad case is specifically naive forward (every light evaluated for every shaded fragment, no pre-pass). And "overdraw" here means redundant shading of fragments that get overwritten, not the cheap vertex/raster cost of extra triangles. Forward eventually got light culling too (Forward+, §8), so the claim isn't "forward can't do many lights", it's that the naive version scales poorly.

02The deferred idea

splits rendering in two. The geometry pass rasterizes the scene once and writes per-pixel surface attributes (material parameters, normal, depth) into the , a set of render targets, via . The lighting pass then runs once per screen pixel: read the G-buffer, accumulate all lights for that one visible surface. Shading becomes proportional to (visible pixels × lights), decoupled from object count and from overdraw, because only the front-most surface survives into the G-buffer^[1].

Decoupled, but not free, and "once per visible pixel" not "once per geometry"

Deferred trades the overdraw/object-count win for G-buffer bandwidth: you write N targets in the geometry pass and read them all in the lighting pass. It's a bandwidth-for-shading trade, not a free speedup (§6). The geometry pass still rasterizes every triangle (the latest-written fragment wins); what's eliminated is redundant lighting of overdrawn fragments. (Aside: this is "deferred shading", which stores full material params and runs the whole BRDF later. The older "deferred lighting / light pre-pass" stored only normals and accumulated light, then re-rendered geometry to apply albedo, two geometry passes, a pre-MRT-era workaround^[3].)

03The G-buffer

The G-buffer stores everything the lighting pass needs about the one visible surface per pixel: base color, normal, metallic + roughness (the glTF params from PBR), depth (the depth attachment, which also serves as the position source), and emissive. Channel count and precision are a budget: a fat G-buffer stores more at higher precision (flexible, more bandwidth); a thin one packs aggressively. DOOM 2016 stores normals in R16G16 (octahedral) and specular in R8G8B8A8^[8].

On Vulkan 1.3, you only get 4 color attachments guaranteed

The Vulkan Core/1.3 guaranteed minimum for maxColorAttachments is 4, not 8 (the ≥8 guarantee arrives with Vulkan 1.4 / Roadmap 2024)^[11]. In practice almost all desktop GPUs report 8, but a portable design fits the G-buffer in ≤4 color attachments + depth (or queries the limit). Also: base color and emissive are sRGB-encoded; metallic, roughness, and the packed normal are linear data, don't put data channels through an sRGB view. "Fat vs thin" is a bandwidth tradeoff, not a quality dial.

Toggle the G-buffer channels. The stored channels stay fixed while the light orbits; only the final composite changes:

04Position & normals

Two packing decisions save a lot of bandwidth. Don't store a position target, reconstruct position from the depth buffer. And store the normal in two channels, not three.

Position from depth: the Vulkan-0..1 reconstruction

Build the NDC position from the screen UV and the sampled depth, multiply by the inverse projection (view space) or inverse view-projection (world space), and divide by w^[4]. The Vulkan catch (same as Going 3D and Shadows): NDC depth is already 0..1, so set ndc.z = depthSample directly, not depthSample*2−1 (the GL-style remap is the classic reconstruction bug). Use the right inverse matrix for the space you want. Storing a position target works but is redundant, it burns a high-precision target's bandwidth for data the depth buffer already holds.

Position from depth + octahedral normals (GLSL)

// --- position from depth (Vulkan 0..1) ---
float depth = texture(gDepth, uv).r;        // Vulkan: already in [0,1]
vec4 ndc = vec4(uv * 2.0 - 1.0, depth, 1.0);  // XY: [0,1]->[-1,1]; Z: NO *2-1 on Vulkan
vec4 world = invViewProj * ndc;             // inverse view-projection -> world
world /= world.w;                            // perspective divide

// --- octahedral normal: unit vector <-> 2 channels (Cigolle et al.) ---
vec2 signNotZero(vec2 v) { return vec2(v.x >= 0.0 ? 1.0 : -1.0, v.y >= 0.0 ? 1.0 : -1.0); }
vec2 octEncode(vec3 n) {
    n /= (abs(n.x) + abs(n.y) + abs(n.z));   // project onto the octahedron
    vec2 e = n.xy;
    if (n.z < 0.0) e = (1.0 - abs(e.yx)) * signNotZero(e);  // fold lower hemisphere
    return e;
}
// ...and the inverse the lighting pass calls: 2 channels -> unit vector
vec3 octDecode(vec2 e) {
    vec3 n = vec3(e.xy, 1.0 - abs(e.x) - abs(e.y));    // unfold the octahedron
    if (n.z < 0.0) n.xy = (1.0 - abs(n.yx)) * signNotZero(n.xy);  // restore lower hemisphere
    return normalize(n);
}

Octahedral encoding is the standard 2-channel packing, more accurate per bit than a naive spheremap, and unlike the "store X,Y, reconstruct Z" hemisphere trick it represents the full sphere with no sign ambiguity^[5].

05The two passes

The geometry pass binds N color attachments plus depth (MRT) and the fragment shader writes one output per target. Under dynamic rendering, the pipeline lists the formats and the render lists the attachments, the count and order must match the shader's out locations^[11].

MRT geometry pass (Rust · ash) + the G-buffer fragment outputs (GLSL)

// pipeline: list the G-buffer color formats (<=4) + depth (extends Going 3D)
let color_formats = [albedo_format, normal_format, material_format];  // order must match the shader
let mut rendering = vk::PipelineRenderingCreateInfo::default()
    .color_attachment_formats(&color_formats)
    .depth_attachment_format(depth_format);
// record: one RenderingAttachmentInfo per target (CLEAR/STORE), plus depth
let color_attachments = [gbuf(albedo_view), gbuf(normal_view), gbuf(material_view)];
let info = vk::RenderingInfo::default()
    .color_attachments(&color_attachments)        // MRT
    .depth_attachment(&depth_attachment);

// geometry-pass fragment shader: one output per G-buffer target
layout(location = 0) out vec4 gAlbedo;    // rgb base color (sRGB target)
layout(location = 1) out vec2 gNormal;    // octahedral-packed world normal
layout(location = 2) out vec4 gMaterial;  // r=metallic g=roughness b=occlusion (linear)
void main() {
    gAlbedo   = vec4(baseColor, 1.0);
    gNormal   = octEncode(normalize(worldNormal));
    gMaterial = vec4(metallic, perceptualRoughness, occlusion, 1.0);
}

The lighting pass draws one full-screen primitive, samples the G-buffer, reconstructs position, decodes the normal, and runs the PBR BRDF once per light. Prefer a single oversized triangle over a two-triangle quad: the quad's shared diagonal makes the GPU run the fragment shader twice on the seam's 2×2 quads, and a single triangle has better cache behavior; generate it attributelessly from gl_VertexIndex^[13].

The lighting pass: full-screen, read G-buffer, loop lights (GLSL)

// full-screen triangle: 3 verts, no vertex buffer
// vec2 uv = vec2((gl_VertexIndex << 1) & 2, gl_VertexIndex & 2);
// gl_Position = vec4(uv * 2.0 - 1.0, 0.0, 1.0);
void main() {
    vec3  albedo   = texture(gAlbedo, uv).rgb;
    vec3  N        = octDecode(texture(gNormal, uv).rg);
    vec4  mat      = texture(gMaterial, uv);
    float metallic = mat.r, perceptualRoughness = mat.g;
    vec3  worldPos = positionFromDepth(uv);          // reconstruct (no position target)
    vec3  V        = normalize(cameraPos - worldPos);

    vec3 Lo = vec3(0.0);
    for (int i = 0; i < lightCount; ++i) {            // naive: all lights (culled in §7)
        vec3  L = normalize(lights[i].pos - worldPos);
        vec3  radiance = lights[i].color * attenuation(lights[i], worldPos);
        float NdotL = max(dot(N, L), 0.0);
        Lo += cookTorrance(N, V, L, albedo, metallic, perceptualRoughness)
              * radiance * NdotL;                       // cosine OUTSIDE the BRDF (from PBR)
    }
    outColor = vec4(Lo + emissive + ambient, 1.0);  // HDR target -> tonemap later
}

The barrier between the passes (and the mobile shortcut)

Between the geometry and lighting passes, transition each G-buffer image from COLOR_ATTACHMENT_OPTIMAL to SHADER_READ_ONLY_OPTIMAL (depth likewise), a write-then-read barrier, the same hazard the Textures upload covered. On tile-based mobile GPUs, this DRAM round-trip is the expensive path: Vulkan subpasses (or dynamic-rendering local reads) let the lighting subpass read the G-buffer straight from on-chip tile memory, which Arm measures at roughly 45% fewer reads and 56% fewer writes^[12]. The separate-pass-plus-barrier version isn't universally optimal.

06Costs & limits

Deferred's four well-known costs^[2]:

Bandwidth: the G-buffer is written by every geometry-pass fragment and read by every lighting-pass pixel. On bandwidth-limited GPUs this can dominate. It's a trade, not a flaw.
MSAA is hard: you can't trivially antialias the G-buffer; correct MSAA needs per-sample G-buffers (multiplied storage) or edge detection with per-sample shading only at edges. This drove many engines to post-AA (FXAA/SMAA/TAA) or to Forward+.
Transparency doesn't work: a G-buffer holds one surface per pixel, so translucent surfaces (which blend several layers) must be drawn in a separate forward pass after the deferred composite. This is fundamental, not an implementation gap.
Material variety is constrained: every pixel is shaded by the one model the G-buffer layout encodes; exotic BRDFs need extra channels or material-ID branching.

Drag the light and object counts. Deferred pays a fixed G-buffer cost up front but scales flatly with lights; naive forward climbs steeply, watch the crossover, and watch deferred lose at low light counts:

07Tiled & clustered

The naive lighting pass still loops all lights for every pixel, so it scales as (pixels × lights). To go further you cull lights spatially. Tiled shading divides the screen into 2D tiles and, in a compute pass, builds each tile's list of overlapping lights (bounded by the tile's min/max depth). shading adds a third, depth dimension: 3D clusters (screen tiles × exponential depth slices)^[6].

Clustered beats tiled on depth discontinuities, and these are orthogonal to deferred

A flat 2D tile spanning a near rail and a far wall gets every light touching either depth assigned to every pixel in the tile. Clusters separate those depths, so a pixel only sees lights in its own cluster, that robustness under high-frequency depth is clustered's whole point^[6]. The depth slices are exponential (Z = near·(far/near)^slice/N) to counter NDC's nonlinearity^[7]. And tiled/clustered are orthogonal to deferred vs forward: you can do tiled deferred (Battlefield 3^[10]), clustered deferred, or clustered forward (DOOM 2016^[8]). Don't equate "clustered" with "deferred", and don't quote a universal "max lights" (DOOM allows 256 per cluster across 3072 clusters, a per-title config).

Lights drift over a tile grid; each tile shows only its overlapping lights. Toggle culling and the depth split:

08Forward+ & the verdict

Forward+ (tiled forward) is the other branch: a depth pre-pass, a compute light-culling pass (the same culling), then a normal forward shading pass where each fragment loops only its tile's lights^[9]. It keeps forward's strengths, hardware MSAA, transparency, per-material shaders, while getting deferred-like light scaling, because the light count is decoupled via the tile list rather than via a G-buffer.

No architecture is universally best

Few lights → forward/Forward+ wins (no G-buffer bandwidth, free MSAA, transparency just works). Many dynamic lights → deferred or clustered wins (shading amortized to once per visible pixel, light count culled). The industry moved toward clustered forward / Forward+ / hybrids because they recover deferred's light scaling while keeping MSAA, transparency, and material flexibility (DOOM 2016 clustered forward; Battlefield 3 tiled deferred + forward transparency), but plenty of engines remain deferred or hybrid^[8]^[10]. The one safe universal: transparency is always a separate forward pass, whatever the opaque architecture.

Wrong answers, and why: transparency is the one-surface-per-pixel limit (fixed by a forward pass, not an alpha channel); and tile over-assignment is a depth-discontinuity problem fixed by clustered depth slices, independent of deferred vs forward.

09Pitfalls

"Deferred is just faster"It's a bandwidth-for-shading trade; at few lights, forward wins.

Storing a position targetRedundant. Reconstruct from depth (Vulkan z is already 0..1, no *2−1).

Assuming 8 color attachmentsVulkan 1.3 guarantees only 4. Fit the G-buffer in ≤4 + depth, or query.

Data channels as sRGBNormal/metallic/roughness are linear; only base color and emissive are sRGB.

Transparency in the G-bufferOne surface per pixel. Translucents go in a separate forward pass.

Forgetting MSAA costDeferred MSAA needs per-sample G-buffers; many engines use post-AA instead.

"Clustered = deferred"Culling is orthogonal; DOOM 2016 is clustered forward.

Linear depth slicesUse exponential slices to counter NDC nonlinearity.

10What's next

The renderer can light many surfaces efficiently, but the ambient term is still a flat constant. The next module, Ambient Occlusion & Global Illumination, darkens that ambient in crevices (SSAO, reusing this G-buffer) and surveys how engines fake indirect bounce light, then post-processing and the 3D capstone. The full 3D path is on the series hub.

Joey de Vries. LearnOpenGL, "Deferred Shading." learnopengl.com. The two-pass structure, the MRT G-buffer, the lighting loop, light volumes, and the disadvantages.
"Deferred shading." Wikipedia. en.wikipedia.org. History, the deferred-shading vs light-pre-pass distinction, and the MSAA/transparency/material limits.
Eric Haines. "Deferred lighting approaches" (Real-Time Rendering). realtimerendering.com. Deferred shading vs light pre-pass and G-buffer channel packing (RTR4 ch. 20 is the book anchor).
Matt Pettineo. "Reconstructing Position From Depth." therealmjp.github.io. The inverse-projection unproject and perspective divide (mind the Vulkan 0..1 depth convention).
Zina Cigolle et al. "A Survey of Efficient Representations for Independent Unit Vectors." JCGT 2014. jcgt.org. Octahedral as the recommended 2-channel normal packing, with accuracy comparisons.
Ola Olsson, Markus Billeter, Ulf Assarsson. "Clustered Deferred and Forward Shading." HPG 2012. cse.chalmers.se. Clustered shading: depth partitioning, robustness over tiled, and that it serves both forward and deferred.
Angel Ortiz. "A Primer on Efficient Rendering Algorithms & Clustered Shading." aortiz.me. The exponential depth-slice formula and the two-stage compute culling pipeline.
Adrian Courrèges. "DOOM (2016) Graphics Study." adriancourreges.com. Clustered forward (16×8×24 clusters, log Z, 256 lights/cluster), R16G16 octahedral normals, and the forward transparency pass.
Takahiro Harada, Jay McKee, Jason Yang. "Forward+: Bringing Deferred Lighting to the Next Level." 2012. takahiroharada.github.io. Tiled forward: light culling while keeping MSAA, transparency, and material flexibility.
Johan Andersson. "DirectX 11 Rendering in Battlefield 3." GDC 2011. slideshare.net. Compute-based tiled deferred plus a forward transparency pass (the hybrid).
The Khronos Group. Vulkan Required Limits, and VkPipelineRenderingCreateInfo. docs.vulkan.org. maxColorAttachments guaranteed minimum is 4 in Vulkan 1.3 (8 only at Roadmap 2024 / 1.4); the MRT format wiring.
Arm. "Deferred shading on mobile" and "Vulkan subpasses." developer.arm.com. Bandwidth as the dominant cost; keeping the G-buffer in tile memory via subpasses (~45%/56% reduction).
Chris Wallis. "Optimizing Triangles for a Full-screen Pass." wallisc.github.io. The full-screen triangle vs quad (the 2×2-quad seam double-shading).