Build a Game Engine · Rendering

The GPU & the Graphics Pipeline

Before writing a thousand lines of Vulkan, build the mental model the API is configuring: how a triangle in clip space becomes lit pixels in a framebuffer. We cover why a GPU is built for throughput, the stages a triangle flows through, how the rasterizer decides which pixels are inside, and the one piece of math () that everyone gets wrong first. No graphics API yet, just a software rasterizer in C++ and Rust.

Time~55 min LevelMid PrereqsThe 3D Math tutorial (we reuse the model→view→projection→clip→NDC flow). You can read C++ or Rust. StackC++ & Rust

01Why a GPU is shaped this way

A CPU spends transistors fighting latency: deep caches, branch prediction, out-of-order execution, all to finish one thread's next instruction as soon as possible. A GPU spends them on throughput: thousands of ALUs running the same program over huge batches of data, hiding memory latency by keeping many groups of work in flight and switching between them^[1]. The pipeline is an assembly line: stages run in sequence on any one triangle, but all stages are busy on different work at once.

The parallelism has a specific shape: . NVIDIA runs threads in groups of 32 called warps; "a warp executes one common instruction at a time," and when threads in a warp take different branches the warp "executes each branch path taken, disabling threads that are not on that path"^[2]. AMD's equivalent, a wavefront, is 32 or 64 threads depending on the architecture (RDNA can do either; older GCN was 64)^[3].

SIMT is not SIMD

They share the "one instruction, many data elements" idea, but the difference matters: SIMD exposes the vector width to your code (you write an 8-wide add), while SIMT presents a scalar-looking per-thread program and the hardware handles divergence by masking lanes^[2]. And the warp/wavefront size is vendor- and architecture-specific (32 on NVIDIA; 32 or 64 on AMD), so don't hard-code "32." Divergence costs by serializing the taken paths, not a fixed multiplier.

02The logical pipeline

A triangle flows through a fixed sequence of stages. Some are programmable (you write a shader), some are fixed-function (you configure state). In order^[4]:

vertex shader→ clip / cull→ rasterize→ fragment shader→ depth / stencil→ blend→ framebuffer

The vertex shader runs once per vertex and outputs a clip-space position. After clipping, the GPU does the perspective divide (xyz / w → NDC) and the viewport transform (NDC → framebuffer pixels), both fixed-function, both built in the 3D Math tutorial, so we won't re-derive them. Vulkan's clip volume is 0 ≤ z_c ≤ w_c (depth range [0,1]), and a negative viewport height performs the Y-flip^[5].

The logical order is an as-if contract

This is the order the API pretends things happen in. Real hardware reorders for speed (early-Z, §7), but always in a way that's "100% functionally consistent" with this model^[10]. Also: a "fragment" is a candidate pixel that depth, stencil, and blend may still kill before it's written; OpenGL and Vulkan say fragment, Direct3D says pixel shader, same thing. Only the vertex and fragment shaders are mandatory; tessellation and geometry stages are optional.

03Rasterization

Rasterization decides which pixels a triangle covers. The trick, from Pineda's 1988 paper, is the : each edge defines a linear function E(x,y) that is zero on the edge, positive on one side, negative on the other^[6]. A sample point is inside the triangle when all three edge functions agree in sign^[7].

Normalize the three edge values by their sum (the doubled triangle area) and you get the barycentric coordinates, which interpolate any per-vertex attribute. Edge functions are computed incrementally with adds, which is exactly why they map to cheap hardware^[6].

The top-left rule is about determinism, not anti-aliasing

When a sample lands exactly on an edge shared by two triangles, a tie-break decides which triangle owns it, so the edge is drawn exactly once: no gaps, no double-shading. Giesen: this "boils down to subtracting 1 from the constant term on some edges," which makes integer rasterizers "guaranteed watertight"^[7]. And "inside" means the sample point (the pixel center by default) passes the test, not "the triangle overlaps the pixel."

Drag the vertices. Covered pixels fill live via the edge test, colored by barycentric interpolation:

04The 2×2 quad

Hardware doesn't shade pixels one at a time; it shades them in 2×2 quads. The reason is texture filtering: choosing a mip level needs the screen-space derivatives of the texture coordinates (ddx/ddy), and the cheapest way to estimate those is to subtract a pixel's value from its neighbor's within the quad^[8].

Because all four lanes are needed for that finite difference, even pixels outside the triangle in an edge quad get shaded; their output is discarded. Giesen calls them helper pixels, and notes that "between 25 and 75 percent of the shading work for quads generated for triangle edges is wasted"^[8]. This is why sub-pixel-thin triangles are expensive, and a real reason engines manage triangle density.

05Perspective-correct interpolation

Perspective projection keeps straight lines straight but does not preserve distances, so a value that's linear across a surface in 3D is not linear across its projected image in screen space. Interpolating an attribute with screen-space barycentrics directly is wrong for anything that isn't already screen-linear^[9].

The fix: interpolate attribute / w and 1 / w linearly (those are screen-linear), then divide at each pixel: attr = (interp of attr/w) / (interp of 1/w). This is the load-bearing idea, and it's the centerpiece of the software rasterizer below:

A software triangle rasterizer (perspective-correct)

struct Vertex {            // already in screen space (post viewport transform)
    float screenX, screenY;   // pixel coordinates
    float oneOverW;           // 1 / clip-space w, precomputed per vertex
    float uOverW, vOverW;     // the attribute (u,v) already divided by w
};

// Edge function: signed doubled area of (a, b, p). Pineda's half-plane test.
static float edge(const Vertex& a, const Vertex& b, float px, float py) {
    return (px - a.screenX) * (b.screenY - a.screenY)
         - (py - a.screenY) * (b.screenX - a.screenX);
}

void rasterize(Vertex v0, Vertex v1, Vertex v2, uint32_t* fb, int W, int H) {
    float area = edge(v0, v1, v2.screenX, v2.screenY);   // doubled area; sign = winding
    if (area == 0) return;                              // degenerate
    float invArea = 1.0f / area;

    for (int y = 0; y < H; ++y)
    for (int x = 0; x < W; ++x) {
        float sx = x + 0.5f, sy = y + 0.5f;     // sample the pixel center
        float w0 = edge(v1, v2, sx, sy);             // the three edge functions
        float w1 = edge(v2, v0, sx, sy);
        float w2 = edge(v0, v1, sx, sy);
        if (w0 < 0 || w1 < 0 || w2 < 0) continue;  // inside iff all agree

        float l0 = w0 * invArea, l1 = w1 * invArea, l2 = w2 * invArea;  // barycentrics

        // Interpolate 1/w and attr/w linearly (these ARE screen-linear)...
        float invW = l0*v0.oneOverW + l1*v1.oneOverW + l2*v2.oneOverW;
        float uW   = l0*v0.uOverW   + l1*v1.uOverW   + l2*v2.uOverW;
        float vW   = l0*v0.vOverW   + l1*v1.vOverW   + l2*v2.vOverW;
        // ...then divide at the pixel to recover the true attribute.
        float u = uW / invW, v = vW / invW;
        fb[y * W + x] = shade(u, v);
    }
}

#[derive(Clone, Copy)]
struct Vertex {            // already in screen space (post viewport transform)
    screen_x: f32, screen_y: f32,   // pixel coordinates
    one_over_w: f32,               // 1 / clip-space w, precomputed per vertex
    u_over_w: f32, v_over_w: f32,   // attribute (u,v) already divided by w
}

// Edge function: signed doubled area of (a, b, p). Pineda's half-plane test.
fn edge(a: &Vertex, b: &Vertex, px: f32, py: f32) -> f32 {
    (px - a.screen_x) * (b.screen_y - a.screen_y)
        - (py - a.screen_y) * (b.screen_x - a.screen_x)
}

fn rasterize(v0: &Vertex, v1: &Vertex, v2: &Vertex, fb: &mut [u32], w: usize, h: usize) {
    let area = edge(v0, v1, v2.screen_x, v2.screen_y);   // doubled area; sign = winding
    if area == 0.0 { return; }                            // degenerate
    let inv_area = 1.0 / area;

    for y in 0..h {
        for x in 0..w {
            let (sx, sy) = (x as f32 + 0.5, y as f32 + 0.5);   // pixel center
            let w0 = edge(v1, v2, sx, sy);
            let w1 = edge(v2, v0, sx, sy);
            let w2 = edge(v0, v1, sx, sy);
            if w0 < 0.0 || w1 < 0.0 || w2 < 0.0 { continue; }   // inside iff all agree

            let (l0, l1, l2) = (w0 * inv_area, w1 * inv_area, w2 * inv_area);   // barycentrics
            // Interpolate 1/w and attr/w linearly, then divide at the pixel.
            let inv_w = l0*v0.one_over_w + l1*v1.one_over_w + l2*v2.one_over_w;
            let u_w   = l0*v0.u_over_w   + l1*v1.u_over_w   + l2*v2.u_over_w;
            let v_w   = l0*v0.v_over_w   + l1*v1.v_over_w   + l2*v2.v_over_w;
            let (u, v) = (u_w / inv_w, v_w / inv_w);
            fb[y * w + x] = shade(u, v);
        }
    }
}

What's intentionally missing

No depth buffer (every covered pixel is written; add an interpolated depth and a z-buffer for occlusion, §6). No 2×2-quad granularity or derivatives, so no real mip selection (§4). No near-plane clipping (vertices with w ≤ 0 produce garbage; the real pipeline clips first, §2). No sub-pixel fixed-point or the proper integer top-left tie-break. No MSAA, single-threaded and scalar (no SIMT, no tiling). The viewport transform that produced screenX/Y and oneOverW, and shade(), are assumed done upstream.

One subtlety the accuracy bar demands: we interpolate over w, and depth (z in NDC) is the special case that is screen-linear after projection, so the depth buffer interpolates linearly with no divide. It's the attributes that need the perspective-correct divide.

06The depth buffer

A per-pixel stores the closest depth seen so far; each fragment compares its depth to the stored value and is kept or discarded. This resolves visibility independent of draw order, which is why you don't have to sort opaque geometry back-to-front.

Reversed-Z, and what it's not

A plain [0,1] depth buffer wastes precision: the 1/z mapping bunches values near the camera, so far geometry z-fights. The fix is reversed-Z (map near→1, far→0) with a float depth buffer, whose distribution nearly cancels the 1/z nonlinearity; Reed's advice is blunt: "in any perspective projection situation, just use a floating-point depth buffer with reversed-Z"^[11]. Reversed-Z is orthogonal to Vulkan's [0,1] range (which you already have): it's the additional near↔far swap, plus flipping the depth compare to GREATER and clearing to 0. And z-fighting has several causes (equal depths, too little precision, coplanar decals), so it's not "a bug in the depth test."

Two overlapping triangles. Toggle the z-test, and slide their depths together to make them fight:

07Early-Z

The API says depth testing happens after the fragment shader, but shading a pixel only to discard it for being behind a wall is wasteful. So hardware runs the depth test early, before the shader, whenever it can prove that's equivalent^[10].

What turns early-Z off

Early-Z is an optimization, not a stage you can rely on. It's forced back to late-Z when the fragment shader writes depth (gl_FragDepth), uses discard or alpha-test, or uses alpha-to-coverage, because the shader might change whether or what depth gets written^[10]. (A texture read or an ordinary branch does not disable it.) There's an escape hatch: conservative depth output (layout(depth_greater)) lets a shader write depth and keep early rejection by promising the change is monotonic.

08Immediate-mode vs tile-based

There are two broad GPU architectures, and the split decides how you structure a frame on mobile.

Immediate-mode (IMR), desktop NVIDIA and AMD: primitives stream straight through the pipeline to a framebuffer in main memory, so every blend and depth test touches external memory.
Tile-based (TBR/TBDR), mobile Arm Mali, Imagination PowerVR, Apple, Adreno: the GPU bins geometry, then shades one screen tile at a time in fast on-chip memory, writing only the resolved color back. Mali uses "a distinct two-pass rendering algorithm" with 16×16 tiles, keeping the whole tile's color, depth, and stencil on-chip^[12]. TBDR adds hardware hidden-surface removal, deferring shading until visibility is resolved per tile^[13].

The engine consequence on mobile

On a tile-based GPU, reading the framebuffer mid-render-pass forces the tile to flush to main memory and throws away the bandwidth win, so you structure passes to avoid it (Vulkan subpasses with input attachments are the sanctioned on-tile read). Two scoping notes: TBR/TBDR is a mobile design, not how all GPUs work; and TBDR's "deferred" (defer shading until visibility) is a different thing from deferred shading (the G-buffer renderer, a later tutorial).

09The baked pipeline

Everything above, which shaders run, the rasterizer state, the depth compare op, the blend equation, the viewport, is configuration. The explicit APIs freeze it into one object. Vulkan: "each pipeline is controlled by a monolithic object created from a description of all of the shader stages and any relevant fixed-function stages"^[4], which it can then optimize as a whole and bind in one call.

That's the mental model for the next tutorial: you don't set pipeline state one stage at a time, you describe the whole pipeline once, compile it, and bind it. (Modern Vulkan adds dynamic state and shader objects that relax the bake-everything rule, and OpenGL's global state machine is the counterexample, so this is the explicit-API model, not a universal law.) The widget walks one vertex through the stages so the data at each step is concrete:

Wrong answers, and why: the seam warp is an interpolation error (not z-fighting or winding); and SIMT vs SIMD is about software-visible width and divergence handling, not a fixed lane count or an inability to branch.

10Pitfalls

Texture warps along a seamLinear screen-space interpolation; interpolate attr/w and 1/w, divide per pixel.

Hard-coding warp = 32Vendor-specific: 32 on NVIDIA, 32 or 64 on AMD. Query it; don't assume.

"Depth needs perspective correction"NDC depth is screen-linear; it's attributes that need the divide.

discard kills performancediscard / alpha-test / depth-write disable early-Z. Use opaque paths where you can.

Thin triangles are slow2×2-quad helper lanes waste 25–75% of edge shading; manage triangle density.

Far-plane z-fighting1/z bunches precision near the camera; use reversed-Z with a float depth buffer.

Reading the framebuffer on mobileForces a tile flush on TBDR; use subpasses/input attachments instead.

11What's next

With the model in hand, the next tutorial writes it for real: Your First Triangle in Vulkan, where this baked pipeline becomes a VkPipeline, the rasterizer and depth test become pipeline state, and the surface from the Platform & Window tutorial gets a swapchain. Then textures, then a real 2D renderer. The full path is on the series hub.

Tomas Akenine-Möller, Eric Haines, Naty Hoffman, et al. Real-Time Rendering, 4th ed., ch. 2–3. realtimerendering.com. The conceptual four-stage pipeline and the throughput-oriented GPU.
NVIDIA. CUDA C++ Programming Guide, "SIMT Architecture." docs.nvidia.com. Warp = 32 threads, branch divergence serializes paths, and the SIMT-vs-SIMD distinction (software-visible width).
AMD. ROCm/HIP hardware docs and the RDNA Architecture whitepaper. gpuopen.com. Wavefront size is 32 or 64 depending on architecture (RDNA vs GCN/CDNA).
The Khronos Group. Vulkan Specification, "Pipelines." docs.vulkan.org. The stage order and the "monolithic" pipeline object baking shader and fixed-function state.
The Khronos Group. Vulkan Specification, vertex post-processing and VkViewport. docs.vulkan.org. The clip volume 0 ≤ z ≤ w, the perspective divide, and the negative-height Y-flip.
Juan Pineda. "A Parallel Algorithm for Polygon Rasterization." SIGGRAPH 1988. dl.acm.org. The linear edge function / half-plane test, computed incrementally.
Fabian Giesen. "A trip through the Graphics Pipeline 2011, part 6." fgiesen.wordpress.com. Edge-test coverage, the watertight top-left rule, and 2×2 quads.
Fabian Giesen. "A trip through the Graphics Pipeline 2011, part 7." fgiesen.wordpress.com. Early-Z as an as-if-consistent optimization and the features that disable it.
Fabian Giesen. "A trip through the Graphics Pipeline 2011, part 8." fgiesen.wordpress.com. Helper pixels, derivatives via finite differencing, and 25–75% edge-quad waste.
Scratchapixel. "Perspective-Correct Interpolation and Vertex Attributes." scratchapixel.com. Why screen-linear interpolation is wrong and the interpolate-attr/w-then-divide fix.
Nathan Reed. "Depth Precision Visualized." reedbeta.com. The 1/z precision problem and reversed-Z with a float depth buffer.
Arm. "The Mali GPU: An Abstract Machine, Part 2 - Tile-Based Rendering." developer.arm.com. The two-pass tile-based architecture, 16×16 tiles, on-chip working set.
Imagination Technologies. "Tile-Based Deferred Rendering (TBDR)." docs.imgtec.com. TBDR defers shading until per-tile visibility is resolved; hidden-surface removal.