Build a Game Engine · Tooling

Engine Tooling: Dev UI & Profiling

You can't fix what you can't see. The last engine-systems module is the tools that turn "the game feels slow" into "the shadow pass took 6 ms this frame." A dev UI to inspect and tweak live, a profiler to measure where time goes, and a frame debugger to dissect the GPU. Two ideas carry it: immediate-mode UI for tools, and the rule that you cannot CPU-time the GPU.

Time~55 min LevelSenior PrereqsThe Game Loop (the monotonic clock, vsync), Job Systems (per-thread tracks), and the GPU Pipeline (why GPU work is async). StackC++ & Rust · Vulkan

01Seeing the engine

Tooling is what makes an engine iterable, the difference between guessing and knowing^[1]. Before any GUI, the two tools with the best effort-to-value ratio are the cheapest:

Logging / tracing: timestamped, categorized, filterable, the first thing you reach for.
Asserts: an invariant written as executable code; a failed assert localizes a bug to the line that broke the assumption.
Debug draw: "printf for geometry", draw the navmesh, the AI paths, the collision shapes, the light bounds, right in the world.

The rest of this module is the next tier: an in-engine UI to tweak live state, to measure, and frame capture to dissect the GPU. Tooling is a spectrum, not just one library.

02Immediate vs retained

The core UI paradigm split. Retained-mode keeps a persistent tree of widget objects you mutate through callbacks, and you must keep that tree in sync with your application's own state. (IMGUI, coined by Casey Muratori; Dear ImGui by Omar Cornut) rebuilds the whole UI every frame straight from your live data, no retained widget objects, the if (Button()) doThing() model^[2].

It's about where UI state lives, not about being "faster"

Immediate mode is "a statement about the interface of the library, not the internals"^[3]. Dear ImGui rebuilds its geometry every frame, which costs CPU, its win is programmer iteration speed and zero state-sync bugs, not raw runtime performance. It doesn't "retain nothing" either (it caches layout, widget IDs, etc. internally). And IMGUI is the paradigm; Dear ImGui is one implementation. The dominant industry pattern: immediate-mode for debug/tools UI, retained/custom for the shipping player HUD (Dear ImGui targets "content creation tools and visualization / debug tools, as opposed to UI for the average end-user"^[4]), though some games do ship it.

Toggle the modes. Retained keeps two stores you must sync (forget to, and they drift); immediate rebuilds from one store, so desync is impossible:

03Dear ImGui & panels

Dear ImGui "outputs optimized vertex buffers that you can render anytime in your 3D-pipeline-enabled application"^[4], it produces draw data, and you render it with your own backend. The architecture is three layers: a platform backend (window/input: GLFW, SDL), the core (API-agnostic), and a renderer backend (Vulkan, D3D, Metal). ImGui's draw data is just another set of draw calls in your command buffer.

Debug panels are sliders bound to live variables, an in-game console, and entity inspectors (iterate an entity's components, draw a widget per field):

A debug panel in the frame loop

ImGui::NewFrame();                              // begin the UI frame
ImGui::Begin("Debug");                          // a panel
ImGui::SliderFloat("Sun angle", &sunAngle, 0.0f, 3.14159f);  // bound to the live variable
ImGui::Checkbox("Wireframe", &showWireframe);   // edits write through now
if (ImGui::Button("Reload shaders"))             // the if(Button()) action model
    reloadShaders();                          // fires this frame, no callback to register
ImGui::Text("Frame: %.2f ms", frameTimeMs);    // a watch value
ImGui::End();

ImGui::Render();                               // build the vertex/index buffers (ImDrawData)
ImGui_ImplVulkan_RenderDrawData(ImGui::GetDrawData(), cmd);  // you draw it, in your render pass

// egui: the idiomatic Rust immediate-mode GUI (imgui-rs is the Dear ImGui binding)
egui::Window::new("Debug").show(ctx, |ui| {        // a panel
    ui.add(egui::Slider::new(&mut sun_angle, 0.0..=PI).text("Sun angle"));  // bound live
    ui.checkbox(&mut show_wireframe, "Wireframe");
    if ui.button("Reload shaders").clicked() {       // same if(Button()) model
        reload_shaders();
    }
    ui.label(format!("Frame: {:.2} ms", frame_time_ms));
});  // egui returns shapes -> your backend tessellates and draws them

04CPU profiling

Two ways to find where CPU time goes, and they're a real tradeoff^[5]:

Sampling: interrupt the program periodically and record the call stack. Low overhead, statistical, but can miss short rare events and gives no exact call counts (perf, VTune).
Instrumentation: explicit scope markers in code give exact per-scope timing, but you only see what you marked and the markers add overhead (Tracy, Optick).

Neither is universally better; modern tools blend both (Tracy is "a hybrid frame and sampling profiler")^[6]. The instrumentation primitive is a RAII scoped timer that records a named scope's begin/end into a per-thread buffer, drawn as a (nested bars, width = duration, stacking = call depth), one track per worker thread.

A scoped CPU timer (RAII, ZoneScoped-style)

struct ScopedZone {                                // RAII: times the enclosing scope
    const char* name;
    std::chrono::steady_clock::time_point start;   // MONOTONIC clock (cross-ref Game Loop)
    ScopedZone(const char* n) : name(n), start(std::chrono::steady_clock::now()) {}
    ~ScopedZone() {
        auto ms = std::chrono::duration<double, std::milli>(std::chrono::steady_clock::now() - start).count();
        profilerRecord(name, ms);                  // append to a PER-THREAD buffer (no lock)
    }
};
#define ZONE_SCOPED(n) ScopedZone _zone##__LINE__(n)

void updatePhysics() { ZONE_SCOPED("updatePhysics"); /* work */ }  // records on scope exit

struct ScopedZone { name: &'static str, start: std::time::Instant }   // Drop = RAII end
impl ScopedZone {
    fn new(name: &'static str) -> Self { Self { name, start: std::time::Instant::now() } }  // monotonic
}
impl Drop for ScopedZone {
    fn drop(&mut self) { profiler_record(self.name, self.start.elapsed().as_secs_f64() * 1000.0); }
}
macro_rules! zone_scoped { ($n:expr) => { let _zone = ScopedZone::new($n); }; }

fn update_physics() { zone_scoped!("update_physics"); /* work */ }   // recorded on drop

A simulated frame's scopes as a flame graph. Trigger a spike and the physics scope dominates; compare sampling dots to exact instrumentation bars:

05GPU profiling

The single biggest trap: you cannot CPU-time GPU work. GPU work is asynchronous, recording and submitting a command buffer returns to the CPU immediately; the GPU executes later (the GPU Pipeline async model). Wrapping vkCmdDraw in a CPU timer measures submission, not execution.

Use timestamp queries, and read them back later

Ask the GPU to timestamp itself: vkCmdWriteTimestamp writes ticks into a query pool after prior commands reach a stage; read back with vkGetQueryPoolResults and convert via timestampPeriod (ns per tick)^[7]. The catch: don't read the same frame, VK_QUERY_RESULT_WAIT_BIT blocks the CPU on the GPU and makes your profiler the bottleneck. Ring the pool by frames-in-flight and read a result a few frames old^[8]. Also: use VK_QUERY_RESULT_64_BIT (a 32-bit counter at a 1 ns tick overflows in ~4.3 s), mask to timestampValidBits, and reset the pool before reuse. (Under Vulkan 1.3 / synchronization2, vkCmdWriteTimestamp2 takes precise stages and deprecates TOP/BOTTOM_OF_PIPE in that context.)

A GPU timestamp pair around a pass (async readback)

unsafe {
    device.cmd_reset_query_pool(cmd, pool, base, 2);   // MUST reset before writing
    device.cmd_write_timestamp(cmd, vk::PipelineStageFlags::TOP_OF_PIPE,    pool, base);
    // ...begin rendering, draw the pass (e.g. the deferred G-buffer)...
    device.cmd_write_timestamp(cmd, vk::PipelineStageFlags::BOTTOM_OF_PIPE, pool, base + 1);
}
// A FEW FRAMES LATER (not this frame): read without stalling.
let mut ticks = [0u64; 2];
unsafe { device.get_query_pool_results(pool, old_base, &mut ticks, vk::QueryResultFlags::TYPE_64)?; }
let mask = if valid_bits == 64 { u64::MAX } else { (1u64 << valid_bits) - 1 };
let gpu_ms = ((ticks[1] & mask) - (ticks[0] & mask)) as f64 * timestamp_period as f64 / 1.0e6;

vkCmdResetQueryPool(cmd, pool, base, 2);              // MUST reset before writing
vkCmdWriteTimestamp(cmd, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,    pool, base);
// ...draw the pass...
vkCmdWriteTimestamp(cmd, VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, pool, base + 1);

// A FEW FRAMES LATER: read without stalling.
uint64_t ticks[2];
vkGetQueryPoolResults(device, pool, oldBase, 2, sizeof(ticks), ticks, sizeof(uint64_t),
                      VK_QUERY_RESULT_64_BIT);            // 32-bit overflows in ~4.3 s at a 1 ns tick
uint64_t mask = validBits == 64 ? ~0ull : ((1ull << validBits) - 1);
double gpuMs = ((ticks[1] & mask) - (ticks[0] & mask)) * timestampPeriod / 1e6;

Frame capture is a different tool

Frame debuggers, RenderDoc (free, MIT), PIX (D3D12/Xbox), Nsight (NVIDIA), capture a single frame and let you inspect every draw call, the pipeline state and bound objects at each draw, texture/buffer contents, and shaders, with no code changes^[9]. That's offline single-frame dissection ("what exactly did draw #4012 do"), distinct from continuous timestamp timing ("how long is the shadow pass, every frame"). Pipeline-statistics queries (vertex/fragment invocations) sit between them, useful for "why is this pass expensive" (overdraw).

06CPU- vs GPU-bound

is about which timeline is the critical path. If the GPU's queue has visible gaps, idle waiting on CPU submission, you're CPU-bound; if the CPU blocks at present/fence waiting for the GPU, you're GPU-bound^[10]. A quick test: halve the render resolution, if the frame rate jumps you were GPU-bound on fragment/fill work (fewer pixels to shade); if it barely moves you're CPU-bound or bound on GPU geometry/draw submission, which fewer pixels don't change.

Vsync masquerades as GPU-bound

With vsync on, both the CPU and GPU can finish early and then wait for the display refresh, so both timelines show idle time that is waiting for scanout, not for each other, and the app can look CPU- or GPU-bound when it's actually neither. Diagnose with vsync off. Don't read a frame-capped frame as a bottleneck.

Drag the CPU and GPU work; see which is the critical path and who waits. Toggle vsync and both lanes idle to the refresh line:

07The mindset

Measure, don't guess. Intuition about hotspots is famously wrong, profile first, then optimize the actual bottleneck. Amdahl's law caps the payoff: speeding up a function that's 2% of the frame can't help much, so find the part that dominates. Knuth, in full (the truncated version is routinely misused)^[11]:

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

Profile a representative workload

Profile the busy combat scene, not the title screen, the bottleneck on an empty menu is not the bottleneck in gameplay. And don't blame a frame spike on one cause by reflex: pop-in, an allocation hitch, a fat physics frame, OS preemption, and shader-compile stutter are distinct causes; the profiler tells you which. Micro-optimizing a non-bottleneck is wasted work.

08Hot reload

Hot reload closes the iteration loop: a file-watcher detects an edit, the asset is re-cooked and re-read, and the live handle is swapped (the Asset Pipeline owns the cook/swap machinery). For a renderer, shader hot-reload (recompile to SPIR-V, rebuild the pipeline object, swap it in) is the highest-value reload, you tune a shader and see it without restarting.

An iteration tool, with edge cases

Hot reload is a dev-build feature, not free or transparent: in-flight GPU work may still reference the old resource (swap behind a fence), layout/format changes invalidate descriptors, and serialized state can stop matching the new code. Treat it as a powerful iteration accelerator with real correctness concerns, not a guarantee that everything reloads cleanly.

Wrong answers, and why: immediate vs retained is about where state lives (retained has the sync burden; immediate still caches internally); and CPU timers can't measure async GPU work (use timestamp queries read back later, not a same-frame WAIT stall).

09Pitfalls

"Immediate mode is faster"It's about where state lives. The win is iteration speed + no desync, not throughput.

Shipping ImGui as the player HUDIt targets tools/debug UI. Retained/custom dominates shipping UI.

CPU-timing GPU workAsync; you time submission. Use timestamp queries.

Reading the timestamp the same frameWAIT_BIT stalls the CPU. Ring the pool, read a few frames old.

32-bit timestampsOverflow in ~4.3 s at a 1 ns tick. Use 64-bit; mask to timestampValidBits.

Reading vsync as GPU-boundBoth idle to the refresh line. Diagnose with vsync off.

Optimizing without profilingIntuition is wrong. Measure, then fix the actual bottleneck.

Profiling the title screenProfile a representative busy workload, not the menu.

10What's next

That's the last engine subsystem: you can render, animate, simulate, network, and now see and tune the whole thing. Everything the series built is on the table. The finale assembles it: the 3D-game capstone, a small but complete game that ties every module together, with this profiler overlay watching its own frame. The full path is on the series hub.

Jason Gregory. Game Engine Architecture, 3rd ed., "Tools for Debugging and Development." gameenginebook.com. Logging/tracing, debug drawing, in-game menus, the console, and the in-game profiler; every team builds a tool suite.
Casey Muratori. "Immediate-Mode Graphical User Interfaces" (2005). caseymuratori.com. The origin of the IMGUI concept and the retained-graphics-drawbacks motivation.
Dear ImGui wiki. "About the IMGUI paradigm." github.com/ocornut/imgui. "Immediate mode is a statement about the interface of the library, not the internals"; internal state is normal.
Omar Cornut. Dear ImGui (README). github.com/ocornut/imgui. Battle-tested in the game industry; targets content-creation and debug tools; outputs vertex buffers you render; the platform/renderer backend split.
HPC Wiki. "Runtime profiling." hpc-wiki.info. Sampling vs instrumentation: "instrumentation produces more accurate results but introduces more overhead; sampling has less overhead but produces less accurate results."
Bartosz Taudul. Tracy Profiler. github.com/wolfpld/tracy. A real-time, nanosecond-resolution hybrid frame and sampling profiler; CPU zones + automatic sampling + GPU zones (Vulkan/D3D/OpenGL).
The Khronos Group. Vulkan Specification, Queries. docs.vulkan.org. vkCmdWriteTimestamp, timestampPeriod (ns/tick), timestampValidBits, and queue support.
The Khronos Group. Vulkan Samples, "Timestamp queries." docs.vulkan.org/samples. Why GPU timing isn't CPU timing; the WAIT-stall vs availability readback; the 32-bit overflow; reset-before-write.
Baldur Karlsson. RenderDoc documentation. renderdoc.org. A free MIT stand-alone graphics debugger; single-frame capture, inspect every draw, resource, bound state, and shader with no code changes.
Intel. Graphics Performance Analyzers, "Identify GPU-CPU Bound Scenarios." intel.com. An idle hardware queue means CPU-bound; vsync can make an app appear bound when it isn't.
Donald Knuth. "Structured Programming with Go To Statements." ACM Computing Surveys 6:4 (1974), p. 268. dl.acm.org. The full "premature optimization is the root of all evil... yet we should not pass up our opportunities in that critical 3%."