Build a Game Engine · Audio

Audio DSP & the FFT

The Audio Engine module built the engine, mixing, the ring buffer, spatialization. This is the signal processing that runs inside it: the FFT to see sound as frequencies, windows to read a spectrum honestly, filters to shape it, and convolution to put a sound in a room. The recurring discipline: all of it runs in the hard-real-time callback, so it has to be cheap and allocation-free.

Time~55 min LevelSenior PrereqsThe Audio Engine (PCM, Nyquist, the callback rule), and helpful: SIMD (vectorizing DSP loops). StackC++ & Rust

01Two domains

A sampled signal has two equivalent representations: a sequence of amplitude samples over time (the time domain), and a set of sinusoidal components, amplitude and phase at each frequency, that sum to it (the ). The DFT is the exact, invertible bridge between them. Frequency analysis is what lets you build an equalizer, spectral effects, and any analyzer or visualizer.

Same information, different coordinates

The two domains carry the same information for a finite block, the transform is lossless and invertible; the frequency domain is a re-coordinatization, not a lossy summary. (An aside: frequency-domain analysis is the conceptual basis of perceptual codecs like MP3/AAC, though those use an MDCT filterbank, not a literal FFT.)

02The DFT & the FFT

The DFT of N samples produces N complex bins; each bin's magnitude = √(re²+im²) is that frequency's amplitude and its phase = atan2(im, re) its offset. Computed directly it's O(N²). The (Cooley-Tukey) computes the same result in O(N log N) by recursively splitting the transform into even/odd halves combined with twiddle-factor butterflies^[1]^[2].

Need a refresher on complex numbers and e^−i2πkn/N?

A complex number has two parts, a real part and an imaginary part: z = re + i·im, where i is the unit defined by i² = −1. Picture it as a point on a 2D plane (real horizontal, imaginary vertical). That point has a length (its magnitude, √(re²+im²)) and an angle from the positive real axis (its phase, atan2(im, re)). Each DFT bin is one such number: length = how strong that frequency is, angle = where in its cycle it starts.

The exponential e^−i2πkn/N looks scary but is just a unit-length arrow that rotates. By Euler's formula, e^iθ = cos θ + i·sin θ: feeding an angle into e^i(…) returns the point on the unit circle at that angle. Here e ≈ 2.718 (the natural-exponential base), π ≈ 3.14159 (half a turn), and the exponent −2πkn/N is an angle that grows with both the bin index k and the sample index n. So the DFT compares your signal against a set of spinning arrows, one per frequency, and reports how much each one matches.

The FFT is exact, not an approximation, and a real signal gives N/2+1 useful bins

The FFT computes the DFT exactly, it's an algorithm, not an estimate, and because it does far fewer operations it's typically more numerically precise than the naive DFT (less accumulated round-off)^[1]. Bin k maps to frequency k·fs/N, so the resolution is fs/N. For a real input only N/2+1 bins are independent (indices 0 to N/2); the upper half is the conjugate mirror, real-FFT routines return exactly N/2+1 values. Radix-2 needs N a power of two; mixed-radix and Bluestein handle any size, still O(N log N). Don't claim N independent bins, and don't throw away phase if you're processing (a visualizer can; reconstruction can't).

Sweep a sine across frequency (and detune it between bins) or switch to a square wave; the live magnitude spectrum is below. The square wave shows the odd-harmonic series:

A teaching radix-2 FFT (production: use a library)

// In-place radix-2 decimation-in-time. N MUST be a power of two.
void fftRadix2(std::vector<std::complex<float>>& samples) {
    const size_t n = samples.size();
    for (size_t i = 1, j = 0; i < n; ++i) {        // bit-reversal permutation
        size_t bit = n >> 1;
        for (; j & bit; bit >>= 1) j ^= bit;
        j ^= bit; if (i < j) std::swap(samples[i], samples[j]);
    }
    for (size_t len = 2; len <= n; len <<= 1) {          // stages: 2, 4, 8, ..., N
        float ang = -2.0f * 3.14159265f / (float)len;     // -2pi/len (forward)
        std::complex<float> step(cosf(ang), sinf(ang));
        for (size_t s = 0; s < n; s += len) {
            std::complex<float> tw(1, 0);
            for (size_t k = 0; k < len/2; ++k) {           // the butterfly
                auto even = samples[s+k], odd = samples[s+k+len/2] * tw;
                samples[s+k] = even + odd; samples[s+k+len/2] = even - odd;
                tw *= step;
            }
        }
    }
}

use num_complex::Complex32;
// In-place radix-2 DIT. n MUST be a power of two. Production: the `rustfft` crate.
fn fft_radix2(samples: &mut [Complex32]) {
    let n = samples.len();
    let mut j = 0usize;
    for i in 1..n {                                  // bit-reversal permutation
        let mut bit = n >> 1;
        while j & bit != 0 { j ^= bit; bit >>= 1; }
        j ^= bit; if i < j { samples.swap(i, j); }
    }
    let mut len = 2;
    while len <= n {                                // stages
        let ang = -2.0 * std::f32::consts::PI / len as f32;
        let step = Complex32::new(ang.cos(), ang.sin());
        let mut s = 0;
        while s < n {
            let mut tw = Complex32::new(1.0, 0.0);
            for k in 0..len/2 {                       // the butterfly
                let (even, odd) = (samples[s+k], samples[s+k+len/2] * tw);
                samples[s+k] = even + odd; samples[s+k+len/2] = even - odd;
                tw *= step;
            }
            s += len;
        }
        len <<= 1;
    }
}

What's intentionally missing

This is a learning aid, ship FFTW, KissFFT, or rustfft^[9]. It does a full complex FFT on real data (a real-FFT does ~half the work, returns N/2+1 bins), recomputes cos/sin per stage instead of a twiddle table, has no SIMD, and supports only power-of-two sizes (no mixed-radix/Bluestein).

03Windowing & leakage

Pulling a finite N-sample block from a longer signal implicitly multiplies by a rectangular window. Unless the signal is exactly periodic in the block, the edges are discontinuous, and that smears each frequency's energy across many bins: . The rectangular window's first side lobe is only 13 dB down, so leakage is severe. A window function (Hann, Hamming, Blackman) tapers the block toward zero at the edges, applied before the FFT, cutting the side lobes^[4].

There is no best window, it's a tradeoff

Lower side lobes (less leakage) come at the cost of a wider main lobe (worse resolution of nearby frequencies). Julius Smith's figures: rectangular first side lobe −13 dB (narrowest main lobe); Hann ≈ −31 dB (main lobe ~2× rectangular); Hamming ≈ −43 dB; Blackman ≈ −58 dB (widest main lobe)^[4]^[5]. Hann is a reasonable default, not universally best. One subtlety: for analysis use the periodic ("DFT-even", denominator N) window, not the symmetric form (that's for FIR design). And windowing reduces leakage, it never eliminates it.

A single sine whose frequency you can push between bins; switching windows drops the leakage as the peak widens:

04The STFT & spectrogram

One FFT is a snapshot. The Short-Time Fourier Transform slides a windowed FFT along the signal, taking a transform of each chunk, producing a 2D time × frequency map; plot magnitude (in dB) as brightness and you have a spectrogram. The hop size is how far the window advances between frames; typical analysis hops are 25 to 50% of the window (so windows overlap and you don't miss transients).

The time-frequency tradeoff is a hard limit

A short window gives good time resolution but poor frequency resolution; a long window gives fine frequency resolution but poor time resolution. You cannot maximize both, the product of time and frequency spread is bounded below (a Fourier/uncertainty relation, minimized by a Gaussian window). This is mathematics, not an implementation limit. And more overlap gives smoother time evolution, not finer frequency resolution, that's set by window length.

05Digital filters

A linear filter is a difference equation, each output sample a weighted sum of recent inputs and recent outputs:

eq. 1 \cdot the filter difference equation y [n] = b 0 x [n] + b 1 x [n-1] + \dots - a 1 y [n-1] - a 2 y [n-2] - \dots

Read it as a mixing desk on a sliding window. The new output y[n] is a weighted blend of the current and past inputs (the b terms, the feed-forward path) minus a weighted blend of past outputs (the a terms, the feedback path)^[7]. A subscript like b₁ just labels a coefficient; [n−1] means "one sample ago". With no feedback (no a terms) it is an FIR filter; with feedback it is IIR. Hover any symbol to see what it stands for.

FIR (no feedback): always stable, can be exactly linear-phase (no phase distortion), but needs many taps, more compute.
IIR (has feedback): achieves a given response with far fewer coefficients (cheap), but can be unstable and has nonlinear phase.

The is a second-order IIR section (two poles, two zeros), the standard workhorse. The RBJ "Audio EQ Cookbook" gives ready coefficients for low/high/band-pass, notch, peak, and shelf filters^[6]. Conceptually, zeros pull the response down, poles push it up (resonance), and a pole on or outside the unit circle is unstable.

Neither FIR nor IIR wins; cascade biquads for higher order

It's a tradeoff: FIR is stable + linear-phase + expensive; IIR/biquad is cheap + can be unstable + nonlinear-phase. Per-voice game EQ is almost always biquads; phase-critical mastering may use linear-phase FIR. A biquad is second-order, build higher-order filters by cascading biquads (second-order sections), which is far less sensitive to coefficient quantization than one high-order section (a small error there can push a pole across the unit circle and blow up). Watch denormals in the feedback path (decaying tails hit subnormal floats, a CPU cliff on x86, set FTZ/DAZ, cross-ref SIMD).

A biquad (RBJ low-pass, Direct Form I), block-processed in the audio thread

struct Biquad {
    float b0,b1,b2,a1,a2;                 // coefficients (a0 normalized to 1)
    float x1=0,x2=0,y1=0,y2=0;            // state: last two inputs/outputs
    void setLowpass(float cutoffHz, float q, float sampleRate) {
        float w0 = 2.0f*3.14159265f*cutoffHz/sampleRate;
        float cw = cosf(w0), alpha = sinf(w0)/(2.0f*q);
        float a0 = 1.0f + alpha;                // normalize everything by a0
        b0 = ((1-cw)*0.5f)/a0; b1 = (1-cw)/a0; b2 = ((1-cw)*0.5f)/a0;
        a1 = (-2*cw)/a0; a2 = (1-alpha)/a0;
    }
    void processBlock(float* s, int count) {
        for (int i = 0; i < count; ++i) {
            float x0 = s[i];                  // the difference equation
            float y0 = b0*x0 + b1*x1 + b2*x2 - a1*y1 - a2*y2;
            x2=x1; x1=x0; y2=y1; y1=y0;       // shift history (y feedback is serial)
            s[i] = y0;
        }
    }
};

struct Biquad { b0:f32,b1:f32,b2:f32,a1:f32,a2:f32, x1:f32,x2:f32,y1:f32,y2:f32 }
impl Biquad {
    fn set_lowpass(&mut self, cutoff_hz: f32, q: f32, sample_rate: f32) {
        let w0 = 2.0*std::f32::consts::PI*cutoff_hz/sample_rate;
        let (cw, alpha) = (w0.cos(), w0.sin()/(2.0*q));
        let a0 = 1.0 + alpha;                 // normalize by a0
        self.b0 = ((1.0-cw)*0.5)/a0; self.b1 = (1.0-cw)/a0; self.b2 = ((1.0-cw)*0.5)/a0;
        self.a1 = (-2.0*cw)/a0; self.a2 = (1.0-alpha)/a0;
    }
    fn process_block(&mut self, s: &mut [f32]) {
        for v in s.iter_mut() {
            let x0 = *v;                       // difference equation
            let y0 = self.b0*x0 + self.b1*self.x1 + self.b2*self.x2 - self.a1*self.y1 - self.a2*self.y2;
            self.x2=self.x1; self.x1=x0; self.y2=self.y1; self.y1=y0;   // shift history
            *v = y0;
        }
    }
}

The biquad's frequency response, recomputed live from the RBJ coefficients; sweep cutoff, Q, and type, and push Q high to see the resonant peak:

06Convolution reverb

Convolution reverb convolves the dry signal with a measured room impulse response (the recorded response of a real space to an impulse); the output sounds as if played there. Direct convolution is O(N·M), far too slow for a seconds-long IR. The convolution theorem, time-domain convolution equals frequency-domain multiplication, is why FFT convolution is fast: FFT both, multiply the spectra, inverse-FFT, O(N log N)^[1].

Plain FFT convolution adds latency; real-time uses partitioned convolution

A single big FFT convolution needs the whole block, adding latency equal to the block length, unacceptable in real time. The fix is partitioned convolution: chop the IR into blocks and overlap-add, with small early blocks (low latency) and larger tail blocks (efficiency). FFT convolution also beats direct only for long kernels, a short FIR is faster done directly. The alternative, algorithmic reverb (Schroeder/Moorer, Freeverb = parallel comb filters for echo density + series all-pass filters for diffusion)^[8], is cheap, low-latency, and fully parameterizable (size, decay, damping) but sounds synthetic next to a measured IR. More realistic vs cheaper-and-tweakable, a tradeoff, not a ranking.

07Real-time game DSP

All of this runs inside the hard-real-time audio callback (or worker threads feeding it): no heap alloc, no locks, no I/O, no unbounded work. Filters and reverb are pre-allocated with fixed state; parameter changes arrive via the lock-free ring, not a mutex. Process whole blocks per effect (cache- and SIMD-friendly).

The DSP graph, and where SIMD does and doesn't help

Cheap per-voice filters (a distance/occlusion low-pass) run per voice; expensive shared effects (reverb, master EQ, a limiter) run once on a bus. This is the DSP graph middleware exposes, FMOD and Wwise are directed graphs of DSP nodes from generators to the output^[10]. SIMD the parallelizable loops (the magnitude loop, FFT butterflies, batches of voices), but the biquad recurrence is serial in time (y[n] depends on y[n−1]), so vectorize across voices/channels, not within one filter. And ramp coefficient changes between blocks, a hard jump clicks (zipper noise).

Wrong answers, and why: the FFT is exact (not approximate) and real input gives N/2+1 bins (not N); and windowing trades leakage for resolution (no best window) and is applied before the FFT.

08Pitfalls

"The FFT is approximate"It's exact, and more precise than the naive DFT (fewer ops).

N independent binsA real signal gives N/2+1; the rest is the conjugate mirror.

"Hann is the best window"No best window. Side lobes vs main-lobe width is a tradeoff.

Longer FFT is strictly betterFiner frequency but coarser time, the hard tradeoff.

Crowning FIR or IIRFIR: stable/linear-phase/expensive. IIR: cheap/can-ring. Tradeoff.

One high-order IIR sectionQuantization can blow it up. Cascade biquads.

"FFT convolution is always faster"Only for long kernels; short FIRs win direct. And it adds latency.

Alloc/lock in the callbackPre-allocate; params via the lock-free ring. Mind denormals.

09What's next

That's the signal processing under the audio engine: see sound as frequencies, read the spectrum honestly, shape it with filters, and place it in a room. Back to the series hub for the rest of the engine, or the Audio Engine module this builds on.

Steven W. Smith. The Scientist and Engineer's Guide to Digital Signal Processing, ch. 12 (the FFT), 9 (the real DFT, N/2+1), 18 (FFT convolution). dspguide.com. The FFT is exact and more precise; bin counts; the convolution theorem.
James Cooley and John Tukey. "An Algorithm for the Machine Calculation of Complex Fourier Series." Math. of Computation 19(90), 1965. overview. The O(N log N) divide-and-conquer FFT.
Julius O. Smith III. Mathematics of the DFT (with Audio Applications). CCRMA, Stanford. ccrma.stanford.edu. The DFT definition, magnitude and phase, for audio.
Julius O. Smith III. Spectral Audio Signal Processing, "Spectrum Analysis Windows." dsprelated.com. Window side-lobe/main-lobe figures (rect −13, Hann −31, Hamming −43, Blackman −58 dB) and the tradeoff.
Fredric J. Harris. "On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform." Proc. IEEE 66(1), 1978. doi.org. The canonical window catalog and performance parameters.
Robert Bristow-Johnson. "Cookbook formulae for audio EQ biquad filter coefficients" (Audio EQ Cookbook). w3.org/TR/audio-eq-cookbook. The RBJ biquad coefficients for every filter type; Direct Form I recommendation.
Julius O. Smith III. Introduction to Digital Filters (with Audio Applications). CCRMA. ccrma.stanford.edu. FIR vs IIR, poles and zeros, stability, the difference equation.
Julius O. Smith III. Physical Audio Signal Processing, "Freeverb" / "Artificial Reverberation." ccrma.stanford.edu. Schroeder/Moorer comb+allpass reverb; Freeverb's 8 combs + 4 all-passes.
Matteo Frigo and Steven Johnson. FFTW ("Fastest Fourier Transform in the West"). fftw.org. The production FFT: any size, O(N log N), Cooley-Tukey + Rader/Bluestein. (KissFFT, rustfft are lighter alternatives.)
Firelight Technologies. FMOD DSP Architecture. fmod.com. Game audio as a directed graph of DSP effect nodes, per-voice and per-bus.