The Audio Engine
Rendering is soft real-time: a dropped frame stutters. Audio is hard real-time: a single missed buffer is a pop everyone hears. That one constraint, the callback that must finish on time, shapes the whole subsystem: no locks, no allocation, a lock-free ring to hand work in, and a mixer that lives or dies by its deadline. We build it in C++ and Rust.
01The real-time callback
The audio device driver runs a dedicated high-priority thread and calls a function you registered, handing it a buffer to fill before the hardware needs it. It's a pull model: the system pulls samples from you, on its schedule, on a thread you don't control. cpal describes it as "a dedicated, high-priority thread responsible for delivering audio data to the system's audio device in a timely manner"[10]. A 128-frame buffer at 48 kHz gives you about 2.67 ms to fill it.
Inside the callback: no heap allocation (malloc/free/new/delete), no mutex a non-real-time thread can hold, no file/network/syscall I/O, and no unbounded work. The reason isn't that these are "slow", it's that they have an unbounded worst case: the allocator may take a contended lock or fault a page in from disk; a mutex can cause priority inversion[3]. Bencina's rule: be deterministically fast, and think in worst-case, not amortized, time[1]. And try_lock is not a loophole: even releasing the lock interacts with the OS scheduler and isn't real-time-safe[2]. The fix is a lock-free queue, not clever locking.
In C++ the traps hide in plain sight: a std::shared_ptr copy can delete on last release, std::function may heap-allocate, vector::push_back can reallocate. Any of those is an allocation in disguise. The mixing callback below is allocation- and lock-free:
// miniaudio 0.11.x. Runs on the high-priority audio thread.
// RULE: no malloc/free, no mutex the game thread holds, no I/O, no unbounded loops.
void dataCallback(ma_device* device, void* pOutput, const void*, ma_uint32 frameCount) {
Mixer* mixer = (Mixer*)device->pUserData; // pre-allocated, owned elsewhere
float* output = (float*)pOutput; // interleaved stereo: L,R,L,R,...
drainPlayCommands(mixer); // wait-free pop from the SPSC ring (§4)
memset(output, 0, sizeof(float) * frameCount * 2); // start from silence (bounded)
for (Voice& voice : mixer->voices) { // fixed-capacity array, no allocation
if (!voice.active) continue;
for (ma_uint32 frame = 0; frame < frameCount; ++frame) {
float sample = voice.data[voice.position]; // mono source
output[frame*2 + 0] += sample * voice.gainLeft; // mixing == summation
output[frame*2 + 1] += sample * voice.gainRight;
if (++voice.position >= voice.length) { voice.active = false; break; }
}
}
// the sum may exceed [-1,1]; clamp/limit on the master bus, not here (§5)
}
// cpal 0.18.x. The closure runs on cpal's high-priority audio thread.
let stream = device.build_output_stream(
&config,
move |output: &mut [f32], _: &cpal::OutputCallbackInfo| {
drain_play_commands(&mut mixer); // wait-free pop from the SPSC ring (§4)
for s in output.iter_mut() { *s = 0.0; } // silence
for voice in mixer.voices.iter_mut() { // fixed-capacity, no allocation
if !voice.active { continue; }
for frame in output.chunks_mut(2) { // 2 samples == one stereo frame
let sample = voice.data[voice.position]; // mono source
frame[0] += sample * voice.gain_left; // summation
frame[1] += sample * voice.gain_right;
voice.position += 1;
if voice.position >= voice.len { voice.active = false; break; }
}
}
// sum may exceed [-1,1]; limit on the master bus, not here (§5)
},
move |err| eprintln!("audio error: {err}"),
None, // timeout (cpal 0.15+ param)
)?;
No device-format negotiation/conversion (the engine converts at the edges); no master-bus limiter (§5); no stealing when all slots are full; no retire ring to free finished buffers off the audio thread (freeing here is the allocation rule violated); no sample-accurate command scheduling (commands apply at block boundaries, not exact sample offsets); no declick ramp on voice start/stop (instant gain changes click).
02PCM fundamentals
represents a waveform as evenly spaced amplitude samples[4]. Four terms you must keep straight:
- Sample rate (Hz) sets the bandwidth (Nyquist = rate/2). 48 kHz is the common game/OS rate, 44.1 kHz is CD.
- Bit depth sets dynamic range for integer formats (16-bit ≈ 96 ). float32 samples are zero-centered, nominally in [−1, 1].
- Channels are concurrent streams (L/R, 5.1).
- A frame is one sample per channel at one instant (a stereo frame is two samples).
An audio frame is not a video frame, it's one sample per channel. That naming collision near the game loop bites everyone, so say which you mean. And float32 in [−1, 1] is the internal processing format (what miniaudio and cpal's f32 hand you); device and file formats are frequently 16- or 24-bit integer PCM, and the engine converts at the edges. The win from float isn't more resolution at the speaker, it's headroom during mixing, where the sum can exceed ±1 without losing data until the final clamp. Layout is interleaved (L,R,L,R) for most device APIs and WAV, planar (LLL…RRR) for Web Audio.
03Buffer & latency
The callback buffer size sets two things at once: the latency floor (you can't react faster than one buffer) and how tight the deadline is. 256 frames at 48 kHz is about 5.3 ms; 64 frames about 1.3 ms. Smaller buffers mean lower latency but a shorter window to finish in, so any jitter, a USB stall, a background flush, an over-budget mix, causes an underrun (a starved output, heard as a click or dropout).
The widget races the device draining the buffer against the callback refilling it. Shrink the buffer or make the callback do too much, and it underruns:
04The ring to the audio thread
The game thread can't call into the audio thread directly and can't share a mutex with it, so the hand-off is a single-producer, single-consumer lock-free ring: the game thread enqueues "play this sound" commands, the callback dequeues them at the top of each block. This is exactly the SPSC ring from the Lock-free Queues tutorial. PortAudio's ring buffer states it "only works when there is a single reader and a single writer" with a power-of-two capacity[5].
The consumer side must be wait-free: bounded steps, never blocks, never allocates. The subtle trap is lifetime: if you pass an owning pointer (or an Arc/shared_ptr) and the audio thread drops the last reference, that free runs on the audio thread, the allocation rule violated. Either the game thread owns the buffer's lifetime (the audio thread only borrows) or you hand finished voices back on a second ring for the game thread to free. And the ring can fill: define a policy (drop the newest, or size for the worst case), the audio thread must never block waiting for space.
struct PlayCommand { const float* data; uint32_t length; float gainLeft, gainRight; };
// Game thread (NOT real-time): enqueue may fail if the ring is full.
void requestPlay(SpscRing<PlayCommand>& ring, const PlayCommand& cmd) {
if (!ring.tryPush(cmd)) { /* drop or count an overflow; never block audio */ }
}
// Audio thread: wait-free drain. Bounded (the ring is finite). Never frees.
void drainPlayCommands(Mixer* mixer) {
PlayCommand cmd;
while (mixer->pendingCommands.tryPop(cmd)) {
Voice* voice = findFreeVoice(mixer); // scan the fixed voice array
if (voice) { voice->data = cmd.data; voice->length = cmd.length;
voice->gainLeft = cmd.gainLeft; voice->gainRight = cmd.gainRight;
voice->position = 0; voice->active = true; }
}
}
// e.g. the `rtrb` or `ringbuf` SPSC crate. Arc: game thread owns the samples;
// the audio thread clones a ref-count (cheap) but must not drop the LAST one.
struct PlayCommand { data: Arc<[f32]>, gain_left: f32, gain_right: f32 }
// Game thread: push may fail when full. Never block the audio thread.
fn request_play(producer: &mut Producer<PlayCommand>, cmd: PlayCommand) {
let _ = producer.push(cmd); // drop on full; do not spin
}
// Audio thread: wait-free drain. pop() returns Err(Empty) when drained.
fn drain_play_commands(mixer: &mut Mixer) {
while let Ok(cmd) = mixer.consumer.pop() {
if let Some(voice) = mixer.find_free_voice() {
voice.data = cmd.data; voice.gain_left = cmd.gain_left;
voice.gain_right = cmd.gain_right; voice.position = 0; voice.active = true;
}
}
}
05Mixing
Mixing N voices is summation: add the (gain-scaled) samples, per channel, per frame. Because independent signals add, the sum routinely leaves [−1, 1]. In float that's non-destructive until output, but the final clamp to the device range will hard-clip anything past ±1, squaring off the peaks into harmonic distortion.
The fix is headroom (mix with peaks below the ceiling) plus a limiter (smoothly reduce gain near the ceiling, preserving waveform shape)[6]. Two wrong "fixes": dividing the sum by N attenuates everything (one quiet voice shouldn't get quieter because another exists); and relying on hard-clipping distorts. The limiter isn't free either, pushed hard it pumps and softens transients. Float not overflowing mid-sum does not mean you can ignore levels: the clamp at the device boundary still clips.
Stack voices until the master sum clips, then enable the soft clip:
06Resampling
Playing a sound at a different pitch, or matching a 44.1 kHz asset to a 48 kHz device, means computing sample values at fractional positions: resampling. The quality ladder runs nearest-neighbor → linear → windowed-sinc.
Naive drop/duplicate and even linear interpolation introduce aliasing. Linear interpolation's frequency response is sinc²(fT), a weak low-pass whose first sidelobe suppresses spectral images by only about 26 dB and which isn't flat in the passband, so high-frequency images fold back as audible alias tones[7]. Production resamplers use windowed-sinc / polyphase filters, trading taps for quality. Linear is acceptable only if the signal is already heavily oversampled. One more: pitch-by-resampling changes duration and timbre together (the chipmunk effect), it's not a formant-preserving pitch shifter.
07Spatial audio
Placing a sound in space starts with stereo panning. Equal-power panning maps the pan angle θ ∈ [0, π/2] to L = cos θ, R = sin θ. Because loudness tracks power (amplitude²) and cos²θ + sin²θ = 1, the total power is flat across the pan; at center each channel is 0.707 (−3 dB)[6]. Linear panning instead dips to −3 dB total at center, the audible "hole in the middle."
"Constant power is flat" assumes the two channels combine incoherently (power-additive), which holds for two loudspeakers in a room. If they combine coherently (a true stereo-to-mono fold-down, or a listener exactly equidistant), equal-power's center is +3 dB hot while linear's is flat, which is exactly why the −4.5 dB compromise pan law (the geometric mean of the two) exists and why pro tools expose selectable pan laws[6]. Equal-power is the right default for game stereo; just know the assumption.
Sweep the pan and compare the two laws; linear's total power dips at center:
Beyond panning: distance attenuation makes far sounds quieter. Physics says intensity falls as 1/r², but games don't ship pure 1/r² (it blows up at r→0); OpenAL's default is an inverse-distance model with a reference distance, a tunable rolloff, and the distance clamped to a max[8]. Doppler shifts pitch from relative velocity (computed from velocities, then applied as a resample). And convolves a mono source with the head-related impulse responses for a direction to place it in full 3D over headphones; it's what Steam Audio and similar spatializers do[12]. HRTF is binaural (headphone) and individual, so generic sets have front/back confusion.
08APIs & libraries
Each OS has a native low-level API that ultimately calls you on a real-time thread to move a buffer: WASAPI on Windows (you write a rendering endpoint buffer; the engine mixes app streams; shared vs exclusive mode)[9], CoreAudio / Audio Units on macOS/iOS (a pull callback), ALSA on Linux (a PCM device, a ring split into periods). Same model, materially different setup.
In practice you use a cross-platform layer. C++: miniaudio (a single-file C library, v0.11.25), one data callback, backends to all of the above[11]. Rust: cpal (v0.18.1, the low-level device/stream layer) and rodio (v0.22.2, higher-level playback on top)[10]. For a shipped game most studios use middleware (FMOD, Wwise); for 3D audio, Steam Audio and similar. The from-scratch mixer here is what those wrap.
Wrong answers, and why: the callback rule is about unbounded worst case (not average speed or corruption); and clipping is fixed with headroom plus a limiter, not by dividing by N or accepting the distortion.
09Pitfalls
10What's next
That's the last subsystem. The engine can now run a loop, draw sprites, take input, load assets, simulate physics, and make sound. The next module is the payoff: the 2D-game capstone, where all of it wires together into a small playable game in C++ and Rust. After that, the series turns to 3D. The full path is on the series hub.
- Ross Bencina. "Real-time audio programming 101: time waits for nothing" (2011). rossbencina.com. The canonical no-allocate / no-lock / no-block callback rules; be deterministically fast (worst-case, not amortized).
- Timur Doumler. "Using locks in real-time audio processing, safely." timur.audio. The audio thread must avoid anything of unknown duration; even
try_lock's unlock touches the scheduler. - Android Open Source Project. "Avoiding Priority Inversion." source.android.com. Priority inversion in audio manifests as glitches/dropouts; the canonical mechanism.
- "Pulse-code modulation." Wikipedia. en.wikipedia.org. Sampling and quantization: a continuous waveform becomes integer samples at a fixed sample rate and bit depth. (A frame is one sample per channel; stereo interleaves left and right.)
- PortAudio.
pa_ringbuffer.hreference. portaudio.com. The lock-free ring works only single-reader / single-writer, with a power-of-two capacity. - Anders Øland and Roger Dannenberg. "Loudness Concepts & Pan Laws" (CMU). cs.cmu.edu. Equal-power panning (cos/sin, flat power, −3 dB center), the linear hole-in-middle, and the −4.5 dB compromise law; headroom.
- Julius O. Smith III (CCRMA). "Linear Interpolation Frequency Response." ccrma.stanford.edu. Linear interpolation is sinc²(fT), about 26 dB image suppression, so it aliases; windowed-sinc is the fix.
- The OpenAL 1.1 Specification. openal.org. The inverse-distance-clamped default attenuation model and the Doppler formula (speed of sound 343.3, factor 1.0).
- Microsoft. "About WASAPI." learn.microsoft.com. The app writes a rendering endpoint buffer; the audio engine mixes app streams; shared vs exclusive mode.
- cpal (Rust). docs.rs/cpal (v0.18.1) and rodio (v0.22.2). The high-priority audio thread;
build_output_streamwith thetimeoutparameter. - David Reid. miniaudio. miniaud.io (v0.11.25). A single-file C audio library; the data callback shape; "never start/stop the device in the callback."
- Valve. Steam Audio. valvesoftware.github.io/steam-audio. HRTF-based binaural spatialization (the headphone 3D-audio technique), with occlusion and reverb.