Tutorial 12 · Engine Programming

SIMD
from Scratch

Sixteen XMM registers since 2003, sixteen 256-bit YMMs since AVX, thirty-two 512-bit ZMMs and eight predicate masks on AVX-512 hardware. One vaddps zmm0, zmm1, zmm2 retires sixteen single-precision adds in one cycle of throughput on a Sapphire Rapids core. The trick that gets a renderer from "60 fps at 100k particles" to "60 fps at 1M particles" is almost never a better algorithm; it is laying the data out so the CPU's vector unit can chew it in lanes. We work from the lane model up through a SoA frustum cull and a 4×4 matrix batch transform, on x86 SSE / AVX / AVX-512 and ARM NEON, with every claim cited to a vendor manual or a measured benchmark.

Time~65 min LevelEngine programmer, intermediate to senior PrereqsYou can read C or C++ comfortably and know what a cache line is. The Assembly tutorial pairs naturally; the Memory Model tutorial is useful for §10. HardwareAny x86-64 chip from the last decade, or an Apple Silicon / Switch / Android device for the ARM sections

01What SIMD buys you, and what it costs

A modern desktop CPU core can issue two 256-bit floating-point multiply-adds per cycle. At 4 GHz, one core can produce 64 billion single-precision FMAs per second from a single hardware thread^[1]. The same core, running the same kernel one scalar element at a time, peaks two orders of magnitude below that. The gap is what a vectorized inner loop unlocks, and what a scalar inner loop leaves on the floor.

In a shipping engine the gap usually shows up not as a peak throughput number but as headroom. A scalar particle tick that costs several milliseconds at a hundred thousand particles can fall to a small fraction of that once the data is in SoA form and the integration is one vfmadd231ps per axis per eight particles; the same is true of skinning, instance-transform passes, frustum and occlusion culling, and the math-heavy parts of physics narrow-phase. The realized speedup varies with the kernel and the layout, but the shape is consistent: the lane-vector form processes 8 or 16 records per instruction, the surrounding loop overhead is roughly fixed, and the frame budget that was tight in scalar suddenly has room for the next system or the next thousand entities.

The cost is the part most introductions skip. SIMD only pays out when:

Your data is laid out for it. A vector load picks up sixteen contiguous bytes (SSE) or thirty-two (AVX); it does not pick up four scattered x components from four scattered Particle records. The single biggest determinant of whether a SIMD rewrite is a win is whether the inputs are in or form. §4 is the long answer.
The work has lane-level parallelism. A reduction with a four-cycle latency chain (each iteration depends on the previous) does not vectorize directly. You have to break the chain into independent partial sums first. §7.
Branches in the inner loop are rare or vectorizable. Lane-divergent branches stall the whole vector, and the SIMD form has to do the work for both sides of the branch and blend the result. AVX-512's mask registers are the first x86 SIMD with first-class predication and are part of why it remains worth hand-writing for cache-resident workloads. §8.
The compiler did not already do it for you. Modern autovectorizers handle clean array maps and reductions on contiguous data. Hand-rolled intrinsics are the right answer when the compiler bailed out on the pattern, not when it didn't. §13.

What you'll have by the end

A working ability to read and write x86 SIMD intrinsics across SSE, AVX2, FMA, and the AVX-512 mask model. Concrete knowledge of AoS vs SoA vs AoSoA layouts and the cost picture of each. The shuffle, blend, broadcast, and movemask toolkit that bridges scalar control flow and vector data flow. Gather and scatter, including when they're a trap. Two worked engine kernels (a batched 4×4 matrix-vector transform and an 8-at-a-time frustum cull). A clear-eyed picture of the AVX-512 downclocking story across Skylake-SP, Ice Lake, Sapphire Rapids, Zen 4, and Zen 5. Enough of ARM NEON to port your math. Six live, in-browser widgets you can step through.

The shape of the speedup, fixed at the architectural level, is the lane count of your vector divided by the per-lane overhead. For a perfectly clean reduction on contiguous float input, that ceiling is 8× on AVX/AVX2 and 16× on AVX-512. For most engine kernels the realized number is half of that, because the data layout cost or the tail handling burns part of the budget. The job of the rest of this tutorial is to make those costs visible and pay them down.

02A short history of x86 vector extensions

x86 SIMD is the carrying state of nearly thirty years of architectural additions, each layered on top of the previous one. Decoding what your compiler emits in 2026 means knowing which generation a given mnemonic comes from and what hardware it requires.

1997

MMX. Intel's first SIMD: 64-bit integer vectors, aliased onto the x87 floating-point register stack. Killed by SSE2's wider, non-aliased integer ops; deprecated in most engine codebases by 2005.

1999

SSE. Eight new 128-bit XMM registers (XMM0–XMM7), packed single-precision floats, and the first explicit prefetch instructions^[2]. AMD's K7 implemented it on the Athlon XP.

2001

SSE2. Packed double-precision floats and the full integer SIMD ISA (the part MMX was supposed to be) on the same XMM registers. AMD64 made SSE2 part of the mandatory baseline, which is why every x86-64 compiler emits XMM-based floating point^[3]. SSE2 is what a portable x86-64 binary can assume.

2004–2008

SSE3, SSSE3, SSE4. Three follow-on bumps. SSE3 (2004) added horizontal adds and the movddup broadcast. SSSE3 (2006) added pshufb, the byte-shuffle instruction that turned out to be the single most useful SIMD primitive for string and lookup work^[4]. SSE4.1 (2007) added pblendvb, insertps, and the dot-product instruction. SSE4.2 (2008) added string-comparison opcodes and CRC32. Console-era PS3/PS4 and Xbox 360/One titles standardized on the SSE3/SSSE3 baseline.

2011

AVX. 256-bit YMM registers (the lower 128 bits alias the XMM registers), a new three-operand VEX encoding that lets the destination be separate from both sources^[2], and packed single/double float ops at 256 bits. Integer ops stayed at 128 bits on AVX; the 256-bit integer expansion arrived with AVX2 in 2013 on Haswell. The separately-introduced FMA3 extension on Haswell brought fused multiply-add (vfmadd231ps and friends), a single μop that does a += b*c with one rounding step instead of two^[5].

2013

Consoles get AVX. The PS4 and Xbox One ship AMD Jaguar cores, which implement AVX (not AVX2). PS5 and Xbox Series X/S (AMD Zen 2, 2020) ship AVX2 and FMA3. AVX2 is the practical SIMD baseline most current PC and console titles target^[6].

2016–2022

AVX-512. 512-bit ZMM registers, thirty-two of them, with eight separate mask (predicate) registers K0–K7^[7]. First shipped on Knights Landing (2016) and Skylake-X (2017); reached client mobile on Ice Lake (2019). Intel disabled it on client P-cores from Alder Lake (2021) through Arrow Lake (2024) once the chips shipped with E-cores that lack the unit. AMD shipped AVX-512 on Zen 4 (2022, via two 256-bit pumps) and on Zen 5 (2024, with full 512-bit datapaths)^[6]. AVX-512 is the only x86 SIMD generation with first-class lane masking, which is why it stays worth hand-writing for cache-resident workloads. §14 covers the frequency-license story in detail.

2023

AVX10 (announced).^[8] Intel's proposal to converge the AVX-512 feature surface onto a unified ISA versioned by AVX10.1 / AVX10.2, with hardware free to ship a 256-bit-max implementation or a 512-bit one. The goal is to end the AVX-512 fragmentation that left client Intel parts without it from Alder Lake onward. Shipping silicon expected mid-decade; mentioned here so the timeline stays current.

The practical state in 2026: a portable x86-64 binary can assume SSE2. -march=x86-64-v3 (the GCC microarchitecture level for AVX2 / FMA / BMI2) is what most current shipping games target on PC, matching the PS5 / Xbox Series baseline. AVX-512 stays opt-in on x86 client, ubiquitous on server. ARM NEON is the parallel story for Switch, Switch 2, every modern Android phone, Apple Silicon, and Windows-on-Arm; §15 is the bridge.

What's the difference between SSE, AVX, AVX2, AVX-512, FMA, BMI?

Each name is a CPUID feature flag. The compiler enables instructions for each independently.

SSE / SSE2 / SSE3 / SSSE3 / SSE4.1 / SSE4.2 are the 128-bit XMM extensions. SSE2 is the AMD64 baseline; SSE4.2 is the rough baseline for any chip from 2010.

AVX is the 256-bit float SIMD and the VEX three-operand encoding. AVX2 is the 256-bit integer SIMD and gather instructions. FMA (specifically FMA3) is fused multiply-add. Haswell (2013) and later have all three; the PS5 / Xbox Series X/S have all three.

AVX-512 is the 512-bit ZMM SIMD and the mask-register predication model, sub-divided into many feature flags (AVX-512F is the foundation; AVX-512BW, DQ, VL, CD, VNNI, GFNI, VBMI2 each add specific instructions). Server-mainly on client Intel through 2024; AMD Zen 4 / Zen 5 across the lineup.

BMI1 / BMI2 are not SIMD but ship in the same generation: bit-manipulation instructions like blsr, tzcnt, pdep, pext. The compiler emits them under -mbmi2 or -march=haswell and later.

03The lane model: how a SIMD register actually works

A SIMD register holds several values of the same type, in fixed positions called lanes. One instruction operates on every lane at the same time. The instruction's mnemonic encodes the lane element format: ps for packed single-precision floats, pd for packed doubles, b/w/d/q for packed 8 / 16 / 32 / 64-bit integers. The scalar variants are ss (scalar single) and sd (scalar double); they touch only the low lane and leave the rest unchanged^[9].

The same physical register file appears at three widths, and each width is a strict prefix of the next:

Register	Width	Lanes (float)	Lanes (double)	Lanes (int32)	Lanes (byte)
`xmm0`–`xmm15`	128 bits	4	2	4	16
`ymm0`–`ymm15`	256 bits	8	4	8	32
`zmm0`–`zmm31`	512 bits	16	8	16	64

The lower 128 bits of ymm0 is xmm0. The lower 256 bits of zmm0 is ymm0. AVX-512-capable cores extend the architectural register count to thirty-two (xmm0–xmm31, etc.); pre-AVX-512 cores see sixteen^[2].

The widget below animates one instruction at the three widths. vaddps on a YMM register adds eight pairs of single-precision floats in one μop; on a ZMM register it adds sixteen. The animation paces the lanes in series for readability, but the real hardware retires all of them in the same cycle:

Two facts about the lane model that beginners get wrong, and that the rest of this tutorial depends on:

The CPU doesn't care what's in the upper lanes when it executes a scalar op. vaddss xmm0, xmm0, xmm1 adds the low single-precision float; the upper three lanes of xmm0 come along unchanged (the VEX-encoded form merges from a third operand; legacy SSE merges from the destination). On Skylake the latency and throughput of a scalar addss are identical to a packed addps, because the hardware computes all four lanes either way and writes only the low one^[1]. Scalar SSE math is not "cheaper" than packed SSE math; it is the same μop with three lanes wasted.
Lane index is positional, not by name. A 256-bit YMM register is not a struct with named fields. Whether lane 0 holds x, lane 1 holds y, etc. is your convention. Most SoA code dedicates a whole register to one component (a register full of x values from many particles), not to one record with four components. That is the deepest difference between vector and scalar programming.

The intrinsic-level vocabulary, since the compiler hides the register names behind type-tagged values^[9]:

simd_types.cpp · the intrinsic types and their loads

#include <immintrin.h>   // pulls in every Intel intrinsic header

// SSE / SSE2: 128-bit registers. Type tag carries the element format.
__m128   vec4Floats;   // 4 × float
__m128d  vec2Doubles;  // 2 × double
__m128i  vec16Bytes;   // 16 × int8, or 8 × int16, or 4 × int32, or 2 × int64

// AVX / AVX2: 256-bit registers.
__m256   vec8Floats;
__m256d  vec4Doubles;
__m256i  vec32Bytes;

// AVX-512: 512-bit registers.
__m512   vec16Floats;
__m512d  vec8Doubles;
__m512i  vec64Bytes;

// Mask register (AVX-512). One bit per lane; eight for __m512d, sixteen for __m512.
__mmask8  doubleMask;
__mmask16 floatMask;

// Load from memory. _ps = packed single, _pd = packed double, _si* = integer.
float* sourcePointer = ...;
__m256 loaded = _mm256_loadu_ps(sourcePointer);   // unaligned load: any address
__m256 loadedAligned = _mm256_load_ps(sourcePointer);  // requires 32-byte aligned source
_mm256_storeu_ps(destinationPointer, loaded);     // unaligned store

Aligned and unaligned loads have identical performance on every Intel core from Nehalem (2008) onward when the address actually is aligned; the difference only surfaces when the load crosses a cache-line boundary, which adds a few cycles regardless of which mnemonic was used^[10]. Modern code uses _mm256_loadu_ps by default and accepts the rare cache-line crossing as the cost of not having to manually align every buffer.

04AoS, SoA, and AoSoA: the layout decision that dominates everything

A scalar Particle in an engine is often laid out as:

particle_aos.cpp · the natural object layout

struct Particle {     // 32 bytes per particle
  float positionX, positionY, positionZ;
  float velocityX, velocityY, velocityZ;
  float lifetime;
  float sizePixels;
};

Particle particles[100000];

// Update step. Scalar, one particle at a time.
void tickScalar(float deltaTime) {
  for (int i = 0; i < 100000; ++i) {
    particles[i].positionX += particles[i].velocityX * deltaTime;
    particles[i].positionY += particles[i].velocityY * deltaTime;
    particles[i].positionZ += particles[i].velocityZ * deltaTime;
    particles[i].lifetime  -= deltaTime;
  }
}

This is : one record at a time, every field of one particle contiguous. It is what a C++ programmer reaches for by default. It is also nearly unvectorizable in its natural form. To produce one SIMD register of positionX values, the CPU has to read positionX from particle 0, skip 28 bytes, read positionX from particle 1, skip 28 bytes, and so on. A 32-byte vector load picks up one usable element (positionX of particle 0) and seven that have to be shuffled into place or discarded. The remaining 28 bytes pulled into the L1 line are velocity, lifetime, size, and the start of the next particle, none of which the inner loop wants this iteration.

The cure is to transpose:

particle_soa.cpp · structure of arrays

struct ParticleSoa {
  float* positionX;   // N floats, contiguous
  float* positionY;
  float* positionZ;
  float* velocityX;
  float* velocityY;
  float* velocityZ;
  float* lifetime;
  float* sizePixels;
  int    particleCount;
};

// Update step, AVX2. Processes 8 particles per loop iteration.
void tickVector(ParticleSoa& particles, float deltaTime) {
  __m256 deltaTimeVector = _mm256_set1_ps(deltaTime);  // broadcast scalar to all 8 lanes
  for (int i = 0; i < particles.particleCount; i += 8) {
    // One load per stream, one FMA per stream. Eight particles updated per iteration.
    __m256 posX = _mm256_loadu_ps(particles.positionX + i);
    __m256 velX = _mm256_loadu_ps(particles.velocityX + i);
    posX = _mm256_fmadd_ps(velX, deltaTimeVector, posX);          // posX += velX * dt
    _mm256_storeu_ps(particles.positionX + i, posX);

    __m256 posY = _mm256_loadu_ps(particles.positionY + i);
    __m256 velY = _mm256_loadu_ps(particles.velocityY + i);
    posY = _mm256_fmadd_ps(velY, deltaTimeVector, posY);
    _mm256_storeu_ps(particles.positionY + i, posY);

    __m256 posZ = _mm256_loadu_ps(particles.positionZ + i);
    __m256 velZ = _mm256_loadu_ps(particles.velocityZ + i);
    posZ = _mm256_fmadd_ps(velZ, deltaTimeVector, posZ);
    _mm256_storeu_ps(particles.positionZ + i, posZ);

    __m256 lifeRemaining = _mm256_loadu_ps(particles.lifetime + i);
    lifeRemaining = _mm256_sub_ps(lifeRemaining, deltaTimeVector);
    _mm256_storeu_ps(particles.lifetime + i, lifeRemaining);
  }
}

Same arithmetic, eight particles per iteration. Every load pulls in eight useful floats per cache half-line. Every FMA does eight floating-point operations. The cache traffic and the compute both scale with the actual work the simulator wants done.

The widget below shows what each load actually picks up. Toggle between AoS and SoA; in AoS the 32-byte load fetches one Particle worth of mixed fields and only one of them is the field you wanted; in SoA the same load fetches eight consecutive positionX values. The animation moves a cursor over the array; the highlighted cells are the bytes a single vmovups brings in:

The cache cost is the gap between "bytes loaded" and "bytes used" each iteration. In the AoS form, 32 bytes of L1 bandwidth produces 4 bytes of useful positionX (the other 28 are velocity, size, lifetime, fields the inner loop will load again next iteration). In SoA, all 32 bytes feed the inner loop. The DRAM-to-L1 read traffic for the same arithmetic is roughly 8× lower in SoA, which is why the layout switch is usually a bigger win than the SIMD switch itself.

The third option, used widely in shipping engines, is : array of "blocks" each holding SIMD-width-worth of records in SoA form. A block of 8 particles' xs, then 8 ys, then 8 zs, then move on to the next 8 particles. This keeps the lane-friendly access pattern inside a block, while giving up a little of the long-stream locality. Unity's Burst-compiled jobs^[11], Unreal's FVector4-based math^[12], and Insomniac's data-oriented design talks^[13] all describe variants of this layout. The common case in engine ECS frameworks is SoA inside a chunk, AoSoA across chunks; the chunk is the unit of streaming and the SoA is the unit of vectorization.

Three rules to remember when picking a layout:

SoA pays where a kernel touches few fields of many records. The particle tick uses position and velocity; the cull pass uses just position and boundsRadius. SoA wins decisively here because the unused fields don't get pulled into the cache at all.
AoS pays where a kernel touches many fields of few records. Inserting an item into a sorted container, dereferencing a single picked entity for inspection, serializing one record to disk. Don't fight the layout for these.
The decision is per-system, not per-codebase. A renderer that has one cold pass over many entities and one hot pass over a few selected entities reasonably stores both: a SoA "transform pool" feeding the renderer, with a thin AoS handle for game logic. The translation lives at the system boundary, runs once per frame, and is cheap if the conversion itself is SIMD-vectorized^[13].

05Alignment, loads, and the cache-line crossing

Three load instructions cover almost all SIMD memory access^[9]:

_mm256_loadu_ps (vmovups in the disassembly): load eight floats from any address. No alignment requirement.
_mm256_load_ps (vmovaps): load eight floats from a 32-byte-aligned address. Faults if the address isn't aligned; the segfault is a useful guarantee that the layout invariant is being respected, but on AVX-encoded hardware the two forms have identical performance when the address actually is aligned^[10].
_mm256_stream_ps (vmovntps): a non-temporal store that bypasses the cache. Used when writing a large block that you know won't be read again before it's evicted (frame buffer composition, large texture uploads, particle render-output buffers). Wrong by default; right occasionally.

The performance cost that does show up in practice is the cache-line crossing. An x86 L1 cache line is 64 bytes. A 32-byte load that starts at offset 48 within a line straddles the boundary and triggers a second L1 access. On Skylake-class cores a split-line load occupies a load port for two accesses instead of one, effectively halving the load throughput for that μop, and adds several cycles of extra latency^[10]. On a hot loop that issues two loads per cycle, a stream of split-line loads cuts inner-loop throughput nearly in half. Sunny Cove (Ice Lake client, 2019) and later have reduced the cost further but not eliminated it. The penalty is the same for vmovaps on a deliberately-misaligned address (which then faults) and vmovups (which doesn't): the cost is the line crossing itself, not the mnemonic.

Tools to control alignment in C++^[14]:

alignment.cpp · the toolkit

// 1) Type-level alignment: every variable of this type gets the wider alignment.
struct alignas(32) ParticleBlock {
  float positionX[8];
  float positionY[8];
  float positionZ[8];
};

// 2) Variable-level: a single declaration gets the wider alignment.
alignas(64) float buffer[1024];

// 3) Heap allocation: ordinary malloc/new only promises alignof(max_align_t),
//    which is 16 bytes on x86-64 (System V and Windows). For wider alignment,
//    use the aligned allocator. C++17 added the over-aligned new operator.
float* page = static_cast<float*>(std::aligned_alloc(32, 1024 * sizeof(float)));
// or, equivalently in C++17 with operator new:
float* page2 = new(std::align_val_t{32}) float[1024];

// 4) Telling the compiler an existing pointer is aligned. Compiler-specific:
//    GCC/Clang: __builtin_assume_aligned. MSVC: __assume(ptr % 32 == 0).
float* alignedPointer = static_cast<float*>(__builtin_assume_aligned(rawPointer, 32));
// Now the autovectorizer can emit vmovaps and skip the runtime alignment check.

Three practical alignment rules:

Align your big arrays to the SIMD width you target. 32 bytes for AVX/AVX2, 64 for AVX-512. Use alignas on the type or the over-aligned new. Don't hand-pad with extra bytes; the compiler will silently pick a different layout if you change the struct, and the manual padding becomes a bug.
Use loadu by default. The cost on aligned data is zero (the two mnemonics emit identical μops). The cost on misaligned data is the cache-line-crossing penalty, paid only on the loads that actually cross a line, not on every loadu. Reach for load only when the alignment is an enforced invariant the compiler can use for codegen reasoning, or when you want the segfault as a runtime alignment check.
Pad your arrays to a multiple of the SIMD width. A loop over a 503-element array running 8 floats at a time has to handle the last 7 elements specially: either a scalar tail, an AVX-512 masked load, or a deliberate over-read into a padded buffer. Padding the array to 504 (a multiple of 8) and zeroing the tail is the cheapest of the three, at the cost of a few bytes of extra memory.

What's a "split-line load" penalty, and why don't I hear about it on modern hardware?

On pre-Nehalem Intel and pre-Bulldozer AMD (so, anything before about 2008), an unaligned SSE load that crossed a 16-byte boundary stalled the pipeline for many cycles. The folklore around "always use movaps, never movups" comes from that era.

Nehalem (2008) made movups as fast as movaps when the address actually is aligned: the two mnemonics retire identical μops on aligned data. Every Intel and AMD core since has carried that improvement. The remaining "split-line" cost on modern hardware is the line-crossing itself (a second L1 access plus a few cycles of latency), and it applies to movups and movaps equally. The compiler will sometimes still prefer movaps for the segfault-on-misaligned property, which catches alignment bugs at the load site rather than letting them silently corrupt later iterations.

One real cliff: a load that crosses a 4 KB page boundary on most Intel cores triggers a TLB miss for the second page and can cost an extra 10–20 cycles. Pad your hot buffers so the data range doesn't straddle a 4 KB boundary; this matters more than 16- or 32-byte alignment in modern code^[10].

06Your first kernel: dot product of two arrays

A reduction that turns up in renderer, physics, and ML code: given two float arrays of length N, compute their dot product. The scalar form is one multiply-add per element:

dot_scalar.cpp

float dotScalar(const float* aArray, const float* bArray, int elementCount) {
  float accumulator = 0.0f;
  for (int i = 0; i < elementCount; ++i) {
    accumulator += aArray[i] * bArray[i];
  }
  return accumulator;
}

The natural AVX2 form processes eight elements per iteration, accumulating into a single 256-bit vector and reducing to a scalar at the end:

dot_avx2_naive.cpp · one accumulator

float dotAvx2Naive(const float* aArray, const float* bArray, int elementCount) {
  __m256 accumulator = _mm256_setzero_ps();
  int i = 0;
  for (; i + 8 <= elementCount; i += 8) {
    __m256 aVector = _mm256_loadu_ps(aArray + i);
    __m256 bVector = _mm256_loadu_ps(bArray + i);
    // Fused multiply-add: accumulator += aVector * bVector, one rounding step.
    accumulator = _mm256_fmadd_ps(aVector, bVector, accumulator);
  }
  // Horizontal sum of the 8 lanes in `accumulator`. Reduce-and-shuffle tree.
  __m128 lowHalf  = _mm256_castps256_ps128(accumulator);            // lanes 0..3
  __m128 highHalf = _mm256_extractf128_ps(accumulator, 1);          // lanes 4..7
  __m128 sum128 = _mm_add_ps(lowHalf, highHalf);                    // 4 sums
  sum128 = _mm_hadd_ps(sum128, sum128);                             // 2 sums in low two lanes
  sum128 = _mm_hadd_ps(sum128, sum128);                             // 1 sum in low lane
  float totalSum = _mm_cvtss_f32(sum128);

  // Scalar tail for the last 0..7 elements that didn't fit a vector iteration.
  for (; i < elementCount; ++i) totalSum += aArray[i] * bArray[i];
  return totalSum;
}

Three working parts here, each of which is a SIMD pattern you'll reuse:

The main loop. One FMA per iteration. The compiler loads both inputs from memory in the FMA's memory-operand form on most uarchs, so the issued μop count is two loads plus one FMA per eight elements^[1].
The horizontal sum. Reducing an 8-lane vector to one scalar with a shuffle-and-add tree. vhaddps is the SSE3 horizontal-add instruction; it's used twice to fold four lanes into one. The widget in §7 visualizes this.
The scalar tail. Loop iterations that don't fit a full vector. For elements past the last i + 8 <= elementCount boundary, fall back to scalar. AVX-512 has an alternative using a masked load (§8) that handles the tail without a separate loop.

On Skylake this kernel runs roughly one iteration every 4–5 cycles in steady state, capped by the loop-carried dependency on accumulator. vfmadd231ps has a 4-cycle latency on Skylake; each iteration's FMA depends on the previous, so the loop is latency-bound, not throughput-bound^[1]. That is the same trap from §1: there are two FMA ports sitting idle while the single dependency chain serializes.

The fix is in §7: use multiple accumulators to break the chain.

Why a shuffle-and-add tree instead of just summing each lane scalar-style?

Extracting each of the eight lanes individually and adding them in scalar would be eight extracts and seven adds, all serially dependent: vpextrd-style extracts at 3-cycle latency plus seven addsses at 4-cycle latency, on the order of 30 cycles minimum. The shuffle-and-add tree finishes in log₂(8) = 3 rounds; each round costs a shuffle (1 cycle) plus an add (4 cycles), so about 15 cycles total for the horizontal sum. The full kernel including the horizontal sum amortizes that one-time cost over the whole main loop.

vhaddps is one of the few horizontal SIMD instructions (it adds adjacent lanes within a register). On Skylake the YMM form is 6-cycle latency, 2-cycle reciprocal throughput, decoded as three μops^[1]. Fine when it's used twice at the end of a reduction; bad inside an inner loop, which is one of the reasons hot SIMD code uses a manual vshufps + vaddps tree instead.

07Breaking the dependency chain

The bottleneck in the §6 kernel is that every iteration's FMA depends on the previous iteration's accumulator. The fix is to keep several independent accumulators and sum them at the end. How many you need is set by Skylake's FMA pipeline: 4-cycle latency, two-per-cycle throughput. Little's law gives the number of independent FMAs that have to be in flight at once to keep both ports busy: latency × throughput = 4 × 2 = 8. With eight independent accumulator chains, every cycle both FMA ports issue a new FMA whose inputs are already ready, and the loop runs at the throughput ceiling instead of the latency floor^[1].

dot_avx2_unrolled.cpp · eight independent accumulators

float dotAvx2Unrolled(const float* aArray, const float* bArray, int elementCount) {
  // Eight independent partial sums. Each is its own dependency chain;
  // with FMA latency 4 and throughput 2, eight in-flight FMAs saturate the ports.
  __m256 sum0 = _mm256_setzero_ps(), sum1 = _mm256_setzero_ps();
  __m256 sum2 = _mm256_setzero_ps(), sum3 = _mm256_setzero_ps();
  __m256 sum4 = _mm256_setzero_ps(), sum5 = _mm256_setzero_ps();
  __m256 sum6 = _mm256_setzero_ps(), sum7 = _mm256_setzero_ps();

  int i = 0;
  for (; i + 64 <= elementCount; i += 64) {
    // 8 independent FMAs per iteration; the CPU schedules across both FMA ports.
    sum0 = _mm256_fmadd_ps(_mm256_loadu_ps(aArray + i +  0), _mm256_loadu_ps(bArray + i +  0), sum0);
    sum1 = _mm256_fmadd_ps(_mm256_loadu_ps(aArray + i +  8), _mm256_loadu_ps(bArray + i +  8), sum1);
    sum2 = _mm256_fmadd_ps(_mm256_loadu_ps(aArray + i + 16), _mm256_loadu_ps(bArray + i + 16), sum2);
    sum3 = _mm256_fmadd_ps(_mm256_loadu_ps(aArray + i + 24), _mm256_loadu_ps(bArray + i + 24), sum3);
    sum4 = _mm256_fmadd_ps(_mm256_loadu_ps(aArray + i + 32), _mm256_loadu_ps(bArray + i + 32), sum4);
    sum5 = _mm256_fmadd_ps(_mm256_loadu_ps(aArray + i + 40), _mm256_loadu_ps(bArray + i + 40), sum5);
    sum6 = _mm256_fmadd_ps(_mm256_loadu_ps(aArray + i + 48), _mm256_loadu_ps(bArray + i + 48), sum6);
    sum7 = _mm256_fmadd_ps(_mm256_loadu_ps(aArray + i + 56), _mm256_loadu_ps(bArray + i + 56), sum7);
  }
  // Fold the eight accumulators together in pairs, then a horizontal sum.
  __m256 fold01 = _mm256_add_ps(sum0, sum1);
  __m256 fold23 = _mm256_add_ps(sum2, sum3);
  __m256 fold45 = _mm256_add_ps(sum4, sum5);
  __m256 fold67 = _mm256_add_ps(sum6, sum7);
  __m256 fold0123 = _mm256_add_ps(fold01, fold23);
  __m256 fold4567 = _mm256_add_ps(fold45, fold67);
  __m256 totalVector = _mm256_add_ps(fold0123, fold4567);

  __m128 sum128 = _mm_add_ps(_mm256_castps256_ps128(totalVector),
                             _mm256_extractf128_ps(totalVector, 1));
  sum128 = _mm_hadd_ps(sum128, sum128);
  sum128 = _mm_hadd_ps(sum128, sum128);
  float totalSum = _mm_cvtss_f32(sum128);
  for (; i < elementCount; ++i) totalSum += aArray[i] * bArray[i];
  return totalSum;
}

Each iteration now issues eight independent FMAs. The CPU's scheduler dispatches two of them per cycle on ports 0 and 1; after four cycles of warm-up, both FMA ports issue every cycle and the loop runs at two FMAs per cycle, which is 16 floats reduced per cycle. Against the single-accumulator version at 0.25 FMAs per cycle (one FMA per 4-cycle latency window), the speedup is 8×. Same algorithm, same arithmetic count, an order of magnitude faster on long arrays.

Why eight, and what about fewer? Two or four accumulators help (the loop becomes partially unrolled and some of the latency is hidden), but they don't fully saturate the FMA ports. Four accumulators leave the loop running at one FMA per cycle, or half the throughput peak; many production kernels stop at four because the marginal gain past 50% port saturation is usually swamped by L1 load bandwidth, retirement throughput, or the surrounding loop overhead. Sixteen accumulators rarely help on Skylake because the register file is only sixteen YMMs wide and the extra accumulators start to spill. On Ice Lake / Sunny Cove the FMA latency and throughput are the same, so the same eight-accumulator rule applies. On AMD Zen 4 (256-bit YMM FMAs at similar 4-cycle latency, 2-per-cycle throughput), the rule is the same again^[6].

The widget runs both kernels on a 1024-element dot product and shows where the cycle budget actually goes:

cycles (modelled)

···

FMAs per cycle

···

speedup vs single

···

A simplified Skylake model: vfmadd231ps ymm has 4-cycle latency, two-per-cycle throughput on ports 0 and 1^[1]. The single-accumulator loop is latency-bound at one FMA per 4 cycles (0.25 FMA/cycle). The eight-accumulator loop hits the throughput ceiling: two FMAs per cycle, 16 floats reduced per cycle, an 8× speedup. The horizontal-sum step at the end is the same in both; on an array of 1024 floats, it's a one-off cost amortized into the loop body.

-O3 -ffast-math (or specifically -fassociative-math) is what lets the autovectorizer perform this transformation for you. Without it, floating-point addition's non-associativity prevents the compiler from regrouping the sum^[15]. The compiler's autovectorization report (-fopt-info-vec on GCC, -Rpass=loop-vectorize on Clang) tells you when it succeeded.

A caveat: the reordered sum produces a slightly different result than the strict left-to-right sum. The two answers differ in the last few ULPs^[15]; both are within the per-operation rounding bound, but they're not bit-identical. If your engine needs deterministic, cross-platform reproducible results (replays, multiplayer lockstep, deterministic physics), the reordered sum can be a problem. The fix is to fix the reduction tree shape (always the same tree, regardless of array length) and never let the compiler re-associate, which means leaving -fassociative-math off and writing the SIMD form explicitly.

08Masking and predication: conditionals without branches

A pre-AVX-512 vector ALU has no conditional execution. vaddps ymm0, ymm1, ymm2 operates on every lane; there is no "operate on lanes where some condition holds." When the source has a lane-divergent branch, the SIMD form has to do the work for both sides and blend the result, with the blend driven by a comparison-produced mask. The compiler does it, by hand it looks like this:

clamp_avx2.cpp · clamp(x, 0, 1) eight at a time, no branches

// Clamp every element of `inputArray` to [0, 1]. Scalar version branches twice per element;
// the vector version computes both branches' results and selects with mask-blends.
void clampZeroOne(float* inputArray, int elementCount) {
  __m256 zeroVector = _mm256_setzero_ps();
  __m256 oneVector  = _mm256_set1_ps(1.0f);
  for (int i = 0; i < elementCount; i += 8) {
    __m256 inputVector  = _mm256_loadu_ps(inputArray + i);
    // min(input, 1) then max(result, 0). Two SIMD min/max ops, no branches.
    __m256 belowOne  = _mm256_min_ps(inputVector, oneVector);
    __m256 inRange   = _mm256_max_ps(belowOne, zeroVector);
    _mm256_storeu_ps(inputArray + i, inRange);
  }
}

vminps and vmaxps are the cleanest case of vectorized branching: there's a hardware instruction for the exact lane-wise min and max, so no comparison-and-blend is needed. The general case (a predicate that isn't a comparison the SIMD ISA has a direct op for) uses a comparison followed by a blend:

condition_avx2.cpp · branchless lerp-or-skip

// Apply an alpha-blended override to every element where `mask[i]` is non-zero,
// keep the original value otherwise. In scalar, this is one branch per element.
void conditionalLerp(float* destination, const float* overrideValue,
                       const int* maskArray, int elementCount, float alpha) {
  __m256 alphaVector = _mm256_set1_ps(alpha);
  __m256 oneMinusAlpha = _mm256_set1_ps(1.0f - alpha);
  for (int i = 0; i < elementCount; i += 8) {
    __m256i predicateBits   = _mm256_loadu_si256((__m256i*)(maskArray + i));
    // Compare each lane against zero; produces all-ones in the lanes where mask[i] != 0.
    __m256 predicateLanes = _mm256_castsi256_ps(
      _mm256_cmpgt_epi32(predicateBits, _mm256_setzero_si256()));

    __m256 originalLanes  = _mm256_loadu_ps(destination + i);
    __m256 overrideLanes  = _mm256_loadu_ps(overrideValue + i);
    __m256 blendedLanes   = _mm256_fmadd_ps(originalLanes, oneMinusAlpha,
                                             _mm256_mul_ps(overrideLanes, alphaVector));
    // Pick blended where predicate is true, original where predicate is false.
    __m256 selectedLanes  = _mm256_blendv_ps(originalLanes, blendedLanes, predicateLanes);
    _mm256_storeu_ps(destination + i, selectedLanes);
  }
}

Two SIMD primitives here:

vcmpgt (and its eq/lt/ge variants) produces a mask of all-ones in lanes where the predicate holds, all-zeros where it doesn't^[9]. In SSE/AVX, the mask occupies a regular vector register; in AVX-512 it lands in a separate K mask register (more on that below).
vblendvps selects between two vectors lane-by-lane based on the sign bit of a third vector. A floating-point version of "if (predicate) take A else take B." Internally a one-μop instruction on Skylake and Zen^[1].

This is doing strictly more work than the scalar version: every lane computes the blended result even when the predicate is false, then the blend throws those results away. The win comes from never paying a branch-mispredict, and from processing eight elements per iteration. The tradeoff is identical to the branchless-vs-branchy story from the Assembly tutorial: it's a win when the predicate is data-dependent and the predictor can't learn it.

AVX-512: first-class lane masking

The story changes substantially on AVX-512. Every arithmetic instruction has a masked variant that operates only on lanes where a mask register is set; lanes where the mask is clear either keep their old value (merge-masking) or get zeroed (zero-masking)^[7]. There are eight mask registers, K0 through K7. The EVEX encoding has a 3-bit mask-selector field; the value 0 in that field is reserved to mean "no mask" (every lane active), which makes K0 unusable as an actual writemask. K0 still exists as a general mask register for KMOV, KAND, and the other mask-manipulation instructions; you just can't name it as the writemask on a masked vector op^[35].

clamp_avx512.cpp · the same kernel with native masking

void clampZeroOneAvx512(float* inputArray, int elementCount) {
  __m512 zeroVector = _mm512_setzero_ps();
  __m512 oneVector  = _mm512_set1_ps(1.0f);

  int i = 0;
  for (; i + 16 <= elementCount; i += 16) {
    __m512 inputVector = _mm512_loadu_ps(inputArray + i);
    __m512 inRange = _mm512_min_ps(_mm512_max_ps(inputVector, zeroVector), oneVector);
    _mm512_storeu_ps(inputArray + i, inRange);
  }

  // Tail: no scalar loop needed. A mask of (1 << remaining) - 1 enables only the live lanes.
  if (i < elementCount) {
    __mmask16 tailMask = (__mmask16)((1u << (elementCount - i)) - 1u);
    __m512 inputVector = _mm512_maskz_loadu_ps(tailMask, inputArray + i);   // inactive lanes read as 0
    __m512 inRange     = _mm512_min_ps(_mm512_max_ps(inputVector, zeroVector), oneVector);
    _mm512_mask_storeu_ps(inputArray + i, tailMask, inRange);                // store only active lanes
  }
}

Three things AVX-512's mask model makes cheap that pre-AVX-512 code paid for explicitly:

Tail handling. The masked load and store make the scalar tail loop go away. Element counts that aren't multiples of the SIMD width stop being a special case.
Sparse updates. "Update only the entities where isActive" becomes a single masked store; lanes whose isActive bit is zero don't write to memory, so there is no need for the both-sides-and-blend dance.
Conditional FMA. vfmadd231ps zmm0 {k1}, zmm1, zmm2 updates only the lanes where k1 is set; the other lanes keep their old value. This is a true predicated SIMD instruction, the way ARM SVE works natively.

The widget below shows a masked FMA across sixteen lanes. The mask register is shown as a row of 0/1 bits above the operation; lanes where the bit is 1 update, lanes where it's 0 keep their old value:

Merge-masking is the default: lanes where the mask is 0 retain their old destination value. The zero-masking variant (intrinsic suffix _maskz_) instead writes 0 in inactive lanes. Both are one μop on Skylake-X and later; the mask register is consumed at the same point in the pipeline as the source vectors and adds no latency^[1].

09Shuffles, blends, broadcasts: rearranging the lanes

Half the time a SIMD kernel doesn't quite work, the fix is a shuffle. Half the time a SIMD kernel is two times slower than expected, the cause is a shuffle. The instruction family that rearranges lanes within a register (or across two registers) is the SIMD glue that bridges scalar control flow and packed data flow:

Mnemonic	What it does	Use it for
`vbroadcastss`	Replicate one float across every lane.	Splat a scalar into a vector for an FMA: "every lane gets the same `deltaTime`."
`vshufps`	Pick four output lanes from two input registers' lanes, controlled by an 8-bit immediate.	Within-128-bit rearrangement; transpose helpers; broadcast one component of a vec4.
`vpermps`	Eight output lanes, each picked from any of the eight input lanes via an index vector.	Arbitrary lane permutation within a 256-bit register; AVX2 only.
`vpshufb`	Sixteen output bytes (SSE) or thirty-two (AVX2), each picked from any of the input's 16 bytes via an index byte.	16-entry parallel lookup table per lane; UTF-8 validation; hex encoding; small case-table dispatch^[4].
`vblendps` / `vblendvps`	Pick lanes from two sources based on an immediate or a per-lane mask vector.	Branchless conditional select (§8); merging two halves of a reduction.
`vmovmskps` / `vpmovmskb`	Extract the sign bits of each lane into a scalar integer.	"Branch on whether any lane satisfied the predicate"; convert a SIMD comparison into a scalar test.
`vperm2f128`	Swap or duplicate the two 128-bit halves of a 256-bit register.	Crossing the 128-bit lane barrier; merging two halves of a reduction.

The instruction that turned out to be the single most useful SIMD primitive for non-arithmetic work is pshufb^[4]. It performs a parallel byte-granularity lookup: for each output byte position i, the output is table[indices[i] & 0x0F], where table is the first operand (a 16-byte table) and indices is the second. One instruction, sixteen 16-entry lookups in parallel. Used in shipping code for UTF-8 validation (Lemire & Keiser 2020^[16]), base64 encode/decode (Muła & Lemire 2018^[17]), hex-decoding, ASCII case conversion, and dozens of other workloads.

The widget animates one pshufb execution. Pick a table and an index pattern; the output bytes light up as the parallel lookup completes:

A single vpshufb xmm0, xmm1, xmm2 instruction. Sixteen lookups, all in parallel, in one μop with 1-cycle latency on Skylake and Zen^[1]. The 256-bit AVX2 form (vpshufb ymm) operates on the two 128-bit halves independently; indices in the upper half can only address the upper half's table. The 512-bit AVX-512 variant (vpshufb zmm) is the same, four-way independent. AVX-512_VBMI's vpermb instruction is the cross-lane version: each output byte can come from any of the 64 input bytes. VBMI is the Vector Byte Manipulation extension, introduced on Cannon Lake (2018) and shipped broadly from Ice Lake (2019).

Two reliable rules about shuffles:

Within-128-bit-lane shuffles are roughly free; cross-lane shuffles aren't. AVX's 256-bit register is two 128-bit "lanes" stitched together. vshufps, vpshufb, vpshufd all operate on each 128-bit lane independently. vpermps and vperm2f128 cross the lane boundary; on Skylake they cost 3-cycle latency and one per cycle throughput, against 1-cycle latency for the within-lane shuffles^[1]. The fastest 8-element transpose is the one that avoids cross-lane shuffles where possible.
Movmask is the bridge from vector to scalar. When you need to branch on whether any (or every) lane satisfied a predicate, vmovmskps extracts the sign bits into a 32-bit integer; tzcnt or popcnt on that integer answers "which lane was the first match" or "how many matched." This is the pattern for early-exit SIMD search loops: SIMD-compare a 256-bit chunk, extract the mask, branch on it^[4].

10Gather and scatter: when you have to load from N pointers

A vector load assumes contiguous addresses. A gather instruction (AVX2 and later) takes a base pointer and a vector of indices, and loads one element per lane from base[indices[lane]]. The hardware does N independent loads internally and assembles the result:

gather.cpp · per-vertex skinning indirection

// 8 vertices, each with a bone-index. Look up the 8 bone matrices' first entries.
float bonePositionX[MAX_BONES];        // SoA bone data, one float per bone
int boneIndices[8] = {3, 17, 3, 8, 42, 8, 17, 5};

__m256i indicesVector = _mm256_loadu_si256((__m256i*)boneIndices);
// Gather: each lane loads bonePositionX[indicesVector[lane]], 4-byte stride.
__m256 boneXVector = _mm256_i32gather_ps(bonePositionX, indicesVector, 4);

Gather is the only way to vectorize random-access patterns: skinning with per-vertex bone indices, lookup-table-driven shaders, ECS systems where the components for a job aren't laid out in the order the job processes entities. The cost picture is the part to understand.

Gather has both a latency and a throughput number, and they are very different. On Haswell (2013, the first AVX2 part), a vpgatherdd ymm (8 i32 lanes) had a result latency of around 22 cycles and a reciprocal throughput of about 11 cycles^[1]. That is, a back-to-back stream of gathers issues a new one every ~11 cycles. The instruction shipped on Haswell as a future-proofing measure rather than a performance win; eight scalar loads with software pipelining frequently beat a single Haswell gather. Skylake dropped the reciprocal throughput to roughly 5 cycles per gather while keeping the latency near 22 cycles. Ice Lake improved both numbers; Sapphire Rapids further. AMD Zen historically had slower gathers than Intel; Zen 4 narrowed the gap. In 2018 the Spectre / L1 Terminal Fault microcode updates briefly regressed gather performance on several generations.

A useful rule of thumb: gather beats scalar loads when the loaded data is already L1-resident and the indices have few duplicates. It loses when the indices hit cold lines (every duplicate becomes a separate cache miss in the underlying hardware) or when the indices are sequential (in which case a normal vector load is much faster).

Scatter (the reverse: write each lane to base[indices[lane]]) is AVX-512 only. It is rarer than gather in shipping code because its failure modes are sharp: if two lanes name the same destination, the result is implementation-defined for ordering; the hardware has to detect conflict, which on early implementations was slow. vpscatterdd's throughput on Skylake-X was on the order of one element per cycle for 16 lanes; Sapphire Rapids improved this, but scatter remains the slowest commonly-used AVX-512 primitive^[1].

The pragmatic answer in most engine code is to avoid gather and scatter by restructuring the data:

Sort by bone index before skinning; vertices with the same bone now load contiguously.
Sort by component-id in an ECS so the inner loop touches one chunk's worth of components per iteration.
If indices are stable across many frames, transpose once into a denser layout and reuse it.

When gather is unavoidable, the typical pattern is to gather a small amount of data (a single float per lane), then do many arithmetic ops in vector form before the next gather. This amortizes the gather's cycle cost across a lot of compute, the same way an L1 cache amortizes DRAM latency^[18].

11A worked engine kernel: batched 4×4 matrix-vector transform

Every renderer applies a 4×4 transform to every vertex. The scalar form is sixteen multiplies and twelve adds per vertex. The SIMD form depends on where in the layout dimension the parallelism comes from:

One vertex at a time, 128 bits. Each vec4 is one __m128; the 4×4 matrix-vector product is four scalar broadcasts and four FMAs (the form from the Assembly tutorial's §13). One vertex per kernel call; useful when vertices are needed individually for a UI or a single particle. Throughput-poor on AVX2 because the 256-bit unit is half-empty.
Eight vertices at a time, AoSoA, 256 bits. The data layout the throughput-tuned form needs: 8 vertices laid out as (x[0..7], y[0..7], z[0..7], w[0..7]). One YMM register per component. The transform broadcasts each matrix entry to all eight lanes and FMAs across all eight vertices in parallel.

The 8-vertices-at-once form is the workhorse pattern for character skinning and instance transform stages in shipping engines. The kernel is short:

matvec_batch.cpp · transform 8 vertices by one matrix per AVX2 iteration

// AoSoA block: 8 vertices laid out (x[0..7], y[0..7], z[0..7], w[0..7]).
struct alignas(32) VertexBlock8 {
  float xCoordinates[8];
  float yCoordinates[8];
  float zCoordinates[8];
  float wCoordinates[8];   // usually 1.0f for positions, 0.0f for direction vectors
};

struct Matrix4x4 { float m[16]; };

void transformVertexBlock(const Matrix4x4& transform, VertexBlock8& block) {
  // Load each input component (8 vertices' x in xmm0, etc.). One vmovaps each.
  __m256 xLanes = _mm256_load_ps(block.xCoordinates);
  __m256 yLanes = _mm256_load_ps(block.yCoordinates);
  __m256 zLanes = _mm256_load_ps(block.zCoordinates);
  __m256 wLanes = _mm256_load_ps(block.wCoordinates);

  // Output x = m00*x + m01*y + m02*z + m03*w, applied to all 8 vertices in parallel.
  // Each broadcast (vbroadcastss) replicates one matrix entry across all 8 lanes.
  __m256 outputX = _mm256_mul_ps  (_mm256_set1_ps(transform.m[ 0]), xLanes);
  outputX        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[ 1]), yLanes, outputX);
  outputX        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[ 2]), zLanes, outputX);
  outputX        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[ 3]), wLanes, outputX);

  __m256 outputY = _mm256_mul_ps  (_mm256_set1_ps(transform.m[ 4]), xLanes);
  outputY        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[ 5]), yLanes, outputY);
  outputY        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[ 6]), zLanes, outputY);
  outputY        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[ 7]), wLanes, outputY);

  __m256 outputZ = _mm256_mul_ps  (_mm256_set1_ps(transform.m[ 8]), xLanes);
  outputZ        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[ 9]), yLanes, outputZ);
  outputZ        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[10]), zLanes, outputZ);
  outputZ        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[11]), wLanes, outputZ);

  __m256 outputW = _mm256_mul_ps  (_mm256_set1_ps(transform.m[12]), xLanes);
  outputW        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[13]), yLanes, outputW);
  outputW        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[14]), zLanes, outputW);
  outputW        = _mm256_fmadd_ps(_mm256_set1_ps(transform.m[15]), wLanes, outputW);

  _mm256_store_ps(block.xCoordinates, outputX);
  _mm256_store_ps(block.yCoordinates, outputY);
  _mm256_store_ps(block.zCoordinates, outputZ);
  _mm256_store_ps(block.wCoordinates, outputW);
}

The work per vertex is exactly the same as the scalar form: four output components, each a four-FMA dot product against a matrix row. The difference is that every FMA does eight in parallel. The kernel as written has four independent output chains (one per output component), each four ops deep (one multiply + three FMAs). The critical-path latency is roughly 4 ops × 4-cycle FMA latency = 16 cycles. In a tight loop that processes many blocks back-to-back, the throughput limit kicks in instead: 16 FMAs per block / 2 FMAs per cycle = 8 cycles per block in steady state, with the four parallel chains hiding most of the latency. The realistic finish time is therefore in the 10–20 cycles per batch range depending on whether the matrix broadcasts are hoisted out of the loop, the L1 load bandwidth is saturated, and the surrounding code lets the four chains overlap. That works out to 1.2–2.5 cycles per vertex.

What this code intentionally skips. The matrix broadcasts are reloaded every call: a real engine hoists them out of the inner loop (broadcast the matrix into 16 registers once, transform many vertex blocks against it) or uses the FMA's memory-operand form to fuse the broadcast and the load. The horizontal vs vertical batching choice (per-matrix or per-vertex) depends on whether you're skinning (many matrices, few vertices each) or instancing (one matrix, many vertices each); the form above is the instancing form. Skinning's per-vertex bone weighting needs the masked-gather pattern from §10. The actual Unreal FMatrix44d and DirectXMath XMMatrixMultiplyVectorBatched implementations are close cousins of this kernel^[12]^[19].

12A worked engine kernel: 8 spheres against the frustum per iteration

The CPU side of every renderer culls bounding volumes against the camera frustum before submitting draw calls. The scalar form is one sphere-vs-six-planes test per object; the SIMD form does eight at once and lets the predicate pass through the lane-mask machinery from §8.

A frustum plane is the equation nx · x + ny · y + nz · z + d = 0 with n the outward normal and d the signed distance from origin. A point lies outside the plane when nx · px + ny · py + nz · pz + d > 0. For a bounding sphere centered at (px, py, pz) with radius r, the sphere is fully outside the plane when that same dot product is greater than r; it's fully inside when the dot product is less than -r; it straddles otherwise. The standard cull keeps any object that isn't fully outside any of the six planes^[20].

The SoA form has all eight spheres' xs in one register, all eight ys in the next, eight zs, eight radii. One plane test is four FMAs and one compare; six planes is twenty-four FMAs and six compares. Per iteration the kernel processes eight objects in fewer cycles than the scalar version takes to process one.

frustum_cull.cpp · 8 spheres × 6 planes per AVX2 iteration

struct alignas(32) SphereBatch8 {
  float centerX[8], centerY[8], centerZ[8], radius[8];
};

struct FrustumPlane { float normalX, normalY, normalZ, distance; };

// Returns an 8-bit mask: bit i is 1 if sphere i passed (was not culled), 0 if outside.
int cullBatchAgainstFrustum(const SphereBatch8& spheres, const FrustumPlane planes[6]) {
  __m256 centerXVector = _mm256_load_ps(spheres.centerX);
  __m256 centerYVector = _mm256_load_ps(spheres.centerY);
  __m256 centerZVector = _mm256_load_ps(spheres.centerZ);
  __m256 radiusVector  = _mm256_load_ps(spheres.radius);

  // Start with "all 8 visible." We will mask off lanes that fail any plane.
  __m256 visibleMask = _mm256_castsi256_ps(_mm256_set1_epi32(-1));

  for (int p = 0; p < 6; ++p) {
    // Broadcast one plane's coefficients to all 8 lanes.
    __m256 planeNormalX = _mm256_set1_ps(planes[p].normalX);
    __m256 planeNormalY = _mm256_set1_ps(planes[p].normalY);
    __m256 planeNormalZ = _mm256_set1_ps(planes[p].normalZ);
    __m256 planeOffset  = _mm256_set1_ps(planes[p].distance);

    // signedDistance[i] = n·center[i] + d, computed in parallel for 8 spheres.
    __m256 signedDistance = _mm256_fmadd_ps(planeNormalX, centerXVector, planeOffset);
    signedDistance        = _mm256_fmadd_ps(planeNormalY, centerYVector, signedDistance);
    signedDistance        = _mm256_fmadd_ps(planeNormalZ, centerZVector, signedDistance);

    // A sphere is fully outside this plane if signedDistance > radius. Mask off those lanes.
    __m256 outsidePlane = _mm256_cmp_ps(signedDistance, radiusVector, _CMP_GT_OQ);
    visibleMask = _mm256_andnot_ps(outsidePlane, visibleMask);    // clear lanes where outsidePlane
  }
  // Extract the 8 sign bits as a single int; bit i is 1 if sphere i is still visible.
  return _mm256_movemask_ps(visibleMask);
}

Six planes × three FMAs + one compare + one mask-AND ≈ thirty μops per iteration for eight spheres. At Skylake's two-FMA-per-cycle peak the dispatch cost is on the order of 15 cycles per batch of 8, or about 2 cycles per sphere. The scalar form costs 6 plane tests × ~10 cycles each ≈ 60 cycles per sphere with no early-out. The vector form is roughly 30× faster on a balanced workload, more when scalar code adds branch-mispredict overhead, less when scalar code can early-out after one plane culls the object^[1]. The widget runs both:

spheres processed

cycles (modelled)

cycles per sphere

···

Simplified Skylake model: each plane test is one FMA chain plus one compare, modelled as 4 cycles latency × 6 planes / 2 ports ≈ 12 cycles per batch of 8. The scalar form is six planes × ~10 cycles each. Real shipping kernels do a few more things this demo skips: early-out on the first plane that culls all eight lanes (a single vmovmsk + jnz), and a hierarchical cull above the per-object pass^[20].

13Autovectorization, intrinsics, and when each wins

Four ways to get vector code out of a C++ compiler, in roughly decreasing order of how often you should use them:

Plain scalar loops, autovectorized. Write the obvious scalar code, compile with -O3 -march=native, look at the autovectorization report (-fopt-info-vec on GCC, -Rpass=loop-vectorize on Clang). The autovectorizer handles clean maps and reductions on contiguous arrays of trivially-copyable types. When it works, this is the right answer.
Pragmas and attributes that nudge the autovectorizer. #pragma omp simd or #pragma GCC ivdep declare that an inner loop has no loop-carried dependencies and the compiler may vectorize it^[21]. __restrict declares that pointers don't alias. __builtin_assume_aligned gives the compiler an alignment guarantee. These are the highest-leverage tools you have: scalar-readable source, vector-quality output.
Portable SIMD libraries. Google's Highway^[22], the upcoming std::simd (C++26^[23]), xsimd, SIMDe (which header-translates <immintrin.h> calls to NEON or WebAssembly SIMD^[24]). The right answer when you want the SIMD-readable source style of intrinsics but need it to run on x86, ARM, and the web. Unity's Mathematics package and Burst compiler^[11] are a closely related design.
Architecture-specific intrinsics. <immintrin.h> on x86, <arm_neon.h> on ARM. The form used throughout this tutorial. The right answer when the autovectorizer bailed out on the pattern you need, when you have to use an instruction the compiler can't pick from source (a specific pshufb table, an AVX-512 mask predicate), or when you need bit-identical results across architectures and the compiler's autovec choices vary by version.

The mental model: scalar loops are for the parts of the codebase that don't profile hot. Pragmas and portable wrappers are for the parts that do profile hot but aren't the absolute bottleneck. Intrinsics are for the kernels that profile in the top five of frame budget and have stayed there after every other optimization. There's a fourth tier (hand-written assembly) that you reach for in roughly the same situation Unreal reaches for it: about five times per shipping engine, in code that's been stable for years^[12].

A practical heuristic from Mike Acton, restated for SIMD specifically: look at the data and the access pattern first, write the inner loop second, switch to intrinsics third^[13]. Most "the compiler won't autovectorize this" complaints are downstream of a layout problem the compiler isn't allowed to fix.

What does an autovectorization report look like, and how do I read it?

On Clang, -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize prints a one-line note per loop the compiler considered. A successful vectorization looks like:

remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]

A bailout looks like:

remark: loop not vectorized: cannot identify array bounds [-Rpass-missed=loop-vectorize]

The common failure modes are unknown loop trip count, pointer aliasing, loop-carried dependency, control flow that the compiler can't if-convert, and reduction that the compiler can't reorder under strict IEEE-754 semantics. The fixes map cleanly: __restrict for aliasing, -ffast-math for the reduction, an early-out converted to a mask-and-blend for the control flow, an explicit length parameter for the trip count^[25].

14AVX-512 in 2026: the downclocking story

The reason AVX-512 is treated with suspicion in many game-engine codebases isn't the instruction set; it's the frequency story on Skylake-SP. The lore goes: "AVX-512 instructions drop the CPU's clock, so the speedup is eaten by the clock drop, so don't use them." Like most performance lore, the kernel of truth got frozen at the version of Skylake-SP it was first measured on.

What actually happened, generation by generation^[26]:

Skylake-SP (Xeon Scalable Gen 1, 2017). The chip had license-based frequencies. Light AVX-512 code (no FP) ran at "license 1" (roughly the SSE base clock). Heavy AVX-512 FP code dropped to "license 2," which on a 2.7-GHz part was around 600 MHz below the base clock. The drop applied for at least 2 ms after the last AVX-512 instruction. The folklore originated here: a mixed workload with a few AVX-512 kernels could hold the whole core at license 2, hurting the surrounding scalar code by ~25%.
Cascade Lake (2019). Same license model. Slightly smaller drop magnitudes; same shape.
Ice Lake server (Sunny Cove, 2021) and onward. The license model was rebuilt: the frequency penalty became proportional to the duty cycle of heavy vector code, in 1 ms granularity, instead of a wholesale state transition. On Ice Lake and later, AVX-512 typically costs 0–4% of clock relative to scalar code on the same chip^[26].
Sapphire Rapids (2023). Further narrowed the gap. AVX-512 carries no measurable clock penalty in normal workloads.
Intel client (Alder Lake / Raptor Lake / Meteor Lake / Arrow Lake, 2021–2024). AVX-512 was disabled in firmware on consumer Intel parts because the E-cores lacked the unit; this is the source of "modern Intel desktops don't have AVX-512" in 2026.
AMD Zen 4 (2022). Implemented AVX-512 via two 256-bit pumps per instruction. Throughput is half of an equivalent Intel server core's, but there is no frequency penalty whatsoever^[6]. AMD's marketing called this "double-pumped AVX-512"; in benchmarks it's competitive with Intel server parts on cache-resident workloads and an outright win on mixed workloads.
AMD Zen 5 (2024). Native 512-bit datapath; one ZMM op per port. The Intel "license penalty" framing never applied.

The practical advice in 2026:

On server (Xeon Ice Lake or later, EPYC Zen 4 or later), use AVX-512 freely.
On consumer Intel from Alder Lake through Arrow Lake, AVX-512 isn't available; -march=x86-64-v3 (AVX2 / FMA / BMI2) is the right target for code that has to run on a player's home machine.
On consumer AMD from Zen 4 onward, AVX-512 is available and has no clock penalty.
For shipping game binaries, the standard pattern is multi-versioning: compile two or three versions of the hot kernels at different -march levels and pick at startup via function multiversioning or a manual dispatch table. ICC, GCC, and Clang all support the FMV attribute^[27].

The 2017-era advice to avoid AVX-512 because of frequency is, in 2026, advice tuned for hardware no game studio is targeting. The 2026-era advice is: use AVX-512 where it's available, fall back to AVX2 where it isn't, and dispatch at startup.

15ARM: NEON, SVE2, and porting the math

ARM's AArch64 (ARMv8 and onward) ships with NEON as its mandatory SIMD: thirty-two 128-bit registers (V0–V31), the same SIMD register names every C ABI on ARM expects. NEON is feature-comparable to SSE4 (128-bit, no native masking, no gather). For game programmers it's the Nintendo Switch, Switch 2, Apple Silicon Mac, every iPhone and most Android phones, and the AWS Graviton server line^[28].

The translation from x86 SSE/AVX2 intrinsics to NEON is mostly mechanical. _mm_add_ps ↔ vaddq_f32. _mm_mul_ps ↔ vmulq_f32. _mm_fmadd_ps ↔ vfmaq_f32. _mm_loadu_ps ↔ vld1q_f32. The dot-product kernel from §6 with NEON intrinsics:

dot_neon.cpp · the same kernel, NEON

#include <arm_neon.h>

float dotNeon(const float* aArray, const float* bArray, int elementCount) {
  float32x4_t accumulatorA = vdupq_n_f32(0.0f);    // 4-lane register of zeros
  float32x4_t accumulatorB = vdupq_n_f32(0.0f);
  int i = 0;
  for (; i + 8 <= elementCount; i += 8) {
    float32x4_t aLow  = vld1q_f32(aArray + i);
    float32x4_t bLow  = vld1q_f32(bArray + i);
    float32x4_t aHigh = vld1q_f32(aArray + i + 4);
    float32x4_t bHigh = vld1q_f32(bArray + i + 4);
    accumulatorA = vfmaq_f32(accumulatorA, aLow,  bLow);    // fma: dst + a*b, one rounding step
    accumulatorB = vfmaq_f32(accumulatorB, aHigh, bHigh);
  }
  float32x4_t combined = vaddq_f32(accumulatorA, accumulatorB);
  // Horizontal sum: ARMv8 has vaddvq_f32, a single-instruction reduce-add of all 4 lanes.
  float totalSum = vaddvq_f32(combined);
  for (; i < elementCount; ++i) totalSum += aArray[i] * bArray[i];
  return totalSum;
}

Two NEON conveniences x86 doesn't have. vaddvq_f32 is a single-instruction horizontal sum (faddp on the asm side). NEON's three-operand encoding has been the default since ARMv8, so there's no SSE-style "destination is one of the sources" baggage. Apple's documentation on the M-series and the ARM Cortex-A optimization guides are the practical references^[29]^[30].

SVE and SVE2: vector-length-agnostic

The longer-term ARM story is SVE (Scalable Vector Extension), which adds vector-length-agnostic instructions: the same binary runs on hardware with a 128-bit vector, a 256-bit vector, a 512-bit vector, or anything in between up to 2048 bits. SVE2 is an optional feature in the ARMv9-A profile (not mandatory), implemented in practice by the Neoverse server line and a slice of mobile silicon^[31]. The shipping examples worth knowing about:

Fujitsu A64FX in the Fugaku supercomputer. SVE (not SVE2) at 512 bits per vector. First major SVE deployment, 2020.
AWS Graviton 3 (Neoverse V1). SVE at 256 bits, implemented as two 256-bit pipes. The first cloud-server SVE rollout, 2022.
AWS Graviton 4 (Neoverse V2). SVE2 at 128 bits, four 128-bit pipes. AWS narrowed the per-instruction vector width going from V1 to V2; the per-core FP throughput is similar because the issue width doubled.
ARMv9 client parts. Mobile SoCs based on Cortex-X4 / X925 / A720 implement SVE2 at 128 bits. Apple's M4 (ARMv9.2-A, 2024) implements SME (Scalable Matrix Extension) but does not expose non-streaming SVE2 to user code; SVE-style instructions on M4 are only available inside an SME streaming-mode region^[34]. Code that wants to run on M4 has to use NEON or stay inside SME.

SVE has first-class predication: every instruction takes a predicate register, the way AVX-512 does, so the mask-and-blend dance from §8 disappears. Code written for SVE looks closer to AVX-512 with mask registers than to NEON. The price is that you can't reason about lane count at compile time; every loop has to be written in a "while there's still work" form (the SVE WHILELT instruction generates the predicate for "lanes within bounds").

For game code in 2026 the practical situation is: NEON is what every shipping ARM target supports, including Switch 2 and every consumer Apple Silicon Mac. SVE2 is a server-side story (Graviton, some Neoverse-based instances) plus a slice of Android mobile. The Highway library^[22] already has working SVE and NEON backends, and most engine codebases get cross-architecture support for free as soon as they switch to a portable SIMD library. The cost of writing pure NEON intrinsics in 2026 is roughly the same as writing pure SSE2 intrinsics: portable to a slice of hardware, but not future-proof for the wide-vector successors.

16Pitfalls

Mixing legacy SSE encodings with VEX-encoded AVX. Switching from a non-VEX SSE instruction (movaps xmm0, ...) into a VEX AVX instruction (vmulps ymm0, ...) without an intervening vzeroupper can incur a state-save penalty of tens of cycles on Skylake-class cores^[32]. The compiler emits vzeroupper at AVX/SSE boundaries automatically; hand-written intrinsic code is fine because everything goes through the VEX encoding. Inline assembly that mixes the two has to insert vzeroupper manually.
Forgetting the scalar tail. A loop processing 8 floats at a time over an array of 503 elements leaves 7 elements undone. Common bug patterns: off-by-eight (the SIMD loop reads past the end), off-by-one (the scalar tail reads one element too many), data races (the SIMD loop and the tail loop write to overlapping memory). AVX-512's masked load and store make the tail a non-issue; pre-AVX-512 code should pad arrays to a multiple of the SIMD width and zero the tail, or write the scalar tail with a clear loop bound.
Treating scalar SSE math as cheaper than packed. addss and addps have identical latency and throughput on every Intel and AMD core since 2009^[1]. Three lanes of an XMM register sitting unused doesn't make the instruction faster. The win is in using those lanes.
Cross-lane shuffles in inner loops. vperm2f128, vpermps, vextractf128, and the cross-lane gather/scatter variants are all 3-cycle latency on Skylake, against 1 cycle for the within-lane shuffles^[1]. A horizontal sum at the end of a long reduction amortizes this; using a cross-lane shuffle every iteration triples the per-iteration cost.
Gather on cold data. Vector gather is faster than scalar loads only when the loaded data is already L1-resident. On a 1 MB working set with random gather indices, every load is a cache miss, and the gather serializes those misses inside one instruction. The branch-prediction trick that makes scalar code tolerate cache misses (lots of outstanding loads in flight, scheduler hides the latency) doesn't apply when the loads are inside a single μop. Restructure the data instead.
Compiling once and running anywhere. A binary compiled with -march=skylake-avx512 will SIGILL on a chip without AVX-512. Game shipping binaries either pick the lowest common denominator (often x86-64-v3 = AVX2 / FMA / BMI2 in 2026) or use function multiversioning to dispatch at startup. Test the dispatch path on hardware that lacks the higher-tier instructions; the failure mode is "works on the dev machine, crashes at boot on a player's machine"^[27].
Floating-point determinism. A reduction reordered for SIMD doesn't produce the same bit pattern as a scalar left-to-right reduction. For replays, lockstep multiplayer, deterministic physics, write the SIMD reduction tree explicitly (and the same tree every time, regardless of length) and don't enable -fassociative-math^[15].
Denormals. A particle that has decayed to a near-zero velocity can produce subnormal numbers, which on some Intel pre-Sapphire-Rapids parts incur a 100-cycle microcode penalty per operation. The fix is to enable Flush-to-Zero (FTZ) and Denormals-Are-Zero (DAZ) in MXCSR at engine startup; physics and audio codebases routinely set this^[32].

17What's next

Where to go from here:

Pick a hot kernel in your engine and rewrite it twice. Once as plain scalar with __restrict and #pragma omp simd, once with intrinsics. Compare the disassembly and the measured frame cost. The gap is the size of the autovec-vs-intrinsics decision in your codebase.
Read the Intel Intrinsics Guide^[9] for the instructions you're using. Every intrinsic has a latency / throughput badge per micro-architecture, plus a pseudo-C description of the lane semantics. The reference is searchable by mnemonic, by intrinsic name, or by feature flag.
Set up llvm-mca on a small slice of your inner loop. A few minutes of llvm-mca -mcpu=skylake on a 30-instruction kernel tells you which port is saturated, where the dependency chain pins the throughput, and whether reordering loads-before-stores helps. uica.uops.info is the same tool with a richer front-end model^[33].
Skim Wojciech Muła's blog and Daniel Lemire's blog. Both at 0x80.pl and lemire.me/blog. Together they are the most up-to-date practitioner-grade resource on modern SIMD, with hundreds of small worked examples and microbenchmarks on x86 and ARM.
Pair with the Assembly tutorial and the Memory Model tutorial. SIMD reductions interact directly with cache coherence and store-buffer ordering. Once a SIMD loop is fast enough to be memory-bandwidth-bound, the next set of optimizations is in those two tutorials' territory.

18Sources & further reading

Numbered citations refer to the superscripts above. Most are freely available; a few are vendor manuals that require a no-cost registration.

A note on originality

The prose, code samples, CSS, and interactive widgets on this page are original writing. The lane-format and instruction-encoding details follow the Intel SDM Volume 1 (chapters 11–14 on SSE/AVX/AVX-512) [2] and the Intel Intrinsics Guide [9]. The dot-product reduction pattern and the multi-accumulator rule trace to the AMD Software Optimization Guide and the Intel optimization manual [6][32]. The pshufb-as-lookup-table examples follow Wojciech Muła's blog and the Lemire–Keiser UTF-8 validation paper [4][16]. The AVX-512 frequency-license analysis in §14 follows Travis Downs's measurement series [26]; the SoA/AoSoA framing in §4 echoes Mike Acton's CppCon 2014 talk on data-oriented design [13]. The frustum cull kernel in §12 follows the standard formulation from Akenine-Möller et al. Real-Time Rendering [20] with the SIMD batching adapted from the DirectXMath internal layout [19].

Abel, A., & Reineke, J. (2019). uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. ASPLOS. Project home: uops.info. Machine-measured latency, throughput, and port-usage tables for nearly every x86-64 instruction on every recent micro-architecture, including the AVX-512 mask-register forms.
Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Order Numbers 253665–253669. intel.com. Volume 1 chapters 11–14 cover SSE / AVX / AVX-512 conceptual layout; Volume 2 has the per-instruction encoding tables and pseudo-C semantics.
Advanced Micro Devices. AMD64 Architecture Programmer's Manual, Volume 1: Application Programming. Publication #24592. amd.com. The original AMD64 spec, including the mandatory SSE2 baseline that every x86-64 compiler may assume.
Muła, W. Practical SIMD and bit-twiddling notes. 0x80.pl. The reference body of work on pshufb-driven tricks, including byte-shuffle parallel lookups, UTF-8 validation, base64 codec routines, and hundreds of related microbenchmarks across SSE, AVX2, AVX-512, and NEON.
Muller, J.-M., Brisebarre, N., de Dinechin, F., Jeannerod, C.-P., Lefèvre, V., Melquiond, G., Revol, N., Stehlé, D., & Torres, S. (2018). Handbook of Floating-Point Arithmetic (2nd ed.). Birkhäuser. The reference for FMA rounding semantics: a single rounding step after the multiply-then-add, instead of two.
Advanced Micro Devices. Software Optimization Guide for AMD Family 19h Processors (Zen 3 and Zen 4). Publication #56665 / #57487. amd.com. AMD's counterpart to the Intel optimization manual; covers the Zen 4 double-pumped AVX-512 implementation and the latency/throughput numbers for SIMD instructions.
Intel Corporation. Intel® Architecture Instruction Set Extensions Programming Reference. intel.com. The normative document for AVX-512 mask registers, the EVEX prefix, AVX-512F/BW/DQ/VL feature splits, and the masked memory operations used in §8.
Intel Corporation. (2023). Intel® Advanced Vector Extensions 10 (Intel® AVX10) Architecture Specification. intel.com. The proposal to unify the AVX-512 feature surface and allow hardware to ship at either 256-bit or 512-bit max vector width.
Intel Corporation. Intel® Intrinsics Guide. intel.com/intrinsics-guide. The searchable reference for every intrinsic in the <immintrin.h> family, with mnemonic, encoding, latency, and a per-architecture availability badge.
Fog, A. The microarchitecture of Intel, AMD and VIA CPUs and Instruction tables. agner.org/optimize. Manuals 3 and 4 of the agner.org/optimize series. The 2024 edition covers Zen 4, Raptor Lake, and Arrow Lake; periodically updated since 1996.
Unity Technologies. The Burst Compiler and the Mathematics package. docs.unity3d.com/Packages/com.unity.burst. Unity's job-compiler that translates a C# subset to NEON or AVX2; the Unity.Mathematics package provides the float4 / float4x4 types it vectorizes.
Epic Games. Unreal Engine source, Math/VectorRegister.h family. github.com/EpicGames/UnrealEngine (access requires a linked Epic account). Unreal's portable SIMD wrapper: a VectorRegister4Float type that lowers to __m128 on x86, float32x4_t on ARM, and similar elsewhere. The internal math library (FMatrix44d, FQuat) is built on top of it.
Acton, M. (2014). Data-Oriented Design and C++. CppCon. YouTube. The canonical talk on the SoA / AoS / chunked-layout decision in engine code; Insomniac's framing is the reference most subsequent ECS designs cite.
ISO/IEC. C++ standard, [basic.align] and [expr.alignof]. The normative spec for alignas, alignof, std::aligned_alloc, and the C++17 over-aligned operator new. cppreference's overview at en.cppreference.com/w/cpp/language/alignas is the practical entry point.
Muller, J.-M., et al. (2018). Handbook of Floating-Point Arithmetic (2nd ed.). Birkhäuser. Chapter 3 covers the non-associativity of FP addition and the rounding bounds for reordered reductions. The reference for "is the SIMD-reordered sum still numerically correct" questions.
Keiser, J., & Lemire, D. (2020). Validating UTF-8 In Less Than One Instruction Per Byte. Software: Practice and Experience. arXiv:2010.03090. The simdutf algorithm: a fully branchless UTF-8 validator built on pshufb as a parallel lookup. Used in Node.js, Chromium, and the simdjson library.
Muła, W., & Lemire, D. (2018). Faster Base64 Encoding and Decoding using AVX2 Instructions. ACM Transactions on the Web. arXiv:1704.00605. The canonical SIMD base64 codec; a small pshufb-driven kernel that runs at roughly 10× the throughput of a portable scalar implementation.
Drepper, U. (2007). What Every Programmer Should Know About Memory. PDF. The long-form treatment of the memory hierarchy, prefetch, cache effects, and bandwidth limits. Older than this tutorial's other references but still the clearest single document on the memory side of SIMD performance.
Microsoft. DirectXMath. learn.microsoft.com and github.com/microsoft/DirectXMath. Microsoft's SIMD math library that ships with the Windows SDK; the source has worked-out 4×4 transforms in SSE2, SSE4, AVX, AVX2, and AVX-512 variants.
Akenine-Möller, T., Haines, E., Hoffman, N., Pesce, A., Iwanicki, M., & Hillaire, S. (2018). Real-Time Rendering (4th ed.). CRC Press. Chapter 19 on culling has the canonical plane-sphere test and the standard hierarchical-cull pipeline assumed by §12.
OpenMP Architecture Review Board. OpenMP Application Programming Interface, Version 5.2. openmp.org. The normative spec for #pragma omp simd, the standardized way to declare "this loop has no loop-carried dependencies and may be vectorized."
Wassenberg, J., et al. Highway: A C++ library for SIMD. github.com/google/highway. Portable SIMD library with backends for SSE/AVX/AVX-512/NEON/SVE/RVV and runtime dispatch; used in Chromium's JPEG XL decoder, image-encoding libraries, and several Google internal codebases.
Kretz, M. (2018). P0214R9 — Data-parallel vector library. WG21. open-std.org. The C++ standardization proposal that became std::experimental::simd (TS 19570) and is on the path to std::simd in C++26. Follow-up papers (P1928 and later) track the integration into the working draft.
SIMDe Contributors. SIMDe: SIMD Everywhere. github.com/simd-everywhere/simde. Header-only translation of x86 SIMD intrinsics to NEON, WebAssembly SIMD, AltiVec, and a fallback portable C. Used by codebases that want to keep <immintrin.h> source but ship on ARM and the web.
LLVM Project. The LLVM Loop Vectorizer. llvm.org/docs/Vectorizers.html. The reference for what the autovectorizer can and can't do, the diagnostics it emits, and the pragmas (#pragma clang loop vectorize_width(N)) that override its decisions.
Downs, T. Performance Matters. travisdowns.github.io and the related Gathering Intel on Intel AVX-512 Transitions measurement series at travisdowns.github.io/blog/2020/01/17/avxfreq1.html. The canonical measurement of Skylake-SP, Ice Lake, and Sapphire Rapids AVX-512 frequency behavior, with reproducible benchmarks.
Free Software Foundation. Function Multiversioning. gcc.gnu.org. GCC's mechanism for emitting several versions of a function (one per __attribute__((target("avx2"))), one per ("avx512f"), etc.) and dispatching at startup via the IFUNC resolver. Clang supports the same syntax for binary compatibility.
ARM Limited. Arm® Architecture Reference Manual for A-profile architecture. Document DDI 0487. developer.arm.com. The normative reference for AArch64 NEON (Advanced SIMD), SVE, and SVE2; the register file, the instruction encoding, the C ABI conventions for V0–V31.
ARM Limited. Arm Cortex-A optimization guides. developer.arm.com. Per-microarchitecture latency and throughput tables (Cortex-A78, A715, Neoverse N1/N2/V1/V2). The ARM-side counterparts to Agner Fog's manuals.
Johnson, D. Apple Silicon CPU Optimization Guide. developer.apple.com. Apple's official guide for the M-series cores; covers the wide front end, the cluster topology (P-cores and E-cores), and the NEON throughput characteristics that differ from a typical ARM Cortex part.
ARM Limited. Arm SVE2 Architecture. developer.arm.com. The scalable-vector ISA implemented on the A64FX (Fugaku) and across the Neoverse line (V1/V2/N2). SVE2 is an optional feature in the ARMv9-A profile; the spec does not mandate it for V9-A conformance. Covers vector-length-agnostic loop forms, predicate registers, and gather/scatter.
Apple Developer Forums and LLVM. SVE availability on Apple Silicon. developer.apple.com/forums. The M4 (ARMv9.2-A) ships SME (Scalable Matrix Extension) but does not expose non-streaming SVE2 to user mode; SVE-class instructions execute only inside SME streaming-mode regions. Confirmed by LLVM backend behavior and developer-forum reports of SIGILL on plain SVE instructions outside SME.
Downs, T. (2019). A Note on Mask Registers. travisdowns.github.io/blog/2019/12/05/kreg-facts.html. The reference for the K0 / EVEX-encoding-0 distinction and the practical implications for AVX-512 mask register allocation.
Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual. Order Number 248966. intel.com. The normative source for Intel microarchitecture detail: the SSE-AVX transition penalty, the recommended idioms (vzeroupper at boundaries), the MXCSR flush-to-zero / denormals-are-zero flags.
Abel, A., & Reineke, J. (2022). uiCA: Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures. ICS. Project home: uica.uops.info. A simulator with a richer model of the front end, μop cache, and back-end ports than llvm-mca; useful when you need a more precise prediction of a kernel's steady-state IPC.
Reinders, J., Jeffers, J., & Sodani, A. (2016). Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann. The reference for the first AVX-512 hardware (the 2016 Knights Landing); the chapters on masked operations and gather/scatter are still the clearest single source on the AVX-512 mask model.
Geffroy, J. (2020). Doom Eternal: Devil is in the Details. SIGGRAPH Advances in Real-Time Rendering. advances.realtimerendering.com. Talk on the idTech 7 renderer's hot paths, including the SIMD-batched culling and instance-transform stages assumed in §11 and §12.
Sousa, T. (2013). The Rendering Technology of Killzone 2. GDC. The classic talk on PS3-era SIMD math kernels (SPU vector code), useful for context on how the SoA discipline became universal in console renderers.
Giesen, F. (Ryg). The ryg blog. fgiesen.wordpress.com. Practitioner-grade writeups on SIMD, codec implementation, the pixel pipeline, and the dependency-chain analyses that make this tutorial's §7 picture concrete on a real workload.
Lemire, D. Daniel Lemire's blog. lemire.me/blog. Microbenchmarks, SIMD analyses, and a steady stream of evidence-based posts on modern x86 and ARM performance; the source for many of the canonical "how fast does this actually go on Skylake / Ice Lake / Zen 4" measurements.
Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann. The textbook on pipelining, out-of-order execution, and cache hierarchies. Chapter 4 (data-level parallelism) is the long-form version of §3 and §7 of this tutorial.
Intel Corporation. Intel® VTune™ Profiler User Guide. intel.com. The reference profiler for Intel hardware; includes the AVX/AVX-512 frequency-license dashboard that visualises the §14 picture on a running workload.