Skeletal Animation & Skinning
A character mesh is one rigid blob until a skeleton drives it. Skeletal animation binds each vertex to a few joints, animates the joints, and deforms the mesh to follow. The data structure is a tree of transforms; the one equation that matters is the skinning matrix, animated-joint times inverse-bind; and the failure modes (the candy-wrapper twist, lerping the wrong way around a rotation) are where the rigor lives.
01Why skeletal animation
Authoring a separate mesh per animation frame doesn't scale (that's vertex animation, useful for crowds, see the VAT tutorial). Skeletal animation instead stores a small skeleton of joints, animates the joints over time, and computes each vertex's deformed position from the joints it's attached to. A 60-joint skeleton drives a 50,000-vertex character; you animate 60 transforms, not 50,000 positions.
Skeleton (a joint tree) → bind pose + inverse bind matrices (the mesh's rest state) → a per-vertex skinning matrix (animated joint × inverse bind) → a deformation method (linear blend or dual-quaternion) → animation clips (keyframes sampled over time) → blending (mixing clips) → the GPU (a matrix palette, skinned in the vertex shader). This tutorial builds it in that order: structure, then deformation, then time, then hardware.
02The skeleton
A skeleton is a tree of joints, each storing a local transform (translation, rotation, scale) relative to its parent. A joint's model-space (global) transform is its parent's global transform times its local, walking from the root, exactly the scene-graph propagation from Going 3D[1]:
A joint's place in the world is its parent's world transform times its own local transform. Walk the skeleton parent-first and each joint just multiplies onto its parent, which is why the root has to be processed before its children. Hover any symbol to see what it stands for.
glTF and most math call the node a joint; a "bone" loosely means the visible segment between a joint and its parent. The transform lives on the joint. And you must compute a parent's global transform before its children, walk the node tree (or sort so parents precede children), don't just iterate the joint array in storage order, or you read a stale parent. The skeleton can have any number of joints; that's separate from the per-vertex influence cap (§4).
03The inverse bind matrix
Mesh vertices are authored once, in model space, in the (the rest pose the mesh was modeled in, often a T-pose). The for joint j is the inverse of that joint's global transform in the bind pose. It maps a vertex from model space into joint j's local space, so the joint's animated transform can then carry it to its deformed position.
Before re-posing a vertex you have to cancel out the joint's rest pose, or the bind transform gets counted twice. Inverting the joint's bind-pose global transform gives the matrix that carries a model-space vertex into that joint's local space, ready for the animated transform to take over.
Forgetting the inverse bind (or applying it on the wrong side) makes the mesh explode the instant animation starts, because the animated joint transform gets applied to a vertex that's already in model space, double-counting the bind pose. glTF stores inverseBindMatrices directly (one per joint); use the stored values rather than recomputing them with a convention mismatch. The whole skinning idea is: undo the bind pose, then redo the current pose. Keep globalBindTransform (fixed) and globalPose (animated) clearly separate.
04The skinning matrix
Each vertex is bound to up to N joints (glTF: 4 per set) with weights that sum to 1. The skinning matrix is the weighted sum of each influencing joint's skinning transform = jointGlobalAnimated[j] · inverseBind[j]:
Each vertex is bound to a few joints. For every influence: undo that joint's rest pose (inverseBind), apply where the joint is now (jointGlobalAnimated), and scale by its weight. Σi sums those weighted matrices, and the blend transforms the vertex. Because the weights sum to 1, a vertex at the knee bends smoothly between thigh and shin instead of tearing.
The order is jointGlobalAnimated · inverseBind: the inverse bind on the right (applied first, to a model-space vertex), the animated global on the left. Weights must be non-negative and sum to 1, or the mesh shrinks or swells toward the weighted joint mean. 4 influences is glTF's per-set cap, not a universal law, more needs a second JOINTS_1/WEIGHTS_1 set. And transform the normal too, by the skinning matrix's inverse-transpose under non-uniform scale (the 3D Math rule); the cheap mat3(skin) · normal path is correct only for rigid joints.
Play the limb and color each vertex by its bone weights; break the weights to see the mesh distort:
// joints sorted so a parent always precedes its children
for (size_t joint = 0; joint < jointCount; ++joint) {
int parent = parentIndex[joint]; // -1 for the root
glm::mat4 local = localPose[joint].toMatrix(); // animated TRS this frame
globalPose[joint] = (parent < 0) ? local
: globalPose[parent] * local; // parent.global * local
}
for (size_t joint = 0; joint < jointCount; ++joint)
skinningMatrix[joint] = globalPose[joint] * inverseBind[joint]; // ORDER: animated * invBind
for joint in 0..joint_count {
let parent = parent_index[joint]; // -1 (i32) for the root
let local = local_pose[joint].to_matrix(); // animated TRS this frame
global_pose[joint] = if parent < 0 { local }
else { global_pose[parent as usize] * local }; // parent.global * local
}
for joint in 0..joint_count {
skinning_matrix[joint] = global_pose[joint] * inverse_bind[joint]; // animated * inv_bind
}
05LBS & DQS
(LBS, a.k.a. matrix palette skinning) is the equation above verbatim: blend the joint matrices by the weights, then transform. It's the default in essentially every engine and is exactly what glTF mandates[4]. But a weighted sum of matrices is not a rigid transform, so it has two artifacts:
- Candy-wrapper (twist): at a joint twisted toward 180°, the blend collapses volume. With one rotation the identity and the other a 180° twist,
0.5·I + 0.5·Ris a rank-1 matrix that projects space onto the axis, pinching the cross-section to a point[6]. - Volume loss (bend): at a sharply bent elbow, linearly interpolating between two rotated positions cuts the corner, shrinking the inner volume.
(DQS, Kavan et al.) blends rigid transforms instead of matrices, so it preserves volume through twists with no candy-wrapper, at cost comparable to LBS[5]. But DQS has its own artifact, joint bulging on sharp bends (it over-preserves volume), it can't represent non-uniform scale or shear, and it needs antipodality handling (the q vs −q double cover from 3D Math). LBS stays the default; the common production fix for its artifacts is twist helper joints and corrective blend shapes, not necessarily switching to DQS. Don't call DQS "the modern replacement."
Twist the forearm and compare LBS (collapses) to DQS (holds volume, then bulges on bend):
06The animation clip
A clip is a set of per-joint keyframe tracks, separate tracks for translation, rotation, and scale (glTF samples exactly these three per node). Sampling at time t: find the bracketing keyframe pair, compute the local parameter, and interpolate. Rotation interpolates with quaternion slerp or nlerp; translation and scale lerp.
Euler lerp takes a wrong arc and hits gimbal-lock degeneracies; component-wise matrix lerp gives a non-rigid in-between (the same problem as LBS). Quaternion slerp/nlerp is the correct cheap path[8]. The slerp-vs-nlerp trade is a triad, you get two of three: slerp is constant-velocity + torque-minimal but not commutative; nlerp is commutative + torque-minimal but not constant-velocity. For sampling one clip, slerp's constant velocity is nice; for blending several poses (§7), nlerp's commutativity matters. And check the quaternion dot sign and negate one before interpolating (the double cover), or you go the long way around.
Transform sampleTrack(const Track& track, float time) {
auto [k0, k1, alpha] = track.bracket(time); // bracketing keys + [0,1] factor
Transform out;
out.translation = glm::mix(k0.translation, k1.translation, alpha); // lerp position
out.scale = glm::mix(k0.scale, k1.scale, alpha); // lerp scale
out.rotation = glm::slerp(k0.rotation, k1.rotation, alpha); // SLERP rotation
return out; // glm::slerp handles the shortest path
}
fn sample_track(track: &Track, time: f32) -> Transform {
let (k0, k1, alpha) = track.bracket(time); // bracketing keys + [0,1] factor
Transform {
translation: k0.translation.lerp(k1.translation, alpha), // lerp position
scale: k0.scale.lerp(k1.scale, alpha), // lerp scale
rotation: k0.rotation.slerp(k1.rotation, alpha), // SLERP (glam: shortest path)
}
}
07Blending
Blending combines poses per joint, in local space, then recomputes globals[1]. A cross-fade interpolates each joint's local rotation by slerp and translation by lerp over the fade. Blend trees parametrically mix clips by 1D (speed) or 2D (speed × direction) parameters, the standard authoring model in Unreal and Unity. Additive blending applies a pose difference (a clip authored relative to a reference) on top of a base, for aim offsets and hit reactions layered over locomotion.
Never blend global joint matrices, that gives non-rigid, broken in-betweens (the matrix-averaging problem again). Blend local TRS per joint (slerp the rotations), then walk the hierarchy. Additive is base + delta per joint (compose the delta rotation, add the delta translation), not a second clip played on top; that's why one aim-offset clip composes over walk, run, and crouch without authoring every combination.
Blend two poses and switch the interpolation mode; the wrong modes take the wrong path or shrink the limb:
08The GPU side & glTF
Each frame the CPU builds the matrix palette (one skinning matrix per joint) and uploads it; the vertex shader reads the per-vertex joint indices and weights, sums the four palette matrices, and transforms position and normal. Skinning happens in the vertex shader.
Vulkan's guaranteed-minimum maxUniformBufferRange is 16 KiB on 1.3, and a mat4 is 64 bytes, so a portable UBO palette holds 256 matrices[11]. Fine for one ~100-joint skeleton, but many skinned instances in one buffer need an SSBO (minimum range 128 MiB), indexed by instanceID · jointCount + joint. This is the exact UBO-vs-SSBO trade from Going 3D, and the boundary where VAT starts to win for crowds.
layout(location = 0) in vec3 in_position;
layout(location = 1) in vec3 in_normal;
layout(location = 2) in ivec4 in_jointIndices; // glTF JOINTS_0 (4 per set)
layout(location = 3) in vec4 in_jointWeights; // glTF WEIGHTS_0 (sum to 1)
layout(std430, set = 0, binding = 0) readonly buffer Palette {
mat4 skinningMatrix[]; // globalPose[j] * inverseBind[j], one per joint
} palette;
void main() {
mat4 skin = in_jointWeights.x * palette.skinningMatrix[in_jointIndices.x]
+ in_jointWeights.y * palette.skinningMatrix[in_jointIndices.y]
+ in_jointWeights.z * palette.skinningMatrix[in_jointIndices.z]
+ in_jointWeights.w * palette.skinningMatrix[in_jointIndices.w]; // LBS
vec4 skinnedPos = skin * vec4(in_position, 1.0);
vec3 skinnedNrm = normalize(mat3(skin) * in_normal); // rigid approx; invT under non-uniform scale
gl_Position = camera.viewProj * skinnedPos;
}
A glTF skin has a joints array (node indices), an inverseBindMatrices accessor (one mat4 per joint, same order), and an optional skeleton root hint (it doesn't change the math)[2]. The skinned primitive carries JOINTS_0 (a VEC4 of indices into skin.joints, not the node array, classic off-by-one) and WEIGHTS_0. The full joint matrix includes inverse(meshNodeGlobal) on the left, but that's identity in the common case, which is why most engines implement just globalJoint · inverseBind[3]. LearnOpenGL's GLSL uses the cheap mat3 normal path[7].
09IK & root motion
Forward kinematics computes positions from joint angles (everything above). Inverse kinematics is the reverse: given a target (put the foot on the ground, the hand on the doorknob), find the joint angles that reach it. IK modifies the pose before the skinning palette is built.
- Two-bone (limb) IK is analytic: the interior elbow angle comes from the law of cosines,
acos((l1² + l2² − D²) / (2·l1·l2))for target distance D, then orient the triangle toward the target[9]. Becausecoshas two solutions (elbow up/down), a pole/bend vector picks the plane. - General chains use iterative solvers: FABRIK places joints along lines forward (base→target) then backward (target→base), converging in a few iterations with no matrices[10]; CCD rotates one joint at a time from the end inward.
In-place: the root stays at the origin and gameplay code moves the character. Root motion: the animation's root bone translates within the clip, and the engine drives the character's world transform from that root delta, so the visual stride matches the displacement[12]. A common cause of foot sliding is an in-place clip whose authored stride speed differs from the code-driven movement speed; root motion fixes that mismatch (foot-lock IK fixes the rest). Root motion isn't strictly better, it complicates networking, so engines often use in-place for locomotion and root motion for discrete actions.
Wrong answers, and why: an exploding mesh on playback is the inverse-bind order/omission (not shader stage or UBO size); and mid-blend shrinkage is wrong rotation interpolation / blending globals (not the blend factor, and not a DQS issue).
10Pitfalls
11What's next
Characters now move. The series turns from rendering and animation to how a game is structured: the next module is The Gameplay Layer, the object/component model, events, and embedded scripting, then AI (navmesh and steering, building on the existing pathfinding and behavior-tree tutorials), networking, tooling, and the 3D-game capstone. The full path is on the series hub.
- Jason Gregory. Game Engine Architecture, 3rd ed., ch. 12 (Animation Systems): skeletons, poses, clips, the matrix palette, and blending. gameenginebook.com. The canonical engine treatment.
- The Khronos Group. glTF 2.0 Specification, §3.7.3 Skins. registry.khronos.org. The mandated skinning data model:
joints,inverseBindMatrices,JOINTS_0/WEIGHTS_0, and the weight constraints. - The Khronos Group. glTF Tutorial, "Skins." github.khronos.org. The joint-matrix formula and the exact GLSL weighted sum (and that the mesh-node transform is ignored).
- Tomas Akenine-Möller, Eric Haines, Naty Hoffman, et al. Real-Time Rendering, 4th ed., §4.4 Vertex Blending. realtimerendering.com. LBS as the most common method; the weighted-average formula; DQS and corrective shapes.
- Ladislav Kavan, Steven Collins, Jiří Žára, Carol O'Sullivan. "Geometric Skinning with Approximate Dual Quaternion Blending." ACM TOG 2008. users.cs.utah.edu. DQS preserves rigidity (no candy-wrapper) at LBS-comparable cost, with a slight bulge at joints.
- Alec Jacobson, Zhigang Deng, Ladislav Kavan, J.P. Lewis. "Skinning: Real-time Shape Deformation." SIGGRAPH 2014 Course. skinning.org. The rank-1-projection derivation of the candy-wrapper collapse.
- Joey de Vries. LearnOpenGL, "Skeletal Animation." learnopengl.com. A hands-on vertex-shader skinning walkthrough (bone IDs, weights, the offset/inverse-bind matrix).
- Jonathan Blow. "Understanding Slerp, Then Not Using It." number-none.com. The slerp/nlerp property triad: constant-velocity vs commutative vs torque-minimal.
- Daniel Holden. "Simple Two Joint IK." theorangeduck.com. Closed-form two-bone IK via the law of cosines, with the bend-axis vector.
- Andreas Aristidou and Joan Lasenby. "FABRIK: A fast, iterative solver for the Inverse Kinematics problem." Graphical Models 2011. andreasaristidou.com. The forward/backward point-on-line iterative chain solver.
- The Khronos Group. Vulkan Specification, Required Limits. docs.vulkan.org.
maxUniformBufferRangeguaranteed minimum 16384 bytes (256 mat4);maxStorageBufferRangeminimum 128 MiB. - Unity Technologies. "How Root Motion works." docs.unity3d.com. The root bone driving the character transform; the in-place speed-mismatch foot-sliding cause.