Tutorial 01 · Machine Learning

Neural Networks
from Scratch

A working classifier, built from a single artificial neuron up to a network that draws decision boundaries through 2D data, trained live in your browser. No frameworks, no hand-waving. The math, the code, and the demos all in one place.

Time~35 min LevelBeginner-friendly PrereqsComfortable with any programming language MathHigh-school algebra; calculus and basic linear-algebra notation help, but the prose carries you if not

01Why neural networks?

Most of the code you've written tells a computer exactly what to do, step by step, branch by branch. Neural networks flip that around. You don't write the rules. You write a flexible function with millions of dials, then turn the dials automatically until the function does what you want. The shift from programming rules to fitting them is what this whole field is about.

Take an example that's notoriously hard to hand-code: deciding whether a photo contains a cat. You can't write if (pixels look cat-shaped). There's no rule that survives contact with kittens, sphinxes, plush toys, or weird camera angles. But you can collect a few thousand labeled photos, define a function with a few million tunable parameters, and let an optimizer find the parameters that classify the training set correctly. With enough capacity and data, it generalizes to photos it has never seen.

The earliest version of this idea was Frank Rosenblatt's perceptron in 1958^[1]. It worked on toy problems, hit a wall when researchers proved it couldn't even compute XOR, and the field went quiet for a decade. Then in 1986 Rumelhart, Hinton, and Williams published Learning representations by back-propagating errors^[2], which is the paper most credit with launching the modern era of deep learning.

A footnote on the standard story: backpropagation had already been worked out by Seppo Linnainmaa in 1970 (as reverse-mode automatic differentiation)^[16], and Paul Werbos applied it to neural networks in 1974. Rumelhart-Hinton-Williams was not the invention; it was the moment the idea finally landed, in a venue people actually read, with experiments people couldn't ignore. That gap, from "the math exists" to "the field cares," shows up over and over in this story.

What you'll have by the end

A ~100-line classifier that learns the spirals, circles, and XOR datasets, written four times in parallel: JavaScript, Python, C++, and Rust. Pick whichever language you prefer and follow along. It trains live on this page, with controls for learning rate, hidden layer size, and activation function.

02The neuron: a tiny decision maker

An artificial does three things, in this order: it takes some numbers in, mixes them together with a weighted sum, and decides whether to "fire" based on whether the sum crosses a threshold.

Concretely: given input numbers x₁, x₂, …, x_n, the neuron has a matching set of weights w₁, w₂, …, w_n and a bias b. The neuron computes a weighted sum:

eq. 1 \cdot pre-activation z = w 1 x 1 + w 2 x 2 + \dots + w n x n + b

Take each input, multiply it by its weight, add all those products together, then add the bias. The result z is a single number. Hover any letter above to see what it means.

In vector form, z = w · x + b. That's just the dot product of two arrays plus a constant. Three lines of code in any language.

Need a refresher on weighted sums & dot products?

A weighted sum is exactly what it sounds like: a sum where each thing being added gets multiplied by its own number first. If a student's final grade is 0.6 × exam + 0.3 × homework + 0.1 × attendance, those 0.6, 0.3, 0.1 are weights.

The dot product is the weighted sum written compactly. Given two arrays w = [0.6, 0.3, 0.1] and x = [85, 92, 100], the dot product is 0.6×85 + 0.3×92 + 0.1×100 = 88.6. In JavaScript: w.reduce((s, wi, i) => s + wi * x[i], 0).

For a neuron, w is the array of weights, x is the array of inputs, and the dot product plus the bias gives the pre-activation z.

Then the neuron applies an activation function to z to decide its output. Rosenblatt's perceptron used the simplest one possible, a hard threshold:

eq. 2 \cdot perceptron output output = 1 if z \geq 0 0 if z < 0

If the pre-activation z is zero or positive, the neuron fires and outputs 1. If z is negative, it stays silent and outputs 0. That's the original perceptron rule, in one if-statement.

In 2D this has a clean geometric meaning. The equation w₁x₁ + w₂x₂ + b = 0 describes a straight line. Points on one side give z > 0 and the neuron fires; points on the other side give z < 0 and it stays silent. The weights tilt and rotate the line; the bias shifts it.

Play with the controls. Every move changes the line, which changes how the point is classified:

w₁ · weight on x₁ 1.00

w₂ · weight on x₂ 1.00

b · bias −0.30

z = w·x + b

0.000

output

fires (1)

One neuron, one straight line. That is the entire expressive power of a single perceptron. It can separate points that are linearly separable, and nothing else. No choice of weights makes a perceptron compute XOR^[3]. That limitation is why we have to stack neurons. But before we get there, we need a fix for a more immediate problem.

03Smoothing it out: the sigmoid neuron

A perceptron has a problem: its output jumps. Push z from −0.001 to +0.001 and the output flips from 0 to 1. There's no notion of "almost firing" or "barely firing." That makes the neuron impossible to nudge: a tiny weight change either changes nothing at all, or changes everything.

Learning requires gentle nudges: small, predictable changes in weights producing small, predictable changes in output. So Rumelhart and friends replaced the hard threshold with a smooth S-shaped curve called the sigmoid:

eq. 3 \cdot sigmoid σ (z) = 1 1 + e - z

Plug z into this and you always get a number between 0 and 1. Very large z → answer near 1. Very negative z → answer near 0. Right at z = 0, the answer is exactly 0.5.

The sigmoid takes any real number and squashes it into the range (0, 1). When z is very positive, σ(z) is near 1. When z is very negative, it's near 0. In the middle, around z = 0, the curve is gently sloped, which means nudging z by a little nudges the output by a little. That gentle sloping is what makes learning possible.

The slope itself matters a lot; we'll need it for backpropagation in a few sections. The derivative of the sigmoid has a tidy closed form:

eq. 4 \cdot sigmoid derivative σ' (z) = σ (z) \cdot (1 - σ (z))

The slope of the sigmoid at any input equals the sigmoid's value at that input, times one minus that value. Handy trick: if we already know σ(z), we get its slope for free.

It's maximal (0.25) right at the inflection point z = 0 and decays toward zero in either direction. Try it: toggle "show derivative" and slide the input.

Wait, what's a derivative again?

A derivative answers one question: if I nudge the input a tiny bit, how much does the output change? It's the slope of the function's graph at that point. Steep = big change, flat = small change.

For f(x) = x², the derivative is f′(x) = 2x. At x = 3, the slope is 6, meaning bumping x from 3.000 to 3.001 bumps the output by about 0.006.

For learning, derivatives are how we know which direction to move the weights to reduce the loss. Without them, we'd be guessing blindly.

The prime symbol (′) is shorthand for "the derivative of." So σ′(z) reads as "sigma-prime of z," meaning "the slope of the sigmoid at z."

input z 0.00

show derivative f′(z)

f(z)

0.5000

f′(z)

0.2500

f′(z) is the slope of the curve: how much the output changes per unit of input. We need it to know which way to nudge the weights.

04Stacking neurons into networks

One neuron can carve the input space with one straight line. Two neurons in parallel can carve it with two lines. Stack a second layer of neurons that takes the first layer's outputs as their inputs, and you can carve out arbitrarily complex regions: circles, spirals, anything.

A feedforward neural network is exactly that: layers of neurons, where every neuron in layer l takes inputs from every neuron in layer l − 1. The first layer is the input layer, the last is the output layer, and the ones in between are hidden layers.

The diagram above is a 2-3-2 network: two inputs, three hidden neurons, two outputs. Every line is one weight. Every circle (except the inputs) holds one bias. The whole thing is a function. Input vector in, output vector out.

We'll use a compact notation that makes the math much easier to write. For layer l, collect all that layer's weights into a matrix W^(l), all the biases into a vector b^(l), and all the activations into a vector a^(l). The forward pass through one layer is then:

eq. 5 \cdot layer forward pass z (l) = W (l) a (l-1) + b (l) a (l) = σ (z (l))

For each layer: (1) multiply the previous layer's outputs by this layer's weight matrix, (2) add this layer's biases, getting the pre-activations z. Then (3) apply the activation function σ to each pre-activation, getting this layer's outputs. The superscript (l) just labels which layer we're talking about.

Two operations per layer: a matrix-vector multiply plus a vector add, then an element-wise activation. The whole forward pass is a few of those in a row, a function that turns an input vector into an output vector with the network's parameters baked in.

Refresher: matrix-vector multiplication

A matrix is a rectangular grid of numbers. A vector is a column (or row) of numbers. Multiplying a matrix by a vector produces a new vector.

If W has 3 rows and 2 columns, and x is a 2-element vector, then W × x is a 3-element vector. Each output element is the dot product of one row of W with x.

For neural networks, that's exactly what's happening: each row of the weight matrix corresponds to one neuron, and dotting it with the previous layer's outputs gives that neuron's pre-activation. Three rows = three neurons in this layer.

In JavaScript: z[i] = W[i].reduce((s, w, j) => s + w * x[j], 0) + b[i].

Universality

A network with just one hidden layer can approximate any continuous function to arbitrary precision, given enough neurons. This is the universal approximation theorem (Cybenko 1989, Hornik 1991). In practice, deeper networks with fewer neurons per layer learn far better than shallow wide ones. That's most of why "deep learning" is a thing.

05Forward propagation

Forward propagation is just running the equations above from input to output. Nothing tricky. We compute z^(l) and a^(l) one layer at a time and pass the activations forward.

Step through it on the network below. Set the inputs, then click STEP to advance one phase at a time, or RUN to animate the whole pass. The output layer applies softmax instead of sigmoid because we want a probability distribution over two classes (more on this in the next section):

x₁ · input 1

x₂ · input 2

click STEP to advance, or RUN to animate.

In code, forward propagation is barely more than a couple of nested loops:

forward.js · pseudo-implementation

function forward(network, x) {
  let a = x;                              // activations of the previous layer
  const activations = [a];                // keep every layer's activations for backprop

  for (let l = 0; l < network.W.length; l++) {
    const W = network.W[l];               // weight matrix for this layer
    const b = network.B[l];               // bias vector for this layer
    const z = matVecMul(W, a).map((zi, i) => zi + b[i]);
    a = z.map(sigmoid);                  // element-wise activation
    activations.push(a);
  }
  return activations;                      // last entry is the network's output
}

activations[activations.length - 1] is the prediction. We keep the intermediate values around because we'll need them when we compute gradients.

06Measuring wrongness: loss functions

A network with random weights makes random predictions. To improve it, we need a single number telling us how wrong it is. Then we can do calculus on that number to figure out which way to nudge each weight to make it smaller. That single number is called the loss (or cost).

Mean squared error

The simplest loss function, used in Nielsen's book^[4] and most introductory examples, is mean squared error. For an example with true output y and prediction ŷ:

eq. 6 \cdot MSE C = 1 2 n \sum x ‖ y(x) - ŷ(x) ‖ 2

For each training example, compute how far the prediction is from the truth, square that distance, then average across all examples (the ½ is a cosmetic factor that makes the derivative tidier). Bigger loss = more wrong.

Average the squared distance between prediction and target, over all training examples. The ½ is there for convenience: it cancels out when we take the derivative, so we get (ŷ − y) instead of 2(ŷ − y).

What does Σ (the big capital sigma) actually mean?

Σ is shorthand for "sum up the following expression for every value of the index." The little symbol underneath says what to vary; the expression to its right is what gets added.

Plain English: ∑_{i=1}^{3} i² means "add up i² for i from 1 to 3," which is 1 + 4 + 9 = 14. In JS that's [1,2,3].reduce((s, i) => s + i*i, 0).

So ∑_x ‖y(x) − ŷ(x)‖² reads as: "loop over every training example x; for each one, compute the squared error; add them all up."

Cross-entropy: the better choice for classification

MSE works, but for classification problems there's a better loss called cross-entropy. The intuition: instead of measuring numeric distance, measure how surprised the model is by the correct answer.

First we turn the raw output numbers (called logits) into a probability distribution using softmax:

eq. 7 \cdot softmax softmax (z) i = e z i \sum j e z j

Raise e to the power of each logit (this makes them positive), then divide each by the total. The result is a list of probabilities that sum to 1 and reflect the relative size of the original logits.

Exponentiate each logit (forces it positive), then divide by the sum so the whole vector adds to 1. Now we have probabilities. The cross-entropy loss is then:

eq. 8 \cdot cross-entropy C = - \sum i y i log (ŷ i)

Only one term in the sum is nonzero (the one for the correct class, where y_i = 1). The loss is simply −log(prediction for the correct class). If the model assigns high probability to the right answer, log is near 0 and the loss is small. If it assigns near 0, log goes to −∞ and the loss explodes.

What's "log"? Why use it for loss?

log here is the natural logarithm (sometimes written ln): the inverse of e^x. So log(1) = 0, log(0.5) ≈ −0.69, log(0.1) ≈ −2.30, and log approaches −∞ as the input approaches 0.

Why use it for loss? Three reasons. (1) It turns multiplications into additions, which is mathematically convenient. (2) It punishes "confidently wrong" answers very heavily. Predicting 0.001 for the correct class gives a loss around 7, while predicting 0.1 gives only 2.3. (3) Paired with softmax, the gradient comes out to prediction − truth, the cleanest possible signal.

If the true class is class 0 (so y = [1, 0]) and the model predicts ŷ = [0.9, 0.1], the loss is −log(0.9) ≈ 0.105. If the model says ŷ = [0.1, 0.9], the loss is −log(0.1) ≈ 2.303. Confident wrong answers are punished heavily; confident right ones are rewarded. That asymmetry is what we want during training.

Softmax + cross-entropy

When you pair softmax with cross-entropy and work through the calculus, almost everything cancels. You're left with one line: the gradient of the loss with respect to the logits is (ŷ − y). Predicted minus actual. The expression is short and its gradient never vanishes, which is why every modern classifier uses this pairing.

07Learning by descent

Now we have a function (the loss) that takes the network's weights and returns a single number. Training means finding the weights that make that number small. With millions of weights, brute force is hopeless; we need a strategy. The strategy is gradient descent.

Imagine you're standing on a hilly landscape in dense fog. You can't see the bottom of the valley. But you can feel which way is downhill under your feet. That direction is the gradient. Take a small step in that direction, look again, take another step. Repeat. Eventually you reach a low point.

The mathematical version is exactly that. For each weight w, we compute ∂C / ∂w, which is how much the loss changes per unit of weight change. (If the curly-d notation is new to you, expand the box below before reading on.) Then we nudge:

What's a partial derivative (∂)?

A regular derivative df/dx answers: "if I change x by a tiny bit, how does f change?" It works when f is a function of just one variable.

When f depends on many variables (like a loss that depends on every weight in the network), we need to ask: "if I change just w by a tiny bit and hold everything else fixed, how does f change?" That's a partial derivative, written with a curly ∂ instead of a straight d.

So ∂C / ∂w reads as: "the partial derivative of the loss C with respect to the weight w." It's a single number telling us how sensitive the loss is to this particular weight, treating all other weights as constants.

The full vector of all these partials is the gradient. It points in the direction of steepest increase, so we move in the opposite direction to go downhill.

eq. 9 \cdot gradient descent update w \leftarrow w - η \cdot \partialC \partialw

The arrow ← means "assign." This line replaces the old weight with: old weight, minus a small fraction (η) of how much the loss responds to changes in that weight. Do this for every weight, every step. The loss goes down.

The constant η ("eta") is the learning rate, how big a step to take. Too small and training crawls; too big and you overshoot the valley and oscillate forever. Picking it well is half the art of training neural networks.

Click anywhere on the loss surface below to drop a starting point, then hit DESCEND. Watch the orange path. Try cranking the learning rate up until the trajectory goes berserk, then back down. Switch surfaces to see what happens when there's a valley or a saddle:

loss surface

learning rate η 0.18

steps

loss

···

Click on the surface to place a starting point. Bright regions = low loss, dark = high. Green dots mark the minima; on the saddle surface a starting point exactly on the b = 0 line will stall there before tipping into one of the two wells.

Stochastic gradient descent

Computing the gradient over the entire training set is expensive. The variant that makes neural nets practical is stochastic gradient descent (SGD): estimate the gradient from a small random batch of examples (a "mini-batch"), update the weights, then grab another batch. The noisy gradient is fine, and it can even help by knocking the optimizer out of bad local minima.

The same landscape, in three dimensions

The contour map above is what we usually draw on paper. But the actual loss surface is a 3D object, with the loss value as height. Drag the surface below to rotate it. Hit Descend and watch the path slide down the side of the hill in real space:

surface

learning rate η 0.150

Drag the surface to rotate it. The brighter the color, the higher the loss. The yellow trace shows the path of gradient descent.

08Beyond SGD: smarter optimizers

Plain SGD has problems. On a long narrow valley it zigzags between the walls instead of running down the floor. At a saddle point it slows almost to a stop. On a flat region it crawls. Real loss landscapes have all three. Modern deep learning training rarely uses raw SGD anymore. It uses adaptive variants that take the gradient and process it through a small running memory before applying it.

Momentum: a ball with mass

Momentum keeps a running average of the recent gradient direction and uses that average to update the weights. The intuition: vanilla SGD is a marble that stops the instant you stop pushing it. Momentum SGD is a ball with mass, it picks up speed in directions where the gradient consistently points the same way, and dampens directions where the gradient keeps flipping sign.

Momentum SGD v \leftarrow β v - η \cdot \nablaC w \leftarrow w + v

Each step, blend β (~90%) of the previous velocity with a fresh nudge from the current gradient. Then move along that blended velocity. In a long valley, the gradient repeatedly points down the valley floor; momentum accumulates speed in that direction. In zig-zag directions, alternating gradients cancel each other out.

RMSProp: adapt the step size per parameter

Different parameters can have wildly different gradient scales. A weight in the output layer might routinely see gradients of size 1.0; a weight five layers deep might see 0.001. Using the same learning rate for both is a compromise neither one wants. RMSProp^[12] tracks a running average of the squared gradient for each parameter and divides each update by the square root of that average. Big-gradient parameters get smaller steps; small-gradient parameters get bigger ones.

RMSProp s \leftarrow ρ s + (1 - ρ) \cdot (\nabla C) 2 w \leftarrow w - η \cdot \nabla C s + ε

Maintain an exponential moving average of how big each parameter's gradient has been recently. Then scale each update by 1 over the square root of that. The tiny ε (typically 10⁻⁸) prevents division by zero when a parameter has never moved.

Adam: momentum + RMSProp + bias correction

Adam^[13] combines the previous two ideas. It keeps both a momentum (first-moment) running average and an RMSProp-style (second-moment) running average. Then it applies bias correction so the early-training estimates aren't biased toward zero. Adam is the default optimizer in nearly every modern deep learning project.

Adam m \leftarrow β 1 m + (1 - β 1) \cdot \nabla C v \leftarrow β 2 v + (1 - β 2) \cdot (\nabla C) 2 m̂ = m 1 - β 1 t, v̂ = v 1 - β 2 t w \leftarrow w - η \cdot m̂ v̂ + ε

Track both how the gradient has been pointing on average (m) and how big it has been on average (v). Correct for the fact that both averages start at zero (the 1 − β^t terms). Take a momentum-style step, scaled per-parameter by the inverse of the running gradient magnitude.

Race the four optimizers down the same surface and the differences are obvious. On the Rosenbrock function (a banana-shaped valley that is a standard stress test), plain SGD struggles to follow the curve, Momentum overshoots, and Adam glides through.

surface

learning rate η 0.050

SGD Momentum RMSProp Adam step 0

09Backpropagation: how we actually get the gradients

Gradient descent needs ∂C/∂w for every weight in the network, potentially millions of them. Naively, you'd think we have to recompute the whole forward pass once per weight to estimate each partial derivative. That would be hopeless. is the algorithm that gets us all the gradients in one efficient sweep, by reusing intermediate values from the forward pass and chaining derivatives layer by layer.

Define δ_j^(l) (pronounced "delta-j-l") as the error at neuron j in layer l: how much the loss changes per unit of the neuron's pre-activation.

Nielsen distills backprop into four equations^[4]:

BP1 \cdot output layer error (general form) δ (L) = \nabla a C ⊙ σ'(z (L))

Output error = (how loss responds to output activations) ⊙ (activation slope at the output). This is the textbook form, written for any output activation + any loss. It's what you'd use for sigmoid + mean-squared error.

For us, the output layer has no activation (the code emits raw logits) and the loss is softmax + cross-entropy. Run those two through the chain rule and the σ′ piece, the softmax Jacobian, and most of the algebra cancel. What you're left with is a one-liner that we already met in §6:

BP1' \cdot softmax + cross-entropy shortcut δ (L) = ŷ - y

Predicted probabilities minus the one-hot label. No σ′ to evaluate, no Jacobian to assemble. This shortcut is the reason almost every modern classifier pairs softmax with cross-entropy. The code in §12 uses exactly this line.

BP2 \cdot propagate error backward δ (l) = ((W (l+1)) ⊤ δ (l+1)) ⊙ σ'(z (l))

Error at layer l = (next layer's weights transposed × next layer's error), times this layer's activation slope. The transpose "flips" the weights so we can send the error back through them. This is the heart of backprop: we already computed δ for layer l+1, so we can use it to compute δ for layer l. Then repeat all the way to the input.

BP3 \cdot gradient wrt. bias \partialC \partialb j (l) = δ j (l)

The bias gradient is just the error itself. No multiplication needed. The bias adds 1 unit per 1 unit of change, so its sensitivity is exactly the error of its neuron.

BP4 \cdot gradient wrt. weight \partialC \partialw jk (l) = a k (l-1) δ j (l)

Weight gradient = (upstream activation) × (downstream error). A weight only matters as much as the input flowing through it (activation of the source neuron) times how off-target its destination neuron is (the error δ at the destination).

The symbol ⊙ is element-wise (Hadamard) product: multiply two equal-length vectors slot by slot. ∇_aC in BP1 is the gradient of the loss with respect to the output post-activations, and depends on which loss you use; for the softmax + cross-entropy combination we use here, the whole BP1 expression collapses to BP1′ above.

The chain rule, in one minute

The chain rule is the calculus law that lets us differentiate composed functions. If y = f(g(x)), then dy/dx = f′(g(x)) × g′(x). In words: multiply the slopes through the chain.

A neural network is a long chain of compositions: input → weights → activation → weights → activation → … → loss. To get the slope of the loss with respect to an early weight, the chain rule says: multiply together the slopes of every step in between.

That multiplication chain is exactly what BP2 is doing: pulling the error δ through the network step by step, multiplying by each layer's activation slope on the way. Backpropagation is just the chain rule, applied carefully so we reuse computation instead of redoing it for every weight.

Here's what each equation is actually saying:

BP1. The textbook form, valid for any output activation and loss. With softmax + cross-entropy, BP1 collapses to BP1′: the output error is simply predicted probabilities minus the one-hot target, with no activation-slope factor at all.
BP2. To get the error at the layer below, take the next layer's error, project it back through the weights (transpose multiplication), and weight by that layer's activation slope.
BP3. The bias gradient is the error itself, because the bias affects only that one neuron's pre-activation, with derivative 1.
BP4. The weight gradient is the error of the downstream neuron times the activation of the upstream neuron. Credit assignment between two neurons.

The four equations are all consequences of the chain rule from multivariable calculus^[4]. Once you have them, the algorithm writes itself:

Run a forward pass; cache every z^(l) and a^(l).
Compute δ^(L) at the output using BP1.
For l = L − 1 down to 1: compute δ^(l) using BP2.
For each layer, accumulate gradients using BP3 and BP4.
Update weights: W^(l) ← W^(l) − η · ∂C/∂W^(l) and similarly for biases.

The whole pass is two matrix-multiplies per layer (one forward, one backward), no matter how big the network. That's the reason this scales.

10Why nobody uses sigmoid anymore

Now that we have backprop in front of us, look back at BP2: every layer multiplies the error by σ′(z). The maximum value of σ′ is 0.25, at the inflection point. Anywhere else, it's much smaller.

Now imagine a 10-layer network. Each layer multiplies the error by something at most 0.25, typically less. After 10 layers: 0.25¹⁰ ≈ 10⁻⁶. The gradient that reaches the first layer is a millionth of the gradient at the output. Those early weights barely move. This is the vanishing gradient problem^[5], and for over a decade it kept deep networks effectively un-trainable.

The fix, popularized around 2010, is to replace sigmoid with the rectified linear unit:

eq. 10 \cdot ReLU ReLU (z) = max (0, z)

If z is positive, output z. If z is zero or negative, output 0. It's a one-line if-statement: return z > 0 ? z : 0. That simplicity is the whole point.

The derivative is 1 for positive inputs and 0 for negative ones. No squashing, no vanishing. The gradient passes through unchanged for any neuron that's "on." Networks that were impossible to train with sigmoid suddenly work fine with ReLU.

ReLU has a known failure mode: a neuron whose pre-activation becomes very negative produces zero output and zero gradient, so it stops learning forever ("dying ReLU"). Leaky ReLU patches this with a small slope below zero (0.1z instead of 0). GELU smooths it out with a Gaussian curve, and is the activation of choice in modern transformers including GPT-2^[6]. Try them in the activation widget above to see the shapes.

Initialization matters too

If all your weights start at zero, every neuron in a layer computes the same thing and learns the same thing. That's a symmetry the gradient can never break, so we initialize randomly. But how big should the random numbers be?

Xavier Glorot showed in 2010^[7] that for sigmoid/tanh networks, weights should be drawn with variance 1/n_in, which keeps activation magnitudes roughly constant from layer to layer. For ReLU networks, Kaiming He showed in 2015^[8] that you need twice that, variance 2/n_in, to compensate for ReLU killing half the activations. Get the initialization wrong and your network either explodes or sits still.

Our code below uses He initialization when the activation is ReLU, Xavier otherwise. One line of code per case.

11The training loop, end to end

Everything above bolts together into a short program. In pseudocode:

training_loop · pseudocode

net = makeNetwork([2, 8, 8, 2])              // shape: 2 in, 2 hidden layers of 8, 2 out
data = loadTrainingData()

for epoch in 1..N:
  shuffle(data)
  for batch in batches(data, size=32):
    grads = zeros_like(net)
    for (x, y) in batch:
      a = forward(net, x)
      g = backprop(net, x, y)         // uses cached forward values
      accumulate(grads, g)
    applyUpdate(net, grads / len(batch), lr=η)
  evaluate(net, validation_set)

The same five concepts (forward pass, loss, gradient via backprop, parameter update, repeat) sit behind everything from LeCun's USPS digit classifier in 1989^[14] to GPT-4 in 2023. The pieces don't get harder; they get bigger. Karpathy's Zero-to-Hero series^[15] uses the same framing: the loop stays the loop, the layers in the middle change.

12A working network, in your language

Below is a complete, runnable implementation in four languages: JavaScript, Python, C++, and Rust. Pick whichever you're most comfortable with. The algorithm is identical in all four; only the syntax changes. Every variable has a descriptive name, every non-obvious line gets a comment.

Following along in an editor?

Copy any version into a file with the matching extension. The JavaScript version works in Node.js or any browser. The Python version is pure stdlib (no NumPy). The C++ version compiles with -std=c++11 or later. The Rust version uses only std with a tiny hand-rolled PRNG (in real code, use the rand crate). Each is around 120 to 170 lines.

network · pick a language ↓

// ──────────────────────────────────────────────────────────────
// network.js · feed-forward neural net trained with backprop + SGD
// ──────────────────────────────────────────────────────────────

// Activation functions, paired with their derivatives.
// Note: each derivative is expressed in terms of the *post-activation*
// value `a`, not the pre-activation `z`. For our three choices this
// makes the derivative cheap to compute from values we already have.
const ACTIVATIONS = {
  relu: {
    fn:                       z => z > 0 ? z : 0,
    derivativeFromActivation: a => a > 0 ? 1 : 0,
  },
  tanh: {
    fn:                       z => Math.tanh(z),
    derivativeFromActivation: a => 1 - a * a,
  },
  sigmoid: {
    fn:                       z => 1 / (1 + Math.exp(-z)),
    derivativeFromActivation: a => a * (1 - a),
  },
};

// Build a fresh network. `layerSizes` is an array of integers,
// e.g. [2, 8, 8, 2] = 2 inputs, two hidden layers of 8, 2 outputs.
function makeNetwork(layerSizes, activationName = 'relu') {
  const weights = [];   // weights[layer][neuron][input]
  const biases  = [];   // biases[layer][neuron]

  for (let layer = 1; layer < layerSizes.length; layer++) {
    const fanIn  = layerSizes[layer - 1];   // inputs coming in
    const fanOut = layerSizes[layer];       // neurons in this layer

    // He init for ReLU, Xavier for everything else. Both keep activations
    // from blowing up or vanishing as data flows through deep nets.
    const scale = activationName === 'relu'
      ? Math.sqrt(2 / fanIn)
      : Math.sqrt(1 / fanIn);

    const layerWeights = [];
    for (let neuron = 0; neuron < fanOut; neuron++) {
      const weightRow = [];
      for (let input = 0; input < fanIn; input++) {
        weightRow.push((Math.random() * 2 - 1) * scale);
      }
      layerWeights.push(weightRow);
    }
    weights.push(layerWeights);
    biases.push(new Array(fanOut).fill(0));   // start biases at zero
  }

  return { layerSizes, weights, biases, activationName };
}

// Forward pass: turn an input vector into an output prediction.
// Returns both the activations (per layer) and the pre-activations,
// because we'll need both for backprop.
function forward(network, inputVector) {
  const activationsPerLayer    = [inputVector.slice()];
  const preActivationsPerLayer = [];
  const activation = ACTIVATIONS[network.activationName];

  for (let layer = 0; layer < network.weights.length; layer++) {
    const layerWeights        = network.weights[layer];
    const layerBiases         = network.biases[layer];
    const previousActivations = activationsPerLayer[activationsPerLayer.length - 1];

    const preActivations    = [];
    const currentActivations = [];
    const isOutputLayer = (layer === network.weights.length - 1);

    for (let neuron = 0; neuron < layerWeights.length; neuron++) {
      // Weighted sum + bias = pre-activation `z`.
      let z = layerBiases[neuron];
      const weightRow = layerWeights[neuron];
      for (let input = 0; input < weightRow.length; input++) {
        z += weightRow[input] * previousActivations[input];
      }
      preActivations.push(z);

      // Output layer stays as raw logits, we apply softmax separately
      // when we need probabilities. Hidden layers use the activation fn.
      currentActivations.push(isOutputLayer ? z : activation.fn(z));
    }

    preActivationsPerLayer.push(preActivations);
    activationsPerLayer.push(currentActivations);
  }

  return { activationsPerLayer, preActivationsPerLayer };
}

// Softmax: turn the raw output logits into a probability distribution.
// Subtracting the max first is a numerical-stability trick, same answer,
// no overflow.
function softmax(logits) {
  const maxLogit = Math.max(...logits);
  const exponentials = logits.map(z => Math.exp(z - maxLogit));
  const total = exponentials.reduce((sum, value) => sum + value, 0);
  return exponentials.map(value => value / total);
}

// Backpropagation for softmax + cross-entropy loss.
// Returns the gradient of the loss with respect to every weight and
// bias in the network, plus the loss itself.
function backprop(network, inputVector, trueClassIndex) {
  const { activationsPerLayer } = forward(network, inputVector);
  const numLayers = network.weights.length;
  const activation = ACTIVATIONS[network.activationName];

  // Allocate zero-initialized gradient arrays with the same shape
  // as the weights and biases.
  const weightGradients = network.weights.map(layer =>
    layer.map(row => row.map(() => 0))
  );
  const biasGradients = network.biases.map(layer => layer.map(() => 0));

  // BP1: at the output, the error simplifies (for softmax+CE) to
  //      (predicted probability) − (one-hot true label).
  const outputLogits     = activationsPerLayer[numLayers];
  const outputProbabilities = softmax(outputLogits);
  let errorAtCurrentLayer = outputProbabilities.slice();
  errorAtCurrentLayer[trueClassIndex] -= 1;

  // Walk backward through the layers, computing gradients as we go.
  for (let layer = numLayers - 1; layer >= 0; layer--) {
    const previousLayerActivations = activationsPerLayer[layer];

    // BP3 + BP4: gradient of the loss wrt this layer's biases and weights.
    for (let neuron = 0; neuron < network.weights[layer].length; neuron++) {
      biasGradients[layer][neuron] = errorAtCurrentLayer[neuron];
      for (let input = 0; input < previousLayerActivations.length; input++) {
        weightGradients[layer][neuron][input] =
          errorAtCurrentLayer[neuron] * previousLayerActivations[input];
      }
    }

    // BP2: propagate the error backward to the previous layer,
    // scaled by the activation function's derivative.
    if (layer > 0) {
      const errorAtPreviousLayer = new Array(network.weights[layer - 1].length).fill(0);
      for (let source = 0; source < errorAtPreviousLayer.length; source++) {
        let weightedError = 0;
        for (let target = 0; target < network.weights[layer].length; target++) {
          weightedError += network.weights[layer][target][source] * errorAtCurrentLayer[target];
        }
        errorAtPreviousLayer[source] =
          weightedError * activation.derivativeFromActivation(activationsPerLayer[layer][source]);
      }
      errorAtCurrentLayer = errorAtPreviousLayer;
    }
  }

  // Cross-entropy loss for this example.
  const loss = -Math.log(Math.max(1e-9, outputProbabilities[trueClassIndex]));
  return { weightGradients, biasGradients, loss, outputProbabilities };
}

// One SGD step: subtract a small multiple of each gradient.
function applyGradients(network, gradients, learningRate) {
  for (let layer = 0; layer < network.weights.length; layer++) {
    for (let neuron = 0; neuron < network.weights[layer].length; neuron++) {
      network.biases[layer][neuron] -= learningRate * gradients.biasGradients[layer][neuron];
      for (let input = 0; input < network.weights[layer][neuron].length; input++) {
        network.weights[layer][neuron][input] -=
          learningRate * gradients.weightGradients[layer][neuron][input];
      }
    }
  }
}

# ───────────────────────────────────────────────────────────────
# network.py · feed-forward neural net trained with backprop + SGD
#
# Pure stdlib, no NumPy. For real work you would use NumPy or PyTorch,
# but the loops here make every operation explicit and easy to follow.
# ───────────────────────────────────────────────────────────────
import math
import random

# Activation functions paired with their derivatives. Each derivative
# is expressed in terms of the post-activation `a`, not the pre-activation `z`,
# so we can compute it cheaply from values we already have.
def relu(z):                                   return z if z > 0 else 0.0
def relu_derivative(a):                        return 1.0 if a > 0 else 0.0
def tanh(z):                                   return math.tanh(z)
def tanh_derivative(a):                        return 1.0 - a * a
def sigmoid(z):                                return 1.0 / (1.0 + math.exp(-z))
def sigmoid_derivative(a):                     return a * (1.0 - a)

ACTIVATIONS = {
    'relu':    (relu,    relu_derivative),
    'tanh':    (tanh,    tanh_derivative),
    'sigmoid': (sigmoid, sigmoid_derivative),
}

def make_network(layer_sizes, activation_name='relu'):
    """Build a network. layer_sizes = [n_in, n_hidden, ..., n_out]."""
    weights = []   # weights[layer][neuron][input]
    biases  = []   # biases[layer][neuron]

    for layer in range(1, len(layer_sizes)):
        fan_in  = layer_sizes[layer - 1]
        fan_out = layer_sizes[layer]

        # He init for ReLU, Xavier for sigmoid/tanh. Both keep activations
        # from exploding or vanishing as signals propagate through deep nets.
        scale = math.sqrt(2 / fan_in) if activation_name == 'relu' else math.sqrt(1 / fan_in)

        layer_weights = [
            [(random.random() * 2 - 1) * scale for _ in range(fan_in)]
            for _ in range(fan_out)
        ]
        weights.append(layer_weights)
        biases.append([0.0] * fan_out)   # biases start at zero

    return {
        'layer_sizes':     layer_sizes,
        'weights':         weights,
        'biases':          biases,
        'activation_name': activation_name,
    }

def forward(network, input_vector):
    """Run one example through the network. Returns activations and pre-activations
    per layer, we need both for backprop."""
    activations_per_layer     = [list(input_vector)]
    pre_activations_per_layer = []
    activation_fn, _ = ACTIVATIONS[network['activation_name']]

    for layer in range(len(network['weights'])):
        layer_weights        = network['weights'][layer]
        layer_biases         = network['biases'][layer]
        previous_activations = activations_per_layer[-1]
        is_output_layer      = (layer == len(network['weights']) - 1)

        pre_activations = []
        new_activations = []
        for neuron_index in range(len(layer_weights)):
            # Weighted sum + bias = pre-activation z.
            z = layer_biases[neuron_index]
            for input_index in range(len(layer_weights[neuron_index])):
                z += layer_weights[neuron_index][input_index] * previous_activations[input_index]
            pre_activations.append(z)

            # Output layer keeps raw logits; hidden layers apply the activation fn.
            new_activations.append(z if is_output_layer else activation_fn(z))

        pre_activations_per_layer.append(pre_activations)
        activations_per_layer.append(new_activations)

    return activations_per_layer, pre_activations_per_layer

def softmax(logits):
    """Turn raw scores into probabilities. Subtracting the max is a stability trick."""
    max_logit    = max(logits)
    exponentials = [math.exp(z - max_logit) for z in logits]
    total        = sum(exponentials)
    return [value / total for value in exponentials]

def backprop(network, input_vector, true_class_index):
    """Compute gradients of the cross-entropy loss with respect to all weights and biases."""
    activations_per_layer, _ = forward(network, input_vector)
    num_layers = len(network['weights'])
    _, activation_derivative = ACTIVATIONS[network['activation_name']]

    # Allocate zero gradients with the same shape as the parameters.
    weight_gradients = [
        [[0.0] * len(row) for row in layer]
        for layer in network['weights']
    ]
    bias_gradients = [[0.0] * len(layer) for layer in network['biases']]

    # BP1: output error for softmax + cross-entropy is just (probs − one_hot).
    output_logits        = activations_per_layer[num_layers]
    output_probabilities = softmax(output_logits)
    error_at_current_layer = list(output_probabilities)
    error_at_current_layer[true_class_index] -= 1.0

    for layer in range(num_layers - 1, -1, -1):
        previous_layer_activations = activations_per_layer[layer]

        # BP3 + BP4: bias and weight gradients for this layer.
        for neuron in range(len(network['weights'][layer])):
            bias_gradients[layer][neuron] = error_at_current_layer[neuron]
            for input_idx in range(len(previous_layer_activations)):
                weight_gradients[layer][neuron][input_idx] = (
                    error_at_current_layer[neuron] * previous_layer_activations[input_idx]
                )

        # BP2: send the error one layer further back, scaled by the activation slope.
        if layer > 0:
            error_at_previous_layer = [0.0] * len(network['weights'][layer - 1])
            for source in range(len(error_at_previous_layer)):
                weighted_error = 0.0
                for target in range(len(network['weights'][layer])):
                    weighted_error += network['weights'][layer][target][source] * error_at_current_layer[target]
                error_at_previous_layer[source] = (
                    weighted_error
                    * activation_derivative(activations_per_layer[layer][source])
                )
            error_at_current_layer = error_at_previous_layer

    loss = -math.log(max(1e-9, output_probabilities[true_class_index]))
    return weight_gradients, bias_gradients, loss, output_probabilities

def apply_gradients(network, weight_gradients, bias_gradients, learning_rate):
    """One SGD step: subtract a small multiple of each gradient."""
    for layer in range(len(network['weights'])):
        for neuron in range(len(network['weights'][layer])):
            network['biases'][layer][neuron] -= learning_rate * bias_gradients[layer][neuron]
            for input_idx in range(len(network['weights'][layer][neuron])):
                network['weights'][layer][neuron][input_idx] -= (
                    learning_rate * weight_gradients[layer][neuron][input_idx]
                )

// ─────────────────────────────────────────────────────────────────
// network.cpp · feed-forward neural net trained with backprop + SGD
// Requires C++11 or later (uses <random>, range-for, auto, etc.).
// Compile: g++ -std=c++11 -O2 network.cpp -o network
// ─────────────────────────────────────────────────────────────────
#include <vector>
#include <cmath>
#include <random>
#include <algorithm>
#include <string>
#include <cstddef>

// Pair an activation function with its derivative-from-activation.
struct Activation {
    double (*apply)(double z);
    double (*derivative_from_activation)(double a);
};

static double relu_fn(double z) { return z > 0.0 ? z : 0.0; }
static double relu_d (double a) { return a > 0.0 ? 1.0 : 0.0; }
static double tanh_fn(double z) { return std::tanh(z); }
static double tanh_d (double a) { return 1.0 - a * a; }
static double sigmoid_fn(double z) { return 1.0 / (1.0 + std::exp(-z)); }
static double sigmoid_d (double a) { return a * (1.0 - a); }

static Activation choose_activation(const std::string& name) {
    if (name == "relu")    return { relu_fn,    relu_d };
    if (name == "tanh")    return { tanh_fn,    tanh_d };
    return { sigmoid_fn, sigmoid_d };
}

struct Network {
    std::vector<int> layer_sizes;
    std::vector<std::vector<std::vector<double>>> weights;  // weights[layer][neuron][input]
    std::vector<std::vector<double>> biases;                 // biases[layer][neuron]
    std::string activation_name;
};

struct ForwardResult {
    std::vector<std::vector<double>> activations;       // includes input as layer 0
    std::vector<std::vector<double>> pre_activations;
};

struct Gradients {
    std::vector<std::vector<std::vector<double>>> weight_gradients;
    std::vector<std::vector<double>> bias_gradients;
    double loss;
    std::vector<double> output_probabilities;
};

Network make_network(const std::vector<int>& layer_sizes,
                     const std::string& activation_name = "relu") {
    Network net;
    net.layer_sizes     = layer_sizes;
    net.activation_name = activation_name;

    std::mt19937 rng(std::random_device{}());
    std::uniform_real_distribution<double> uniform(-1.0, 1.0);

    for (std::size_t layer = 1; layer < layer_sizes.size(); ++layer) {
        int fan_in  = layer_sizes[layer - 1];
        int fan_out = layer_sizes[layer];

        // He for ReLU, Xavier for everything else.
        double scale = (activation_name == "relu")
            ? std::sqrt(2.0 / fan_in)
            : std::sqrt(1.0 / fan_in);

        std::vector<std::vector<double>> layer_weights(fan_out, std::vector<double>(fan_in));
        for (int neuron = 0; neuron < fan_out; ++neuron)
            for (int input = 0; input < fan_in; ++input)
                layer_weights[neuron][input] = uniform(rng) * scale;

        net.weights.push_back(layer_weights);
        net.biases.push_back(std::vector<double>(fan_out, 0.0));
    }
    return net;
}

ForwardResult forward(const Network& net, const std::vector<double>& input_vector) {
    ForwardResult result;
    result.activations.push_back(input_vector);
    Activation activation = choose_activation(net.activation_name);

    for (std::size_t layer = 0; layer < net.weights.size(); ++layer) {
        const auto& layer_weights = net.weights[layer];
        const auto& layer_biases  = net.biases[layer];
        const auto& previous       = result.activations.back();
        const bool is_output_layer = (layer == net.weights.size() - 1);

        std::vector<double> pre_activations(layer_weights.size());
        std::vector<double> activations    (layer_weights.size());

        for (std::size_t neuron = 0; neuron < layer_weights.size(); ++neuron) {
            double z = layer_biases[neuron];
            for (std::size_t input = 0; input < layer_weights[neuron].size(); ++input)
                z += layer_weights[neuron][input] * previous[input];
            pre_activations[neuron] = z;
            activations[neuron]     = is_output_layer ? z : activation.apply(z);
        }

        result.pre_activations.push_back(pre_activations);
        result.activations.push_back(activations);
    }
    return result;
}

std::vector<double> softmax(const std::vector<double>& logits) {
    double max_logit = *std::max_element(logits.begin(), logits.end());
    std::vector<double> exponentials(logits.size());
    double total = 0.0;
    for (std::size_t i = 0; i < logits.size(); ++i) {
        exponentials[i]  = std::exp(logits[i] - max_logit);
        total           += exponentials[i];
    }
    for (double& value : exponentials) value /= total;
    return exponentials;
}

Gradients backprop(const Network& net,
                    const std::vector<double>& input_vector,
                    int true_class_index) {
    ForwardResult fwd = forward(net, input_vector);
    Activation activation = choose_activation(net.activation_name);
    std::size_t num_layers = net.weights.size();

    // Zero-shaped gradient containers.
    Gradients grads;
    grads.weight_gradients.resize(num_layers);
    grads.bias_gradients.resize(num_layers);
    for (std::size_t layer = 0; layer < num_layers; ++layer) {
        grads.weight_gradients[layer].assign(net.weights[layer].size(), {});
        for (std::size_t neuron = 0; neuron < net.weights[layer].size(); ++neuron)
            grads.weight_gradients[layer][neuron].assign(net.weights[layer][neuron].size(), 0.0);
        grads.bias_gradients[layer].assign(net.biases[layer].size(), 0.0);
    }

    // BP1: output error for softmax + cross-entropy is just (probs - one_hot).
    std::vector<double> output_probabilities = softmax(fwd.activations.back());
    std::vector<double> error_at_current_layer = output_probabilities;
    error_at_current_layer[true_class_index] -= 1.0;

    for (int layer = static_cast<int>(num_layers) - 1; layer >= 0; --layer) {
        const auto& previous_layer_activations = fwd.activations[layer];

        // BP3 + BP4: bias and weight gradients for this layer.
        for (std::size_t neuron = 0; neuron < net.weights[layer].size(); ++neuron) {
            grads.bias_gradients[layer][neuron] = error_at_current_layer[neuron];
            for (std::size_t input = 0; input < previous_layer_activations.size(); ++input)
                grads.weight_gradients[layer][neuron][input] =
                    error_at_current_layer[neuron] * previous_layer_activations[input];
        }

        // BP2: propagate error one layer further back.
        if (layer > 0) {
            std::vector<double> error_at_previous_layer(net.weights[layer - 1].size(), 0.0);
            for (std::size_t source = 0; source < error_at_previous_layer.size(); ++source) {
                double weighted_error = 0.0;
                for (std::size_t target = 0; target < net.weights[layer].size(); ++target)
                    weighted_error += net.weights[layer][target][source] * error_at_current_layer[target];
                error_at_previous_layer[source] =
                    weighted_error * activation.derivative_from_activation(fwd.activations[layer][source]);
            }
            error_at_current_layer = error_at_previous_layer;
        }
    }

    grads.loss = -std::log(std::max(1e-9, output_probabilities[true_class_index]));
    grads.output_probabilities = output_probabilities;
    return grads;
}

void apply_gradients(Network& net, const Gradients& grads, double learning_rate) {
    for (std::size_t layer = 0; layer < net.weights.size(); ++layer) {
        for (std::size_t neuron = 0; neuron < net.weights[layer].size(); ++neuron) {
            net.biases[layer][neuron] -= learning_rate * grads.bias_gradients[layer][neuron];
            for (std::size_t input = 0; input < net.weights[layer][neuron].size(); ++input)
                net.weights[layer][neuron][input] -= learning_rate * grads.weight_gradients[layer][neuron][input];
        }
    }
}

// ───────────────────────────────────────────────────────────────────
// network.rs · feed-forward neural net trained with backprop + SGD.
// Uses only `std`. For real work, pull in the `rand` crate and replace
// the tiny PRNG at the bottom of this file.
// ───────────────────────────────────────────────────────────────────

use std::f64;

#[derive(Clone, Copy)]
pub enum Activation { ReLU, Tanh, Sigmoid }

impl Activation {
    /// Apply the activation function to a pre-activation `z`.
    pub fn apply(self, z: f64) -> f64 {
        match self {
            Activation::ReLU    => if z > 0.0 { z } else { 0.0 },
            Activation::Tanh    => z.tanh(),
            Activation::Sigmoid => 1.0 / (1.0 + (-z).exp()),
        }
    }

    /// Derivative expressed in terms of the *post-activation* value `a`.
    /// Cheap to compute from values we already have.
    pub fn derivative_from_activation(self, a: f64) -> f64 {
        match self {
            Activation::ReLU    => if a > 0.0 { 1.0 } else { 0.0 },
            Activation::Tanh    => 1.0 - a * a,
            Activation::Sigmoid => a * (1.0 - a),
        }
    }
}

pub struct Network {
    pub layer_sizes: Vec<usize>,
    pub weights:     Vec<Vec<Vec<f64>>>,  // [layer][neuron][input]
    pub biases:      Vec<Vec<f64>>,
    pub activation:  Activation,
}

pub struct Gradients {
    pub weight_gradients:     Vec<Vec<Vec<f64>>>,
    pub bias_gradients:       Vec<Vec<f64>>,
    pub loss:                 f64,
    pub output_probabilities: Vec<f64>,
}

impl Network {
    /// Build a fresh network with the given shape.
    pub fn new(layer_sizes: &[usize], activation: Activation) -> Self {
        let mut weights: Vec<Vec<Vec<f64>>> = Vec::new();
        let mut biases:  Vec<Vec<f64>>      = Vec::new();

        for layer in 1..layer_sizes.len() {
            let fan_in  = layer_sizes[layer - 1];
            let fan_out = layer_sizes[layer];

            // He init for ReLU, Xavier for everything else.
            let scale = match activation {
                Activation::ReLU => (2.0 / fan_in as f64).sqrt(),
                _                  => (1.0 / fan_in as f64).sqrt(),
            };

            let layer_weights: Vec<Vec<f64>> = (0..fan_out)
                .map(|_| (0..fan_in)
                    .map(|_| (rand_uniform() * 2.0 - 1.0) * scale)
                    .collect())
                .collect();
            weights.push(layer_weights);
            biases.push(vec![0.0; fan_out]);
        }

        Self { layer_sizes: layer_sizes.to_vec(), weights, biases, activation }
    }

    /// Run one input vector through the network. Returns (activations, pre-activations)
    /// per layer; the input is included as `activations[0]`.
    pub fn forward(&self, input_vector: &[f64]) -> (Vec<Vec<f64>>, Vec<Vec<f64>>) {
        let mut activations_per_layer: Vec<Vec<f64>> = vec![input_vector.to_vec()];
        let mut pre_activations_per_layer: Vec<Vec<f64>> = Vec::new();

        for (layer, layer_weights) in self.weights.iter().enumerate() {
            let layer_biases = &self.biases[layer];
            let previous     = activations_per_layer.last().unwrap().clone();
            let is_output_layer = layer == self.weights.len() - 1;

            let mut pre_activations = Vec::with_capacity(layer_weights.len());
            let mut activations     = Vec::with_capacity(layer_weights.len());

            for (neuron, weight_row) in layer_weights.iter().enumerate() {
                let mut z = layer_biases[neuron];
                for input in 0..weight_row.len() {
                    z += weight_row[input] * previous[input];
                }
                pre_activations.push(z);
                activations.push(if is_output_layer { z } else { self.activation.apply(z) });
            }

            pre_activations_per_layer.push(pre_activations);
            activations_per_layer.push(activations);
        }

        (activations_per_layer, pre_activations_per_layer)
    }

    /// Backpropagation for cross-entropy + softmax.
    pub fn backprop(&self, input_vector: &[f64], true_class: usize) -> Gradients {
        let (activations_per_layer, _) = self.forward(input_vector);

        let mut weight_gradients: Vec<Vec<Vec<f64>>> = self.weights.iter()
            .map(|layer| layer.iter().map(|row| vec![0.0; row.len()]).collect())
            .collect();
        let mut bias_gradients: Vec<Vec<f64>> = self.biases.iter()
            .map(|b| vec![0.0; b.len()])
            .collect();

        // BP1: softmax(logits) − one_hot(true_class).
        let output_probabilities = softmax(activations_per_layer.last().unwrap());
        let mut error_at_current_layer: Vec<f64> = output_probabilities.clone();
        error_at_current_layer[true_class] -= 1.0;

        for layer in (0..self.weights.len()).rev() {
            let previous_layer_activations = &activations_per_layer[layer];

            // BP3 + BP4: bias and weight gradients for this layer.
            for neuron in 0..self.weights[layer].len() {
                bias_gradients[layer][neuron] = error_at_current_layer[neuron];
                for input in 0..previous_layer_activations.len() {
                    weight_gradients[layer][neuron][input] =
                        error_at_current_layer[neuron] * previous_layer_activations[input];
                }
            }

            // BP2: send error one layer further back.
            if layer > 0 {
                let mut error_at_previous_layer = vec![0.0; self.weights[layer - 1].len()];
                for source in 0..error_at_previous_layer.len() {
                    let mut weighted_error = 0.0;
                    for target in 0..self.weights[layer].len() {
                        weighted_error += self.weights[layer][target][source] * error_at_current_layer[target];
                    }
                    error_at_previous_layer[source] = weighted_error
                        * self.activation.derivative_from_activation(activations_per_layer[layer][source]);
                }
                error_at_current_layer = error_at_previous_layer;
            }
        }

        Gradients {
            weight_gradients,
            bias_gradients,
            loss: -output_probabilities[true_class].max(1e-9).ln(),
            output_probabilities,
        }
    }

    /// One SGD step: subtract a small multiple of each gradient.
    pub fn apply_gradients(&mut self, grads: &Gradients, learning_rate: f64) {
        for layer in 0..self.weights.len() {
            for neuron in 0..self.weights[layer].len() {
                self.biases[layer][neuron] -= learning_rate * grads.bias_gradients[layer][neuron];
                for input in 0..self.weights[layer][neuron].len() {
                    self.weights[layer][neuron][input] -=
                        learning_rate * grads.weight_gradients[layer][neuron][input];
                }
            }
        }
    }
}

fn softmax(logits: &[f64]) -> Vec<f64> {
    let max_logit = logits.iter().copied().fold(f64::NEG_INFINITY, f64::max);
    let exponentials: Vec<f64> = logits.iter().map(|&z| (z - max_logit).exp()).collect();
    let total: f64 = exponentials.iter().sum();
    exponentials.iter().map(|e| e / total).collect()
}

// Tiny xorshift64 PRNG. Single-threaded only: `static mut` is unsynchronized,
// so concurrent calls would be undefined behavior. For real (multi-threaded
// or production) code, drop this and use the `rand` crate.
fn rand_uniform() -> f64 {
    use std::time::{SystemTime, UNIX_EPOCH};
    static mut STATE: u64 = 0;
    unsafe {
        if STATE == 0 {
            STATE = SystemTime::now()
                .duration_since(UNIX_EPOCH).unwrap()
                .subsec_nanos() as u64 | 1;
        }
        STATE ^= STATE << 13;
        STATE ^= STATE >> 7;
        STATE ^= STATE << 17;
        (STATE as f64) / (u64::MAX as f64)
    }
}

One file, no frameworks, every line traceable to an equation we walked through. The same algorithm, scaled up with matrix libraries, GPUs, attention, and trillions of parameters, is what runs ChatGPT.

13Train it yourself

The widget below is running the code above in your browser. Pick a dataset, tweak the architecture, hit TRAIN. The left panel shows the predicted class probability over the input space (the decision boundary); the right panel plots the loss per epoch.

INPUT SPACE · decision boundary forms as the network learns

dataset

activation

hidden units / layer 8

learning rate η 0.030

data noise 0.10

epoch

loss

···

accuracy

···

throughput

···

Things to try

Push the learning rate up to 0.2 on the spiral. Watch the loss curve thrash. SGD with too-large η can't settle into the valley.
Drop the learning rate to 0.001. Loss decreases monotonically but takes forever. Pick a middle ground.
Use only 2 hidden units on the spiral. The network doesn't have the capacity to bend the boundary enough. Bump it to 16 and it fits the curve cleanly.
Switch from ReLU to sigmoid with a deep network (e.g. add hidden units to the max). Learning slows dramatically. That's the vanishing gradient problem from §10.
Try XOR with no hidden layer at all. You can't, but the closest is hidden=2, sigmoid. Watch how it barely separates the classes. That was Minsky's whole point.

14Overfitting and how to spot it

There's a problem hiding in everything we've done so far. A network with enough capacity can memorize any training set. Give it 100 noisy points and a network with 1,000 weights will contort itself to pass through every one, noise included. On the training data, it looks perfect. On new data, it falls apart.

This is : confusing the noise in the training set for signal in the world. The standard practical fix is to hold out some of your data, never let the network train on it, and measure performance separately on the held-out set during training. The gap between the two numbers tells you whether the model is generalizing.

Train / validation / test

The standard split is three sets. Training set: what the network sees and learns from. Validation set: held out, used to pick hyperparameters and decide when to stop training. Test set: held out even from you the engineer. Touch it once at the end to report final performance. Use it more than that and you're effectively training on it through the feedback loop.

The widget below trains an over-parameterized network (16 + 16 hidden units, around 350 weights) on a small noisy dataset. The solid green curve is loss on the training set. The dashed red curve is loss on a held-out validation set the network never sees. After a few hundred epochs the two curves diverge.

filled = training · outlined = validation · ◼ class 0 · ● class 1

training loss (solid) validation loss (dashed)

training set size 40

data noise 0.10

weight decay (regularization) 0.0000

epoch

train loss

···

val loss

···

What overfitting looks like, in three pictures

The diagnostic pattern is:

Training loss keeps decreasing. The network is finding patterns in the training data.
Validation loss decreases for a while, then turns around and starts increasing. Past this point, every additional epoch makes the network worse at the actual problem.
The decision boundary develops weird bulges around individual training points, trying to capture each one's noisy position.

How to fix it

Four standard cures, all of which work and are usually combined:

More data. The cleanest fix. With a thousand points, the noise averages out and the network can't memorize individual examples even if it wanted to. Often not an option in practice.
Smaller network. Fewer parameters means less capacity to memorize. Counter-intuitively, the simplest cure is a smaller model.
Weight decay (L2 regularization). Penalize the network for having large weights. Drive it toward simpler functions. Slide the regularization slider above and watch the validation loss recover.
Early stopping. Watch validation loss; the moment it starts rising, stop training and use the previous best checkpoint.

There are more advanced techniques (dropout, which randomly zeros out neurons during training; batch normalization; data augmentation) but the four above are the ones every modern training run uses.

15Your turn: the code playground

The playground below runs the JavaScript version of the network you just read. The Python, C++, and Rust ports from §12 do the same thing; copy any of them into your own editor to run there. The library is exposed as the MPGNet object so you can build your own network, train it, and predict from the editor. Hit RUN (or Ctrl+Enter / Cmd+Enter) to execute; output prints on the right.

▸ playground.js · live JavaScript, in your browser

// Build a network: 2 inputs → 6 hidden → 6 hidden → 2 outputs
const net = MPGNet.makeNetwork([2, 6, 6, 2], { activation: 'relu' });

// Tiny XOR-like dataset (corner labels)
const data = [
  [-0.8, -0.8, 0], [ 0.8,  0.8, 0],
  [-0.8,  0.8, 1], [ 0.8, -0.8, 1],
];

// Train for 800 epochs
const lr = 0.05;
for (let epoch = 0; epoch < 800; epoch++) {
  let totalLoss = 0;
  for (const [x, y, label] of data) {
    const g = MPGNet.backprop(net, [x, y], label);
    MPGNet.applyGrads(net, g, lr);
    totalLoss += g.loss;
  }
  if (epoch % 100 === 0) {
    console.log(`epoch ${epoch.toString().padStart(3)} · loss ${(totalLoss / 4).toFixed(4)}`);
  }
}

console.ok('training complete');

// Test on the training points
for (const [x, y, label] of data) {
  const probs = MPGNet.predict(net, [x, y]);
  const pred = probs[0] > probs[1] ? 0 : 1;
  const tag = pred === label ? '✓' : '✗';
  console.log(`${tag}  (${x}, ${y}) → class ${pred}  · probs [${probs[0].toFixed(2)}, ${probs[1].toFixed(2)}]  (true: ${label})`);
}

// Test a held-out point
const test = MPGNet.predict(net, [0.5, -0.5]);
console.log(`\nheld-out (0.5, -0.5) → class ${test[0] > test[1] ? 0 : 1}`);

Edit anything. Try changing the data, the architecture, the learning rate. Add a held-out test set. Plot the predictions. The whole library (makeNetwork, forward, predict, backprop, applyGrads, softmaxVec) is documented in the comments of the file you saw earlier and lives at window.MPGNet.

16Where to go from here

What we just built is called a multilayer perceptron (MLP) or feedforward neural network. It's the foundational architecture from which everything else descends. It works well on small tabular data and 2D toys. It struggles on the things that made deep learning famous.

Images: convolutions

A 256×256 image has 65,536 pixels. Connecting every pixel to every neuron in the first hidden layer of an MLP gives you tens of millions of parameters per layer, which is wasteful because neighboring pixels are correlated. Convolutional neural networks (CNNs) exploit that structure: a small filter slides over the image, producing a feature map. The same algorithm we built (backprop on a chain of differentiable operations) still drives training; only the layer types change.

Sequences: attention

For text, audio, or any data that comes in a sequence, MLPs can't relate token i to token i+50 without dragging information through every position in between. Vaswani et al. 2017's Attention is All You Need^[9] solves this with scaled dot-product attention:

scaled dot-product attention Attention (Q, K, V) = softmax (QK ⊤ \sqrtd k) V

Every token in the sequence has three roles: it asks (Q), it advertises (K), and it has content (V). Compute the similarity between every query and every key, scale down, softmax to get attention weights, and use them to mix the values. The result: each token gets a customized summary of every other token.

Every token gets to attend to every other token. Stack a few of these blocks and you have a , the architecture behind every major language model since 2018. Karpathy's nanoGPT^[10] is the cleanest implementation of a GPT-style transformer around: about 300 lines for the model, 300 for training.

You don't need GPT-4 to do interesting things

Eldan and Li's TinyStories paper^[11] showed that transformers with under 10 million parameters (small enough to train on a laptop) can produce fluent, coherent children's stories if you give them a carefully constrained vocabulary. The lesson is that you can replicate the entire training pipeline end-to-end on a small budget and still get something that genuinely works.

That's what the next tutorial in this series will walk through: building a tiny transformer from the same first principles we used here. Same forward/backward/update loop, more interesting layers in the middle.

You finished

You now have a working, code-level picture of how every modern neural network learns. The depth and the data change. The mechanism does not.

The final exam

Five questions covering the whole tutorial. If you can answer all five without scrolling back up, you've got the fundamentals.

17Sources & further reading

Numbered citations refer to the superscripts above. Everything below is either freely available on the open web or linked from arXiv.

A note on originality

The prose in this tutorial is original writing. The mathematical equations themselves are canonical statements that appear identically across every neural-network textbook (there's only one way to write the sigmoid function or the four backpropagation equations), and they are attributed at the points where they are introduced. The labeling convention BP1–BP4 follows Nielsen [4], who derives all four in detail. The scaled dot-product attention formula appears verbatim in Vaswani et al. [9]. The interactive demos, code, and CSS on this page are written from scratch for this tutorial.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. The original artificial neuron.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. Brought backpropagation into the mainstream.
Minsky, M. & Papert, S. (1969). Perceptrons. MIT Press. The XOR critique that triggered the first AI winter.
Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press. Free online: neuralnetworksanddeeplearning.com. The clearest derivation of the four backprop equations anywhere.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich. The first formal analysis of the vanishing-gradient problem.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI. PDF. The GPT-2 paper.
Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS. The Xavier initialization paper.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV. The He / Kaiming initialization paper.
Vaswani, A., Shazeer, N., Parmar, N. et al. (2017). Attention Is All You Need. NeurIPS. arXiv:1706.03762. The Transformer paper.
Karpathy, A. nanoGPT. GitHub: github.com/karpathy/nanoGPT. The pedagogical reference implementation of GPT.
Eldan, R. & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv:2305.07759. The "tiny LMs work" paper.
Hinton, G. (2012). Lecture 6e: rmsprop, Divide the gradient by a running average of its recent magnitude. Slide deck from the Coursera Neural Networks course. RMSProp's first public appearance.
Kingma, D. P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. The optimizer that became the deep-learning default.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. The first end-to-end-trained convolutional network deployed at scale; the MNIST work in Gradient-based learning applied to document recognition (LeCun et al., 1998) followed nine years later.
Karpathy, A. Neural Networks: Zero to Hero. YouTube series, 2022–2023. karpathy.ai/zero-to-hero.html. The pedagogy of "forward, loss, backward, update, scale" as the through-line from a single neuron to a small GPT.
Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. M.Sc. thesis, University of Helsinki. Reverse-mode automatic differentiation, which is the algorithm we now call backpropagation, sixteen years before Rumelhart-Hinton-Williams.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math. of Control, Signals, and Systems. The original universal approximation theorem.
Pournis, A. LLM from Scratch. GitHub: github.com/angelos-p/llm-from-scratch. A hands-on workshop building a small GPT trainable on a laptop.