Community

Start a Thread


Notifications
Subscribe You’re not receiving notifications from this thread.

Explain Forward and Back Propagation in the context of deep learning.

Career • Asked over 2 years ago by Ashley

Ashley Commented on Feb 22, 2023:

Forward Propagation: The hidden layer, between the input layer and the output layer of the network, receives inputs with weights. We calculate the output of the activation at each node at each hidden layer, and this propagates to the next layer until we reach the final output layer. We go forward from the inputs to the final output layer, which is known as the forward propagation.
Back Propagation: It sends error information from the network's last layer to all of the weights within the network. It's a technique for fine-tuning the weights of a neural network based on the previous epoch's (i.e., iteration) error rate. By fine-tuning the weights, you may lower error rates and improve the model's generalization, making it more dependable. The process of backpropagation can be broken down into the following steps: It can generate output by propagating training data through the network. It, then, computes the error derivative for output activations using the target and output values. It can backpropagate to compute the derivative of the error in the previous layer's output activation, and so on for all hidden layers. It calculates the error derivative for weights using the previously obtained derivatives and all hidden layers. The weights are updated based on the error derivatives obtained from the next layer.

Jun Commented on Jun 21, 2025:

Ashley’s on point. A slightly tighter, mathy view:

  • Forward pass (compute predictions and cache intermediates):

    • For layer l: z[l] = W[l] a[l-1] + b[l], a[l] = f(z[l]) (a[0] = input x)
    • Output: ŷ = g(z[L]) and loss L(ŷ, y). Cache a[l], z[l] for backprop.
  • Backprop (chain rule over the computational graph, reusing those caches):

    • δ[L] = ∂L/∂z[L]. For softmax + cross-entropy, δ[L] = ŷ − y.
    • For l = L … 1:
    • dW[l] = δ[l] a[l-1]T
    • db[l] = sum over batch of δ[l]
    • δ[l−1] = (W[l])T δ[l] ⊙ f′(z[l−1])
    • Update parameters with SGD/Adam: W[l] ← W[l] − η dW[l], b[l] ← b[l] − η db[l]

Notes:
- Vectorize across the batch for speed; automatic differentiation frameworks build this computational graph and do the chain rule for you.
- Numerical stability matters (softmax on logits, not probabilities; use fused cross-entropy).
- Vanishing/exploding gradients are common; use good init (He/Xavier), residuals, normalization, and gradient clipping.
- Like “N Points Near Origin,” caching/reusing computed pieces is everything—forward caches make backprop O(params), not O(params × depth2).

Team AlgoDaily Commented on Jun 25, 2025:

Ashley and Jun nailed the mechanics. A compact intuition to round it out:

  • Forward pass: you compose simple functions (affine + nonlinearity) layer by layer to get predictions and a scalar loss. You cache what you’ll need for derivatives later (z’s for f′, a’s for dW).
  • Backprop: you start from ∂L/∂output and apply the chain rule backward through those same functions. For common softmax + cross-entropy, the output error is δ[L] = ŷ − y. Each layer uses
    • dW[l] = δ[l] a[l−1]T, db[l] = sum(δ[l] over batch)
    • δ[l−1] = (W[l])T δ[l] ⊙ f′(z[l−1])
  • Update: apply an optimizer step (SGD/Adam): W ← W − η dW, b ← b − η db. That’s the whole training loop.

Mental model: forward “computes,” backward “assigns blame” for the loss to each parameter via chain rule. It’s just reverse‑mode autodiff over the computation graph; frameworks cache intermediates on the forward so backward is one additional pass.

Common gotchas:
- Wrong activation derivative or forgetting to sum db across the batch.
- Broadcasting mistakes in dW shapes.
- Numerical stability: use fused softmax‑cross‑entropy on logits.
- Vanishing/exploding gradients: prefer He/Xavier init, ReLU/variants, residuals, normalization, and gradient clipping.

Minimal PyTorch training step: yhat = model(x); loss = criterion(yhat, y); loss.backward(); optimizer.step(); optimizer.zero_grad().

Sarah Chen Commented on Jun 30, 2025:

Ashley, Jun, and AlgoDaily covered the gist. A compact “from scratch” view that ties it together:

Pocket example (2‑layer MLP, batch size B)
- Forward:
- z1 = X W1 + b1, a1 = ReLU(z1)
- z2 = a1 W2 + b2, ŷ = softmax(z2)
- Loss L = crossentropy(ŷ, y)
- Backward:
- δ2 = ŷ − y [softmax + CE]
- dW2 = a1ᵀ δ2, db2 = sum
B(δ2)
- δ1 = (δ2 W2ᵀ) ⊙ ReLU′(z1) [ReLU′ = 1 where z1>0 else 0]
- dW1 = Xᵀ δ1, db1 = sum_B(δ1)
- Update (SGD): W ← W − η dW, b ← b − η db
- Usually average grads over batch: dW /= B, db /= B

Practical notes
- Cache z’s (pre‑activations) for derivatives; f′(z), not f′(a).
- Bias grads sum over the batch; watch broadcasting and shapes:
- X: [B,d], W1: [d,h], b1: [h], W2: [h,c], etc.
- Regularization: L2 adds λW to dW; dropout only active in train mode; BatchNorm changes its behavior between train/eval.
- Numerical stability: feed logits to a fused softmax‑cross‑entropy; mask padded tokens in sequence tasks.
- Gradient health: exploding/vanishing → use He/Xavier init, ReLU/variants, residuals/normalization, gradient clipping.
- Sanity checks: monitor loss decreasing and grad norms; do a small finite‑diff gradient check when implementing by hand.
- Same idea for convs: δ flows through convs/pools via chain rule; weight grads are correlations of inputs with δ.

Frameworks (PyTorch/TF/JAX) build this graph and run reverse‑mode autodiff, but understanding the above is what lets you debug when shapes, derivatives, or numerics go sideways.