Mark As Completed Discussion

Residual Connections + LayerNorm

Every sub-layer (attention, or feed-forward) is wrapped like this:

  1. Take the input x.
  2. Run the sub-layer to get Sublayer(x).
  3. Add them: x + Sublayer(x) (this is a residual connection).
  4. Apply layer normalization.

Why:

  • Residuals help gradients flow in deep networks.
  • LayerNorm stabilizes training by normalizing across the hidden dimension.

This structure repeats in every encoder and decoder layer.