Residual Connections + LayerNorm
Every sub-layer (attention, or feed-forward) is wrapped like this:
- Take the input
x. - Run the sub-layer to get
Sublayer(x). - Add them:
x + Sublayer(x)(this is aresidual connection). - Apply
layer normalization.
Why:
- Residuals help gradients flow in deep networks.
- LayerNorm stabilizes training by normalizing across the hidden dimension.
This structure repeats in every encoder and decoder layer.


