Encoder Stack: What Happens Inside
The Transformer encoder is a stack of N = 6 identical layers.
Each layer has:
- A
multi-head self-attentionsub-layer. - A
position-wise feed-forward networksub-layer (a tiny 2-layer MLP applied independently to each position's vector).
Each sub-layer is wrapped with a residual connection (add the input back to the output of the sub-layer) and layer normalization (normalize activations so training stays stable).
So the pattern is basically:
x -> Attention -> Add&Norm -> FeedForward -> Add&Norm

