Mark As Completed Discussion

Encoder Stack: What Happens Inside

The Transformer encoder is a stack of N = 6 identical layers.

Each layer has:

  1. A multi-head self-attention sub-layer.
  2. A position-wise feed-forward network sub-layer (a tiny 2-layer MLP applied independently to each position's vector).

Each sub-layer is wrapped with a residual connection (add the input back to the output of the sub-layer) and layer normalization (normalize activations so training stays stable).

So the pattern is basically: x -> Attention -> Add&Norm -> FeedForward -> Add&Norm