Position-Wise Feed-Forward Networks
After attention, each position’s embedding is run through a tiny feed-forward network:
- A linear layer (W1, b1),
- A
ReLU, - Then another linear layer (W2, b2).
It’s applied identically to every position, but with different parameters per layer depth. You can think of this as: attention mixes information across tokens, and then the feed-forward block “transforms” each token’s channel representation.
In the base Transformer:
d_model = 512- The inner feed-forward dimension
d_ff = 2048(so it expands, then contracts).

