Mark As Completed Discussion

Position-Wise Feed-Forward Networks

After attention, each position’s embedding is run through a tiny feed-forward network:

  • A linear layer (W1, b1),
  • A ReLU,
  • Then another linear layer (W2, b2).

It’s applied identically to every position, but with different parameters per layer depth. You can think of this as: attention mixes information across tokens, and then the feed-forward block “transforms” each token’s channel representation.

In the base Transformer:

  • d_model = 512
  • The inner feed-forward dimension d_ff = 2048 (so it expands, then contracts).