Mark As Completed Discussion

Masked Self-Attention in the Decoder

The decoder can’t look into the future when predicting the next word. To enforce that, the Transformer uses a mask inside attention.

Mechanically:

  • Any attention score from position i to a future position j>i is set to -∞ before softmax.
  • After softmax, those future positions get probability ~0.
  • So token y_i only depends on tokens < i.

This keeps generation left-to-right, like a language model.