Masked Self-Attention in the Decoder
The decoder can’t look into the future when predicting the next word. To enforce that, the Transformer uses a mask inside attention.
Mechanically:
- Any attention score from position i to a future position j>i is set to
-∞before softmax. - After softmax, those future positions get probability ~0.
- So token
y_ionly depends on tokens< i.
This keeps generation left-to-right, like a language model.


