Decoder Stack: What’s Different
The Transformer decoder is also a stack of 6 identical layers, but each layer has 3 sub-layers:
masked multi-head self-attentionover the already-generated output tokens. (“Masked” here means a token can’t peek at future tokens. This keeps generation auto-regressive.)encoder-decoder attention, where the decoder attends to all encoder outputs.- The same
position-wise feed-forward network.
Each sub-layer again uses residual connections + layer normalization.

