Mark As Completed Discussion

Decoder Stack: What’s Different

The Transformer decoder is also a stack of 6 identical layers, but each layer has 3 sub-layers:

  1. masked multi-head self-attention over the already-generated output tokens. (“Masked” here means a token can’t peek at future tokens. This keeps generation auto-regressive.)
  2. encoder-decoder attention, where the decoder attends to all encoder outputs.
  3. The same position-wise feed-forward network.

Each sub-layer again uses residual connections + layer normalization.