Model Variations: What Matters Most
They tried lots of variants:
Changing number of
attention heads(1, 4, 8, 16, 32).- 8 heads worked best.
- Too few heads loses expressiveness.
- Too many heads also hurt a bit.
Scaling up model width (
d_model), depth (Nlayers), and feed-forward size (d_ff):- Bigger models → better BLEU (unsurprising).
Changing
dropout:- Removing dropout overfit and hurt BLEU.
Replacing sinusoidal positional encodings with learned positional embeddings:
- Performance was basically the same.
- They kept sinusoids for extrapolation reasons.

