Mark As Completed Discussion

Model Variations: What Matters Most

They tried lots of variants:

  • Changing number of attention heads (1, 4, 8, 16, 32).

    • 8 heads worked best.
    • Too few heads loses expressiveness.
    • Too many heads also hurt a bit.
  • Scaling up model width (d_model), depth (N layers), and feed-forward size (d_ff):

    • Bigger models → better BLEU (unsurprising).
  • Changing dropout:

    • Removing dropout overfit and hurt BLEU.
  • Replacing sinusoidal positional encodings with learned positional embeddings:

    • Performance was basically the same.
    • They kept sinusoids for extrapolation reasons.