Mark As Completed Discussion

Regularization Tricks They Used

They used three main stabilizers:

  1. Dropout

    • Applied after each sub-layer (before adding the residual)
    • Also applied to the sum of embeddings + positional encodings
    • Base model used dropout rate 0.1
  2. Label smoothing

    • Instead of training on one-hot targets, they softened targets slightly (ε_ls = 0.1).
    • This hurts raw perplexity (model becomes “less sure”) but improves BLEU accuracy.
  3. Averaging checkpoints

    • At inference, they averaged the weights from the last several checkpoints to stabilize predictions.