Regularization Tricks They Used
They used three main stabilizers:
Dropout- Applied after each sub-layer (before adding the residual)
- Also applied to the sum of embeddings + positional encodings
- Base model used dropout rate 0.1
Label smoothing- Instead of training on one-hot targets, they softened targets slightly (ε_ls = 0.1).
- This hurts raw perplexity (model becomes “less sure”) but improves BLEU accuracy.
Averaging checkpoints- At inference, they averaged the weights from the last several checkpoints to stabilize predictions.

