Optimizer, Learning Rate, and Warmup
They used the Adam optimizer with custom hyperparameters:
β1 = 0.9,β2 = 0.98,ϵ = 1e-9.
They did a special learning rate schedule:
- Warm up: linearly increase learning rate for the first 4000 steps.
- Then decay it proportionally to the inverse square root of the step number and model dimension.
- Intuition: don’t blast the model with a huge LR at the start; gradually ramp up, then cool down.

