Mark As Completed Discussion

Optimizer, Learning Rate, and Warmup

They used the Adam optimizer with custom hyperparameters:

  • β1 = 0.9, β2 = 0.98, ϵ = 1e-9.

They did a special learning rate schedule:

  • Warm up: linearly increase learning rate for the first 4000 steps.
  • Then decay it proportionally to the inverse square root of the step number and model dimension.
  • Intuition: don’t blast the model with a huge LR at the start; gradually ramp up, then cool down.