Training Setup (Translation Tasks)
They trained on:
WMT 2014 English→German(~4.5M sentence pairs; shared subword vocab using byte-pair encoding around 37k tokens).WMT 2014 English→French(~36M sentence pairs; 32k word-piece vocab).
Batches roughly contained ~25k source tokens + 25k target tokens. Sentences were batched by similar length to make training efficient.
Hardware: a single machine with 8 NVIDIA P100 GPUs.
Base model:
- ~12 hours of training (100k steps, ~0.4s/step on base config).
Big model:
- ~3.5 days of training (300k steps, ~1.0s/step).


