Mark As Completed Discussion

Training Setup (Translation Tasks)

They trained on:

  • WMT 2014 English→German (~4.5M sentence pairs; shared subword vocab using byte-pair encoding around 37k tokens).
  • WMT 2014 English→French (~36M sentence pairs; 32k word-piece vocab).

Batches roughly contained ~25k source tokens + 25k target tokens. Sentences were batched by similar length to make training efficient.

Hardware: a single machine with 8 NVIDIA P100 GPUs.

Base model:

  • ~12 hours of training (100k steps, ~0.4s/step on base config).

Big model:

  • ~3.5 days of training (300k steps, ~1.0s/step).