Inference: How It Generates Translations
At inference:
- The decoder generates tokens one at a time.
- They use
beam search(beam size ~4 for translation), which keeps multiple candidate sequences in parallel and chooses the best-scoring. - They apply a
length penaltyso the model doesn’t unfairly prefer too-short outputs.
They also cap max output length to input_length + 50, but will stop early if it predicts an end-of-sentence token.

