Mark As Completed Discussion

Positional Encoding (Because There’s No Recurrence)

Transformers have no built-in notion of order. To fix that, they inject a positional encoding into each token embedding.

They use fixed sinusoidal encodings:

  • Each dimension of the position encoding is a sine or cosine with a different frequency.
  • These encodings are added to the token embeddings at the bottom of the encoder/decoder.

Why sinusoids?

  • They let the model infer both absolute and relative positions.
  • They can, in principle, generalize to longer sequences than seen in training, because the pattern is continuous.
Positional Encoding (Because There’s No Recurrence)