Mark As Completed Discussion

Attention, at a High Level

attention is a mechanism where the model asks: “For this position I’m generating or encoding, which other positions are relevant, and how relevant are they?”

Instead of passing information strictly left→right, attention can directly link any word to any other word. So the distance between “making … more difficult” can become effectively 1 hop, not 20 hops.

The paper’s move: build an entire encoder-decoder model that uses attention everywhere, and only attention. No recurrent layers. No convolution layers. Just attention + simple feed-forward parts.