Why Self-Attention Wins (Speed + Quality)
The paper analyzes three factors compared to RNNs and CNNs:
Parallelism:
- Self-attention can process all positions in parallel (O(1) sequential steps).
- RNNs require O(n) sequential steps because each hidden state depends on the previous one.
Path Length for Long-Range Dependencies:
- Self-attention: any position can attend to any other in 1 step.
- RNN: info must flow through many time steps, so long-range info weakens.
- CNN: needs multiple convolution layers to connect distant positions.
Computational Cost:
- Self-attention per layer is ~O(n² · d) where n = sequence length, d = hidden size.
- RNNs are O(n · d²). For typical sentence lengths and hidden sizes, self-attention is competitive or faster.

