Mark As Completed Discussion

Why Self-Attention Wins (Speed + Quality)

The paper analyzes three factors compared to RNNs and CNNs:

  1. Parallelism:

    • Self-attention can process all positions in parallel (O(1) sequential steps).
    • RNNs require O(n) sequential steps because each hidden state depends on the previous one.
  2. Path Length for Long-Range Dependencies:

    • Self-attention: any position can attend to any other in 1 step.
    • RNN: info must flow through many time steps, so long-range info weakens.
    • CNN: needs multiple convolution layers to connect distant positions.
  3. Computational Cost:

    • Self-attention per layer is ~O(n² · d) where n = sequence length, d = hidden size.
    • RNNs are O(n · d²). For typical sentence lengths and hidden sizes, self-attention is competitive or faster.