The Bottleneck With Old Sequence Models
RNNs (like LSTMs and GRUs) generate a hidden state one token at a time. Each state depends on the previous one. That means on long sequences, you do lots of steps in order, and you can’t parallelize those steps well during training.
CNN-based sequence models improved parallelism by using convolutions over windows of words. But they still only connect distant tokens indirectly, through many stacked layers. The longer the span between two related words, the more layers you have to stack.
The core pain: both approaches struggle with long-range dependencies efficiently. The farther two words are apart, the harder it is for the model to learn how they relate.

