AlgoDaily - [Transformers Case Study] Attention Is All You Need Summarized

Home > Systems Design and Architecture 🔥 > Academic Whitepapers Summarized > [Transformers Case Study] Attention Is All You Need Summarized

One Pager Cheat Sheet

The paper introduced the Transformer, a sequence transduction model that ditches recurrence and convolutions and uses only attention (specifically self-attention) to be fully parallelizable and learn long-range dependencies with fewer steps, enabling it to train faster, scale better, and set new state-of-the-art translation results (measured by BLEU) more cheaply than RNNs- or CNNs-based systems.
Older sequence models—RNNs (e.g., LSTMs, GRUs) that update a hidden state token-by-token and CNN-based models that stack local convolutions—suffer from sequential hidden-state updates and limited parallelism and indirect, depth-dependent connections, so both struggle to capture long-range dependencies efficiently.
False — because the recurrence h_t = f(h_{t-1}, x_t) creates a sequential dependency so you cannot compute h_t until h_{t-1} is known, forcing step-by-step computation in both training (via Backpropagation Through Time (BPTT)) and autoregressive inference, even though you can parallelize across batches and within-step matrix operations (e.g., in LSTM/GRU kernels), whereas convolutional sequence models and Transformers compute positions simultaneously.
The paper builds a full encoder-decoder model that uses only attention (plus simple feed-forward parts), creating an all-attention architecture that directly links any word to any other so long-range dependencies become effectively one hop and there are no recurrent or convolutional layers.
Because attention computes normalized relevance scores between the query of a position and the keys of every other position and uses them to form a weighted sum of values, the model can in a single layer directly connect to every token, adaptively focus on any token regardless of distance, and evaluate those connections in parallel.
The Transformer retains the classic encoder-decoder structure: the encoder reads the input and produces contextual vector representations, the decoder generates the output one symbol at a time using previous outputs plus the encoded input, and both are built from repeated layers of multi-head self-attention and feed-forward networks rather than stacks of RNN cells.
The Transformer encoder is a stack of N = 6 identical layers; each layer has a multi-head self-attention sub-layer and a position-wise feed-forward network sub-layer, and every sub-layer is wrapped with residual connections and layer normalization, yielding the pattern x -> Attention -> Add&Norm -> FeedForward -> Add&Norm.
The phrase describes the combination of an identity skip (or shortcut) path implemented as a residual connection (i.e., y = x + F(x) where F(x) is the sub-layer such as multi-head self-attention or the feed-forward MLP), which creates an identity path that improves gradient flow and mitigates vanishing gradients, with layer normalization (applied as LayerNorm after the addition) to stabilize activations — together forming the Transformer’s Add&Norm pattern called a residual connection + layer normalization.
The Transformer decoder is a stack of 6 identical layers where each layer has three sub-layers — masked multi-head self-attention (preventing future-token peeking to keep generation auto-regressive), encoder-decoder attention (attending to all encoder outputs), and a position-wise feed-forward network — and each sub-layer uses residual connections + layer normalization.
False: the decoder enforces causality via masked multi-head self-attention—a triangular mask added as −∞ in softmax((QK^T)/sqrt(d_k) + mask) makes attention to j>i effectively zero, encoder-decoder attention only sees encoder outputs (so it can't leak future decoder tokens), and the same causal masking is applied during training (often with teacher forcing) and inference, thus preserving the autoregressive property.
self-attention (also called intra-attention) relates every position in a sequence to every other position, weights and blends their information to compute updated representations for each position, enabling the model to capture word–word relationships regardless of distance.
In scaled dot-product attention, for each Q you compute dot-product scores with K, scale them by dividing by sqrt(d_k), apply softmax to produce weights, and use those weights to mix V, preventing very large dot products that would make the softmax too peaky and gradients tiny.
Dividing by sqrt(d_k) keeps the typical size of the dot product inputs to the softmax roughly constant as d_k changes, which prevents softmax saturation / excessively peaky outputs, avoids vanishing gradients, and improves numerical and optimization stability.
The Transformer performs parallel attention by running attention h times with different learned projections—each a head—where inputs are linearly projected into smaller d_k/d_v spaces, scaled dot-product attention is applied independently per head, the outputs are concatenated and projected back to d_model, allowing different heads to specialize (e.g., subject-verb agreement, coreference).
The decoder can't look into the future: the Transformer enforces this with a mask inside attention that sets scores to -∞ for positions j>i before softmax, giving those positions ~0 probability so token y_i depends only on tokens < i, preserving left-to-right generation like a language model.
An auto-regressive model factorizes sequence probability as a product of conditionals (P(y_1..y_T)=∏_t P(y_t|y_<t)) and the decoder enforces this with causal (or no-peek) masking—replacing any attention score from position i to a future j>i with -inf before softmax (so those positions get ≈0 probability)—ensuring each token’s representation and prediction depend only on y_<i (even during teacher forcing) and enabling sequential inference (sampling or beam search) without using future ground truth.
After attention, each position’s embedding is passed through a position-wise feed-forward network—W1, b1 → ReLU → W2, b2—applied identically to every position (with different parameters per layer depth) so that attention mixes information across tokens while the feed-forward block transforms each token’s channel representation; in the base Transformer d_model = 512 and d_ff = 2048 (it expands then contracts).
Because transformers have no built-in notion of order, they inject a positional encoding (added to token embeddings at the bottom of the encoder/decoder) using fixed sinusoidal encodings, which let the model infer absolute and relative positions and, in principle, generalize to longer sequences.
Each sub-layer takes input x, computes Sublayer(x), adds them as x + Sublayer(x) (a residual connection) and then applies layer normalization, so the residual connections help gradients flow in deep networks while LayerNorm stabilizes training, and this structure repeats in every encoder and decoder layer.
By replacing y = H(x) with y = x + F(x), a residual connection provides an identity shortcut that gives a direct path for gradients during backpropagation, preventing vanishing gradients and making it easier to learn the residual function F(x)=H(x)-x, thereby improving optimization and enabling stable training of very deep networks.
The paper shows that self-attention wins because it provides maximal parallelism (all positions processed in O(1) sequential steps vs RNNs' O(n)), enables one-step long-range access to any position (unlike RNNs or multi-layer CNNs), and offers a competitive computational cost (~O(n²·d) per layer vs RNNs' O(n·d²)), yielding faster and higher-quality results for typical sentence lengths and hidden sizes.
Self-attention delivers practical speed and better modeling quality because it enables massive parallelism (only O(1) sequential steps per layer and hardware-friendly dense ops), provides short path length (path length = 1) for direct long-range dependencies and improved gradient/representation flow, and has a practical computational profile—O(n^2 * d) vs O(n * d^2)—that is efficient for typical n and d on modern accelerators.
Training used large WMT datasets (WMT 2014 English→German ~4.5M pairs with shared subword vocab via byte-pair encoding ~37k tokens, and WMT 2014 English→French ~36M pairs with 32k word-piece vocab), length-based batching (~25k source + 25k target tokens per batch), run on a single-machine 8 × NVIDIA P100 GPUs setup, with the base model trained for ~12 hours (100k steps, ~0.4s/step) and the big model for ~3.5 days (300k steps, ~1.0s/step).
They used the Adam optimizer with custom hyperparameters (β1 = 0.9, β2 = 0.98, ϵ = 1e-9) and a learning rate schedule that warms up linearly for the first 4000 steps then decays proportional to the inverse square root of the step number and model dimension — the intuition being to avoid a huge initial LR by ramping up and then cooling down.
The correct term is warmup, meaning gradually increasing the learning rate during an initial warmup period so early noisy gradients and immature optimizer statistics (m, v, especially with Adam) don’t cause huge parameter updates, allowing the model to build reliable gradient statistics and achieve more stable optimization and better final performance (e.g., linear rise for 4k steps then inverse-square-root decay used in Transformers).
They used three main stabilizers—Dropout (Dropout), applied after each sub-layer (before adding the residual) and to the sum of embeddings + positional encodings (base rate 0.1); Label smoothing (Label smoothing), which softens one-hot targets with ε_ls = 0.1 (hurting raw perplexity but improving BLEU); and Averaging checkpoints (Averaging checkpoints), which averages the weights from the last several checkpoints at inference to stabilize predictions.
True: label smoothing trains against a softened target instead of a one-hot (minimizing cross-entropy to q), which acts as a confidence penalty / regularizer that lowers p(gold) and therefore raises perplexity, yet by improving calibration, reducing overfitting and making decoding (e.g., beam search) less brittle it often improves BLEU, illustrating that perplexity and BLEU measure different things and that smoothing intentionally trades likelihood for better sequence-level performance.
At inference the decoder generates tokens one at a time, using beam search (beam size ~4) to keep multiple candidate sequences and a length penalty to avoid favoring too-short outputs, while they cap max output length to input_length + 50 and stop early if it predicts an end-of-sentence token.
On English→German Transformer (big) achieved BLEU ≈ 28.4 and on English→French BLEU ≈ 41.8 (~41.0 depending on presentation), establishing it as a new SOTA single-model while delivering higher quality with drastically less compute (much lower FLOPs and wall-clock training time than older systems like GNMT).
Because the Transformer replaces sequential RNN/LSTM recurrence with massive parallelism via self-attention (and multi-head attention), shortens the information path between tokens (shorter path length), packs more capacity into parallelizable GEMM-style ops that map to highly optimized dense operations, and—together with residual connections, layer normalization and modern recipes like Adam with warmup and label smoothing—yields better representational efficiency, faster convergence, and stable training and better gradient propagation, it achieves drastically less compute (i.e., fewer total FLOPs and much less wall‑clock time) to reach equal or higher BLEU than previous top systems.
Across many variants, changing the number of attention heads showed 8 heads worked best (too few lost expressiveness, too many slightly hurt), scaling d_model, N, and d_ff showed larger models improved BLEU, removing dropout caused overfitting and worse BLEU, and swapping sinusoidal positional encodings for learned positional embeddings made little difference so they kept sinusoids for extrapolation.
A 4-layer Transformer (with d_model = 1024) trained on English constituency parsing—using only the Penn Treebank WSJ (~40K sentences) and, in a semi-supervised setup, millions more high-confidence trees—was competitive with strong parsers with limited data and, with semi-supervised data, surpassed many previous approaches, demonstrating the architecture generalizes beyond translation.
By inspecting attention maps, we find that different attention heads specialize in linguistic tasks—tracking long-distance dependencies, linking pronouns to antecedents (its → product), and capturing phrase structure—so they act as relation detectors the model learns without explicit syntax labels.
A tiny runnable demo implements scaled dot-product attention for one attention head using only the standard library, computing attention weights from Q and K, applying scaled softmax, and using those weights to mix the values V.
Blends V into outputs by computing similarity via Q·K^T (scaled by sqrt(d_k) and normalized with softmax to produce attention weights) and combining them as a weighted sum (weights · V).
This is a hands-on code example showing a one encoder block forward pass implemented in Python stdlib-only that fakes multi-head attention with just one head, includes a feed-forward network and residual connections plus layer norm, and is not a full Transformer but mirrors how data flows in a single encoder layer.

One Pager Cheat Sheet

Programming Categories

Popular Lessons