One Pager Cheat Sheet
- The paper introduced the
Transformer, asequence transduction modelthat ditches recurrence and convolutions and uses onlyattention(specificallyself-attention) to be fully parallelizable and learn long-range dependencies with fewer steps, enabling it to train faster, scale better, and set new state-of-the-art translation results (measured byBLEU) more cheaply thanRNNs- orCNNs-based systems. - Older sequence models—
RNNs(e.g.,LSTMs,GRUs) that update a hidden state token-by-token andCNN-basedmodels that stack local convolutions—suffer from sequential hidden-state updates and limited parallelism and indirect, depth-dependent connections, so both struggle to capturelong-range dependenciesefficiently. - False — because the recurrence
h_t = f(h_{t-1}, x_t)creates a sequential dependency so you cannot computeh_tuntilh_{t-1}is known, forcing step-by-step computation in both training (viaBackpropagation Through Time (BPTT)) and autoregressive inference, even though you can parallelize across batches and within-step matrix operations (e.g., inLSTM/GRUkernels), whereas convolutional sequence models andTransformers compute positions simultaneously. - The paper builds a full
encoder-decodermodel that uses onlyattention(plus simplefeed-forwardparts), creating an all-attention architecture that directly links any word to any other so long-range dependencies become effectively one hop and there are no recurrent or convolutional layers. - Because attention computes normalized relevance scores between the
queryof a position and thekeysof every other position and uses them to form a weighted sum ofvalues, the model can in a single layer directly connect to every token, adaptively focus on any token regardless of distance, and evaluate those connections in parallel. - The Transformer retains the classic encoder-decoder structure: the
encoderreads the input and produces contextual vector representations, thedecodergenerates the output one symbol at a time using previous outputs plus the encoded input, and both are built from repeated layers ofmulti-head self-attentionandfeed-forward networksrather than stacks ofRNNcells. - The Transformer
encoderis a stack of N = 6 identical layers; each layer has amulti-head self-attentionsub-layer and aposition-wise feed-forward networksub-layer, and every sub-layer is wrapped with residual connections and layer normalization, yielding the patternx -> Attention -> Add&Norm -> FeedForward -> Add&Norm. - The phrase describes the combination of an identity skip (or shortcut) path implemented as a residual connection (i.e.,
y = x + F(x)whereF(x)is the sub-layer such asmulti-head self-attentionor the feed-forward MLP), which creates an identity path that improves gradient flow and mitigates vanishing gradients, with layer normalization (applied asLayerNormafter the addition) to stabilize activations — together forming the Transformer’sAdd&Normpattern called a residual connection + layer normalization. - The Transformer
decoderis a stack of 6 identical layers where each layer has three sub-layers —masked multi-head self-attention(preventing future-token peeking to keep generation auto-regressive),encoder-decoder attention(attending to all encoder outputs), and aposition-wise feed-forward network— and each sub-layer uses residual connections + layer normalization. - False: the decoder enforces causality via
masked multi-head self-attention—a triangularmaskadded as −∞ insoftmax((QK^T)/sqrt(d_k) + mask)makes attention to j>i effectively zero,encoder-decoder attentiononly sees encoder outputs (so it can't leak future decoder tokens), and the same causal masking is applied during training (often withteacher forcing) and inference, thus preserving the autoregressive property. self-attention(also calledintra-attention) relates every position in a sequence to every other position, weights and blends their information to compute updated representations for each position, enabling the model to capture word–word relationships regardless of distance.- In
scaled dot-product attention, for eachQyou compute dot-product scores withK, scale them by dividing bysqrt(d_k), apply softmax to produce weights, and use those weights to mixV, preventing very large dot products that would make the softmax too peaky and gradients tiny. - Dividing by
sqrt(d_k)keeps the typical size of thedot productinputs to thesoftmaxroughly constant asd_kchanges, which preventssoftmaxsaturation / excessively peaky outputs, avoids vanishing gradients, and improves numerical and optimization stability. - The Transformer performs parallel attention by running attention
htimes with different learned projections—each ahead—where inputs are linearly projected into smallerd_k/d_vspaces,scaled dot-product attentionis applied independently per head, the outputs are concatenated and projected back tod_model, allowing different heads to specialize (e.g., subject-verb agreement, coreference). - The decoder can't look into the future: the Transformer enforces this with a
maskinside attention that sets scores to-∞for positions j>i beforesoftmax, giving those positions ~0 probability so tokeny_idepends only on tokens< i, preserving left-to-right generation like a language model. - An auto-regressive model factorizes sequence probability as a product of conditionals (
P(y_1..y_T)=∏_t P(y_t|y_<t)) and the decoder enforces this with causal (or no-peek) masking—replacing anyattention scorefrom positionito a futurej>iwith-infbeforesoftmax(so those positions get ≈0 probability)—ensuring each token’s representation and prediction depend only ony_<i(even duringteacher forcing) and enabling sequential inference (sampling or beam search) without using future ground truth. - After attention, each position’s embedding is passed through a position-wise feed-forward network—
W1,b1→ReLU→W2,b2—applied identically to every position (with different parameters per layer depth) so that attention mixes information across tokens while the feed-forward block transforms each token’s channel representation; in the base Transformerd_model = 512andd_ff = 2048(it expands then contracts). - Because transformers have no built-in notion of order, they inject a
positional encoding(added totoken embeddingsat the bottom of theencoder/decoder) using fixed sinusoidal encodings, which let the model infer absolute and relative positions and, in principle, generalize to longer sequences. - Each sub-layer takes input
x, computesSublayer(x), adds them asx + Sublayer(x)(aresidual connection) and then applieslayer normalization, so the residual connections help gradients flow in deep networks while LayerNorm stabilizes training, and this structure repeats in every encoder and decoder layer. - By replacing
y = H(x)withy = x + F(x), aresidual connectionprovides an identity shortcut that gives adirect pathfor gradients duringbackpropagation, preventingvanishing gradientsand making it easier to learn theresidual functionF(x)=H(x)-x, thereby improving optimization and enabling stable training of very deep networks. - The paper shows that
self-attentionwins because it provides maximal parallelism (all positions processed inO(1)sequential steps vsRNNs'O(n)), enables one-step long-range access to any position (unlikeRNNsor multi-layerCNNs), and offers a competitive computational cost (~O(n²·d)per layer vsRNNs'O(n·d²)), yielding faster and higher-quality results for typical sentence lengths and hidden sizes. - Self-attention delivers practical speed and better modeling quality because it enables massive parallelism (only
O(1)sequential steps per layer and hardware-friendly dense ops), provides short path length (path length = 1) for direct long-range dependencies and improved gradient/representation flow, and has a practical computational profile—O(n^2 * d)vsO(n * d^2)—that is efficient for typicalnanddon modern accelerators. - Training used large WMT datasets (
WMT 2014 English→German~4.5M pairs with shared subword vocab viabyte-pair encoding~37k tokens, andWMT 2014 English→French~36M pairs with 32kword-piecevocab), length-based batching (~25k source + 25k target tokens per batch), run on a single-machine 8 ×NVIDIA P100GPUs setup, with the base model trained for ~12 hours (100k steps, ~0.4s/step) and the big model for ~3.5 days (300k steps, ~1.0s/step). - They used the
Adamoptimizer with custom hyperparameters (β1 = 0.9,β2 = 0.98,ϵ = 1e-9) and alearning rate schedulethat warms up linearly for the first 4000 steps then decays proportional to the inverse square root of the step number and model dimension — the intuition being to avoid a huge initial LR by ramping up and then cooling down. - The correct term is warmup, meaning gradually increasing the
learning rateduring an initialwarmupperiod so early noisy gradients and immatureoptimizerstatistics (m,v, especially withAdam) don’t cause huge parameter updates, allowing the model to build reliable gradient statistics and achieve more stable optimization and better final performance (e.g., linear rise for 4k steps then inverse-square-root decay used inTransformers). - They used three main stabilizers—Dropout (
Dropout), applied after each sub-layer (before adding the residual) and to the sum of embeddings + positional encodings (base rate 0.1); Label smoothing (Label smoothing), which softensone-hottargets withε_ls = 0.1(hurting raw perplexity but improving BLEU); and Averaging checkpoints (Averaging checkpoints), which averages the weights from the last several checkpoints at inference to stabilize predictions. - True:
label smoothingtrains against a softened target instead of aone-hot(minimizingcross-entropyto q), which acts as a confidence penalty / regularizer that lowers p(gold) and therefore raisesperplexity, yet by improving calibration, reducing overfitting and making decoding (e.g.,beam search) less brittle it often improvesBLEU, illustrating thatperplexityandBLEUmeasure different things and that smoothing intentionally trades likelihood for better sequence-level performance. - At inference the decoder generates tokens one at a time, using
beam search(beam size ~4) to keep multiple candidate sequences and alength penaltyto avoid favoring too-short outputs, while they cap max output length toinput_length + 50and stop early if it predicts anend-of-sentencetoken. - On English→German
Transformer (big)achievedBLEU ≈ 28.4and on English→FrenchBLEU ≈ 41.8(~41.0 depending on presentation), establishing it as a new SOTA single-model while delivering higher quality with drastically less compute (much lowerFLOPsand wall-clock training time than older systems like GNMT). - Because the Transformer replaces sequential
RNN/LSTMrecurrence with massive parallelism viaself-attention(andmulti-head attention), shortens the information path between tokens (shorter path length), packs more capacity into parallelizableGEMM-style ops that map to highly optimized dense operations, and—together withresidual connections,layer normalizationand modern recipes likeAdamwithwarmupandlabel smoothing—yields better representational efficiency, faster convergence, and stable training and better gradient propagation, it achieves drastically less compute (i.e., fewer total FLOPs and much less wall‑clock time) to reach equal or higher BLEU than previous top systems. - Across many variants, changing the number of
attention headsshowed 8 heads worked best (too few lost expressiveness, too many slightly hurt), scalingd_model,N, andd_ffshowed larger models improved BLEU, removingdropoutcaused overfitting and worse BLEU, and swappingsinusoidal positional encodingsforlearned positional embeddingsmade little difference so they kept sinusoids for extrapolation. - A 4-layer Transformer (with
d_model = 1024) trained on Englishconstituency parsing—using only the Penn Treebank WSJ (~40K sentences) and, in a semi-supervised setup, millions more high-confidence trees—was competitive with strong parsers with limited data and, with semi-supervised data, surpassed many previous approaches, demonstrating the architecture generalizes beyond translation. - By inspecting
attention maps, we find that differentattention headsspecialize in linguistic tasks—tracking long-distance dependencies, linking pronouns to antecedents (its → product), and capturing phrase structure—so they act as relation detectors the model learns without explicit syntax labels. - A tiny runnable demo implements
scaled dot-product attentionfor one attention head using only the standard library, computing attention weights fromQandK, applying scaledsoftmax, and using those weights to mix the valuesV. - Blends
Vinto outputs by computing similarity viaQ·K^T(scaled bysqrt(d_k)and normalized withsoftmaxto produceattention weights) and combining them as a weighted sum (weights · V). - This is a hands-on code example showing a one encoder block forward pass implemented in Python stdlib-only that fakes
multi-head attentionwith just one head, includes afeed-forward networkandresidualconnections pluslayer norm, and is not a full Transformer but mirrors how data flows in a single encoder layer.

