Home > Systems Design and Architecture 🔥 > Academic Whitepapers Summarized > [Transformers Case Study] Attention Is All You Need Summarized

"Attention Is All You Need" — The Transformer Architecture

This is a summary of the original Attention Is All You Need white paper, which can be found at this link.

The Transformer is a neural network that ditches recurrence and convolutions and instead uses only attention to understand and generate sequences. It trains faster, scales better, and set new state-of-the-art translation results while being cheaper to train.

Why This Paper Mattered

The paper introduces the Transformer, a sequence transduction model (a model that maps one sequence to another, like English → German) that uses only self-attention instead of RNNs (recurrent neural networks) or CNNs (convolutional neural networks).

Before this, translation systems relied on stacked recurrent layers or convolutions to encode and decode sentences. Those systems were accurate but slow to train because they processed tokens mostly in sequence.

The Transformer is fully parallelizable across all positions in a sentence and still learns long-range dependencies (like "the dog … it") with fewer steps. This let it beat previous translation systems in BLEU score (a standard accuracy metric for translation quality) while training in a fraction of the time.

"Attention Is All You Need" — The Transformer Architecture

The Bottleneck With Old Sequence Models

RNNs (like LSTMs and GRUs) generate a hidden state one token at a time. Each state depends on the previous one. That means on long sequences, you do lots of steps in order, and you can’t parallelize those steps well during training.

CNN-based sequence models improved parallelism by using convolutions over windows of words. But they still only connect distant tokens indirectly, through many stacked layers. The longer the span between two related words, the more layers you have to stack.

The core pain: both approaches struggle with long-range dependencies efficiently. The farther two words are apart, the harder it is for the model to learn how they relate.

Are you sure you're getting this? Is this statement true or false?

RNNs naturally parallelize all time steps of a sentence at once.

Press true if you believe the statement is correct, or false otherwise.

Attention, at a High Level

attention is a mechanism where the model asks: “For this position I’m generating or encoding, which other positions are relevant, and how relevant are they?”

Instead of passing information strictly left→right, attention can directly link any word to any other word. So the distance between “making … more difficult” can become effectively 1 hop, not 20 hops.

The paper’s move: build an entire encoder-decoder model that uses attention everywhere, and only attention. No recurrent layers. No convolution layers. Just attention + simple feed-forward parts.

Try this exercise. Click the correct answer from the options.

Attention lets the model:

Click the option that best answers the question.

focus on all other tokens directly
store data to disk
change the optimizer
act like a database

Transformer = Encoder + Decoder (Still)

The Transformer keeps the classic encoder-decoder structure common in translation:

The encoder reads the input sentence (e.g. English) and produces contextual vector representations.
The decoder generates the output sentence (e.g. German) one symbol at a time, using what it has produced so far plus the encoded input.

But both encoder and decoder are now built out of repeated layers of multi-head self-attention plus small feed-forward networks, instead of stacks of RNN cells.

Encoder Stack: What Happens Inside

The Transformer encoder is a stack of N = 6 identical layers.

Each layer has:

A multi-head self-attention sub-layer.
A position-wise feed-forward network sub-layer (a tiny 2-layer MLP applied independently to each position's vector).

Each sub-layer is wrapped with a residual connection (add the input back to the output of the sub-layer) and layer normalization (normalize activations so training stays stable).

So the pattern is basically: x -> Attention -> Add&Norm -> FeedForward -> Add&Norm

Are you sure you're getting this? Fill in the missing part by typing it in.

Wrapping a sub-layer’s output with “add the original input, then normalize” is called a __________ connection + layer normalization pattern.

Write the missing line below.

Decoder Stack: What’s Different

The Transformer decoder is also a stack of 6 identical layers, but each layer has 3 sub-layers:

masked multi-head self-attention over the already-generated output tokens. (“Masked” here means a token can’t peek at future tokens. This keeps generation auto-regressive.)
encoder-decoder attention, where the decoder attends to all encoder outputs.
The same position-wise feed-forward network.

Each sub-layer again uses residual connections + layer normalization.

Let's test your knowledge. Is this statement true or false?

During generation, position i in the decoder is allowed to attend to future positions (i+1, i+2, …).

Press true if you believe the statement is correct, or false otherwise.

Self-Attention: The Core Operation

self-attention (also called intra-attention) relates every position in a sequence to every other position in that same sequence, to compute a new representation for each position.

Intuition:

For each word, ask: which other words in this sentence matter for me?
Weight them.
Blend their info into my updated representation.

This lets the model capture word-word relationships without caring how far apart they are in the sentence.

Scaled Dot-Product Attention (Math but Friendly)

Attention takes three things:

Q (“queries”)
K (“keys`)
V (“values`)

For each query, we score how well it matches each key (using a dot product), scale it, apply softmax to get weights, and use those weights to mix the values.

The paper’s version is scaled dot-product attention:

TEXT

1Attention(Q, K, V) = softmax( (Q * K^T) / sqrt(d_k) ) * V

Where:

d_k is the dimensionality of keys.
We divide by sqrt(d_k) to prevent very large dot products that would make the softmax too peaky and gradients tiny.

Build your intuition. Click the correct answer from the options.

Why divide by sqrt(d_k)?

Click the option that best answers the question.

noise suppression
numerical stability in softmax
to save memory
to enforce sparsity

Multi-Head Attention

Instead of doing attention once, the Transformer does it in parallel h times with different learned projections. Each parallel attention is called a head.

Steps:

Linearly project inputs into multiple smaller (d_k, d_v) spaces.
Run scaled dot-product attention independently in each head.
Concatenate all heads’ outputs.
Project back to d_model.

Why? Different heads can specialize. One head might track subject-verb agreement. Another might track coreference (“its” → which noun?). The model can attend to multiple types of relationships at once.

Masked Self-Attention in the Decoder

The decoder can’t look into the future when predicting the next word. To enforce that, the Transformer uses a mask inside attention.

Mechanically:

Any attention score from position i to a future position j>i is set to -∞ before softmax.
After softmax, those future positions get probability ~0.
So token y_i only depends on tokens < i.

This keeps generation left-to-right, like a language model.

Build your intuition. Fill in the missing part by typing it in.

Masking future positions in the decoder preserves the model’s __________ property (it generates one token at a time, using only what’s already generated).

Write the missing line below.

Position-Wise Feed-Forward Networks

After attention, each position’s embedding is run through a tiny feed-forward network:

A linear layer (W1, b1),
A ReLU,
Then another linear layer (W2, b2).

It’s applied identically to every position, but with different parameters per layer depth. You can think of this as: attention mixes information across tokens, and then the feed-forward block “transforms” each token’s channel representation.

In the base Transformer:

d_model = 512
The inner feed-forward dimension d_ff = 2048 (so it expands, then contracts).

Positional Encoding (Because There’s No Recurrence)

Transformers have no built-in notion of order. To fix that, they inject a positional encoding into each token embedding.

They use fixed sinusoidal encodings:

Each dimension of the position encoding is a sine or cosine with a different frequency.
These encodings are added to the token embeddings at the bottom of the encoder/decoder.

Why sinusoids?

They let the model infer both absolute and relative positions.
They can, in principle, generalize to longer sequences than seen in training, because the pattern is continuous.

Positional Encoding (Because There’s No Recurrence)

Residual Connections + LayerNorm

Every sub-layer (attention, or feed-forward) is wrapped like this:

Take the input x.
Run the sub-layer to get Sublayer(x).
Add them: x + Sublayer(x) (this is a residual connection).
Apply layer normalization.

Why:

Residuals help gradients flow in deep networks.
LayerNorm stabilizes training by normalizing across the hidden dimension.

This structure repeats in every encoder and decoder layer.

Try this exercise. Is this statement true or false?

Residual connections are mainly there to help very deep models train without gradients vanishing.

Press true if you believe the statement is correct, or false otherwise.

Why Self-Attention Wins (Speed + Quality)

The paper analyzes three factors compared to RNNs and CNNs:

Parallelism:
- Self-attention can process all positions in parallel (O(1) sequential steps).
- RNNs require O(n) sequential steps because each hidden state depends on the previous one.
Path Length for Long-Range Dependencies:
- Self-attention: any position can attend to any other in 1 step.
- RNN: info must flow through many time steps, so long-range info weakens.
- CNN: needs multiple convolution layers to connect distant positions.
Computational Cost:
- Self-attention per layer is ~O(n² · d) where n = sequence length, d = hidden size.
- RNNs are O(n · d²). For typical sentence lengths and hidden sizes, self-attention is competitive or faster.

Are you sure you're getting this? Click the correct answer from the options.

Self-attention offers:

Click the option that best answers the question.

short paths between any two tokens
parallel training
both
neither

Training Setup (Translation Tasks)

They trained on:

WMT 2014 English→German (~4.5M sentence pairs; shared subword vocab using byte-pair encoding around 37k tokens).
WMT 2014 English→French (~36M sentence pairs; 32k word-piece vocab).

Batches roughly contained ~25k source tokens + 25k target tokens. Sentences were batched by similar length to make training efficient.

Hardware: a single machine with 8 NVIDIA P100 GPUs.

Base model:

~12 hours of training (100k steps, ~0.4s/step on base config).

Big model:

~3.5 days of training (300k steps, ~1.0s/step).

Optimizer, Learning Rate, and Warmup

They used the Adam optimizer with custom hyperparameters:

β1 = 0.9, β2 = 0.98, ϵ = 1e-9.

They did a special learning rate schedule:

Warm up: linearly increase learning rate for the first 4000 steps.
Then decay it proportionally to the inverse square root of the step number and model dimension.
Intuition: don’t blast the model with a huge LR at the start; gradually ramp up, then cool down.

Try this exercise. Fill in the missing part by typing it in.

Slowly increasing the learning rate at the start of training is called __________.

Write the missing line below.

Regularization Tricks They Used

They used three main stabilizers:

Dropout
- Applied after each sub-layer (before adding the residual)
- Also applied to the sum of embeddings + positional encodings
- Base model used dropout rate 0.1
Label smoothing
- Instead of training on one-hot targets, they softened targets slightly (ε_ls = 0.1).
- This hurts raw perplexity (model becomes “less sure”) but improves BLEU accuracy.
Averaging checkpoints
- At inference, they averaged the weights from the last several checkpoints to stabilize predictions.

Build your intuition. Is this statement true or false?

Label smoothing can improve BLEU even if it slightly worsens perplexity.

Press true if you believe the statement is correct, or false otherwise.

Inference: How It Generates Translations

At inference:

The decoder generates tokens one at a time.
They use beam search (beam size ~4 for translation), which keeps multiple candidate sequences in parallel and chooses the best-scoring.
They apply a length penalty so the model doesn’t unfairly prefer too-short outputs.

They also cap max output length to input_length + 50, but will stop early if it predicts an end-of-sentence token.

Results: Translation Quality and Cost

On English→German:

Transformer (big) reached BLEU ≈ 28.4, beating previous state-of-the-art (including ensembles) by more than 2 BLEU.
It became the new SOTA with a single model.

On English→French:

Transformer (big) hit BLEU ≈ 41.8 (or ~41.0 depending on table presentation), also state-of-the-art for single models.
Training cost (in FLOPs and wall-clock) was dramatically lower than older systems like GNMT or convolutional seq2seq.

Key story: higher quality, drastically less compute time to reach that quality.

Are you sure you're getting this? Click the correct answer from the options.

The Transformer matched or beat previous top translation systems while:

Click the option that best answers the question.

taking more compute
taking about the same compute
taking drastically less compute
not using GPUs

Model Variations: What Matters Most

They tried lots of variants:

Changing number of attention heads (1, 4, 8, 16, 32).
- 8 heads worked best.
- Too few heads loses expressiveness.
- Too many heads also hurt a bit.
Scaling up model width (d_model), depth (N layers), and feed-forward size (d_ff):
- Bigger models → better BLEU (unsurprising).
Changing dropout:
- Removing dropout overfit and hurt BLEU.
Replacing sinusoidal positional encodings with learned positional embeddings:
- Performance was basically the same.
- They kept sinusoids for extrapolation reasons.

Beyond Translation: Parsing

They tested English constituency parsing (turning a sentence into a full syntax tree). This task has tricky long-range structure.

They trained a 4-layer Transformer (with d_model = 1024) on:

Just the Penn Treebank WSJ (~40K sentences), and
A semi-supervised setup with millions more high-confidence parse trees.

Result:

Even with limited data, the Transformer was competitive with strong parsers.
With semi-supervised data, it surpassed many previous approaches, showing the architecture generalizes beyond translation.

Interpretability: Heads Learn Linguistic Jobs

One cool side effect: you can inspect attention maps.

Different attention heads specialize:

Some heads follow long-distance dependencies (“making … more difficult”).
Some heads link pronouns to the right noun (its → product, for example).
Some seem to track syntactic structure (which words group into a phrase).

That means different heads act like different “relation detectors.” The model learns these roles without explicit syntax labels.

Interpretability: Heads Learn Linguistic Jobs

Hands-On Code: Scaled Dot-Product Attention

Below is a tiny runnable demo of scaled dot-product attention for one attention head, using only the standard library. It:

Computes attention weights from Q and K.
Applies softmax with scaling.
Uses those weights to mix the values V.

xxxxxxxxxx
    main()
 
# file: scaled_dot_attention.py
# Minimal scaled dot-product attention.
# Only standard library. Run with `python scaled_dot_attention.py`.
​
import math
import random
​
def softmax(xs):
    # subtract max for numerical stability
    m = max(xs)
    exps = [math.exp(x - m) for x in xs]
    s = sum(exps)
    return [e / s for e in exps]
​
def matmul(a, b):
    # a: [n x d], b: [d x m] => [n x m]
    n = len(a)
    d = len(a[0])
    m = len(b[0])
    out = [[0.0]*m for _ in range(n)]
    for i in range(n):
        for j in range(m):
            s = 0.0
            for k in range(d):
                s += a[i][k] * b[k][j]
            out[i][j] = s
    return out
​
def transpose(m):

Build your intuition. Click the correct answer from the options.

scaled_dot_attention mainly:

Click the option that best answers the question.

blends V using weights from Q·K^T
stores gradients
loads data from disk
compresses vocab

Hands-On Code: One Encoder Block Forward Pass (Python, stdlib only)

Below is a toy “encoder layer” forward pass. We’ll fake:

multi-head attention with just one head,
a feed-forward network,
residual + layer norm.

This is not a full Transformer, but it mirrors how data flows in a single encoder layer.

xxxxxxxxxx
    main()
 
# file: tiny_encoder_block.py
# A super-simplified encoder layer forward pass with:
# attention -> add&norm -> feedforward -> add&norm
# No external libraries. Run with `python tiny_encoder_block.py`.
​
import math
import random
​
def layer_norm(vec):
    # simple per-vector layer norm
    mean = sum(vec)/len(vec)
    var = sum((x-mean)**2 for x in vec)/len(vec)
    eps = 1e-6
    return [(x-mean)/math.sqrt(var+eps) for x in vec]
​
def linear(x, W, b):
    # x: [d_in], W: [d_out x d_in], b: [d_out] -> [d_out]
    out = []
    for row, bias in zip(W, b):
        s = 0.0
        for xi, wi in zip(x, row):
            s += xi * wi
        out.append(s + bias)
    return out
​
def relu(v):
    return [max(0.0, x) for x in v]
​
def softmax(xs):

One Pager Cheat Sheet

The paper introduced the Transformer, a sequence transduction model that ditches recurrence and convolutions and uses only attention (specifically self-attention) to be fully parallelizable and learn long-range dependencies with fewer steps, enabling it to train faster, scale better, and set new state-of-the-art translation results (measured by BLEU) more cheaply than RNNs- or CNNs-based systems.
Older sequence models—RNNs (e.g., LSTMs, GRUs) that update a hidden state token-by-token and CNN-based models that stack local convolutions—suffer from sequential hidden-state updates and limited parallelism and indirect, depth-dependent connections, so both struggle to capture long-range dependencies efficiently.
False — because the recurrence h_t = f(h_{t-1}, x_t) creates a sequential dependency so you cannot compute h_t until h_{t-1} is known, forcing step-by-step computation in both training (via Backpropagation Through Time (BPTT)) and autoregressive inference, even though you can parallelize across batches and within-step matrix operations (e.g., in LSTM/GRU kernels), whereas convolutional sequence models and Transformers compute positions simultaneously.
The paper builds a full encoder-decoder model that uses only attention (plus simple feed-forward parts), creating an all-attention architecture that directly links any word to any other so long-range dependencies become effectively one hop and there are no recurrent or convolutional layers.
Because attention computes normalized relevance scores between the query of a position and the keys of every other position and uses them to form a weighted sum of values, the model can in a single layer directly connect to every token, adaptively focus on any token regardless of distance, and evaluate those connections in parallel.
The Transformer retains the classic encoder-decoder structure: the encoder reads the input and produces contextual vector representations, the decoder generates the output one symbol at a time using previous outputs plus the encoded input, and both are built from repeated layers of multi-head self-attention and feed-forward networks rather than stacks of RNN cells.
The Transformer encoder is a stack of N = 6 identical layers; each layer has a multi-head self-attention sub-layer and a position-wise feed-forward network sub-layer, and every sub-layer is wrapped with residual connections and layer normalization, yielding the pattern x -> Attention -> Add&Norm -> FeedForward -> Add&Norm.
The phrase describes the combination of an identity skip (or shortcut) path implemented as a residual connection (i.e., y = x + F(x) where F(x) is the sub-layer such as multi-head self-attention or the feed-forward MLP), which creates an identity path that improves gradient flow and mitigates vanishing gradients, with layer normalization (applied as LayerNorm after the addition) to stabilize activations — together forming the Transformer’s Add&Norm pattern called a residual connection + layer normalization.
The Transformer decoder is a stack of 6 identical layers where each layer has three sub-layers — masked multi-head self-attention (preventing future-token peeking to keep generation auto-regressive), encoder-decoder attention (attending to all encoder outputs), and a position-wise feed-forward network — and each sub-layer uses residual connections + layer normalization.
False: the decoder enforces causality via masked multi-head self-attention—a triangular mask added as −∞ in softmax((QK^T)/sqrt(d_k) + mask) makes attention to j>i effectively zero, encoder-decoder attention only sees encoder outputs (so it can't leak future decoder tokens), and the same causal masking is applied during training (often with teacher forcing) and inference, thus preserving the autoregressive property.
self-attention (also called intra-attention) relates every position in a sequence to every other position, weights and blends their information to compute updated representations for each position, enabling the model to capture word–word relationships regardless of distance.
In scaled dot-product attention, for each Q you compute dot-product scores with K, scale them by dividing by sqrt(d_k), apply softmax to produce weights, and use those weights to mix V, preventing very large dot products that would make the softmax too peaky and gradients tiny.
Dividing by sqrt(d_k) keeps the typical size of the dot product inputs to the softmax roughly constant as d_k changes, which prevents softmax saturation / excessively peaky outputs, avoids vanishing gradients, and improves numerical and optimization stability.
The Transformer performs parallel attention by running attention h times with different learned projections—each a head—where inputs are linearly projected into smaller d_k/d_v spaces, scaled dot-product attention is applied independently per head, the outputs are concatenated and projected back to d_model, allowing different heads to specialize (e.g., subject-verb agreement, coreference).
The decoder can't look into the future: the Transformer enforces this with a mask inside attention that sets scores to -∞ for positions j>i before softmax, giving those positions ~0 probability so token y_i depends only on tokens < i, preserving left-to-right generation like a language model.
An auto-regressive model factorizes sequence probability as a product of conditionals (P(y_1..y_T)=∏_t P(y_t|y_<t)) and the decoder enforces this with causal (or no-peek) masking—replacing any attention score from position i to a future j>i with -inf before softmax (so those positions get ≈0 probability)—ensuring each token’s representation and prediction depend only on y_<i (even during teacher forcing) and enabling sequential inference (sampling or beam search) without using future ground truth.
After attention, each position’s embedding is passed through a position-wise feed-forward network—W1, b1 → ReLU → W2, b2—applied identically to every position (with different parameters per layer depth) so that attention mixes information across tokens while the feed-forward block transforms each token’s channel representation; in the base Transformer d_model = 512 and d_ff = 2048 (it expands then contracts).
Because transformers have no built-in notion of order, they inject a positional encoding (added to token embeddings at the bottom of the encoder/decoder) using fixed sinusoidal encodings, which let the model infer absolute and relative positions and, in principle, generalize to longer sequences.
Each sub-layer takes input x, computes Sublayer(x), adds them as x + Sublayer(x) (a residual connection) and then applies layer normalization, so the residual connections help gradients flow in deep networks while LayerNorm stabilizes training, and this structure repeats in every encoder and decoder layer.
By replacing y = H(x) with y = x + F(x), a residual connection provides an identity shortcut that gives a direct path for gradients during backpropagation, preventing vanishing gradients and making it easier to learn the residual function F(x)=H(x)-x, thereby improving optimization and enabling stable training of very deep networks.
The paper shows that self-attention wins because it provides maximal parallelism (all positions processed in O(1) sequential steps vs RNNs' O(n)), enables one-step long-range access to any position (unlike RNNs or multi-layer CNNs), and offers a competitive computational cost (~O(n²·d) per layer vs RNNs' O(n·d²)), yielding faster and higher-quality results for typical sentence lengths and hidden sizes.
Self-attention delivers practical speed and better modeling quality because it enables massive parallelism (only O(1) sequential steps per layer and hardware-friendly dense ops), provides short path length (path length = 1) for direct long-range dependencies and improved gradient/representation flow, and has a practical computational profile—O(n^2 * d) vs O(n * d^2)—that is efficient for typical n and d on modern accelerators.
Training used large WMT datasets (WMT 2014 English→German ~4.5M pairs with shared subword vocab via byte-pair encoding ~37k tokens, and WMT 2014 English→French ~36M pairs with 32k word-piece vocab), length-based batching (~25k source + 25k target tokens per batch), run on a single-machine 8 × NVIDIA P100 GPUs setup, with the base model trained for ~12 hours (100k steps, ~0.4s/step) and the big model for ~3.5 days (300k steps, ~1.0s/step).
They used the Adam optimizer with custom hyperparameters (β1 = 0.9, β2 = 0.98, ϵ = 1e-9) and a learning rate schedule that warms up linearly for the first 4000 steps then decays proportional to the inverse square root of the step number and model dimension — the intuition being to avoid a huge initial LR by ramping up and then cooling down.
The correct term is warmup, meaning gradually increasing the learning rate during an initial warmup period so early noisy gradients and immature optimizer statistics (m, v, especially with Adam) don’t cause huge parameter updates, allowing the model to build reliable gradient statistics and achieve more stable optimization and better final performance (e.g., linear rise for 4k steps then inverse-square-root decay used in Transformers).
They used three main stabilizers—Dropout (Dropout), applied after each sub-layer (before adding the residual) and to the sum of embeddings + positional encodings (base rate 0.1); Label smoothing (Label smoothing), which softens one-hot targets with ε_ls = 0.1 (hurting raw perplexity but improving BLEU); and Averaging checkpoints (Averaging checkpoints), which averages the weights from the last several checkpoints at inference to stabilize predictions.
True: label smoothing trains against a softened target instead of a one-hot (minimizing cross-entropy to q), which acts as a confidence penalty / regularizer that lowers p(gold) and therefore raises perplexity, yet by improving calibration, reducing overfitting and making decoding (e.g., beam search) less brittle it often improves BLEU, illustrating that perplexity and BLEU measure different things and that smoothing intentionally trades likelihood for better sequence-level performance.
At inference the decoder generates tokens one at a time, using beam search (beam size ~4) to keep multiple candidate sequences and a length penalty to avoid favoring too-short outputs, while they cap max output length to input_length + 50 and stop early if it predicts an end-of-sentence token.
On English→German Transformer (big) achieved BLEU ≈ 28.4 and on English→French BLEU ≈ 41.8 (~41.0 depending on presentation), establishing it as a new SOTA single-model while delivering higher quality with drastically less compute (much lower FLOPs and wall-clock training time than older systems like GNMT).
Because the Transformer replaces sequential RNN/LSTM recurrence with massive parallelism via self-attention (and multi-head attention), shortens the information path between tokens (shorter path length), packs more capacity into parallelizable GEMM-style ops that map to highly optimized dense operations, and—together with residual connections, layer normalization and modern recipes like Adam with warmup and label smoothing—yields better representational efficiency, faster convergence, and stable training and better gradient propagation, it achieves drastically less compute (i.e., fewer total FLOPs and much less wall‑clock time) to reach equal or higher BLEU than previous top systems.
Across many variants, changing the number of attention heads showed 8 heads worked best (too few lost expressiveness, too many slightly hurt), scaling d_model, N, and d_ff showed larger models improved BLEU, removing dropout caused overfitting and worse BLEU, and swapping sinusoidal positional encodings for learned positional embeddings made little difference so they kept sinusoids for extrapolation.
A 4-layer Transformer (with d_model = 1024) trained on English constituency parsing—using only the Penn Treebank WSJ (~40K sentences) and, in a semi-supervised setup, millions more high-confidence trees—was competitive with strong parsers with limited data and, with semi-supervised data, surpassed many previous approaches, demonstrating the architecture generalizes beyond translation.
By inspecting attention maps, we find that different attention heads specialize in linguistic tasks—tracking long-distance dependencies, linking pronouns to antecedents (its → product), and capturing phrase structure—so they act as relation detectors the model learns without explicit syntax labels.
A tiny runnable demo implements scaled dot-product attention for one attention head using only the standard library, computing attention weights from Q and K, applying scaled softmax, and using those weights to mix the values V.
Blends V into outputs by computing similarity via Q·K^T (scaled by sqrt(d_k) and normalized with softmax to produce attention weights) and combining them as a weighted sum (weights · V).
This is a hands-on code example showing a one encoder block forward pass implemented in Python stdlib-only that fakes multi-head attention with just one head, includes a feed-forward network and residual connections plus layer norm, and is not a full Transformer but mirrors how data flows in a single encoder layer.

"Attention Is All You Need" — The Transformer Architecture

Why This Paper Mattered

The Bottleneck With Old Sequence Models

Are you sure you're getting this? Is this statement true or false?

Attention, at a High Level

Try this exercise. Click the correct answer from the options.

Click the option that best answers the question.

Transformer = Encoder + Decoder (Still)

Encoder Stack: What Happens Inside

Are you sure you're getting this? Fill in the missing part by typing it in.

Decoder Stack: What’s Different

Let's test your knowledge. Is this statement true or false?

Self-Attention: The Core Operation

Scaled Dot-Product Attention (Math but Friendly)

Build your intuition. Click the correct answer from the options.

Click the option that best answers the question.

Multi-Head Attention

Masked Self-Attention in the Decoder

Build your intuition. Fill in the missing part by typing it in.

Position-Wise Feed-Forward Networks

Positional Encoding (Because There’s No Recurrence)

Residual Connections + LayerNorm

Try this exercise. Is this statement true or false?

Why Self-Attention Wins (Speed + Quality)

Are you sure you're getting this? Click the correct answer from the options.

Click the option that best answers the question.

Training Setup (Translation Tasks)

Optimizer, Learning Rate, and Warmup

Try this exercise. Fill in the missing part by typing it in.

Regularization Tricks They Used

Build your intuition. Is this statement true or false?

Inference: How It Generates Translations

Results: Translation Quality and Cost

Are you sure you're getting this? Click the correct answer from the options.

Click the option that best answers the question.

Model Variations: What Matters Most

Beyond Translation: Parsing

Interpretability: Heads Learn Linguistic Jobs

Hands-On Code: Scaled Dot-Product Attention

Build your intuition. Click the correct answer from the options.

Click the option that best answers the question.

Hands-On Code: One Encoder Block Forward Pass (Python, stdlib only)

One Pager Cheat Sheet

Programming Categories

Popular Lessons