"Attention Is All You Need" — The Transformer Architecture
This is a summary of the original Attention Is All You Need white paper, which can be found at this link.
The Transformer is a neural network that ditches recurrence and convolutions and instead uses only attention to understand and generate sequences. It trains faster, scales better, and set new state-of-the-art translation results while being cheaper to train.
Why This Paper Mattered
The paper introduces the Transformer, a sequence transduction model (a model that maps one sequence to another, like English → German) that uses only self-attention instead of RNNs (recurrent neural networks) or CNNs (convolutional neural networks).
Before this, translation systems relied on stacked recurrent layers or convolutions to encode and decode sentences. Those systems were accurate but slow to train because they processed tokens mostly in sequence.
The Transformer is fully parallelizable across all positions in a sentence and still learns long-range dependencies (like "the dog … it") with fewer steps. This let it beat previous translation systems in BLEU score (a standard accuracy metric for translation quality) while training in a fraction of the time.

The Bottleneck With Old Sequence Models
RNNs (like LSTMs and GRUs) generate a hidden state one token at a time. Each state depends on the previous one. That means on long sequences, you do lots of steps in order, and you can’t parallelize those steps well during training.
CNN-based sequence models improved parallelism by using convolutions over windows of words. But they still only connect distant tokens indirectly, through many stacked layers. The longer the span between two related words, the more layers you have to stack.
The core pain: both approaches struggle with long-range dependencies efficiently. The farther two words are apart, the harder it is for the model to learn how they relate.
Are you sure you're getting this? Is this statement true or false?
RNNs naturally parallelize all time steps of a sentence at once.
Press true if you believe the statement is correct, or false otherwise.
Attention, at a High Level
attention is a mechanism where the model asks: “For this position I’m generating or encoding, which other positions are relevant, and how relevant are they?”
Instead of passing information strictly left→right, attention can directly link any word to any other word. So the distance between “making … more difficult” can become effectively 1 hop, not 20 hops.
The paper’s move: build an entire encoder-decoder model that uses attention everywhere, and only attention. No recurrent layers. No convolution layers. Just attention + simple feed-forward parts.
Try this exercise. Click the correct answer from the options.
Attention lets the model:
Click the option that best answers the question.
- focus on all other tokens directly
- store data to disk
- change the optimizer
- act like a database
Transformer = Encoder + Decoder (Still)
The Transformer keeps the classic encoder-decoder structure common in translation:
- The
encoderreads the input sentence (e.g. English) and produces contextual vector representations. - The
decodergenerates the output sentence (e.g. German) one symbol at a time, using what it has produced so far plus the encoded input.
But both encoder and decoder are now built out of repeated layers of multi-head self-attention plus small feed-forward networks, instead of stacks of RNN cells.

Encoder Stack: What Happens Inside
The Transformer encoder is a stack of N = 6 identical layers.
Each layer has:
- A
multi-head self-attentionsub-layer. - A
position-wise feed-forward networksub-layer (a tiny 2-layer MLP applied independently to each position's vector).
Each sub-layer is wrapped with a residual connection (add the input back to the output of the sub-layer) and layer normalization (normalize activations so training stays stable).
So the pattern is basically:
x -> Attention -> Add&Norm -> FeedForward -> Add&Norm
Are you sure you're getting this? Fill in the missing part by typing it in.
Wrapping a sub-layer’s output with “add the original input, then normalize” is called a __________ connection + layer normalization pattern.
Write the missing line below.
Decoder Stack: What’s Different
The Transformer decoder is also a stack of 6 identical layers, but each layer has 3 sub-layers:
masked multi-head self-attentionover the already-generated output tokens. (“Masked” here means a token can’t peek at future tokens. This keeps generation auto-regressive.)encoder-decoder attention, where the decoder attends to all encoder outputs.- The same
position-wise feed-forward network.
Each sub-layer again uses residual connections + layer normalization.
Try this exercise. Is this statement true or false?
During generation, position i in the decoder is allowed to attend to future positions (i+1, i+2, …).
Press true if you believe the statement is correct, or false otherwise.
Self-Attention: The Core Operation
self-attention (also called intra-attention) relates every position in a sequence to every other position in that same sequence, to compute a new representation for each position.
Intuition:
- For each word, ask: which other words in this sentence matter for me?
- Weight them.
- Blend their info into my updated representation.
This lets the model capture word-word relationships without caring how far apart they are in the sentence.
Scaled Dot-Product Attention (Math but Friendly)
Attention takes three things:
Q(“queries”)K(“keys`)V(“values`)
For each query, we score how well it matches each key (using a dot product), scale it, apply softmax to get weights, and use those weights to mix the values.
The paper’s version is scaled dot-product attention:
1Attention(Q, K, V) = softmax( (Q * K^T) / sqrt(d_k) ) * VWhere:
d_kis the dimensionality of keys.- We divide by
sqrt(d_k)to prevent very large dot products that would make the softmax too peaky and gradients tiny.
Build your intuition. Click the correct answer from the options.
Why divide by sqrt(d_k)?
Click the option that best answers the question.
- noise suppression
- numerical stability in softmax
- to save memory
- to enforce sparsity
Multi-Head Attention
Instead of doing attention once, the Transformer does it in parallel h times with different learned projections. Each parallel attention is called a head.
Steps:
- Linearly project inputs into multiple smaller (
d_k,d_v) spaces. - Run scaled dot-product attention independently in each head.
- Concatenate all heads’ outputs.
- Project back to
d_model.
Why? Different heads can specialize. One head might track subject-verb agreement. Another might track coreference (“its” → which noun?). The model can attend to multiple types of relationships at once.

Masked Self-Attention in the Decoder
The decoder can’t look into the future when predicting the next word. To enforce that, the Transformer uses a mask inside attention.
Mechanically:
- Any attention score from position i to a future position j>i is set to
-∞before softmax. - After softmax, those future positions get probability ~0.
- So token
y_ionly depends on tokens< i.
This keeps generation left-to-right, like a language model.
Let's test your knowledge. Fill in the missing part by typing it in.
Masking future positions in the decoder preserves the model’s __________ property (it generates one token at a time, using only what’s already generated).
Write the missing line below.
Position-Wise Feed-Forward Networks
After attention, each position’s embedding is run through a tiny feed-forward network:
- A linear layer (W1, b1),
- A
ReLU, - Then another linear layer (W2, b2).
It’s applied identically to every position, but with different parameters per layer depth. You can think of this as: attention mixes information across tokens, and then the feed-forward block “transforms” each token’s channel representation.
In the base Transformer:
d_model = 512- The inner feed-forward dimension
d_ff = 2048(so it expands, then contracts).
Positional Encoding (Because There’s No Recurrence)
Transformers have no built-in notion of order. To fix that, they inject a positional encoding into each token embedding.
They use fixed sinusoidal encodings:
- Each dimension of the position encoding is a sine or cosine with a different frequency.
- These encodings are added to the token embeddings at the bottom of the encoder/decoder.
Why sinusoids?
- They let the model infer both absolute and relative positions.
- They can, in principle, generalize to longer sequences than seen in training, because the pattern is continuous.

Residual Connections + LayerNorm
Every sub-layer (attention, or feed-forward) is wrapped like this:
- Take the input
x. - Run the sub-layer to get
Sublayer(x). - Add them:
x + Sublayer(x)(this is aresidual connection). - Apply
layer normalization.
Why:
- Residuals help gradients flow in deep networks.
- LayerNorm stabilizes training by normalizing across the hidden dimension.
This structure repeats in every encoder and decoder layer.
Try this exercise. Is this statement true or false?
Residual connections are mainly there to help very deep models train without gradients vanishing.
Press true if you believe the statement is correct, or false otherwise.
Why Self-Attention Wins (Speed + Quality)
The paper analyzes three factors compared to RNNs and CNNs:
Parallelism:
- Self-attention can process all positions in parallel (O(1) sequential steps).
- RNNs require O(n) sequential steps because each hidden state depends on the previous one.
Path Length for Long-Range Dependencies:
- Self-attention: any position can attend to any other in 1 step.
- RNN: info must flow through many time steps, so long-range info weakens.
- CNN: needs multiple convolution layers to connect distant positions.
Computational Cost:
- Self-attention per layer is ~O(n² · d) where n = sequence length, d = hidden size.
- RNNs are O(n · d²). For typical sentence lengths and hidden sizes, self-attention is competitive or faster.
Try this exercise. Click the correct answer from the options.
Self-attention offers:
Click the option that best answers the question.
- short paths between any two tokens
- parallel training
- both
- neither
Training Setup (Translation Tasks)
They trained on:
WMT 2014 English→German(~4.5M sentence pairs; shared subword vocab using byte-pair encoding around 37k tokens).WMT 2014 English→French(~36M sentence pairs; 32k word-piece vocab).
Batches roughly contained ~25k source tokens + 25k target tokens. Sentences were batched by similar length to make training efficient.
Hardware: a single machine with 8 NVIDIA P100 GPUs.
Base model:
- ~12 hours of training (100k steps, ~0.4s/step on base config).
Big model:
- ~3.5 days of training (300k steps, ~1.0s/step).
Optimizer, Learning Rate, and Warmup
They used the Adam optimizer with custom hyperparameters:
β1 = 0.9,β2 = 0.98,ϵ = 1e-9.
They did a special learning rate schedule:
- Warm up: linearly increase learning rate for the first 4000 steps.
- Then decay it proportionally to the inverse square root of the step number and model dimension.
- Intuition: don’t blast the model with a huge LR at the start; gradually ramp up, then cool down.
Let's test your knowledge. Fill in the missing part by typing it in.
Slowly increasing the learning rate at the start of training is called __________.
Write the missing line below.
Regularization Tricks They Used
They used three main stabilizers:
Dropout- Applied after each sub-layer (before adding the residual)
- Also applied to the sum of embeddings + positional encodings
- Base model used dropout rate 0.1
Label smoothing- Instead of training on one-hot targets, they softened targets slightly (ε_ls = 0.1).
- This hurts raw perplexity (model becomes “less sure”) but improves BLEU accuracy.
Averaging checkpoints- At inference, they averaged the weights from the last several checkpoints to stabilize predictions.
Let's test your knowledge. Is this statement true or false?
Label smoothing can improve BLEU even if it slightly worsens perplexity.
Press true if you believe the statement is correct, or false otherwise.
Inference: How It Generates Translations
At inference:
- The decoder generates tokens one at a time.
- They use
beam search(beam size ~4 for translation), which keeps multiple candidate sequences in parallel and chooses the best-scoring. - They apply a
length penaltyso the model doesn’t unfairly prefer too-short outputs.
They also cap max output length to input_length + 50, but will stop early if it predicts an end-of-sentence token.
Results: Translation Quality and Cost
On English→German:
Transformer (big)reached BLEU ≈ 28.4, beating previous state-of-the-art (including ensembles) by more than 2 BLEU.- It became the new SOTA with a single model.
On English→French:
Transformer (big)hit BLEU ≈ 41.8 (or ~41.0 depending on table presentation), also state-of-the-art for single models.- Training cost (in FLOPs and wall-clock) was dramatically lower than older systems like GNMT or convolutional seq2seq.
Key story: higher quality, drastically less compute time to reach that quality.
Try this exercise. Click the correct answer from the options.
The Transformer matched or beat previous top translation systems while:
Click the option that best answers the question.
- taking more compute
- taking about the same compute
- taking drastically less compute
- not using GPUs
Model Variations: What Matters Most
They tried lots of variants:
Changing number of
attention heads(1, 4, 8, 16, 32).- 8 heads worked best.
- Too few heads loses expressiveness.
- Too many heads also hurt a bit.
Scaling up model width (
d_model), depth (Nlayers), and feed-forward size (d_ff):- Bigger models → better BLEU (unsurprising).
Changing
dropout:- Removing dropout overfit and hurt BLEU.
Replacing sinusoidal positional encodings with learned positional embeddings:
- Performance was basically the same.
- They kept sinusoids for extrapolation reasons.
Beyond Translation: Parsing
They tested English constituency parsing (turning a sentence into a full syntax tree). This task has tricky long-range structure.
They trained a 4-layer Transformer (with d_model = 1024) on:
- Just the Penn Treebank WSJ (~40K sentences), and
- A semi-supervised setup with millions more high-confidence parse trees.
Result:
- Even with limited data, the Transformer was competitive with strong parsers.
- With semi-supervised data, it surpassed many previous approaches, showing the architecture generalizes beyond translation.
Interpretability: Heads Learn Linguistic Jobs
One cool side effect: you can inspect attention maps.
Different attention heads specialize:
- Some heads follow long-distance dependencies (“making … more difficult”).
- Some heads link pronouns to the right noun (
its → product, for example). - Some seem to track syntactic structure (which words group into a phrase).
That means different heads act like different “relation detectors.” The model learns these roles without explicit syntax labels.

Hands-On Code: Scaled Dot-Product Attention
Below is a tiny runnable demo of scaled dot-product attention for one attention head, using only the standard library.
It:
- Computes attention weights from Q and K.
- Applies softmax with scaling.
- Uses those weights to mix the values V.
xxxxxxxxxx main()# file: scaled_dot_attention.py# Minimal scaled dot-product attention.# Only standard library. Run with `python scaled_dot_attention.py`.import mathimport randomdef softmax(xs): # subtract max for numerical stability m = max(xs) exps = [math.exp(x - m) for x in xs] s = sum(exps) return [e / s for e in exps]def matmul(a, b): # a: [n x d], b: [d x m] => [n x m] n = len(a) d = len(a[0]) m = len(b[0]) out = [[0.0]*m for _ in range(n)] for i in range(n): for j in range(m): s = 0.0 for k in range(d): s += a[i][k] * b[k][j] out[i][j] = s return outdef transpose(m):Build your intuition. Click the correct answer from the options.
scaled_dot_attention mainly:
Click the option that best answers the question.
- blends V using weights from Q·K^T
- stores gradients
- loads data from disk
- compresses vocab
Hands-On Code: One Encoder Block Forward Pass (Python, stdlib only)
Below is a toy “encoder layer” forward pass. We’ll fake:
- multi-head attention with just one head,
- a feed-forward network,
- residual + layer norm.
This is not a full Transformer, but it mirrors how data flows in a single encoder layer.
xxxxxxxxxx main()# file: tiny_encoder_block.py# A super-simplified encoder layer forward pass with:# attention -> add&norm -> feedforward -> add&norm# No external libraries. Run with `python tiny_encoder_block.py`.import mathimport randomdef layer_norm(vec): # simple per-vector layer norm mean = sum(vec)/len(vec) var = sum((x-mean)**2 for x in vec)/len(vec) eps = 1e-6 return [(x-mean)/math.sqrt(var+eps) for x in vec]def linear(x, W, b): # x: [d_in], W: [d_out x d_in], b: [d_out] -> [d_out] out = [] for row, bias in zip(W, b): s = 0.0 for xi, wi in zip(x, row): s += xi * wi out.append(s + bias) return outdef relu(v): return [max(0.0, x) for x in v]def softmax(xs):One Pager Cheat Sheet
- The paper introduced the
Transformer, asequence transduction modelthat ditches recurrence and convolutions and uses onlyattention(specificallyself-attention) to be fully parallelizable and learn long-range dependencies with fewer steps, enabling it to train faster, scale better, and set new state-of-the-art translation results (measured byBLEU) more cheaply thanRNNs- orCNNs-based systems. - Older sequence models—
RNNs(e.g.,LSTMs,GRUs) that update a hidden state token-by-token andCNN-basedmodels that stack local convolutions—suffer from sequential hidden-state updates and limited parallelism and indirect, depth-dependent connections, so both struggle to capturelong-range dependenciesefficiently. - False — because the recurrence
h_t = f(h_{t-1}, x_t)creates a sequential dependency so you cannot computeh_tuntilh_{t-1}is known, forcing step-by-step computation in both training (viaBackpropagation Through Time (BPTT)) and autoregressive inference, even though you can parallelize across batches and within-step matrix operations (e.g., inLSTM/GRUkernels), whereas convolutional sequence models andTransformers compute positions simultaneously. - The paper builds a full
encoder-decodermodel that uses onlyattention(plus simplefeed-forwardparts), creating an all-attention architecture that directly links any word to any other so long-range dependencies become effectively one hop and there are no recurrent or convolutional layers. - Because attention computes normalized relevance scores between the
queryof a position and thekeysof every other position and uses them to form a weighted sum ofvalues, the model can in a single layer directly connect to every token, adaptively focus on any token regardless of distance, and evaluate those connections in parallel. - The Transformer retains the classic encoder-decoder structure: the
encoderreads the input and produces contextual vector representations, thedecodergenerates the output one symbol at a time using previous outputs plus the encoded input, and both are built from repeated layers ofmulti-head self-attentionandfeed-forward networksrather than stacks ofRNNcells. - The Transformer
encoderis a stack of N = 6 identical layers; each layer has amulti-head self-attentionsub-layer and aposition-wise feed-forward networksub-layer, and every sub-layer is wrapped with residual connections and layer normalization, yielding the patternx -> Attention -> Add&Norm -> FeedForward -> Add&Norm. - The phrase describes the combination of an identity skip (or shortcut) path implemented as a residual connection (i.e.,
y = x + F(x)whereF(x)is the sub-layer such asmulti-head self-attentionor the feed-forward MLP), which creates an identity path that improves gradient flow and mitigates vanishing gradients, with layer normalization (applied asLayerNormafter the addition) to stabilize activations — together forming the Transformer’sAdd&Normpattern called a residual connection + layer normalization. - The Transformer
decoderis a stack of 6 identical layers where each layer has three sub-layers —masked multi-head self-attention(preventing future-token peeking to keep generation auto-regressive),encoder-decoder attention(attending to all encoder outputs), and aposition-wise feed-forward network— and each sub-layer uses residual connections + layer normalization. - False: the decoder enforces causality via
masked multi-head self-attention—a triangularmaskadded as −∞ insoftmax((QK^T)/sqrt(d_k) + mask)makes attention to j>i effectively zero,encoder-decoder attentiononly sees encoder outputs (so it can't leak future decoder tokens), and the same causal masking is applied during training (often withteacher forcing) and inference, thus preserving the autoregressive property. self-attention(also calledintra-attention) relates every position in a sequence to every other position, weights and blends their information to compute updated representations for each position, enabling the model to capture word–word relationships regardless of distance.- In
scaled dot-product attention, for eachQyou compute dot-product scores withK, scale them by dividing bysqrt(d_k), apply softmax to produce weights, and use those weights to mixV, preventing very large dot products that would make the softmax too peaky and gradients tiny. - Dividing by
sqrt(d_k)keeps the typical size of thedot productinputs to thesoftmaxroughly constant asd_kchanges, which preventssoftmaxsaturation / excessively peaky outputs, avoids vanishing gradients, and improves numerical and optimization stability. - The Transformer performs parallel attention by running attention
htimes with different learned projections—each ahead—where inputs are linearly projected into smallerd_k/d_vspaces,scaled dot-product attentionis applied independently per head, the outputs are concatenated and projected back tod_model, allowing different heads to specialize (e.g., subject-verb agreement, coreference). - The decoder can't look into the future: the Transformer enforces this with a
maskinside attention that sets scores to-∞for positions j>i beforesoftmax, giving those positions ~0 probability so tokeny_idepends only on tokens< i, preserving left-to-right generation like a language model. - An auto-regressive model factorizes sequence probability as a product of conditionals (
P(y_1..y_T)=∏_t P(y_t|y_<t)) and the decoder enforces this with causal (or no-peek) masking—replacing anyattention scorefrom positionito a futurej>iwith-infbeforesoftmax(so those positions get ≈0 probability)—ensuring each token’s representation and prediction depend only ony_<i(even duringteacher forcing) and enabling sequential inference (sampling or beam search) without using future ground truth. - After attention, each position’s embedding is passed through a position-wise feed-forward network—
W1,b1→ReLU→W2,b2—applied identically to every position (with different parameters per layer depth) so that attention mixes information across tokens while the feed-forward block transforms each token’s channel representation; in the base Transformerd_model = 512andd_ff = 2048(it expands then contracts). - Because transformers have no built-in notion of order, they inject a
positional encoding(added totoken embeddingsat the bottom of theencoder/decoder) using fixed sinusoidal encodings, which let the model infer absolute and relative positions and, in principle, generalize to longer sequences. - Each sub-layer takes input
x, computesSublayer(x), adds them asx + Sublayer(x)(aresidual connection) and then applieslayer normalization, so the residual connections help gradients flow in deep networks while LayerNorm stabilizes training, and this structure repeats in every encoder and decoder layer. - By replacing
y = H(x)withy = x + F(x), aresidual connectionprovides an identity shortcut that gives adirect pathfor gradients duringbackpropagation, preventingvanishing gradientsand making it easier to learn theresidual functionF(x)=H(x)-x, thereby improving optimization and enabling stable training of very deep networks. - The paper shows that
self-attentionwins because it provides maximal parallelism (all positions processed inO(1)sequential steps vsRNNs'O(n)), enables one-step long-range access to any position (unlikeRNNsor multi-layerCNNs), and offers a competitive computational cost (~O(n²·d)per layer vsRNNs'O(n·d²)), yielding faster and higher-quality results for typical sentence lengths and hidden sizes. - Self-attention delivers practical speed and better modeling quality because it enables massive parallelism (only
O(1)sequential steps per layer and hardware-friendly dense ops), provides short path length (path length = 1) for direct long-range dependencies and improved gradient/representation flow, and has a practical computational profile—O(n^2 * d)vsO(n * d^2)—that is efficient for typicalnanddon modern accelerators. - Training used large WMT datasets (
WMT 2014 English→German~4.5M pairs with shared subword vocab viabyte-pair encoding~37k tokens, andWMT 2014 English→French~36M pairs with 32kword-piecevocab), length-based batching (~25k source + 25k target tokens per batch), run on a single-machine 8 ×NVIDIA P100GPUs setup, with the base model trained for ~12 hours (100k steps, ~0.4s/step) and the big model for ~3.5 days (300k steps, ~1.0s/step). - They used the
Adamoptimizer with custom hyperparameters (β1 = 0.9,β2 = 0.98,ϵ = 1e-9) and alearning rate schedulethat warms up linearly for the first 4000 steps then decays proportional to the inverse square root of the step number and model dimension — the intuition being to avoid a huge initial LR by ramping up and then cooling down. - The correct term is warmup, meaning gradually increasing the
learning rateduring an initialwarmupperiod so early noisy gradients and immatureoptimizerstatistics (m,v, especially withAdam) don’t cause huge parameter updates, allowing the model to build reliable gradient statistics and achieve more stable optimization and better final performance (e.g., linear rise for 4k steps then inverse-square-root decay used inTransformers). - They used three main stabilizers—Dropout (
Dropout), applied after each sub-layer (before adding the residual) and to the sum of embeddings + positional encodings (base rate 0.1); Label smoothing (Label smoothing), which softensone-hottargets withε_ls = 0.1(hurting raw perplexity but improving BLEU); and Averaging checkpoints (Averaging checkpoints), which averages the weights from the last several checkpoints at inference to stabilize predictions. - True:
label smoothingtrains against a softened target instead of aone-hot(minimizingcross-entropyto q), which acts as a confidence penalty / regularizer that lowers p(gold) and therefore raisesperplexity, yet by improving calibration, reducing overfitting and making decoding (e.g.,beam search) less brittle it often improvesBLEU, illustrating thatperplexityandBLEUmeasure different things and that smoothing intentionally trades likelihood for better sequence-level performance. - At inference the decoder generates tokens one at a time, using
beam search(beam size ~4) to keep multiple candidate sequences and alength penaltyto avoid favoring too-short outputs, while they cap max output length toinput_length + 50and stop early if it predicts anend-of-sentencetoken. - On English→German
Transformer (big)achievedBLEU ≈ 28.4and on English→FrenchBLEU ≈ 41.8(~41.0 depending on presentation), establishing it as a new SOTA single-model while delivering higher quality with drastically less compute (much lowerFLOPsand wall-clock training time than older systems like GNMT). - Because the Transformer replaces sequential
RNN/LSTMrecurrence with massive parallelism viaself-attention(andmulti-head attention), shortens the information path between tokens (shorter path length), packs more capacity into parallelizableGEMM-style ops that map to highly optimized dense operations, and—together withresidual connections,layer normalizationand modern recipes likeAdamwithwarmupandlabel smoothing—yields better representational efficiency, faster convergence, and stable training and better gradient propagation, it achieves drastically less compute (i.e., fewer total FLOPs and much less wall‑clock time) to reach equal or higher BLEU than previous top systems. - Across many variants, changing the number of
attention headsshowed 8 heads worked best (too few lost expressiveness, too many slightly hurt), scalingd_model,N, andd_ffshowed larger models improved BLEU, removingdropoutcaused overfitting and worse BLEU, and swappingsinusoidal positional encodingsforlearned positional embeddingsmade little difference so they kept sinusoids for extrapolation. - A 4-layer Transformer (with
d_model = 1024) trained on Englishconstituency parsing—using only the Penn Treebank WSJ (~40K sentences) and, in a semi-supervised setup, millions more high-confidence trees—was competitive with strong parsers with limited data and, with semi-supervised data, surpassed many previous approaches, demonstrating the architecture generalizes beyond translation. - By inspecting
attention maps, we find that differentattention headsspecialize in linguistic tasks—tracking long-distance dependencies, linking pronouns to antecedents (its → product), and capturing phrase structure—so they act as relation detectors the model learns without explicit syntax labels. - A tiny runnable demo implements
scaled dot-product attentionfor one attention head using only the standard library, computing attention weights fromQandK, applying scaledsoftmax, and using those weights to mix the valuesV. - Blends
Vinto outputs by computing similarity viaQ·K^T(scaled bysqrt(d_k)and normalized withsoftmaxto produceattention weights) and combining them as a weighted sum (weights · V). - This is a hands-on code example showing a one encoder block forward pass implemented in Python stdlib-only that fakes
multi-head attentionwith just one head, includes afeed-forward networkandresidualconnections pluslayer norm, and is not a full Transformer but mirrors how data flows in a single encoder layer.


