Multi-Head Attention
Instead of doing attention once, the Transformer does it in parallel h times with different learned projections. Each parallel attention is called a head.
Steps:
- Linearly project inputs into multiple smaller (
d_k,d_v) spaces. - Run scaled dot-product attention independently in each head.
- Concatenate all heads’ outputs.
- Project back to
d_model.
Why? Different heads can specialize. One head might track subject-verb agreement. Another might track coreference (“its” → which noun?). The model can attend to multiple types of relationships at once.


