Mark As Completed Discussion

Multi-Head Attention

Instead of doing attention once, the Transformer does it in parallel h times with different learned projections. Each parallel attention is called a head.

Steps:

  • Linearly project inputs into multiple smaller (d_k, d_v) spaces.
  • Run scaled dot-product attention independently in each head.
  • Concatenate all heads’ outputs.
  • Project back to d_model.

Why? Different heads can specialize. One head might track subject-verb agreement. Another might track coreference (“its” → which noun?). The model can attend to multiple types of relationships at once.

Multi-Head Attention