Mark As Completed Discussion

Scaled Dot-Product Attention (Math but Friendly)

Attention takes three things:

  • Q (“queries”)
  • K (“keys`)
  • V (“values`)

For each query, we score how well it matches each key (using a dot product), scale it, apply softmax to get weights, and use those weights to mix the values.

The paper’s version is scaled dot-product attention:

TEXT
1Attention(Q, K, V) = softmax( (Q * K^T) / sqrt(d_k) ) * V

Where:

  • d_k is the dimensionality of keys.
  • We divide by sqrt(d_k) to prevent very large dot products that would make the softmax too peaky and gradients tiny.