Scaled Dot-Product Attention (Math but Friendly)
Attention takes three things:
Q(“queries”)K(“keys`)V(“values`)
For each query, we score how well it matches each key (using a dot product), scale it, apply softmax to get weights, and use those weights to mix the values.
The paper’s version is scaled dot-product attention:
TEXT
1Attention(Q, K, V) = softmax( (Q * K^T) / sqrt(d_k) ) * VWhere:
d_kis the dimensionality of keys.- We divide by
sqrt(d_k)to prevent very large dot products that would make the softmax too peaky and gradients tiny.


