One Pager Cheat Sheet
- Deep learning learns complex input→output mappings by stacking layers of simple units into a
neural networkthat is a composable function (e.g.,output = layer_L(...layer_2(layer_1(input)))), where many layers provide depth and the model’s numeric knobs—theweightsandbiases—are tuned to minimize aloss. Machine learning (ML)learns patterns from data,Representation learninglearns useful features automatically, andDeep learning (DL)is representation learning with many layers of differentiable transformations that excels on large datasets, high-dimensional inputs (images, audio, text), and when end-to-end learning is required.- A
perceptronis a mathematical model of a biological neuron that originally produced abinary outputviay = step(w·x + b), while modern neurons computez = w·x + bthena = φ(z)with anactivation function(e.g.,ReLU,sigmoid,tanh), and stacking layers of such neurons produces neural networks. - Because the composition of linear maps is itself a linear map, stacking layers that compute
z = W x + b(with identity activations) simply collapses to a single equivalent layer withW_eq = W^(L) ... W^(1)and a combined bias, so without a non-linear activation (e.g.ReLU,sigmoid,tanh) depth does not increase a network's representational power and cannot produce non-linear decision boundaries. - Neural networks learn
Weights(W) andbiases(b) — the parameters we learn — apply anactivation functionφ(e.g.,ReLU(x) = max(0,x)) to add non-linearity, measure performance with aLoss(e.g.,MSEorcross-entropy) as the error measure, compute theGradient(partial derivatives) as the direction to improve parameters, and useGradient descentwith the update ruleθ ← θ − η ∇θ L(step size given bylearning rateη) as the update rule. - A tiny implementation of a single neuron with
ReLUactivation, trained withgradient descentto learn y ≈ 2*x + 1 on synthetic data, using the standard library only. - ReLU f(x)=max(0,x) is
differentiablefor x ≠ 0 but not differentiable at 0 because theleft-hand derivative= 0 and theright-hand derivative= 1, though it is continuous at 0, hassubgradients in [0,1] there, and is therefore almost everywhere differentiable, so gradient-based training remains practical. - The core training loop—Forward (compute predictions), Loss (measure error), Backward (compute gradients via
backpropagation), and Update (adjust weights usinggradient descentor other optimizers)—repeats many times to reduce error and improve the model. - The steps must occur in order: forward pass to compute
y_hatand cache activations, then compute loss to get a scalarL(y_hat, y), then backpropagate gradients to obtain∂L/∂θ, and finally update parameters with anoptimizer(e.g., SGD), because each step depends on the previous step's outputs. - This is a minimal implementation of a Two-Layer Network: a
2-layer MLPperformingbinary classificationon a toy dataset using the standard library only. - The missing word is
softmax, a mapping from rawlogitsviap_i = exp(z_i)/sum_j exp(z_j)that produces non-negative outputs which sum to 1 (forming a proper probability distribution), preserves ordering (so theargmaxis unchanged), is invariant to additive constants (enablingnumerical stabilityby subtracting max), supportstemperaturescaling to control peakiness (→ one-hot as temp→0, uniform as temp→∞), reduces to thesigmoidfor two classes, and has Jacobian∂p_i/∂z_j = p_i(δ_ij - p_j)which withcross-entropyand aone-hottarget yields the simple gradientp - y. - Multiclass heads compute a vector of
logitsz ∈ ℝ^KforKclasses, convert them to probabilities with softmaxsoftmax(z)_k = e^{z_k} / Σ_j e^{z_j}, and optimize using cross-entropy lossL = − Σ_k y_k log(softmax(z)_k)whereyis a one-hot label. - The pipeline
Linear→softmax→cross-entropyis standard because the finalLinearproduces unconstrained real-valuedlogitsthatsoftmaxturns into a probability distribution,cross-entropy(the negative log-likelihood) trains those probabilities with simple, stable gradients (∂L/∂z = p − y) and a clear probabilistic interpretation with numerically stable fused implementations, while for multi-label problems one should instead usesigmoid+binary cross-entropy. - Overfitting (low training loss, high validation loss) versus Underfitting (high training and validation loss): Regularization aims to improve generalization using techniques like
L2(weight decay),Early stopping,Dropout, andData augmentation. - Add L2 Weight Decay: illustrates adding an
L2penalty to thelossinside the training loop. - The statement is true: unlike
RNN/LSTMmodels that use recurrence, thetransformerusesself-attention—computingqueries,keys, andvaluesand weights viasoftmax(Q K^T / sqrt(d_k)) V—so each layer yields direct, learnable, parallel connections between all positions (thereby eliminating recurrence, providing a short path length for dependencies, and enabling parallel processing across sequence positions), while practical additions likepositional encoding,multi-head attention, andmasked attentionsupply order information, richer relations, and autoregressive causality, at the cost of an O(n^2) trade-off in memory and compute. - For problems with a tiny dataset and easily engineered features try simpler
ML(e.g.linearortree-basedmodels), when you need perfect interpretability or strict guaranteesDLis hard to justify, and with low compute or tight latency constraints a smaller model is preferable—start simple and scale up when the problem/data demands it. - This provides a minimal
2-layer MLPthat implements theXORfunction using standard libraries only. - Training cost grows with data size, model size, and sequence/image resolution;
Batch size(samples per gradient step) andEpoch(one full pass over data) affect memory and training dynamics, and while typical accelerators areGPUs/TPUs, conceptually you only need the underlying math. - Neural nets learn what they see, so to mitigate biased training data you should perform
dataset curationandevaluation on diverse slices, use explainability tools such asfeature attributionsandprobesto audit behavior, and adopt safety measures likerate limits,human review, anddomain constraintsto avoid harmful outputs. - Run a sanity check by confirming the model can overfit a tiny subset (e.g., 10 samples); if loss not decreasing, lower
lrand inspectgradientssigns/shapes; if exploding loss, clipgradients, reducelr, and check forNaNs; if validation worse than training, add regularization or gather more data. - Because Overfitting is primarily a high-variance problem, adding an L2 penalty (a
weight decayterm likelambda * ||w||^2that shrinks weights) and using early stopping (monitoringval_lossand halting afterpatience) both primarily reduce variance—the former by constraining parameter magnitudes and the latter by limiting optimization time—and together act complementarily to improve generalization. - The correct fill-in is
epoch: a single pass through the entire training dataset (aka apass), which differs from abatch/mini-batchand aniteration—oneiterationupdates parameters using onebatch—and because the number ofepochscontrols how often the model sees the full data, training for too manyepochscan cause overfitting (mitigate with avalidation set,early stopping, fewerepochs, or regularization). - The composition of
linear layers of the formf(x) = W x + bis itself a singlelinear transformation—e.g.f2(f1(x)) = (W2 W1) x + (W2 b1 + b2)—so stacking layers withoutnon-linear activationsadds no expressive power, though hidden dimensions can impose arankconstraint on the resulting matrix. - You’ve learned what deep learning is and why it works, implemented
tiny netsfrom scratch, and are ready to port them to a properframework—now knowing exactly what the framework is doing under the hood.


