Deep Learning Defined
Deep learning is a way to learn functions by stacking layers of simple units (neurons) so that the whole network can approximate very complex input→output mappings. A neural network
is just a composable function:
output = layer_L(...layer_2(layer_1(input)))
.
Why “deep”? Because there are many layers (depth). Why “learning”? Because the network’s numeric knobs (its weights
and biases
) are tuned to minimize a loss
—a number that measures how wrong the network is on your data.

Where It Fits
Machine learning (ML)
: learn patterns from data.Representation learning
: learn useful features automatically (instead of hand-crafting them).Deep learning (DL)
: representation learning with many layers of differentiable transformations.
DL shines when you have large datasets, high-dimensional inputs (images, audio, text), and the need for end-to-end learning.
From Perceptron to Neuron
A perceptron
is a mathematical model of a biological neuron that takes numerical inputs, applies weights, adds a bias, and uses an activation function to produce a binary output, classifying data into two categories.
The original perceptron
computed: y = step(w·x + b)
. Modern neurons do:
z = w·x + b
, then a = φ(z)
where φ
is an activation function
(e.g., ReLU
, sigmoid
, tanh
). Stacking many neurons gives you a layer; stacking layers gives you a network.

Are you sure you're getting this? Click the correct answer from the options.
Which statement is most accurate?
Click the option that best answers the question.
- Deep learning requires non-differentiable activations to be expressive.
- Deep learning stacks linear layers; without non-linear activations this equals one big linear map.
- Deep learning is a rule-based expert system with no training.
- Deep learning can’t model images.
The Math You Really Need
Here's the mathematical terms at play:
Weights
(W
) andbiases
(b
): the parameters we learn.Activation function
φ
: adds non-linearity (e.g.,ReLU(x) = max(0,x)
).Loss
: scalar measuring error, e.g.,MSE
for regression,cross-entropy
for classification.Gradient
: vector of partial derivatives that tells us how to tweak parameters to reduce loss.Gradient descent
: update ruleθ ← θ − η ∇θ L
withlearning rate
η
.
A Tiny Neuron
Here is a tiny neuron implementation. It is a single neuron with ReLU activation, trained with plain gradient descent to learn y ≈ 2*x + 1 on synthetic data. Standard library only.
xxxxxxxxxx
train_single_neuron()
import random
import math
def relu(x):
return x if x > 0 else 0.0
def relu_grad(x):
return 1.0 if x > 0 else 0.0
def train_single_neuron(epochs=2000, lr=0.01, seed=42):
random.seed(seed)
# Generate simple 1D data: y = 2x + 1 + noise
xs = [random.uniform(-2.0, 2.0) for _ in range(200)]
ys = [2.0 * x + 1.0 + random.gauss(0, 0.1) for x in xs]
# Parameters of a 1D neuron: w and b
w = random.uniform(-1.0, 1.0)
b = 0.0
for epoch in range(epochs):
dw = 0.0
db = 0.0
loss = 0.0
for x, y in zip(xs, ys):
z = w * x + b
a = relu(z)
# Mean squared error (per sample)
diff = a - y
loss += 0.5 * diff * diff
Let's test your knowledge. Is this statement true or false?
ReLU(x) = max(0, x)
is differentiable everywhere, including at x = 0
.
Press true if you believe the statement is correct, or false otherwise.
Forward, Loss, Backprop: The Loop
The Forward, Loss, Backprop loop is the core training process for a neural network, where a forward pass makes a prediction, a loss function calculates how wrong it is, and backpropagation computes gradients to update the model's weights, reducing error over many iterations to improve future predictions.
- Forward: compute predictions from inputs via layers and activations.
- Loss: compare predictions to targets.
- Backward: compute gradients of loss w.r.t. each parameter (
backpropagation
). - Update: adjust parameters with
gradient descent
(or a fancier optimizer).

Are you sure you're getting this? Could you figure out the right sequence for this list?
Put the training steps in the correct order:
Press the below buttons in the order in which they should occur. Click on them again to un-select.
Options:
- Compute loss on predictions
- Update parameters
- Run forward pass
- Backpropagate gradients
Two-Layer Network Implementation
Here is a minimal 2-layer MLP for binary classification on a toy dataset using the standard library only.
xxxxxxxxxx
train_mlp()
import random
import math
def sigmoid(x): # activation for last layer (probability)
return 1.0 / (1.0 + math.exp(-x))
def dsigmoid(y): # derivative given output y = sigmoid(x)
return y * (1.0 - y)
def relu(x):
return x if x > 0 else 0.0
def relu_grad(x):
return 1.0 if x > 0 else 0.0
def dot(a, b):
return sum(x*y for x, y in zip(a, b))
def matvec(W, v):
# W: list of rows, v: vector
return [dot(row, v) for row in W]
def add(v, b):
return [x + y for x, y in zip(v, b)]
def outer(u, v):
# returns matrix: u * v^T
return [[ui * vj for vj in v] for ui in u]
Build your intuition. Fill in the missing part by typing it in.
A function used to map raw logits to probabilities over multiple classes is called ________
. It ensures outputs are non-negative and sum to 1.
Write the missing line below.
Multiclass Heads & Cross-Entropy
For K
classes, we compute a vector of logits
z ∈ ℝ^K
, then apply softmax(z)_k = e^{z_k} / Σ_j e^{z_j}
. Use cross-entropy loss
:
L = − Σ_k y_k log(softmax(z)_k)
where y
is a one-hot label.
Try this exercise. Click the correct answer from the options.
Which combination is typical for multiclass classification?
Click the option that best answers the question.
- `Linear` → `sigmoid` → `MSE`
- `Linear` → `softmax` → `cross-entropy`
- `Linear` → `ReLU` → `hinge loss`
- `Linear` → `tanh` → `MAE`
Regularization & Generalization
Overfitting
: model learns noise; low training loss, high validation loss.Underfitting
: model too simple; high training and validation loss.Regularization
: techniques to improve generalization:

L2
(weight decay): penalize large weights.Early stopping
: stop when validation loss worsens.Dropout
: randomly drop units during training (simulated in code by masking).Data augmentation
: alter inputs (flips/crops/noise) to create variety.
Add L2 Weight Decay
This illustrates adding L2 penalty to the loss inside training loop.
xxxxxxxxxx
# Suppose total_loss accumulates data loss already:
# L_total = L_data + (lambda/2) * sum(W^2)
def l2_penalty_mats(mats):
return sum(sum(w*w for w in row) for M in mats for row in M)
def l2_penalty_vecs(vecs):
return sum(w*w for v in vecs for w in v) if vecs and isinstance(vecs[0], list) else sum(w*w for w in vecs)
# Example inside training after accumulating gradients:
lam = 1e-3
# add penalty to total_loss (W1,W2 shown)
total_loss += 0.5 * lam * (l2_penalty_mats([W1]) + sum(w*w for w in W2))
# and when updating grads, add lam * W terms (weight decay)
for j in range(hidden):
for i in range(input_dim):
dW1[j][i] += lam * W1[j][i]
for j in range(hidden):
dW2[j] += lam * W2[j]
Are you sure you're getting this? Is this statement true or false?
Transformers eliminate the need for recurrence by using attention
to connect positions in a sequence directly.
Press true if you believe the statement is correct, or false otherwise.
When NOT to Use Deep Learning
- Tiny dataset with easily engineered features? Try simpler
ML
(likelinear
ortree
-based models). - Need perfect interpretability or strict guarantees? DL may be harder to justify.
- Low compute budget or latency constraints? A smaller model may be better.
Rule of thumb: start simple, scale up when the problem/data demands it.
Build Your Own MLP
Here's a minimal 2-layer MLP for XOR using standard libraries only.
xxxxxxxxxx
}
function randu(a, b) { return a + (b - a) * Math.random(); }
function sigmoid(x) { return 1 / (1 + Math.exp(-x)); }
function dsigmoid(y) { return y * (1 - y); }
function relu(x) { return x > 0 ? x : 0; }
function reluGrad(x) { return x > 0 ? 1 : 0; }
function matvec(W, v) {
const out = new Array(W.length).fill(0);
for (let r = 0; r < W.length; r++) {
let s = 0;
for (let c = 0; c < v.length; c++) s += W[r][c] * v[c];
out[r] = s;
}
return out;
}
function addv(a, b) { return a.map((x, i) => x + b[i]); }
function trainXOR(epochs = 5000, lr = 0.1, hidden = 4) {
// XOR dataset
const X = [[0,0],[0,1],[1,0],[1,1]];
const Y = [0,1,1,0];
// Params
const inputDim = 2;
let W1 = Array.from({length: hidden}, () => Array.from({length: inputDim}, () => randu(-1,1)));
let b1 = Array.from({length: hidden}, () => 0);
let W2 = Array.from({length: hidden}, () => randu(-1,1));
let b2 = 0;
Hardware and Complexity
- Training cost grows with data size, model size, and sequence/image resolution.
Batch size
: how many samples per gradient step. Larger batches use more memory.Epoch
: one full pass over training data.- Typical accelerators: GPUs/TPUs; but conceptually all you need is the math we wrote.
Ethics, Safety, and Bias
Neural nets learn what they see. If training data is biased, the model may be biased. Key ideas:
Dataset curation
andevaluation on diverse slices
.Explainability
tools (feature attributions, probes) to audit behavior.Safety
: avoid harmful outputs; consider rate limits, human review, domain constraints.
Quick Debugging Playbook
- Sanity check: can the model overfit a tiny subset (e.g., 10 samples)?
- Loss not decreasing? Lower
lr
, check gradient signs and shapes. - Exploding loss? Clip gradients, reduce
lr
, check for NaNs. - Validation worse than training? Add regularization or more data.
Try this exercise. Click the correct answer from the options.
Which change most directly combats overfitting?
Click the option that best answers the question.
- Increase learning rate dramatically
- Add L2 penalty and use early stopping
- Remove validation set
- Train forever
Are you sure you're getting this? Fill in the missing part by typing it in.
A single run through the entire training dataset is called an ________
.
Write the missing line below.
Are you sure you're getting this? Is this statement true or false?
Without non-linear activations, stacking multiple linear layers is equivalent to a single linear transformation.
Press true if you believe the statement is correct, or false otherwise.
You’ve seen what deep learning is, why it works, and you’ve implemented tiny nets from scratch. When you’re ready, port these to a proper framework—but now you’ll know exactly what the framework is doing under the hood.
One Pager Cheat Sheet
- Deep learning is the process of approximating complex input-output mappings by stacking layers of simple units (neurons) in a neural network, tuning the network's numeric knobs (weights and biases) to minimize a loss function through many layers (depth).
- Deep learning (DL) is a subset of machine learning (ML) that excels when working with large datasets, high-dimensional inputs, and the need for end-to-end learning due to its focus on representation learning with many layers of differentiable transformations.
- From Perceptron to Neuron: The mathematical model of a biological
perceptron
evolved into modern neurons that computez = w·x + b
, apply anactivation function
φ(z)
, and can be stacked to create layers and networks in deep learning. - Deep learning stacks linear layers; without non-linear activations this equals one big linear map.
- In neural networks, the key terms include
weights
(W
),biases
(b
),activation function
φ
,loss
functions likeMSE
andcross-entropy
,gradients
, andgradient descent
with alearning rate
η
. - Tiny neuron implemented with ReLU activation trained using plain gradient descent to learn y ≈ 2*x + 1 on synthetic data using only standard library.
- The statement that the function is false is because the
ReLU
function is not differentiable at 0 due to the disagreement between theleft derivative
(0) and theright derivative
(1) at that point. - The Forward, Loss, Backprop loop is the core training process for a neural network, where a forward pass predicts, a loss function calculates errors, and backpropagation adjusts weights with
gradient descent
to improve future predictions. - In the training loop, the forward pass computes predictions, the
loss
is calculated on those predictions,backpropagation
is used to compute gradients, and finally parameters are updated, each step depending on the results of the previous one. - Two-layer MLP implemented for binary classification on a toy dataset using standard library only.
- The softmax function converts raw logits to a probability distribution by exponentiating and normalizing values, making it a "soft" argmax that is differentiable and pairs well with
cross-entropy
loss for multi-class classification training. - Multiclass heads involve computing a vector of logits for
K
classes, applying the softmax function to obtain probabilities, and calculating the cross-entropy lossL
using the formulaL = − Σ_k y_k log(softmax(z)_k)
withy
as a one-hot label. - In multiclass classification, the typical approach involves a
Linear
layer for classlogits
, followed by asoftmax
function to generate a probability distribution, andcross-entropy
loss to maximize likelihood and differentiate predicted probabilities from true values. - Regularization techniques like
L2 weight decay
,early stopping
,dropout
, anddata augmentation
are used to prevent overfitting and improve the generalization of machine learning models. - L2 Weight Decay is added as an L2 penalty to the loss function within the training loop.
- Self-attention in transformers allows for direct connections between positions, enabling parallel computation and modeling of long-range dependencies without the need for recurrent steps.
- When NOT to Use Deep Learning: For a tiny dataset with easily engineered features, perfect interpretability or strict guarantees, or low compute budget or latency constraints, consider simpler
ML
models (likelinear
ortree
-based) and start simple, scaling up only when necessary. - Build Your Own MLP for XOR using a 2-layer
MLP
with standard libraries only. - Training cost increases with data size, model size, and sequence/image resolution, where batch size and epoch are important factors to consider when using accelerators such as GPUs/TPUs.
- Neural nets learn from data and may exhibit bias if the training data is biased, requiring dataset curation, evaluation on diverse slices, and explainability tools such as feature attributions and probes to audit behavior while also ensuring safety through measures like rate limits, human review, and domain constraints.
- The Quick Debugging Playbook includes steps like checking for overfitting on a
tiny
subset, adjustinglr
and gradients if loss isn't decreasing, clipping gradients for exploding loss, and adding regularization for cases where validation performs worse than training. - Overfitting occurs when a model fits training data too closely but fails to generalize, and combating this involves using an L2 penalty to reduce complexity by shrinking weights and early stopping to prevent memorization of noise, both directly reducing the training/validation gap.
- The correct term is
epoch
, referring to a single pass through the entire training dataset that allows the model to learn patterns, with multipleepochs
used in training to refine parameters and combat overfitting. - The composition of linear maps is itself a linear map, allowing multiple linear layers without activation functions to be collapsed into a single linear transformation.
- You have gained an understanding of deep learning, implemented small networks from scratch, and are prepared to transfer them to a framework where you will have a clear understanding of the underlying operations.