A Story to Start: From “Do Math” to “Build a Machine”
Imagine you’re teaching a robot to recognize cats. You don’t write rules like “if pointy ears then cat.” Instead, you build a machine that learns rules from data. Under the hood, that machine is a stack of layers
wired into a computational graph
. TensorFlow is the toolkit that builds the graph, runs it efficiently on CPUs/GPUs/TPUs, and optimizes the machine using gradients.
TensorFlow
= Tensors (multi-dimensional arrays) + Flow (they flow through operations) + Autodiff (automatic differentiation) + Execution engines (graph runtime).

The Big Picture: What TensorFlow Does
- Represents computations as a
graph
ofops
connected bytensors
. - Uses
automatic differentiation
to computegradients
of aloss
with respect tovariables
. - Executes efficiently on different
devices
(CPU/GPU/TPU), scheduling and batching work. - Provides
optimizers
(e.g.,SGD
,Adam
) to update model parameters.
The magic: You describe the forward pass (how outputs are computed). TensorFlow builds enough metadata to compute the backward pass (gradients) automatically.

Tensors: The Things That Flow
A tensor
is a container for numbers with a shape
(like [batch, height, width, channels]
for images).
rank
: number of dimensions (scalar=0D, vector=1D, matrix=2D, etc.).dtype
: numeric type (float32
,int32
, …).
Mental model: a tensor is a spreadsheet that can have many tabs (dimensions). Ops know how to broadcast and combine them.

Graphs and Ops: Blueprint vs. Execution
A computational graph
is a directed acyclic graph (DAG) where nodes are ops
(like MatMul
, Add
, Conv2D
) and edges carry tensors
. In TensorFlow 2, you usually write imperative Python (eager mode), and optionally decorate functions with tf.function
to trace them into graphs for speed.
Think: you sketch a blueprint (graph). The TensorFlow runtime is a construction crew that reads the blueprint and builds results quickly, using all the hardware lanes.
Autodiff: Why Gradients Matter
Automatic differentiation (autodiff)
uses the chain rule
to compute how changing each weight will change the loss
. Forward: compute predictions. Backward: propagate ∂loss/∂node
from outputs to inputs, accumulating gradients.
Without gradients, your model can’t learn. With them, an optimizer
updates variables:
w ← w − η * ∂L/∂w
where η
is the learning rate
.

Hands-On Analogy: Build a Mini Autodiff Engine
Using raw Python (no libraries), we’ll mirror TensorFlow’s core idea: a Tensor
that remembers how it was made, so we can backprop. For illustration, here is a minimal scalar/vector autodiff to mirror TensorFlow's gradient tapes.
This is not TensorFlow, but it rhymes with its GradientTape
: track ops → compute loss → call backward()
→ update weights.
xxxxxxxxxx
print(f"step={step} loss={loss.data:.4f} w={w.data:.3f} b={b.data:.3f}")
class Tensor:
def __init__(self, data, parents=(), op="leaf"):
self.data = data # number or list[float]
self.grad = 0.0 # dL/dThis (scalar engine for clarity)
self.parents = parents # upstream nodes
self.op = op # for debug
def __add__(self, other):
other = other if isinstance(other, Tensor) else Tensor(other)
out = Tensor(self.data + other.data, (self, other), "add")
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Tensor) else Tensor(other)
out = Tensor(self.data * other.data, (self, other), "mul")
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
def relu(self):
out = Tensor(self.data if self.data > 0 else 0.0, (self,), "relu")
def _backward():
self.grad += (1.0 if self.data > 0 else 0.0) * out.grad
Try this exercise. Click the correct answer from the options.
Which ingredient is essential for automatic differentiation to work?
Click the option that best answers the question.
- Storing how each tensor was computed (the graph)
- Randomly updating weights without gradients
- Running the backward pass before the forward pass
Eager vs Graph: Two Modes, One Goal
Eager execution
: ops run immediately (great for debugging).Graph mode
: trace Python functions into a static graph for optimizations like kernel fusion, device placement, and parallelism.
Think: sketch first (eager), then commit blueprints for speed (graph). TF 2 encourages eager by default, with optional graph compilation for production.

Devices & Kernels: How It Gets Fast
A kernel
is the device-specific implementation of an op
. TensorFlow schedules ops on devices:
- CPU: general-purpose; great for control-heavy tasks.
- GPU: massive parallel math (matmuls, convs).
- TPU: matrix math ASIC for large-scale training.
TensorFlow picks placements, manages memory copies, and runs kernels in parallel streams when possible.
Shape & Dtype: Contracts Between Ops
Every tensor has a shape
and dtype
. Mismatches cause errors or implicit broadcasting
. TensorFlow validates compatibility so MatMul
doesn’t try to multiply [3,4]
by [5,6]
.
Rule of thumb: when something breaks, check shapes first, then types.
Optimizers: How Variables Learn
Variables
are tensors you can update. Optimizers
compute updates from gradients:
SGD
:w ← w − η g
Momentum
: adds velocity to smooth updatesAdam
: adaptive learning rates per parameter (mean + variance estimates)
Under the hood, these are additional ops in the graph that read gradients and write new values.
Are you sure you're getting this? Is this statement true or false?
In TensorFlow, variables
are the trainable parameters whose values persist across steps.
Press true if you believe the statement is correct, or false otherwise.
Mini-Model From Scratch: Two-Layer MLP
Here is a two-layer MLP for binary classification implementation on toy data, using no external libs. We’ll mimic TensorFlow logic: forward pass + backward gradients + updates. (This helps you understand what TF automates.)
TensorFlow would: (1) define layers, (2) run forward, (3) use a GradientTape
to get grads, (4) call optimizer.apply_gradients
. Everything else (placement, kernels, shapes) comes “for free.”
xxxxxxxxxx
train()
import random, math
def sigmoid(x): return 1/(1+math.exp(-x))
def dsigmoid(y): return y*(1-y)
def relu(x): return x if x>0 else 0.0
def drelu(x): return 1.0 if x>0 else 0.0
def dot(w, x): return sum(wi*xi for wi,xi in zip(w,x))
def make_blobs(n=200, seed=0):
random.seed(seed)
X, Y = [], []
for _ in range(n//2):
X.append([random.gauss(-1,0.5), random.gauss(0,0.5)]); Y.append(0.0)
for _ in range(n//2):
X.append([random.gauss( 1,0.5), random.gauss(0,0.5)]); Y.append(1.0)
return X, Y
def train(hidden=8, epochs=800, lr=0.05):
X, Y = make_blobs()
in_dim = 2
W1 = [[random.uniform(-0.5,0.5) for _ in range(in_dim)] for _ in range(hidden)]
b1 = [0.0]*hidden
W2 = [random.uniform(-0.5,0.5) for _ in range(hidden)]
b2 = 0.0
for ep in range(epochs):
dW1 = [[0.0]*in_dim for _ in range(hidden)]
db1 = [0.0]*hidden
Let's test your knowledge. Click the correct answer from the options.
Why is eager mode helpful during model development?
Click the option that best answers the question.
- It hides errors until compile time
- It executes ops immediately so you can print tensors and inspect shapes
- It prevents you from building graphs
Input Pipelines: Feeding the Beast
Large models are hungry. tf.data
pipelines let you read, shuffle, batch, prefetch, and parallelize input work so your accelerators never starve. Conceptually: a conveyor belt that keeps the GPU fed while the CPU prepares the next batch.
Distribution Strategies: Use All the Compute
tf.distribute.Strategy
scales training across multiple GPUs/TPUs or machines. It shards your batches, runs replicas in parallel, and reduces gradients correctly (e.g., all-reduce
).
Mental model: many workers push identical sleds uphill (same model) on different snow lanes (data shards), then share what they learned at checkpoints.

Wrap-Up
TensorFlow’s secret sauce is representing math as graphs of tensor ops, then using autodiff and device kernels to train models at scale. You write the math; TensorFlow handles the gradients, speed, and hardware. Once this mental model clicks, using high-level APIs (Keras) feels like calling a well-tuned orchestra: you conduct, it performs.
One Pager Cheat Sheet
- Instead of hand-coding rules, you build a machine that learns rules—a stack of
layers
organized as acomputational graph
—andTensorFlow
builds the graph, runs it efficiently on CPUs/GPUs/TPUs, and optimizes the machine using gradients, combiningTensors
,Flow
,Autodiff
, and execution engines. - TensorFlow represents computations as a
graph
ofops
connected bytensors
, usesautomatic differentiation
to computegradients
of aloss
with respect tovariables
(automating the backward pass from your forward pass), executes efficiently acrossdevices
(CPU/GPU/TPU) with scheduling and batching, and provides optimizers likeSGD
andAdam
to update model parameters. tensor
is a container for numbers with ashape
(e.g.,[batch, height, width, channels]
), described byrank
(number of dimensions) anddtype
(numeric type likefloat32
), and can be thought of as a spreadsheet with many tabs that operations broadcast and combine.- TensorFlow lets you write imperative Python (eager mode) that you can
tf.function
-trace into acomputational graph
—a DAG ofops
(e.g.,MatMul
,Add
,Conv2D
) withtensors
on the edges—so the TensorFlow runtime acts like a construction crew that reads the blueprint and executes it efficiently across hardware. autodiff
uses thechain rule
to compute how changing each weight affects theloss
, performing a forward pass to get predictions and a backward pass propagating∂loss/∂node
so the resulting gradients let an optimizer update variables viaw ← w − η * ∂L/∂w
withη
thelearning rate
, without which the model can’t learn.- Using raw Python (no libraries), we build a mini autodiff engine whose
Tensor
remembers how it was made so we canbackprop
like TensorFlow'sGradientTape
—track ops → compute loss → callbackward()
→ update weights. - Automatic differentiation must record the computational graph — i.e. the recorded history of how each tensor was computed (the
op
, itsinputs
andoutput
, and the node’sgrad_fn
) — because it applies the chain rule to composed operations and needs that structure to perform backpropagation and compute/accumulate the local derivatives. - TensorFlow offers Eager execution (where
ops
run immediately for easy debugging) and Graph mode (which traces Python into a staticgraph
to enable optimizations like kernel fusion, device placement, and parallelism), and TF 2 defaults to eager with optional graph compilation for production. TensorFlow
treats akernel
as the device-specific implementation of anop
, assigns work to CPU (control-heavy), GPU (massive parallel math), or TPU (matrix-math ASIC), and picks placements, manages memory copies, and runs kernels in parallel streams to maximize speed.- Tensors have
shape
anddtype
as contracts—mismatches lead to errors or implicitbroadcasting
, and frameworks like TensorFlow validates compatibility (soMatMul
won’t try to multiply[3,4]
by[5,6]
); rule of thumb: check shapes first, then types when something breaks. Variables
are updatabletensors
andoptimizers
compute updates from gradients—e.g.SGD
(w ← w − η g
),Momentum
(adds velocity), andAdam
(adaptive per-parameter rates using mean and variance estimates)—and are implemented as additional ops in the graph that read gradients and write new values.- A
tf.Variable
is the mutable, persisted storage for model parameters—unlike atf.Tensor
which is immutable—so it persists across training steps, can be marked as trainable parameters (e.g.tf.Variable(..., trainable=True)
) whose gradients (collected viatf.trainable_variables()
ormodel.trainable_weights
) are used by optimizers that read gradients and write new variable values via assignment ops (e.g.v.assign_sub(...)
) and may create slottf.Variable
s for optimizer state, while all variables (including non-trainable ones like batch-norm moving averages) are saved and restored with checkpoints (e.g.tf.train.Checkpoint
/tf.train.Saver
). - A two-layer MLP for binary classification, implemented from scratch with no external libs, explicitly performs the
forward pass
, computesbackward gradients
, and makes parameterupdates
to mimic TensorFlow logic—the same steps TensorFlow automates usingGradientTape
andoptimizer.apply_gradients
. - Because
eager execution
runs operations immediately, it provides immediate, inspectable values and natural Python control flow, which makes debugging, shape/gradient verification, and rapid iteration much easier (e.g., withGradientTape
,tensor.numpy()
,print
) and supports a smooth transition to production via@tf.function
, though it's slower than graph mode and still requires final testing under tracing. - Because large models are hungry,
tf.data
pipelines—by reading, shuffling, batching, prefetching, and parallelizing—act as a conveyor belt that keeps theGPU
fed so your accelerators never starve. tf.distribute.Strategy
scales training across multiple GPUs/TPUs or machines by sharding batches, running replicas in parallel, and reducing gradients correctly (e.g.,all-reduce
), so workers train identical models on different data shards and share updates at checkpoints.TensorFlow
represents math as graphs of tensor ops, usesautodiff
anddevice kernels
to handle gradients, performance, and hardware, and so once this model clicks, usingKeras
feels like conducting a well-tuned orchestra.