One Pager Cheat Sheet
- Instead of hand-coding rules, you build a machine that learns rules—a stack of
layers
organized as acomputational graph
—andTensorFlow
builds the graph, runs it efficiently on CPUs/GPUs/TPUs, and optimizes the machine using gradients, combiningTensors
,Flow
,Autodiff
, and execution engines. - TensorFlow represents computations as a
graph
ofops
connected bytensors
, usesautomatic differentiation
to computegradients
of aloss
with respect tovariables
(automating the backward pass from your forward pass), executes efficiently acrossdevices
(CPU/GPU/TPU) with scheduling and batching, and provides optimizers likeSGD
andAdam
to update model parameters. tensor
is a container for numbers with ashape
(e.g.,[batch, height, width, channels]
), described byrank
(number of dimensions) anddtype
(numeric type likefloat32
), and can be thought of as a spreadsheet with many tabs that operations broadcast and combine.- TensorFlow lets you write imperative Python (eager mode) that you can
tf.function
-trace into acomputational graph
—a DAG ofops
(e.g.,MatMul
,Add
,Conv2D
) withtensors
on the edges—so the TensorFlow runtime acts like a construction crew that reads the blueprint and executes it efficiently across hardware. autodiff
uses thechain rule
to compute how changing each weight affects theloss
, performing a forward pass to get predictions and a backward pass propagating∂loss/∂node
so the resulting gradients let an optimizer update variables viaw ← w − η * ∂L/∂w
withη
thelearning rate
, without which the model can’t learn.- Using raw Python (no libraries), we build a mini autodiff engine whose
Tensor
remembers how it was made so we canbackprop
like TensorFlow'sGradientTape
—track ops → compute loss → callbackward()
→ update weights. - Automatic differentiation must record the computational graph — i.e. the recorded history of how each tensor was computed (the
op
, itsinputs
andoutput
, and the node’sgrad_fn
) — because it applies the chain rule to composed operations and needs that structure to perform backpropagation and compute/accumulate the local derivatives. - TensorFlow offers Eager execution (where
ops
run immediately for easy debugging) and Graph mode (which traces Python into a staticgraph
to enable optimizations like kernel fusion, device placement, and parallelism), and TF 2 defaults to eager with optional graph compilation for production. TensorFlow
treats akernel
as the device-specific implementation of anop
, assigns work to CPU (control-heavy), GPU (massive parallel math), or TPU (matrix-math ASIC), and picks placements, manages memory copies, and runs kernels in parallel streams to maximize speed.- Tensors have
shape
anddtype
as contracts—mismatches lead to errors or implicitbroadcasting
, and frameworks like TensorFlow validates compatibility (soMatMul
won’t try to multiply[3,4]
by[5,6]
); rule of thumb: check shapes first, then types when something breaks. Variables
are updatabletensors
andoptimizers
compute updates from gradients—e.g.SGD
(w ← w − η g
),Momentum
(adds velocity), andAdam
(adaptive per-parameter rates using mean and variance estimates)—and are implemented as additional ops in the graph that read gradients and write new values.- A
tf.Variable
is the mutable, persisted storage for model parameters—unlike atf.Tensor
which is immutable—so it persists across training steps, can be marked as trainable parameters (e.g.tf.Variable(..., trainable=True)
) whose gradients (collected viatf.trainable_variables()
ormodel.trainable_weights
) are used by optimizers that read gradients and write new variable values via assignment ops (e.g.v.assign_sub(...)
) and may create slottf.Variable
s for optimizer state, while all variables (including non-trainable ones like batch-norm moving averages) are saved and restored with checkpoints (e.g.tf.train.Checkpoint
/tf.train.Saver
). - A two-layer MLP for binary classification, implemented from scratch with no external libs, explicitly performs the
forward pass
, computesbackward gradients
, and makes parameterupdates
to mimic TensorFlow logic—the same steps TensorFlow automates usingGradientTape
andoptimizer.apply_gradients
. - Because
eager execution
runs operations immediately, it provides immediate, inspectable values and natural Python control flow, which makes debugging, shape/gradient verification, and rapid iteration much easier (e.g., withGradientTape
,tensor.numpy()
,print
) and supports a smooth transition to production via@tf.function
, though it's slower than graph mode and still requires final testing under tracing. - Because large models are hungry,
tf.data
pipelines—by reading, shuffling, batching, prefetching, and parallelizing—act as a conveyor belt that keeps theGPU
fed so your accelerators never starve. tf.distribute.Strategy
scales training across multiple GPUs/TPUs or machines by sharding batches, running replicas in parallel, and reducing gradients correctly (e.g.,all-reduce
), so workers train identical models on different data shards and share updates at checkpoints.TensorFlow
represents math as graphs of tensor ops, usesautodiff
anddevice kernels
to handle gradients, performance, and hardware, and so once this model clicks, usingKeras
feels like conducting a well-tuned orchestra.