One Pager Cheat Sheet
- Instead of hand-coding rules, you build a machine that learns rules—a stack of 
layersorganized as acomputational graph—andTensorFlowbuilds the graph, runs it efficiently on CPUs/GPUs/TPUs, and optimizes the machine using gradients, combiningTensors,Flow,Autodiff, and execution engines. - TensorFlow represents computations as a 
graphofopsconnected bytensors, usesautomatic differentiationto computegradientsof alosswith respect tovariables(automating the backward pass from your forward pass), executes efficiently acrossdevices(CPU/GPU/TPU) with scheduling and batching, and provides optimizers likeSGDandAdamto update model parameters. tensoris a container for numbers with ashape(e.g.,[batch, height, width, channels]), described byrank(number of dimensions) anddtype(numeric type likefloat32), and can be thought of as a spreadsheet with many tabs that operations broadcast and combine.- TensorFlow lets you write imperative Python (eager mode) that you can 
tf.function-trace into acomputational graph—a DAG ofops(e.g.,MatMul,Add,Conv2D) withtensorson the edges—so the TensorFlow runtime acts like a construction crew that reads the blueprint and executes it efficiently across hardware. autodiffuses thechain ruleto compute how changing each weight affects theloss, performing a forward pass to get predictions and a backward pass propagating∂loss/∂nodeso the resulting gradients let an optimizer update variables viaw ← w − η * ∂L/∂wwithηthelearning rate, without which the model can’t learn.- Using raw Python (no libraries), we build a mini autodiff engine whose 
Tensorremembers how it was made so we canbackproplike TensorFlow'sGradientTape—track ops → compute loss → callbackward()→ update weights. - Automatic differentiation must record the computational graph — i.e. the recorded history of how each tensor was computed (the 
op, itsinputsandoutput, and the node’sgrad_fn) — because it applies the chain rule to composed operations and needs that structure to perform backpropagation and compute/accumulate the local derivatives. - TensorFlow offers Eager execution (where 
opsrun immediately for easy debugging) and Graph mode (which traces Python into a staticgraphto enable optimizations like kernel fusion, device placement, and parallelism), and TF 2 defaults to eager with optional graph compilation for production. TensorFlowtreats akernelas the device-specific implementation of anop, assigns work to CPU (control-heavy), GPU (massive parallel math), or TPU (matrix-math ASIC), and picks placements, manages memory copies, and runs kernels in parallel streams to maximize speed.- Tensors have 
shapeanddtypeas contracts—mismatches lead to errors or implicitbroadcasting, and frameworks like TensorFlow validates compatibility (soMatMulwon’t try to multiply[3,4]by[5,6]); rule of thumb: check shapes first, then types when something breaks. Variablesare updatabletensorsandoptimizerscompute updates from gradients—e.g.SGD(w ← w − η g),Momentum(adds velocity), andAdam(adaptive per-parameter rates using mean and variance estimates)—and are implemented as additional ops in the graph that read gradients and write new values.- A 
tf.Variableis the mutable, persisted storage for model parameters—unlike atf.Tensorwhich is immutable—so it persists across training steps, can be marked as trainable parameters (e.g.tf.Variable(..., trainable=True)) whose gradients (collected viatf.trainable_variables()ormodel.trainable_weights) are used by optimizers that read gradients and write new variable values via assignment ops (e.g.v.assign_sub(...)) and may create slottf.Variables for optimizer state, while all variables (including non-trainable ones like batch-norm moving averages) are saved and restored with checkpoints (e.g.tf.train.Checkpoint/tf.train.Saver). - A two-layer MLP for binary classification, implemented from scratch with no external libs, explicitly performs the 
forward pass, computesbackward gradients, and makes parameterupdatesto mimic TensorFlow logic—the same steps TensorFlow automates usingGradientTapeandoptimizer.apply_gradients. - Because 
eager executionruns operations immediately, it provides immediate, inspectable values and natural Python control flow, which makes debugging, shape/gradient verification, and rapid iteration much easier (e.g., withGradientTape,tensor.numpy(),print) and supports a smooth transition to production via@tf.function, though it's slower than graph mode and still requires final testing under tracing. - Because large models are hungry, 
tf.datapipelines—by reading, shuffling, batching, prefetching, and parallelizing—act as a conveyor belt that keeps theGPUfed so your accelerators never starve. tf.distribute.Strategyscales training across multiple GPUs/TPUs or machines by sharding batches, running replicas in parallel, and reducing gradients correctly (e.g.,all-reduce), so workers train identical models on different data shards and share updates at checkpoints.TensorFlowrepresents math as graphs of tensor ops, usesautodiffanddevice kernelsto handle gradients, performance, and hardware, and so once this model clicks, usingKerasfeels like conducting a well-tuned orchestra.


