Mark As Completed Discussion

One Pager Cheat Sheet

  • Instead of hand-coding rules, you build a machine that learns rules—a stack of layers organized as a computational graph—and TensorFlow builds the graph, runs it efficiently on CPUs/GPUs/TPUs, and optimizes the machine using gradients, combining Tensors, Flow, Autodiff, and execution engines.
  • TensorFlow represents computations as a graph of ops connected by tensors, uses automatic differentiation to compute gradients of a loss with respect to variables (automating the backward pass from your forward pass), executes efficiently across devices (CPU/GPU/TPU) with scheduling and batching, and provides optimizers like SGD and Adam to update model parameters.
  • tensor is a container for numbers with a shape (e.g., [batch, height, width, channels]), described by rank (number of dimensions) and dtype (numeric type like float32), and can be thought of as a spreadsheet with many tabs that operations broadcast and combine.
  • TensorFlow lets you write imperative Python (eager mode) that you can tf.function-trace into a computational graph—a DAG of ops (e.g., MatMul, Add, Conv2D) with tensors on the edges—so the TensorFlow runtime acts like a construction crew that reads the blueprint and executes it efficiently across hardware.
  • autodiff uses the chain rule to compute how changing each weight affects the loss, performing a forward pass to get predictions and a backward pass propagating ∂loss/∂node so the resulting gradients let an optimizer update variables via w ← w − η * ∂L/∂w with η the learning rate, without which the model can’t learn.
  • Using raw Python (no libraries), we build a mini autodiff engine whose Tensor remembers how it was made so we can backprop like TensorFlow's GradientTape—track ops → compute loss → call backward() → update weights.
  • Automatic differentiation must record the computational graph — i.e. the recorded history of how each tensor was computed (the op, its inputs and output, and the node’s grad_fn) — because it applies the chain rule to composed operations and needs that structure to perform backpropagation and compute/accumulate the local derivatives.
  • TensorFlow offers Eager execution (where ops run immediately for easy debugging) and Graph mode (which traces Python into a static graph to enable optimizations like kernel fusion, device placement, and parallelism), and TF 2 defaults to eager with optional graph compilation for production.
  • TensorFlow treats a kernel as the device-specific implementation of an op, assigns work to CPU (control-heavy), GPU (massive parallel math), or TPU (matrix-math ASIC), and picks placements, manages memory copies, and runs kernels in parallel streams to maximize speed.
  • Tensors have shape and dtype as contracts—mismatches lead to errors or implicit broadcasting, and frameworks like TensorFlow validates compatibility (so MatMul won’t try to multiply [3,4] by [5,6]); rule of thumb: check shapes first, then types when something breaks.
  • Variables are updatable tensors and optimizers compute updates from gradients—e.g. SGD (w ← w − η g), Momentum (adds velocity), and Adam (adaptive per-parameter rates using mean and variance estimates)—and are implemented as additional ops in the graph that read gradients and write new values.
  • A tf.Variable is the mutable, persisted storage for model parameters—unlike a tf.Tensor which is immutable—so it persists across training steps, can be marked as trainable parameters (e.g. tf.Variable(..., trainable=True)) whose gradients (collected via tf.trainable_variables() or model.trainable_weights) are used by optimizers that read gradients and write new variable values via assignment ops (e.g. v.assign_sub(...)) and may create slot tf.Variables for optimizer state, while all variables (including non-trainable ones like batch-norm moving averages) are saved and restored with checkpoints (e.g. tf.train.Checkpoint / tf.train.Saver).
  • A two-layer MLP for binary classification, implemented from scratch with no external libs, explicitly performs the forward pass, computes backward gradients, and makes parameter updates to mimic TensorFlow logic—the same steps TensorFlow automates using GradientTape and optimizer.apply_gradients.
  • Because eager execution runs operations immediately, it provides immediate, inspectable values and natural Python control flow, which makes debugging, shape/gradient verification, and rapid iteration much easier (e.g., with GradientTape, tensor.numpy(), print) and supports a smooth transition to production via @tf.function, though it's slower than graph mode and still requires final testing under tracing.
  • Because large models are hungry, tf.data pipelines—by reading, shuffling, batching, prefetching, and parallelizing—act as a conveyor belt that keeps the GPU fed so your accelerators never starve.
  • tf.distribute.Strategy scales training across multiple GPUs/TPUs or machines by sharding batches, running replicas in parallel, and reducing gradients correctly (e.g., all-reduce), so workers train identical models on different data shards and share updates at checkpoints.
  • TensorFlow represents math as graphs of tensor ops, uses autodiff and device kernels to handle gradients, performance, and hardware, and so once this model clicks, using Keras feels like conducting a well-tuned orchestra.