AlgoDaily - How Does TensorFlow Work?

Home > Machine Learning Fundamentals > Machine Learning > How Does TensorFlow Work?

A Story to Start: From “Do Math” to “Build a Machine”

Imagine you’re teaching a robot to recognize cats. You don’t write rules like “if pointy ears then cat.” Instead, you build a machine that learns rules from data. Under the hood, that machine is a stack of layers wired into a computational graph. TensorFlow is the toolkit that builds the graph, runs it efficiently on CPUs/GPUs/TPUs, and optimizes the machine using gradients.

TensorFlow = Tensors (multi-dimensional arrays) + Flow (they flow through operations) + Autodiff (automatic differentiation) + Execution engines (graph runtime).

A Story to Start: From “Do Math” to “Build a Machine”

The Big Picture: What TensorFlow Does

Represents computations as a graph of ops connected by tensors.
Uses automatic differentiation to compute gradients of a loss with respect to variables.
Executes efficiently on different devices (CPU/GPU/TPU), scheduling and batching work.
Provides optimizers (e.g., SGD, Adam) to update model parameters.

The magic: You describe the forward pass (how outputs are computed). TensorFlow builds enough metadata to compute the backward pass (gradients) automatically.

Tensors: The Things That Flow

A tensor is a container for numbers with a shape (like [batch, height, width, channels] for images).

rank: number of dimensions (scalar=0D, vector=1D, matrix=2D, etc.).
dtype: numeric type (float32, int32, …).

Mental model: a tensor is a spreadsheet that can have many tabs (dimensions). Ops know how to broadcast and combine them.

Graphs and Ops: Blueprint vs. Execution

A computational graph is a directed acyclic graph (DAG) where nodes are ops (like MatMul, Add, Conv2D) and edges carry tensors. In TensorFlow 2, you usually write imperative Python (eager mode), and optionally decorate functions with tf.function to trace them into graphs for speed.

Think: you sketch a blueprint (graph). The TensorFlow runtime is a construction crew that reads the blueprint and builds results quickly, using all the hardware lanes.

Autodiff: Why Gradients Matter

Automatic differentiation (autodiff) uses the chain rule to compute how changing each weight will change the loss. Forward: compute predictions. Backward: propagate ∂loss/∂node from outputs to inputs, accumulating gradients.

Without gradients, your model can’t learn. With them, an optimizer updates variables: w ← w − η * ∂L/∂w where η is the learning rate.

Hands-On Analogy: Build a Mini Autodiff Engine

Using raw Python (no libraries), we’ll mirror TensorFlow’s core idea: a Tensor that remembers how it was made, so we can backprop. For illustration, here is a minimal scalar/vector autodiff to mirror TensorFlow's gradient tapes.

This is not TensorFlow, but it rhymes with its GradientTape: track ops → compute loss → call backward() → update weights.

xxxxxxxxxx
            print(f"step={step} loss={loss.data:.4f} w={w.data:.3f} b={b.data:.3f}")
 
class Tensor:
    def __init__(self, data, parents=(), op="leaf"):
        self.data = data            # number or list[float]
        self.grad = 0.0             # dL/dThis (scalar engine for clarity)
        self.parents = parents      # upstream nodes
        self.op = op                # for debug
​
    def __add__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(self.data + other.data, (self, other), "add")
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
​
    def __mul__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(self.data * other.data, (self, other), "mul")
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
​
    def relu(self):
        out = Tensor(self.data if self.data > 0 else 0.0, (self,), "relu")
        def _backward():
            self.grad += (1.0 if self.data > 0 else 0.0) * out.grad

Build your intuition. Click the correct answer from the options.

Which ingredient is essential for automatic differentiation to work?

Click the option that best answers the question.

Storing how each tensor was computed (the graph)
Randomly updating weights without gradients
Running the backward pass before the forward pass

Eager vs Graph: Two Modes, One Goal

Eager execution: ops run immediately (great for debugging).
Graph mode: trace Python functions into a static graph for optimizations like kernel fusion, device placement, and parallelism.

Think: sketch first (eager), then commit blueprints for speed (graph). TF 2 encourages eager by default, with optional graph compilation for production.

Devices & Kernels: How It Gets Fast

A kernel is the device-specific implementation of an op. TensorFlow schedules ops on devices:

CPU: general-purpose; great for control-heavy tasks.
GPU: massive parallel math (matmuls, convs).
TPU: matrix math ASIC for large-scale training.

TensorFlow picks placements, manages memory copies, and runs kernels in parallel streams when possible.

Shape & Dtype: Contracts Between Ops

Every tensor has a shape and dtype. Mismatches cause errors or implicit broadcasting. TensorFlow validates compatibility so MatMul doesn’t try to multiply [3,4] by [5,6].

Rule of thumb: when something breaks, check shapes first, then types.

Optimizers: How Variables Learn

Variables are tensors you can update. Optimizers compute updates from gradients:

SGD: w ← w − η g
Momentum: adds velocity to smooth updates
Adam: adaptive learning rates per parameter (mean + variance estimates)

Under the hood, these are additional ops in the graph that read gradients and write new values.

Let's test your knowledge. Is this statement true or false?

In TensorFlow, variables are the trainable parameters whose values persist across steps.

Press true if you believe the statement is correct, or false otherwise.

Mini-Model From Scratch: Two-Layer MLP

Here is a two-layer MLP for binary classification implementation on toy data, using no external libs. We’ll mimic TensorFlow logic: forward pass + backward gradients + updates. (This helps you understand what TF automates.)

TensorFlow would: (1) define layers, (2) run forward, (3) use a GradientTape to get grads, (4) call optimizer.apply_gradients. Everything else (placement, kernels, shapes) comes “for free.”

xxxxxxxxxx
    train()
 
import random, math
​
def sigmoid(x): return 1/(1+math.exp(-x))
def dsigmoid(y): return y*(1-y)
def relu(x): return x if x>0 else 0.0
def drelu(x): return 1.0 if x>0 else 0.0
​
def dot(w, x): return sum(wi*xi for wi,xi in zip(w,x))
​
def make_blobs(n=200, seed=0):
    random.seed(seed)
    X, Y = [], []
    for _ in range(n//2):
        X.append([random.gauss(-1,0.5), random.gauss(0,0.5)]); Y.append(0.0)
    for _ in range(n//2):
        X.append([random.gauss( 1,0.5), random.gauss(0,0.5)]); Y.append(1.0)
    return X, Y
​
def train(hidden=8, epochs=800, lr=0.05):
    X, Y = make_blobs()
    in_dim = 2
    W1 = [[random.uniform(-0.5,0.5) for _ in range(in_dim)] for _ in range(hidden)]
    b1 = [0.0]*hidden
    W2 = [random.uniform(-0.5,0.5) for _ in range(hidden)]
    b2 = 0.0
​
    for ep in range(epochs):
        dW1 = [[0.0]*in_dim for _ in range(hidden)]
        db1 = [0.0]*hidden

Are you sure you're getting this? Click the correct answer from the options.

Why is eager mode helpful during model development?

Click the option that best answers the question.

It hides errors until compile time
It executes ops immediately so you can print tensors and inspect shapes
It prevents you from building graphs

Input Pipelines: Feeding the Beast

Large models are hungry. tf.data pipelines let you read, shuffle, batch, prefetch, and parallelize input work so your accelerators never starve. Conceptually: a conveyor belt that keeps the GPU fed while the CPU prepares the next batch.

Distribution Strategies: Use All the Compute

tf.distribute.Strategy scales training across multiple GPUs/TPUs or machines. It shards your batches, runs replicas in parallel, and reduces gradients correctly (e.g., all-reduce).

Mental model: many workers push identical sleds uphill (same model) on different snow lanes (data shards), then share what they learned at checkpoints.

Distribution Strategies: Use All the Compute

Wrap-Up

TensorFlow’s secret sauce is representing math as graphs of tensor ops, then using autodiff and device kernels to train models at scale. You write the math; TensorFlow handles the gradients, speed, and hardware. Once this mental model clicks, using high-level APIs (Keras) feels like calling a well-tuned orchestra: you conduct, it performs.

One Pager Cheat Sheet

Instead of hand-coding rules, you build a machine that learns rules—a stack of layers organized as a computational graph—and TensorFlow builds the graph, runs it efficiently on CPUs/GPUs/TPUs, and optimizes the machine using gradients, combining Tensors, Flow, Autodiff, and execution engines.
TensorFlow represents computations as a graph of ops connected by tensors, uses automatic differentiation to compute gradients of a loss with respect to variables (automating the backward pass from your forward pass), executes efficiently across devices (CPU/GPU/TPU) with scheduling and batching, and provides optimizers like SGD and Adam to update model parameters.
tensor is a container for numbers with a shape (e.g., [batch, height, width, channels]), described by rank (number of dimensions) and dtype (numeric type like float32), and can be thought of as a spreadsheet with many tabs that operations broadcast and combine.
TensorFlow lets you write imperative Python (eager mode) that you can tf.function-trace into a computational graph—a DAG of ops (e.g., MatMul, Add, Conv2D) with tensors on the edges—so the TensorFlow runtime acts like a construction crew that reads the blueprint and executes it efficiently across hardware.
autodiff uses the chain rule to compute how changing each weight affects the loss, performing a forward pass to get predictions and a backward pass propagating ∂loss/∂node so the resulting gradients let an optimizer update variables via w ← w − η * ∂L/∂w with η the learning rate, without which the model can’t learn.
Using raw Python (no libraries), we build a mini autodiff engine whose Tensor remembers how it was made so we can backprop like TensorFlow's GradientTape—track ops → compute loss → call backward() → update weights.
Automatic differentiation must record the computational graph — i.e. the recorded history of how each tensor was computed (the op, its inputs and output, and the node’s grad_fn) — because it applies the chain rule to composed operations and needs that structure to perform backpropagation and compute/accumulate the local derivatives.
TensorFlow offers Eager execution (where ops run immediately for easy debugging) and Graph mode (which traces Python into a static graph to enable optimizations like kernel fusion, device placement, and parallelism), and TF 2 defaults to eager with optional graph compilation for production.
TensorFlow treats a kernel as the device-specific implementation of an op, assigns work to CPU (control-heavy), GPU (massive parallel math), or TPU (matrix-math ASIC), and picks placements, manages memory copies, and runs kernels in parallel streams to maximize speed.
Tensors have shape and dtype as contracts—mismatches lead to errors or implicit broadcasting, and frameworks like TensorFlow validates compatibility (so MatMul won’t try to multiply [3,4] by [5,6]); rule of thumb: check shapes first, then types when something breaks.
Variables are updatable tensors and optimizers compute updates from gradients—e.g. SGD (w ← w − η g), Momentum (adds velocity), and Adam (adaptive per-parameter rates using mean and variance estimates)—and are implemented as additional ops in the graph that read gradients and write new values.
A tf.Variable is the mutable, persisted storage for model parameters—unlike a tf.Tensor which is immutable—so it persists across training steps, can be marked as trainable parameters (e.g. tf.Variable(..., trainable=True)) whose gradients (collected via tf.trainable_variables() or model.trainable_weights) are used by optimizers that read gradients and write new variable values via assignment ops (e.g. v.assign_sub(...)) and may create slot tf.Variables for optimizer state, while all variables (including non-trainable ones like batch-norm moving averages) are saved and restored with checkpoints (e.g. tf.train.Checkpoint / tf.train.Saver).
A two-layer MLP for binary classification, implemented from scratch with no external libs, explicitly performs the forward pass, computes backward gradients, and makes parameter updates to mimic TensorFlow logic—the same steps TensorFlow automates using GradientTape and optimizer.apply_gradients.
Because eager execution runs operations immediately, it provides immediate, inspectable values and natural Python control flow, which makes debugging, shape/gradient verification, and rapid iteration much easier (e.g., with GradientTape, tensor.numpy(), print) and supports a smooth transition to production via @tf.function, though it's slower than graph mode and still requires final testing under tracing.
Because large models are hungry, tf.data pipelines—by reading, shuffling, batching, prefetching, and parallelizing—act as a conveyor belt that keeps the GPU fed so your accelerators never starve.
tf.distribute.Strategy scales training across multiple GPUs/TPUs or machines by sharding batches, running replicas in parallel, and reducing gradients correctly (e.g., all-reduce), so workers train identical models on different data shards and share updates at checkpoints.
TensorFlow represents math as graphs of tensor ops, uses autodiff and device kernels to handle gradients, performance, and hardware, and so once this model clicks, using Keras feels like conducting a well-tuned orchestra.

A Story to Start: From “Do Math” to “Build a Machine”

The Big Picture: What TensorFlow Does

Tensors: The Things That Flow

Graphs and Ops: Blueprint vs. Execution

Autodiff: Why Gradients Matter

Hands-On Analogy: Build a Mini Autodiff Engine

Build your intuition. Click the correct answer from the options.

Click the option that best answers the question.

Eager vs Graph: Two Modes, One Goal

Devices & Kernels: How It Gets Fast

Shape & Dtype: Contracts Between Ops

Optimizers: How Variables Learn

Let's test your knowledge. Is this statement true or false?

Mini-Model From Scratch: Two-Layer MLP

Are you sure you're getting this? Click the correct answer from the options.

Click the option that best answers the question.

Input Pipelines: Feeding the Beast

Distribution Strategies: Use All the Compute

Wrap-Up

One Pager Cheat Sheet

Programming Categories

Popular Lessons