AlgoDaily - Course Overview & Environment Setup

Home > Algorithmic Trading for HFTS using C++ and python > Algorithmic Trading for HFTS using C++ and python > Course Overview & Environment Setup

Measuring Latency and Throughput

Why this matters for HFT engineers (beginner-friendly)

In HFT the difference between 1,500 ns and 2,500 ns per tick can change whether your order wins a trade or not. Think of latency like a fast-break in basketball: a small delay can be the difference between an easy layup and a contested shot.
Throughput (ops/sec) is how many ticks your system can handle per second — like how many possessions a team can run in a game.

Quick ASCII diagram: where measurement fits in the pipeline

[Market feed NIC] --(packets)--> [Capture / Handler] --(parse)--> [Strategy inner-loop] --(orders)--> [Exchange gateway] ^ | |------ instrument (timestamps) ---|

The critical path (where latency matters) is from packet arrival to order emission.
We measure: per-event latency (ns) and overall throughput (ops/sec).

Core approaches and tools (what to reach for)

Software timers: use std::chrono::steady_clock in C++, time.perf_counter() in Python, System.nanoTime() in Java, clock_gettime(CLOCK_MONOTONIC_RAW) in C, performance.now() in JS. These give you program-side timings.
Kernel/hardware timestamps: NICs and kernel support (SO_TIMESTAMPING or PTP). These give lower-level absolute times and remove user-space scheduling jitter.
Packet capture: tcpdump -tt -i eth0 -w out.pcap and analyze timestamps with Wireshark. Use hardware timestamping where available.
Profilers and counters: perf record / perf stat for CPU metrics and hotspots. perf helps find the hot function you should optimize.

Commands (beginner-safe examples)

Capture packets (software timestamps):
- sudo tcpdump -i eth0 -w feed.pcap
Profile CPU to find hotspots:
- sudo perf record -F 99 -- ./your_binary
- sudo perf report --stdio
Check NIC timestamping capability:
- ethtool -T eth0

How to interpret measurements (simple rules)

Look at percentiles, not just average: p95 and p99 show tail latency which kills HFT performance.
Correlate throughput and latency: higher throughput often raises latency (queueing).
Watch for long tails caused by GC, page faults, IRQs, or CPU frequency scaling.

Analogy to basketball (keep it intuitive)

Average latency = team's average shot time.
p99 latency = worst possession in the last 100 possessions (the play that cost you the game).
Throughput = possessions per minute.

The supplied C++ example (in the code block) shows a reproducible microbenchmark:

It builds deterministic ticks (vector<Tick>) so results are reproducible.
Measures per-tick latency (nanoseconds) and computes min, avg, p50, p95, p99, max and ops/sec.
Prints SLO breaches for a simple service-level check.

Beginner challenges (try these after running the code)

Change ITERATIONS to 10000 and 500000. How do ops/sec and p99 change?
Toggle the heavy boolean to true to simulate a slower inner loop (like an unoptimized Python hotspot migrated to C++). What happens to throughput?
Replace the synthetic price generator with a replay from CSV: read timestamps and prices into ticks and rerun the benchmark.
Implement the same microbenchmark in Python using time.perf_counter() and compare ops/sec. (Hint: Python will be much slower per-op; that’s why we migrate hotspots.)

Practical next steps and what to measure in the field

For network I/O benchmarks, use pcap with hardware timestamps when possible and compute hop-to-order latency.
Use perf to see if allocations, syscalls, or branch mispredictions dominate the time.
Establish SLOs early (e.g., p99 < 5us) and continuously measure against them; alert when breached.

Try a small modification now (exercise):

Edit the C++ example and:
- increase ITERATIONS by 10x,
- or add std::this_thread::sleep_for(std::chrono::nanoseconds(2000)); inside the loop to simulate NIC queueing jitter,
- or switch to the heavy workload.

Observe how the numbers change (min, p95, p99 and ops/sec). Understanding how these metrics move when you change workload or environment is the key skill here.

xxxxxxxxxx
}
 
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>
#include <chrono>
#include <cstdint>
​
using namespace std;
using ns = std::chrono::nanoseconds;
using Clock = std::chrono::steady_clock;
​
// Simple deterministic PRNG for reproducible "ticks" (no <random> overhead)
uint32_t lcg(uint32_t &state) {
  state = state * 1664525u + 1013904223u;
  return state;
}
​
// Synthetic tick: timestamp + price
struct Tick { uint64_t ts_ns; double price; };
​
// Simulated processing workload: a small amount of math per tick
inline double process_tick_fast(const Tick &t) {
  // cheap arithmetic that an HFT inner loop might do
  double p = t.price;
  // combine a few ops to simulate feature extraction
  return (p * 1.0001 + p / 123.456 - (p > 100.0 ? 0.42 : 0.21));
}
​
inline double process_tick_heavy(const Tick &t) {

Measuring Latency and Throughput

Programming Categories

Popular Lessons