Measuring Latency and Throughput
Why this matters for HFT engineers (beginner-friendly)
- In HFT the difference between
1,500 nsand2,500 nsper tick can change whether your order wins a trade or not. Think of latency like a fast-break in basketball: a small delay can be the difference between an easy layup and a contested shot. - Throughput (
ops/sec) is how many ticks your system can handle per second — like how many possessions a team can run in a game.
Quick ASCII diagram: where measurement fits in the pipeline
[Market feed NIC] --(packets)--> [Capture / Handler] --(parse)--> [Strategy inner-loop] --(orders)--> [Exchange gateway] ^ | |------ instrument (timestamps) ---|
- The critical path (where latency matters) is from packet arrival to order emission.
- We measure: per-event latency (ns) and overall throughput (ops/sec).
Core approaches and tools (what to reach for)
- Software timers: use
std::chrono::steady_clockin C++,time.perf_counter()in Python,System.nanoTime()in Java,clock_gettime(CLOCK_MONOTONIC_RAW)in C,performance.now()in JS. These give you program-side timings. - Kernel/hardware timestamps: NICs and kernel support (SO_TIMESTAMPING or PTP). These give lower-level absolute times and remove user-space scheduling jitter.
- Packet capture:
tcpdump -tt -i eth0 -w out.pcapand analyze timestamps with Wireshark. Use hardware timestamping where available. - Profilers and counters:
perf record/perf statfor CPU metrics and hotspots.perfhelps find the hot function you should optimize.
Commands (beginner-safe examples)
- Capture packets (software timestamps):
sudo tcpdump -i eth0 -w feed.pcap
- Profile CPU to find hotspots:
sudo perf record -F 99 -- ./your_binarysudo perf report --stdio
- Check NIC timestamping capability:
ethtool -T eth0
How to interpret measurements (simple rules)
- Look at percentiles, not just average:
p95andp99show tail latency which kills HFT performance. - Correlate throughput and latency: higher throughput often raises latency (queueing).
- Watch for long tails caused by GC, page faults, IRQs, or CPU frequency scaling.
Analogy to basketball (keep it intuitive)
- Average latency = team's average shot time.
- p99 latency = worst possession in the last 100 possessions (the play that cost you the game).
- Throughput = possessions per minute.
The supplied C++ example (in the code block) shows a reproducible microbenchmark:
- It builds deterministic
ticks(vector<Tick>) so results are reproducible. - Measures per-tick latency (nanoseconds) and computes
min,avg,p50,p95,p99,maxandops/sec. - Prints SLO breaches for a simple service-level check.
Beginner challenges (try these after running the code)
- Change
ITERATIONSto10000and500000. How doops/secandp99change? - Toggle the
heavyboolean totrueto simulate a slower inner loop (like an unoptimized Python hotspot migrated to C++). What happens to throughput? - Replace the synthetic price generator with a replay from CSV: read timestamps and prices into
ticksand rerun the benchmark. - Implement the same microbenchmark in Python using
time.perf_counter()and compareops/sec. (Hint: Python will be much slower per-op; that’s why we migrate hotspots.)
Practical next steps and what to measure in the field
- For network I/O benchmarks, use pcap with hardware timestamps when possible and compute hop-to-order latency.
- Use
perfto see if allocations, syscalls, or branch mispredictions dominate the time. - Establish SLOs early (e.g., p99 < 5us) and continuously measure against them; alert when breached.
Try a small modification now (exercise):
- Edit the C++ example and:
- increase
ITERATIONSby 10x, - or add
std::this_thread::sleep_for(std::chrono::nanoseconds(2000));inside the loop to simulate NIC queueing jitter, - or switch to the
heavyworkload.
- increase
Observe how the numbers change (min, p95, p99 and ops/sec). Understanding how these metrics move when you change workload or environment is the key skill here.
xxxxxxxxxx}using namespace std;using ns = std::chrono::nanoseconds;using Clock = std::chrono::steady_clock;// Simple deterministic PRNG for reproducible "ticks" (no <random> overhead)uint32_t lcg(uint32_t &state) { state = state * 1664525u + 1013904223u; return state;}// Synthetic tick: timestamp + pricestruct Tick { uint64_t ts_ns; double price; };// Simulated processing workload: a small amount of math per tickinline double process_tick_fast(const Tick &t) { // cheap arithmetic that an HFT inner loop might do double p = t.price; // combine a few ops to simulate feature extraction return (p * 1.0001 + p / 123.456 - (p > 100.0 ? 0.42 : 0.21));}inline double process_tick_heavy(const Tick &t) {

