Measuring Latency and Throughput
Why this matters for HFT engineers (beginner-friendly)
- In HFT the difference between
1,500 ns
and2,500 ns
per tick can change whether your order wins a trade or not. Think of latency like a fast-break in basketball: a small delay can be the difference between an easy layup and a contested shot. - Throughput (
ops/sec
) is how many ticks your system can handle per second — like how many possessions a team can run in a game.
Quick ASCII diagram: where measurement fits in the pipeline
[Market feed NIC] --(packets)--> [Capture / Handler] --(parse)--> [Strategy inner-loop] --(orders)--> [Exchange gateway] ^ | |------ instrument (timestamps) ---|
- The critical path (where latency matters) is from packet arrival to order emission.
- We measure: per-event latency (ns) and overall throughput (ops/sec).
Core approaches and tools (what to reach for)
- Software timers: use
std::chrono::steady_clock
in C++,time.perf_counter()
in Python,System.nanoTime()
in Java,clock_gettime(CLOCK_MONOTONIC_RAW)
in C,performance.now()
in JS. These give you program-side timings. - Kernel/hardware timestamps: NICs and kernel support (SO_TIMESTAMPING or PTP). These give lower-level absolute times and remove user-space scheduling jitter.
- Packet capture:
tcpdump -tt -i eth0 -w out.pcap
and analyze timestamps with Wireshark. Use hardware timestamping where available. - Profilers and counters:
perf record
/perf stat
for CPU metrics and hotspots.perf
helps find the hot function you should optimize.
Commands (beginner-safe examples)
- Capture packets (software timestamps):
sudo tcpdump -i eth0 -w feed.pcap
- Profile CPU to find hotspots:
sudo perf record -F 99 -- ./your_binary
sudo perf report --stdio
- Check NIC timestamping capability:
ethtool -T eth0
How to interpret measurements (simple rules)
- Look at percentiles, not just average:
p95
andp99
show tail latency which kills HFT performance. - Correlate throughput and latency: higher throughput often raises latency (queueing).
- Watch for long tails caused by GC, page faults, IRQs, or CPU frequency scaling.
Analogy to basketball (keep it intuitive)
- Average latency = team's average shot time.
- p99 latency = worst possession in the last 100 possessions (the play that cost you the game).
- Throughput = possessions per minute.
The supplied C++ example (in the code block) shows a reproducible microbenchmark:
- It builds deterministic
ticks
(vector<Tick>
) so results are reproducible. - Measures per-tick latency (nanoseconds) and computes
min
,avg
,p50
,p95
,p99
,max
andops/sec
. - Prints SLO breaches for a simple service-level check.
Beginner challenges (try these after running the code)
- Change
ITERATIONS
to10000
and500000
. How doops/sec
andp99
change? - Toggle the
heavy
boolean totrue
to simulate a slower inner loop (like an unoptimized Python hotspot migrated to C++). What happens to throughput? - Replace the synthetic price generator with a replay from CSV: read timestamps and prices into
ticks
and rerun the benchmark. - Implement the same microbenchmark in Python using
time.perf_counter()
and compareops/sec
. (Hint: Python will be much slower per-op; that’s why we migrate hotspots.)
Practical next steps and what to measure in the field
- For network I/O benchmarks, use pcap with hardware timestamps when possible and compute hop-to-order latency.
- Use
perf
to see if allocations, syscalls, or branch mispredictions dominate the time. - Establish SLOs early (e.g., p99 < 5us) and continuously measure against them; alert when breached.
Try a small modification now (exercise):
- Edit the C++ example and:
- increase
ITERATIONS
by 10x, - or add
std::this_thread::sleep_for(std::chrono::nanoseconds(2000));
inside the loop to simulate NIC queueing jitter, - or switch to the
heavy
workload.
- increase
Observe how the numbers change (min, p95, p99 and ops/sec). Understanding how these metrics move when you change workload or environment is the key skill here.
xxxxxxxxxx
}
using namespace std;
using ns = std::chrono::nanoseconds;
using Clock = std::chrono::steady_clock;
// Simple deterministic PRNG for reproducible "ticks" (no <random> overhead)
uint32_t lcg(uint32_t &state) {
state = state * 1664525u + 1013904223u;
return state;
}
// Synthetic tick: timestamp + price
struct Tick { uint64_t ts_ns; double price; };
// Simulated processing workload: a small amount of math per tick
inline double process_tick_fast(const Tick &t) {
// cheap arithmetic that an HFT inner loop might do
double p = t.price;
// combine a few ops to simulate feature extraction
return (p * 1.0001 + p / 123.456 - (p > 100.0 ? 0.42 : 0.21));
}
inline double process_tick_heavy(const Tick &t) {