Mark As Completed Discussion

Kernel and Network Stack Tuning for Minimal Latency

When building HFT systems for algorithmic trading, every microsecond counts. The kernel and network stack are the stage crew moving the packets from the wire to your strategy code — if they fumble, your execution timing (and P&L) suffers.

  • Goal (this screen): give practical knobs you can change safely and a tiny C++ experiment that demonstrates why CPU affinity and polling vs. kernel wakeups matter. You're coming from Java/C/Python/JS — think of isolcpus and IRQ affinity like telling the OS "don't interrupt my star player during the buzzer-beater".

Quick mental model (ASCII)

[NIC] --hw-ts--> (NIC ring RX) --> (NIC IRQ) --> [Kernel softirq / NAPI] --> [socket / user app] | v (CPU core)

Important places to tune:

  • IRQ affinity — bind NIC interrupts to specific CPU cores by writing to /proc/irq/<irq>/smp_affinity or using irqbalance carefully.
  • isolcpus — kernel boot parameter to isolate cores from the scheduler (good for dedicating cores to latency sensitive threads).
  • PREEMPT / real-time kernels — CONFIG_PREEMPT, CONFIG_PREEMPT_RT reduce scheduling latency.
  • RX/TX ring sizesethtool -g <iface> and ethtool -G <iface> rx <count> tx <count> adjust NIC buffers.
  • Offloads — disable GRO/GSO/TSO for accurate per-packet timing with ethtool -K <iface> gro off gso off tso off.
  • Socket & kernel knobs — net.core.rmem_max, net.core.netdev_max_backlog, net.core.busy_poll and SO_BUSY_POLL for polling sockets.

Why this matters in HFT terms:

  • Polling (busy-spin) is like having a guard constantly watching the scoreboard — you pay CPU (power) for ultra-low and deterministic latency.
  • Kernel wakeups (condvars, epoll) are energy efficient but introduce jitter — like waiting for the PA announcer to tell you the buzzer sounded.

Practical safe-testing rules:

  • Test on a dedicated lab box (do not change kernel settings on prod network appliances).
  • Keep a remote admin session and a recovery plan (rescue kernel, reboot). Use sysctl -w for transient changes.
  • Record baselines before each change. Use ethtool -T, ptp4l -m (if PTP), tcpdump -tt, perf record/perf top.

Commands you will use often:

  • Check timestamping/offloads: ethtool -T eth0, ethtool -k eth0
  • Resize rings: ethtool -G eth0 rx 4096 tx 512
  • Disable offloads: ethtool -K eth0 gro off gso off tso off
  • Affix IRQ to CPU mask: echo 2 > /proc/irq/<irq>/smp_affinity (mask is hex; be careful)
  • Transient sysctl: sysctl -w net.core.busy_poll=50

Tiny experiment (run locally)

Below is a C++ program that simulates a simple producer (market-data) and consumer (strategy) pair and measures notification latency in three scenarios:

  • unpinned threads (default scheduler)
  • pinned to the same core (bad)
  • pinned to different cores (good)

This will help you reason about isolcpus and thread pinning effects. It includes both condition_variable (kernel wake) and polling (busy-spin) modes. Try it on a multi-core Linux VM and change the CPU numbers (or run with isolcpus= kernel param) to see the difference.

Note: This is a simulation — it doesn't change kernel IRQ routing or NIC offloads. Run real network tests separately with pktgen and ethtool once you're comfortable.

Challenge: Run the program, then:

  • Change prod_cpu/cons_cpu values to match cores on your machine (try 0 and 1).
  • Switch between use_polling = true and false.
  • Observe mean and max latencies. Relate improvements to what you'd expect if you used isolcpus and bound the NIC IRQ to a nearby core.

Now the code — save as main.cpp and compile with g++ -O2 -std=c++17 -pthread main.cpp -o tune_test and run ./tune_test.

CPP
OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment