Kernel and Network Stack Tuning for Minimal Latency
When building HFT systems for algorithmic trading, every microsecond counts. The kernel and network stack are the stage crew moving the packets from the wire to your strategy code — if they fumble, your execution timing (and P&L) suffers.
- Goal (this screen): give practical knobs you can change safely and a tiny C++ experiment that demonstrates why CPU affinity and polling vs. kernel wakeups matter. You're coming from Java/C/Python/JS — think of
isolcpus
and IRQ affinity like telling the OS "don't interrupt my star player during the buzzer-beater".
Quick mental model (ASCII)
[NIC] --hw-ts--> (NIC ring RX) --> (NIC IRQ) --> [Kernel softirq / NAPI] --> [socket / user app] | v (CPU core)
Important places to tune:
IRQ affinity
— bind NIC interrupts to specific CPU cores by writing to/proc/irq/<irq>/smp_affinity
or usingirqbalance
carefully.isolcpus
— kernel boot parameter to isolate cores from the scheduler (good for dedicating cores to latency sensitive threads).PREEMPT
/ real-time kernels —CONFIG_PREEMPT
,CONFIG_PREEMPT_RT
reduce scheduling latency.RX/TX ring sizes
—ethtool -g <iface>
andethtool -G <iface> rx <count> tx <count>
adjust NIC buffers.- Offloads — disable
GRO/GSO/TSO
for accurate per-packet timing withethtool -K <iface> gro off gso off tso off
. - Socket & kernel knobs —
net.core.rmem_max
,net.core.netdev_max_backlog
,net.core.busy_poll
andSO_BUSY_POLL
for polling sockets.
Why this matters in HFT terms:
- Polling (
busy-spin
) is like having a guard constantly watching the scoreboard — you pay CPU (power) for ultra-low and deterministic latency. - Kernel wakeups (condvars, epoll) are energy efficient but introduce jitter — like waiting for the PA announcer to tell you the buzzer sounded.
Practical safe-testing rules:
- Test on a dedicated lab box (do not change kernel settings on prod network appliances).
- Keep a remote admin session and a recovery plan (rescue kernel, reboot). Use
sysctl -w
for transient changes. - Record baselines before each change. Use
ethtool -T
,ptp4l -m
(if PTP),tcpdump -tt
,perf record
/perf top
.
Commands you will use often:
- Check timestamping/offloads:
ethtool -T eth0
,ethtool -k eth0
- Resize rings:
ethtool -G eth0 rx 4096 tx 512
- Disable offloads:
ethtool -K eth0 gro off gso off tso off
- Affix IRQ to CPU mask:
echo 2 > /proc/irq/<irq>/smp_affinity
(mask is hex; be careful) - Transient sysctl:
sysctl -w net.core.busy_poll=50
Tiny experiment (run locally)
Below is a C++ program that simulates a simple producer (market-data) and consumer (strategy) pair and measures notification latency in three scenarios:
- unpinned threads (default scheduler)
- pinned to the same core (bad)
- pinned to different cores (good)
This will help you reason about isolcpus
and thread pinning effects. It includes both condition_variable
(kernel wake) and polling
(busy-spin) modes. Try it on a multi-core Linux VM and change the CPU numbers (or run with isolcpus=
kernel param) to see the difference.
Note: This is a simulation — it doesn't change kernel IRQ routing or NIC offloads. Run real network tests separately with pktgen
and ethtool
once you're comfortable.
Challenge: Run the program, then:
- Change
prod_cpu
/cons_cpu
values to match cores on your machine (try0
and1
). - Switch between
use_polling = true
andfalse
. - Observe mean and max latencies. Relate improvements to what you'd expect if you used
isolcpus
and bound the NIC IRQ to a nearby core.
Now the code — save as main.cpp
and compile with g++ -O2 -std=c++17 -pthread main.cpp -o tune_test
and run ./tune_test
.
xxxxxxxxxx
}
using namespace std;
using namespace std::chrono;
// Pin a std::thread to a CPU core (returns true on success)
bool pin_thread_to_cpu(std::thread &t, int cpu) {
if (cpu < 0) return true; // -1 means leave unpinned
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
int rc = pthread_setaffinity_np(t.native_handle(), sizeof(cpu_set_t), &cpuset);
return rc == 0;
}
struct Results {
double mean_ns;
uint64_t max_ns;
};