Hardware and Operating System Choices for Low Latency
Low-latency algorithmic trading depends as much on hardware and OS choices as on your code. Think of the stack like a basketball team: the hardware is your roster (big, fast players), the OS is your playbook and coach — both must be tuned to execute in under a second. You're coming from Java/C/JS and are a beginner in C++ & Python — so I'll keep analogies concrete and give you a small, runnable C++ helper you can tweak.
Quick visual: data path (simplified)
[NIC] -> (HW timestamp) -> [Kernel / Bypass Layer] | | v v (packets) (DPDK / PF_RING / Onload) | | v v [Feed Handler] -> [Strategy Hot Path] -> [Order Gateway]
Critical low-latency touches: the NIC (hardware timestamping, RX queue), kernel bypass (DPDK, PF_RING), CPU locality (NUMA), and BIOS/NIC options (interrupt moderation, power states).
Key hardware concepts (what to look for)
CPU
- Prefer high single-thread performance (higher clock / lower uop latency) for hot-path logic. For HFT,
few fast cores
often beat many slow ones. - Disable power-saving features for predictable latency: set CPU P-states/C-states appropriately in BIOS or via
intel_pstate
/cpupower
. - Hyperthreading: can help throughput but sometimes hurts worst-case latency due to shared execution ports — test with your workload.
- Prefer high single-thread performance (higher clock / lower uop latency) for hot-path logic. For HFT,
Cache & Memory
- Large L1/L2 is valuable. Watch cache-coherency traffic between cores — design hot-paths to be cache-local.
- NUMA: make sure your NIC and the feed-processing thread are on the same NUMA node. Cross-NUMA memory access can add tens to hundreds of nanoseconds.
NICs
- Enterprise NICs (Solarflare/Xilinx/Mellanox/Intel) have hardware timestamping, large ring buffers, and good driver tooling.
- Look for features:
RX/TX queue steering
,RSS
,hardware timestamping
,SR-IOV
, andflow director
. - Consider kernel-bypass options: DPDK gives lowest latency but adds complexity; PF_RING is easier to start with; OpenOnload helps on some hardware.
Storage/IO
- Most hot-paths avoid disk. If you must log, use asynchronous, non-blocking appenders or dedicated logging cores.
OS and distro choices
- Linux is the standard for HFT. Popular distros and notes:
Ubuntu LTS
: friendly, modern kernels — good for development.CentOS/RHEL
orRocky
: often used in production, stable enterprise kernels.Debian
: stable and conservative.
- Kernel options and tuning (start on dev box, test in staging):
irqaffinity
/irqbalance
– pin NIC interrupts to specific cores.isolcpus=...
kernel parameter to isolate cores for real-time threads.PREEMPT
/PREEMPT_RT
— real-time patches can help but add complexity.- Network stack: tune
rx/tx
ring sizes, disable offloads selectively (ethtool --offload
), enable hardware timestamping if available.
BIOS / NIC tuning checklist
BIOS
- Disable C-states beyond C1 or set
C-states=off
for stable latency. - Disable turbo if you require predictable performance (turbo can shift frequency unpredictably).
- Ensure NUMA is enabled and documented in BIOS.
- Disable C-states beyond C1 or set
NIC (ethtool and driver)
- Set
rx/tx
ring sizes to match traffic patterns. - Use
ethtool -K
to enable/disable offloads (GSO/TSO/LRO) — sometimes disabling helps latency. - Configure IRQ affinity: pin NIC queues to CPU cores that run your feed handlers.
- Set
Colocation vs Cloud
- Colocated (on-prem or exchange colocated):
- Best for absolute lowest latency. Access to specialized NICs, direct exchange connectivity, and physical proximity.
- You control BIOS, kernel, and hardware.
- Cloud:
- Easier to iterate, but often noisy neighbors and virtualization add jitter.
- Use bare-metal instances when possible (some clouds offer SR-IOV / dedicated NICs). Test end-to-end latency — don't assume advertised instance specs guarantee low tail-latency.
NUMA: hands-on rule of thumb
- Keep memory and CPU on the same NUMA node as the NIC. Use
numactl --hardware
andlscpu
to inspect layout. - Pin threads with
pthread_setaffinity_np
(C/C++), or usetaskset
for quick experiments.
Practical checklist before you deploy to production
- Verify hardware timestamps end-to-end.
- Measure tail latency, not just mean latency (99.9th percentile matters).
- Build repeatable lab tests: replay market data into your stack and measure processing and send latencies.
- Keep a small config matrix and change one setting at a time — rollbacks are your friend.
Challenge (try it — edit the C++ below)
- Run the C++ helper program below. It models
CPU
,NIC
,NUMA
, andOS
weights and prints a simple score for a candidate machine. - Try these experiments:
- Increase
cpu_weight
if you care more about single-thread speed (typical for many trading strategies). - Toggle
hyperthreading
to see how it affects the recommendation string. - Add a new candidate for a cloud
bare-metal
instance and see how it scores.
- Increase
This exercise is friendly to your Java/C/JS background: the code is plain C++ I/O and struct use — think of it like a typed version of a JSON object you might manipulate in JS or a simple Java POJO.
1// replicate this code into main.cpp and run it
2
3#include <iostream>
4#include <string>
5#include <vector>
6
7using namespace std;
8
9struct Machine {
10 string name;
11 int cpu_score; // single-thread perf (0-100)
12 int nic_score; // NIC features & hw timestamp (0-100)
13 int numa_penalty; // penalty for cross-NUMA (0-100, higher worse)
14 bool hyperthreading;
15};
16
17int main() {
18 // personalized touch (you like basketball? change this!)
19 string favorite_player = "Kobe Bryant";
20
21 vector<Machine> candidates = {
22 {"Colo-Baremetal-1", 95, 95, 5, false},
23 {"Cloud-Baremetal-XL", 88, 85, 10, true},
24 {"Dev-Workstation", 80, 60, 20, true}
25 };
26
27 // Tunable weights: increase cpu_weight if single-thread matters more
28 double cpu_weight = 0.45;
29 double nic_weight = 0.40;
30 double numa_weight = -0.15; // negative because higher penalty reduces score
31
32 cout << "HFT Hardware Quick Scorer — tuned for low-latency strategy\n";
33 cout << "Favorite player for vibes: " << favorite_player << "\n\n";
34
35 for (const auto &m : candidates) {
36 double score = m.cpu_score * cpu_weight + m.nic_score * nic_weight + m.numa_penalty * numa_weight;
37 cout << "Machine: " << m.name << "\n";
38 cout << " CPU:" << m.cpu_score << " NIC:" << m.nic_score << " NUMA_penalty:" << m.numa_penalty << " HT:" << (m.hyperthreading?"on":"off") << "\n";
39 cout << " Composite score: " << int(score + 0.5) << "\n";
40
41 if (m.hyperthreading && m.cpu_score > 85) {
42 cout << " Note: HT enabled on fast CPU — test for tail-latency degradation.\n";
43 }
44
45 cout << "\n";
46 }
47
48 cout << "Tips: change cpu_weight/nic_weight/numa_weight to see different trade-offs.\n";
49 cout << "Try moving the feed handler to the NIC's NUMA node and re-run the scoring.\n";
50 return 0;
51}
If you're coming from Java: treat CPU pinning
& NUMA
as you would thread pools and locality — they determine where your thread runs and what memory it's allowed to touch. From JS: think of kernel bypass (DPDK) as moving from an interpreted runtime into a native socket with direct access — faster but more responsibility.
Next step: in the lab, we'll measure baseline latency on an un-tuned VM, then apply each tuning step and watch the 99.9th percentile move. Ready to tweak the C++ weights and simulate real-world choices?
xxxxxxxxxx
}
using namespace std;
struct Machine {
string name;
int cpu_score; // single-thread perf (0-100)
int nic_score; // NIC features & hw timestamp (0-100)
int numa_penalty; // penalty for cross-NUMA (0-100, higher worse)
bool hyperthreading;
};
int main() {
// personalized touch (you like basketball? change this!)
string favorite_player = "Kobe Bryant";
vector<Machine> candidates = {
{"Colo-Baremetal-1", 95, 95, 5, false},
{"Cloud-Baremetal-XL", 88, 85, 10, true},
{"Dev-Workstation", 80, 60, 20, true}
};
// Tunable weights: increase cpu_weight if single-thread matters more
double cpu_weight = 0.45;
double nic_weight = 0.40;
double numa_weight = -0.15; // negative because higher penalty reduces score