Introduction to the Course and Goals
Welcome! If you're a software engineer curious about algorithmic trading and have a beginner background in C++
, Python
, Java
, C
, or JavaScript
, this short orientation will get you grounded in what to expect.
- Course goal: Build practical, low-latency HFT components and concepts — from
market data
ingestion to a minimalorder gateway
— implemented in bothC++
(performance-critical) andPython
(rapid prototyping). Think: "from notebook prototype to a tiny, fast microservice." - Target outcomes: Be able to design and implement latency-sensitive pieces of a trading stack, understand kernel/network tweaks, measure latency, and move hot paths from Python to C++ safely.
- Who this is for: Engineers with basic familiarity in languages like
C++
,Python
,Java
,C
, orJS
who want to learn HFT systems engineering. - Time commitment: Plan ~6–8 hours/week for ~8–12 weeks (labs + reading). Labs are incremental: each builds on the previous — market data -> strategy -> execution -> backtesting.
High-level structure (quick map)
ASCII diagram of the minimal HFT data-flow you'll build and test:
1[Exchange Multicast] --> [Market Data Handler] --> [Strategy / Alpha] --> [Order Gateway] --> [Exchange TCP]
2 | | |
3 (parser) (decision) (risk/serialization)
Market Data Handler
: UDP/multicast ingestion, sequence recovery, parsingbinary
messages.Strategy / Alpha
: simple momentum or spread logic (prototype inPython
, migrate hot loops toC++
).Order Gateway
: TCP binary protocols or FIX glue for real order submission.
What you'll learn (concrete):
- Read and parse exchange message formats (
ITCH
,OUCH
), implement a simpleUDP
multicast listener inC++
. - Prototype strategies in
Python
withnumpy/pandas
, profile, then move hotspots to C++ viapybind11
. - Measure latency with hardware timestamps and
pcap
traces; microbenchmark I/O and serialization. - Kernel & NIC tuning basics:
irq affinity
,isolcpus
, RX/TX rings, and offloads.
A micro-analogy (for Java/C/JS folks and even basketball fans)
Building an HFT system is like coaching a basketball team:
Market data
is the crowd noise and scoreboard — the raw sensory input.Strategy
is your playbook; a short, decisive play is like a low-latency hot path.Order gateway
is the point guard executing the shot — timing and coordination matter.
If your favorite player is LeBron James
or Kobe Bryant
, think of moving a play from a slow walk-on to an instant alley-oop — that's the jump from prototype to optimized C++.
Quick practical expectation
- Labs: small, focused; each lab includes a runnable
C++
binary and aPython
notebook. - Assessments: functional correctness + latency/throughput baseline comparisons.
- Safety: all network/NIC tuning steps are demonstrated with rollback tips and VM-friendly alternatives.
Try this now
Below is a tiny C++
program that prints a course micro-plan, shows weekly hours, and prints an ASCII HFT diagram. Run it, then try the challenge at the end of the program by editing the code.
Challenge: Change module durations, add your favorite module (e.g., FPGA intro
or Advanced SIMD
), or replace the favorite_player
with your own sports analogy to personalize output.
1#include <iostream>
2#include <vector>
3#include <string>
4#include <iomanip>
5
6using namespace std;
7
8struct Module {
9 string name;
10 int hours_per_week;
11 int weeks;
12};
13
14int main() {
15 // Personal touch for engineers who like basketball
16 const string favorite_player = "LeBron James"; // change this to your favorite
17
18 vector<Module> modules = {
19 {"Intro & Env Setup", 4, 1},
20 {"Market Data Handler (C++)", 6, 2},
21 {"Strategy Prototyping (Python)", 6, 2},
22 {"Order Gateway & Risk", 5, 2},
23 {"Backtesting & Simulation", 4, 2},
24 {"Profiling & Optimization", 5, 2}
25 };
26
27 cout << "Course: Algorithmic Trading for HFTs using C++ and Python\n";
28 cout << "Goal: Build latency-sensitive trading components.\n\n";
29
30 int total_hours = 0;
31 cout << left << setw(35) << "Module" << setw(10) << "hrs/wk" << setw(8) << "weeks" << "\n";
32 cout << string(60, '-') << "\n";
33 for (auto &m : modules) {
34 cout << left << setw(35) << m.name << setw(10) << m.hours_per_week << setw(8) << m.weeks << "\n";
35 total_hours += m.hours_per_week * m.weeks;
36 }
37
38 cout << "\nEstimated total commitment: " << total_hours << " hours (spread over the course).\n";
39
40 cout << "\nMinimal HFT stack diagram:\n";
41 cout << "[Exchange Multicast] --> [Market Data Handler] --> [Strategy] --> [Order Gateway] --> [Exchange TCP]\n";
42 cout << " (parser) (decision) (serialize & risk)\n";
43
44 cout << "\nA quick tip: Prototype logic in Python, then move hot loops to C++ (pybind11).\n";
45 cout << "Analogy: Make plays as quickly as " << favorite_player << " does alley-oops!\n";
46
47 cout << "\nTry this change: edit the array of modules to add a module named 'FPGA intro' or change hours_per_week.\n";
48 return 0;
49}
Happy hacking — when you're ready, continue to the next screen where we'll set up the C++
toolchain and a Python
virtual environment. Don't forget to modify the code above and run it to make the plan your own!
xxxxxxxxxx
}
using namespace std;
struct Module {
string name;
int hours_per_week;
int weeks;
};
int main() {
// Personal touch for engineers who like basketball
const string favorite_player = "LeBron James"; // change this to your favorite
vector<Module> modules = {
{"Intro & Env Setup", 4, 1},
{"Market Data Handler (C++)", 6, 2},
{"Strategy Prototyping (Python)", 6, 2},
{"Order Gateway & Risk", 5, 2},
{"Backtesting & Simulation", 4, 2},
{"Profiling & Optimization", 5, 2}
};
cout << "Course: Algorithmic Trading for HFTs using C++ and Python\n";
cout << "Goal: Build latency-sensitive trading components.\n\n";
Let's test your knowledge. Click the correct answer from the options.
Which of the following statements is NOT true about the "Algorithmic Trading for HFTs using C++ and Python" course?
Click the option that best answers the question.
- You will implement latency-sensitive components (e.g., a UDP multicast market data handler) in C++.
- The course expects prior familiarity with languages like C++ or Python and basic systems/network knowledge.
- Labs are incremental and build on each other: market data → strategy → execution → backtesting.
- There is no time commitment — you can complete the entire course in one day without studying.
Who Should Take This and Prerequisites
This course is aimed at engineers who want to learn practical, low-latency algorithmic trading systems (HFT) implemented in C++
and Python
. If you're a beginner in C++
& Python
and have some experience in Java
, C
, or JavaScript
, you'll fit right in — you'll reuse many concepts (threads, memory, async I/O) while learning new, performance-focused patterns.
Quick summary — should you take this?
- Yes if you: want to build latency-sensitive services, enjoy systems programming, and like squeezing performance out of code.
- Helpful background: basic programming in any language (
C++
,Python
,Java
,C
,JS
) — we map transferable skills for each. - Not required but recommended: prior exposure to Linux, basic networking, and undergraduate-level probability/statistics.
Required foundations (the must-haves)
C++
fundamentals: types, functions, classes/RAII, basic STL (vector
,string
), building withCMake
.Python
fundamentals: virtual environments,numpy
,pandas
, and writing/reading small scripts.- Math & statistics: basic probability, expectations, variance, simple time series intuition (moving averages).
- Operating systems: familiarity with Linux commands, processes/threads, and the idea of system calls.
- Networking basics: TCP vs UDP, sockets, and the concept of multicast (exchange market-data commonly uses UDP multicast).
Transferable skills from your background
- From
Java
: concurrency models, threads, and JVM-managed memory — useful when learning C++ thread safety and memory management. - From
C
: manual memory handling and low-level systems thinking — great prep for cache-awareness in C++. - From
JavaScript
: async/event-driven patterns map nicely to event-loop-based market-data handlers.
Recommended readings & quick primers (2–10 hour windows)
C++
: "A Tour of C++" (Bjarne Stroustrup) or a short C++ crash course coveringunique_ptr
,move
,std::vector
.Python
: Official tutorial + "Python Data Science Handbook" (Jake VanderPlas) chapters onnumpy
basics.- Math/Stats: Khan Academy or a short refresher on probability & statistics (expectation, variance, conditional probability).
- OS/Networking: "Linux Basics for Hackers" (for CLI comfort) + simple socket tutorial (create a TCP and UDP echo server/client).
Preparatory exercises (small, hands-on; try 2–4 of these)
- Build and run a "Hello, build system" C++ program using
CMake
. - In Python, load a CSV into
pandas
and compute a rolling mean and standard deviation. - Write a small UDP sender and receiver (two programs) on your machine and observe packets with
tcpdump
/wireshark
. - Do a short probability exercise: compute expectation and variance of a discrete distribution.
Visual checklist (edit and track!)
1[✔] `Python` scripting & `numpy`
2[ ] `C++` basics: types, RAII, build with `CMake`
3[ ] Math & stats: probability, variance, moving averages
4[ ] Linux: basic shell, processes, top/htop
5[ ] Networking: sockets, UDP/TCP, pcap/tcpdump
Short analogy to keep things friendly
- Think of
Python
as your playbook writer — quick to prototype plays (strategies). C++
is the point guard who finishes the alley-oop — fast and disciplined.- Networking/OS knowledge is the stadium and court: if it's noisy or misconfigured, even the best play loses.
Challenge (interactive)
We've included a tiny, editable C++ program below. Edit the boolean flags at the top to reflect your current skills (flip false
to true
), recompile, and run it. The program prints a personalized prep plan and a checklist tuned to what you still need to study. If you're into basketball, change the favorite_player
string to your favorite athlete (e.g., Kobe Bryant
, LeBron James
) for a bit of fun personalization.
Happy prepping — when you've completed 2–3 preparatory exercises from above, you'll be ready to jump into the first lab (environment setup and a minimal multicast listener).
xxxxxxxxxx
}
using namespace std;
int main() {
// EDIT THESE: set true when you feel comfortable with the topic
const bool know_cpp = false; // types, RAII, STL, basic CMake
const bool know_python = true; // venv, numpy, pandas
const bool know_java = true; // transferable concurrency/OO skills
const bool know_c = false; // low-level memory understanding
const bool know_js = true; // async/event-driven patterns
const bool know_os = false; // Linux basics, processes, threads
const bool know_networking = false; // sockets, UDP/TCP, multicast
const bool know_math_stats = false; // prob., expectation, variance
const string favorite_player = "Kobe Bryant"; // change for fun
cout << "HFT Course — Prerequisite Self-Check\n";
cout << "Favorite player (fun): " << favorite_player << "\n\n";
vector<string> todo;
if (!know_cpp) todo.push_back("C++ crash course: types, RAII, smart pointers, basic STL, CMake setup");
if (!know_python) todo.push_back("Python: venv, numpy, pandas, small data processing scripts");
if (!know_math_stats) todo.push_back("Math/stats refresher: expectation, variance, simple time-series concepts");
if (!know_os) todo.push_back("Linux & OS basics: shell, processes, threads, profiling with top/htop");
if (!know_networking) todo.push_back("Networking: sockets (TCP/UDP), pcap/tcpdump, multicast basics");
if (know_java || know_c || know_js) {
Try this exercise. Click the correct answer from the options.
Who is the ideal target audience for this course based on the prerequisites described on the previous screen?
Click the option that best answers the question.
- Complete beginners with no programming experience
- Engineers who want to build latency-sensitive systems using C++ and Python
- Purely financial traders who won't write code
- Casual Python scriptwriters who do not plan to learn C++
High-Level Architecture of an HFT System
Understand the big picture first — then we dive into code. This screen shows the main components you'll meet when building a tiny HFT service and explains the latency‑critical path you must shrink. You're a beginner in C++
& Python
(and have background in Java
, C
, JS
) — so I'll point out where those languages typically live in this stack.
ASCII diagram (simple, left-to-right data flow):
[Exchange multicast / TCP] --> [NIC / Hardware timestamp] --> [Kernel / Driver] --> [Market Data Feed Handler] | v [Strategy Engine (decision)] --> [Order Gateway] --> [Exchange] | | v v [Risk] [Logging / Telemetry]
Key components (short, practical notes):
Market Data Feed Handler
- Role: receive, parse, and sequence-recover exchange messages (often UDP multicast / binary protocols like ITCH).
- Typical implementation: C++ for lowest latency (tight parsing, zero-copy), or
Python
for prototyping (slow path). - Things to watch: copying, memory allocation, and parse branching.
Strategy Engine
- Role: use parsed market data to decide orders. Could be simple rules (crossing SMA) or complex signals.
- Typical flow: prototype algorithm quickly in
Python
(numpy
,pandas
), then move hot code paths toC++
(or bind withpybind11
). - Keep decision logic in-memory and branch-minimal for microseconds.
Order Gateway
- Role: serialize orders and send to exchange; track acknowledgements and resend logic.
- Typical implementation: low-level
C++
for performance and strict socket handling.
Risk
andLogging
- Risk checks should be inline and extremely fast (pre-trade); heavy risk policies are off the hot-path.
- Logging must not block: use async/batched writers, ring buffers, or route logs off-thread.
Latency-critical path (what to optimize first):
- From the NIC timestamp to the bytes on the wire back to exchange:
NIC -> Kernel -> Feed handler -> Strategy -> Order Gateway -> NIC
. - Focus on: zero/allocation-free parsing, cache-friendly data layout, avoiding syscalls in the hot path, and hardware timestamping.
Language mapping and analogies for your background:
- If you come from
Java
: think ofC++
here as Java without the GC — you must manage memory but you get predictable pauses. - If you come from
C
: same low-level control, plus modern tools (std::vector
, RAII) to avoid bugs. - If you come from
JS
: imagine the market feed as events on an event loop — but instead of a single-threaded loop, we design threads and lockless queues for microsecond latencies. Python
is your rapid-prototyping notebook — don't ship it on the hot path without moving bottlenecks toC++
.
Quick checklist (visual):
1[ ] NIC hardware timestamping enabled
2[ ] Feed handler: zero-copy parsing
3[ ] Strategy: branch-light, cache-friendly data
4[ ] Order gateway: async socket send, minimal syscalls
5[ ] Risk: pre-trade checks inline
6[ ] Logging: non-blocking, batched
Hands-on challenge (run the C++ program below):
- The C++ snippet simulates the component chain and prints per-stage and total microsecond latencies. It's a model — not a real network stack — but it helps you reason about which stages dominate.
- Try these experiments:
- Change stage latencies to see which component pushes you past the critical threshold.
- Replace the
Strategy
stage with a smaller value to simulate migrating Python logic to C++. - Edit
favorite_player
to your favorite athlete (or coder) — a tiny personalization tie-in to keep learning playful.
Below is an executable C++
snippet that models this pipeline. Modify the stage times and rerun to explore the latency profile.
xxxxxxxxxx
}
using namespace std;
using Clock = chrono::high_resolution_clock;
using us = chrono::microseconds;
int main() {
// Personalize this (change to your favorite player or coder):
string favorite_player = "Kobe Bryant"; // change for fun
// Each pair is: (stage name, simulated latency in microseconds)
// These numbers are coarse simulations to help you reason about hotspots.
vector<pair<string,int>> stages = {
{"NIC/hardware rx (hw ts)", 30},
{"Kernel / driver copy", 20},
{"Feed handler parse (zero-copy)", 60},
{"Strategy (in-memory decision)", 120},
{"Risk check (inline)", 40},
{"Order serialization", 30},
{"Socket send / NIC tx", 50},
{"Exchange ack RTT (mock)", 300}
};
us critical_threshold(500); // microseconds: quick example threshold
Let's test your knowledge. Is this statement true or false?
True or false: The latency-critical path in an HFT system runs from NIC hardware timestamp → kernel/driver → market data feed handler → strategy engine → order gateway → NIC.
Press true if you believe the statement is correct, or false otherwise.
Learning Path and Hands-on Labs
A clear roadmap helps you go from small, safe experiments to a full HFT microservice. Think of the labs like basketball drills: start with dribbling (market data ingestion
), add shooting form (simple strategy
), then play scrimmages (execution
+ backtesting
) — every exercise builds toward game-ready performance.
ASCII roadmap (left → right):
[Lab 1] Market Data Ingestion | v [Lab 2] ---> [Lab 3] Simple Strategy Execution Gateway | | v v [Lab 4] ------> [Lab 5] Backtesting Integration & Microservice
Key modules and practical labs (what you'll actually code):
Lab 1: Market Data Ingestion
(UDP multicast / binary parsing)- Goal: receive, parse, and sequence-recover simple mock messages.
- Languages: prototype in
Python
to parse, implement production parser inC++
for low-latency.
Lab 2: Simple Strategy
(stateless decision)- Goal: implement a moving-average crossover or RSI rule.
- Languages: fast prototyping in
Python
(numpy
), then optionally port hot function toC++
usingpybind11
.
Lab 3: Execution Gateway
(order serialization + TCP/UDP sends)- Goal: build a robust order sender with resend/ack tracking.
- Languages:
C++
recommended for socket control and minimal syscalls.
Lab 4: Backtesting & Replay Engine
- Goal: deterministic market replay and strategy validation with offline metrics.
- Languages:
Python
for analysis (pandas
) andC++
for heavy replay if needed.
Lab 5: Integration — Build Your First Microservice
- Goal: combine ingestion, strategy, and gateway into a runnable microservice with logging and simple risk checks.
- Languages: mixed—
C++
for hot path,Python
for orchestration or analytics.
How labs build on each other (dependency rules):
- Each lab produces a contract (simple API): parsed
MarketMessage
→StrategyInput
→Order
. - Later labs reuse earlier outputs: backtesting uses the same
MarketMessage
format you implement in Lab 1; the Execution Gateway reusesOrder
serialization from your strategy. - This incremental approach makes debugging and evaluation tractable.
Evaluation criteria (how you'll be graded / measure success):
- Correctness: unit tests for parsing and serialization (
pytest
for Python,Catch2
/googletest
for C++). - Determinism & Reproducibility: replay outputs must match between runs.
- Performance targets: micro-benchmarks for hot-paths (latency budgets per stage). Start with coarse goals (e.g., <1ms per stage) and tighten.
- Code hygiene: clear interfaces, CI build with
CMake
, and reproducible dependency management (conan
/pip
/venv
). - Observability: logs non-blocking, simple metrics (events/sec, avg latency).
Tailored notes for you (beginner in C++ & Python, familiar with Java, C, JS):
- Prototype quickly in
Python
(like playing 3-on-3 pickup) — iterate rules. - Move hot inner-loops to
C++
(the pro league): small functions, well-tested, and expose viapybind11
when you want to orchestrate in Python. - If you come from
Java
: think ofC++
as Java without a GC — you will manage memory and must watch allocations on the hot path. - If you come from
JS
: event-driven logic maps well to feed handlers; translate callbacks into lock-free queues for low-latency C++ flows.
Practical timeline (recommended pacing):
- Lab 1: 6–10 hours
- Lab 2: 4–8 hours
- Lab 3: 6–10 hours
- Lab 4: 6–12 hours
- Lab 5: 8–16 hours
(Do these part-time over several weeks — adjust per prior experience.)
Hands-on challenge (run the C++ helper below):
- The C++ program prints a suggested lab sequence, per-lab estimated hours, and the total. It's a tiny planner you can edit. Try these experiments:
- Shorten
Market Data
hours to simulate moving faster from Python -> C++. - Add a new lab
Hardware Timestamping
and set its hours. - Change
favorite_player
to your favorite athlete or coder to personalize output.
- Shorten
Small tips while coding labs:
- Keep
MarketMessage
layouts explicit and test bit-exact parsing. - Avoid dynamic allocation on the hot path — prefer pre-allocated buffers.
- Write small, focused unit tests for each lab before integration.
Now open the C++ file below (main.cpp), run it, and try the small edits above. When you're done, reflect: which lab took longest? Which one forced you to rewrite code in C++
instead of Python
?
xxxxxxxxxx
}
using namespace std;
struct Lab {
string name;
int hours;
string recommended_language;
};
int main() {
// Personalize this!
string favorite_player = "Kobe Bryant"; // change to your favorite coder or athlete
vector<Lab> roadmap = {
{"Market Data Ingestion", 8, "Prototype: Python -> Prod: C++"},
{"Simple Strategy", 6, "Python (fast iterate), move hot parts to C++"},
{"Execution Gateway", 8, "C++ (low-level sockets)"},
{"Backtesting & Replay", 10, "Python for analysis, C++ for fast replay"},
{"Integration: Microservice", 12, "C++ with thin Python orchestration"}
};
cout << "Learning Path Planner — Algorithmic Trading (HFT)\n";
cout << "Hi " << favorite_player << "! Here's a suggested sequence of hands-on labs:\n\n";
int total = 0;
for (size_t i = 0; i < roadmap.size(); ++i) {
Are you sure you're getting this? Is this statement true or false?
Each lab produces a contract (parsed MarketMessage
→ StrategyInput
→ Order
) that later labs reuse, so implementing Lab 1's MarketMessage
format first helps avoid incompatible formats and rewrites.
Press true if you believe the statement is correct, or false otherwise.
Hardware and Operating System Choices for Low Latency
Low-latency algorithmic trading depends as much on hardware and OS choices as on your code. Think of the stack like a basketball team: the hardware is your roster (big, fast players), the OS is your playbook and coach — both must be tuned to execute in under a second. You're coming from Java/C/JS and are a beginner in C++ & Python — so I'll keep analogies concrete and give you a small, runnable C++ helper you can tweak.
Quick visual: data path (simplified)
[NIC] -> (HW timestamp) -> [Kernel / Bypass Layer] | | v v (packets) (DPDK / PF_RING / Onload) | | v v [Feed Handler] -> [Strategy Hot Path] -> [Order Gateway]
Critical low-latency touches: the NIC (hardware timestamping, RX queue), kernel bypass (DPDK, PF_RING), CPU locality (NUMA), and BIOS/NIC options (interrupt moderation, power states).
Key hardware concepts (what to look for)
CPU
- Prefer high single-thread performance (higher clock / lower uop latency) for hot-path logic. For HFT,
few fast cores
often beat many slow ones. - Disable power-saving features for predictable latency: set CPU P-states/C-states appropriately in BIOS or via
intel_pstate
/cpupower
. - Hyperthreading: can help throughput but sometimes hurts worst-case latency due to shared execution ports — test with your workload.
- Prefer high single-thread performance (higher clock / lower uop latency) for hot-path logic. For HFT,
Cache & Memory
- Large L1/L2 is valuable. Watch cache-coherency traffic between cores — design hot-paths to be cache-local.
- NUMA: make sure your NIC and the feed-processing thread are on the same NUMA node. Cross-NUMA memory access can add tens to hundreds of nanoseconds.
NICs
- Enterprise NICs (Solarflare/Xilinx/Mellanox/Intel) have hardware timestamping, large ring buffers, and good driver tooling.
- Look for features:
RX/TX queue steering
,RSS
,hardware timestamping
,SR-IOV
, andflow director
. - Consider kernel-bypass options: DPDK gives lowest latency but adds complexity; PF_RING is easier to start with; OpenOnload helps on some hardware.
Storage/IO
- Most hot-paths avoid disk. If you must log, use asynchronous, non-blocking appenders or dedicated logging cores.
OS and distro choices
- Linux is the standard for HFT. Popular distros and notes:
Ubuntu LTS
: friendly, modern kernels — good for development.CentOS/RHEL
orRocky
: often used in production, stable enterprise kernels.Debian
: stable and conservative.
- Kernel options and tuning (start on dev box, test in staging):
irqaffinity
/irqbalance
– pin NIC interrupts to specific cores.isolcpus=...
kernel parameter to isolate cores for real-time threads.PREEMPT
/PREEMPT_RT
— real-time patches can help but add complexity.- Network stack: tune
rx/tx
ring sizes, disable offloads selectively (ethtool --offload
), enable hardware timestamping if available.
BIOS / NIC tuning checklist
BIOS
- Disable C-states beyond C1 or set
C-states=off
for stable latency. - Disable turbo if you require predictable performance (turbo can shift frequency unpredictably).
- Ensure NUMA is enabled and documented in BIOS.
- Disable C-states beyond C1 or set
NIC (ethtool and driver)
- Set
rx/tx
ring sizes to match traffic patterns. - Use
ethtool -K
to enable/disable offloads (GSO/TSO/LRO) — sometimes disabling helps latency. - Configure IRQ affinity: pin NIC queues to CPU cores that run your feed handlers.
- Set
Colocation vs Cloud
- Colocated (on-prem or exchange colocated):
- Best for absolute lowest latency. Access to specialized NICs, direct exchange connectivity, and physical proximity.
- You control BIOS, kernel, and hardware.
- Cloud:
- Easier to iterate, but often noisy neighbors and virtualization add jitter.
- Use bare-metal instances when possible (some clouds offer SR-IOV / dedicated NICs). Test end-to-end latency — don't assume advertised instance specs guarantee low tail-latency.
NUMA: hands-on rule of thumb
- Keep memory and CPU on the same NUMA node as the NIC. Use
numactl --hardware
andlscpu
to inspect layout. - Pin threads with
pthread_setaffinity_np
(C/C++), or usetaskset
for quick experiments.
Practical checklist before you deploy to production
- Verify hardware timestamps end-to-end.
- Measure tail latency, not just mean latency (99.9th percentile matters).
- Build repeatable lab tests: replay market data into your stack and measure processing and send latencies.
- Keep a small config matrix and change one setting at a time — rollbacks are your friend.
Challenge (try it — edit the C++ below)
- Run the C++ helper program below. It models
CPU
,NIC
,NUMA
, andOS
weights and prints a simple score for a candidate machine. - Try these experiments:
- Increase
cpu_weight
if you care more about single-thread speed (typical for many trading strategies). - Toggle
hyperthreading
to see how it affects the recommendation string. - Add a new candidate for a cloud
bare-metal
instance and see how it scores.
- Increase
This exercise is friendly to your Java/C/JS background: the code is plain C++ I/O and struct use — think of it like a typed version of a JSON object you might manipulate in JS or a simple Java POJO.
1// replicate this code into main.cpp and run it
2
3#include <iostream>
4#include <string>
5#include <vector>
6
7using namespace std;
8
9struct Machine {
10 string name;
11 int cpu_score; // single-thread perf (0-100)
12 int nic_score; // NIC features & hw timestamp (0-100)
13 int numa_penalty; // penalty for cross-NUMA (0-100, higher worse)
14 bool hyperthreading;
15};
16
17int main() {
18 // personalized touch (you like basketball? change this!)
19 string favorite_player = "Kobe Bryant";
20
21 vector<Machine> candidates = {
22 {"Colo-Baremetal-1", 95, 95, 5, false},
23 {"Cloud-Baremetal-XL", 88, 85, 10, true},
24 {"Dev-Workstation", 80, 60, 20, true}
25 };
26
27 // Tunable weights: increase cpu_weight if single-thread matters more
28 double cpu_weight = 0.45;
29 double nic_weight = 0.40;
30 double numa_weight = -0.15; // negative because higher penalty reduces score
31
32 cout << "HFT Hardware Quick Scorer — tuned for low-latency strategy\n";
33 cout << "Favorite player for vibes: " << favorite_player << "\n\n";
34
35 for (const auto &m : candidates) {
36 double score = m.cpu_score * cpu_weight + m.nic_score * nic_weight + m.numa_penalty * numa_weight;
37 cout << "Machine: " << m.name << "\n";
38 cout << " CPU:" << m.cpu_score << " NIC:" << m.nic_score << " NUMA_penalty:" << m.numa_penalty << " HT:" << (m.hyperthreading?"on":"off") << "\n";
39 cout << " Composite score: " << int(score + 0.5) << "\n";
40
41 if (m.hyperthreading && m.cpu_score > 85) {
42 cout << " Note: HT enabled on fast CPU — test for tail-latency degradation.\n";
43 }
44
45 cout << "\n";
46 }
47
48 cout << "Tips: change cpu_weight/nic_weight/numa_weight to see different trade-offs.\n";
49 cout << "Try moving the feed handler to the NIC's NUMA node and re-run the scoring.\n";
50 return 0;
51}
If you're coming from Java: treat CPU pinning
& NUMA
as you would thread pools and locality — they determine where your thread runs and what memory it's allowed to touch. From JS: think of kernel bypass (DPDK) as moving from an interpreted runtime into a native socket with direct access — faster but more responsibility.
Next step: in the lab, we'll measure baseline latency on an un-tuned VM, then apply each tuning step and watch the 99.9th percentile move. Ready to tweak the C++ weights and simulate real-world choices?
xxxxxxxxxx
}
using namespace std;
struct Machine {
string name;
int cpu_score; // single-thread perf (0-100)
int nic_score; // NIC features & hw timestamp (0-100)
int numa_penalty; // penalty for cross-NUMA (0-100, higher worse)
bool hyperthreading;
};
int main() {
// personalized touch (you like basketball? change this!)
string favorite_player = "Kobe Bryant";
vector<Machine> candidates = {
{"Colo-Baremetal-1", 95, 95, 5, false},
{"Cloud-Baremetal-XL", 88, 85, 10, true},
{"Dev-Workstation", 80, 60, 20, true}
};
// Tunable weights: increase cpu_weight if single-thread matters more
double cpu_weight = 0.45;
double nic_weight = 0.40;
double numa_weight = -0.15; // negative because higher penalty reduces score
Build your intuition. Fill in the missing part by typing it in.
To minimize cross-socket memory latency in an HFT feed handler, always place the NIC and the feed-processing thread on the same ___. Use tools like numactl
and lscpu
to verify placement.
Write the missing line below.
Time Synchronization and High-Resolution Timing
Accurate time is the referee in HFT — if your clocks disagree, your order/market-data timestamps lie, audits fail, and latency measurements become meaningless. Think of PTP
/GPS
as the league office keeping all courts' clocks in sync so your shot-clock (order timestamps) and the official game clock (exchange time) agree.
Why it matters for HFT (short)
- Trading decisions, order sequencing, and regulatory audit trails all depend on consistent timestamps across machines and network cards.
- Tail-latency debugging uses timestamps from NIC hardware and application logs — if the clocks drift, you can't correlate events correctly.
Key primitives & jargon
PTP
(Precision Time Protocol) — network time sync with sub-microsecond accuracy when using hardware timestamping.PHC
/PHC2SYS
— kernel PTP clock exposed by some NICs (the NIC's hardware clock).TSC
(Timestamp Counter) — very fast CPU cycle counter; great resolution but needs careful handling (invariant TSC, constant-rate, pinned cores).SO_TIMESTAMPING
/ hardware timestamping — get timestamps from NIC hardware (preferred for accurate packet timing).clock_gettime(CLOCK_REALTIME)
vsCLOCK_MONOTONIC
/steady_clock
— pick the right clock for measuring intervals vs wall-time.
ASCII visual: packet timestamp flow (simplified)
[Exchange NIC HW] ---hw-ts---> (packet on wire) ---> [Your NIC hw-ts] | v (kernel / socket timestamp) | v (app capture time)
The important offsets: wire delay
+ NIC hardware timestamp offset
+ kernel/syscall latency
+ app capture jitter
.
- Common tools to inspect & verify
ptp4l -m
andphc2sys
(show PTP status and sync progress)ethtool -T eth0
(check NIC hardware timestamping support)tcpdump -tt -n -i eth0
with hardware timestamps (or pcap with hw ts)chronyc tracking
/timedatectl
for NTP info
Practical notes for you (beginner in C++/Python, coming from Java/C/JS):
- Use
CLOCK_MONOTONIC
orstd::chrono::steady_clock
to measure durations (like latency). Usesystem_clock
only for wall-clock labeling (logs, audits). - If you prototype in Python, still rely on the NIC's hardware timestamp (via socket options) when you need accuracy — user-space timestamps are noisy.
- TSC is like reading the CPU cycle counter directly (ultra-fast). It is great for microbenchmarks, but requires the system to guarantee the TSC rate is constant across cores. If you treat TSC like a simple Java System.nanoTime() replacement, test it carefully on your hardware.
Challenge: run the C++ helper below. It simulates a hardware timestamp (earlier) and an application timestamp and prints the offset in nanoseconds and an estimated TSC-cycle count using an input CPU frequency. Tweak cpu_freq_ghz
and simulated_skew_ns
to see how skew and CPU frequency affect cycle-counts and perceived offsets. Try to relate the printed offsets to the ptp4l
offsets you'd see on a real machine.
Hands-on checks to attempt after this screen
- On a lab box with a PTP-capable NIC: run
ptp4l -m
and note the reported offset (should be sub-microsecond when synced). - Use
ethtool -T <iface>
to confirm hardware timestamping. - Replay a pcap into your stack (or use a packet generator) and compare NIC hw timestamps vs application timestamps.
Ready? Tweak the code: change simulated_skew_ns
and cpu_freq_ghz
, or add a simulated jitter loop (like a busy spin) to see how application capture time moves relative to the NIC hw-ts. If you're into basketball analogies: try increasing simulated_skew_ns
as if the scorekeepers in two arenas started one second apart — you'd lose the ability to compare who hit a buzzer-beater first.
xxxxxxxxxx
}
using namespace std;
using namespace std::chrono;
int main() {
// Friendly personalization (you mentioned basketball earlier!)
const string favorite_player = "Kobe Bryant"; // change for fun
// Simulation knobs (edit these to experiment):
double cpu_freq_ghz = 3.0; // set approximate CPU freq (GHz)
long long simulated_skew_ns = 250; // positive => app clock is *later* than hw-ts (ns)
long long simulated_processing_ns = 120; // app capture latency after hw timestamp (ns)
cout << "Time Sync Helper — tuned for HFT learning (" << favorite_player << ")\n\n";
// Simulate NIC hardware timestamp (we pretend NIC stamped packet arrival earlier)
auto hw_ts = steady_clock::now();
// Simulate wire + kernel + NIC processing by sleeping a tiny amount
// In real systems you would obtain hw_ts from the NIC or kernel (SO_TIMESTAMPING)
this_thread::sleep_for(nanoseconds(simulated_processing_ns));
// Application capture uses system (wall) clock for labeling, but we measure offsets in steady_clock
auto app_ts = steady_clock::now() + nanoseconds(simulated_skew_ns);
Are you sure you're getting this? Is this statement true or false?
You can rely on the CPU TSC (Timestamp Counter) as a portable, synchronized wall-clock across multiple machines in an HFT deployment without using PTP/GPS or NIC hardware timestamping.
Press true if you believe the statement is correct, or false otherwise.
Kernel and Network Stack Tuning for Minimal Latency
When building HFT systems for algorithmic trading, every microsecond counts. The kernel and network stack are the stage crew moving the packets from the wire to your strategy code — if they fumble, your execution timing (and P&L) suffers.
- Goal (this screen): give practical knobs you can change safely and a tiny C++ experiment that demonstrates why CPU affinity and polling vs. kernel wakeups matter. You're coming from Java/C/Python/JS — think of
isolcpus
and IRQ affinity like telling the OS "don't interrupt my star player during the buzzer-beater".
Quick mental model (ASCII)
[NIC] --hw-ts--> (NIC ring RX) --> (NIC IRQ) --> [Kernel softirq / NAPI] --> [socket / user app] | v (CPU core)
Important places to tune:
IRQ affinity
— bind NIC interrupts to specific CPU cores by writing to/proc/irq/<irq>/smp_affinity
or usingirqbalance
carefully.isolcpus
— kernel boot parameter to isolate cores from the scheduler (good for dedicating cores to latency sensitive threads).PREEMPT
/ real-time kernels —CONFIG_PREEMPT
,CONFIG_PREEMPT_RT
reduce scheduling latency.RX/TX ring sizes
—ethtool -g <iface>
andethtool -G <iface> rx <count> tx <count>
adjust NIC buffers.- Offloads — disable
GRO/GSO/TSO
for accurate per-packet timing withethtool -K <iface> gro off gso off tso off
. - Socket & kernel knobs —
net.core.rmem_max
,net.core.netdev_max_backlog
,net.core.busy_poll
andSO_BUSY_POLL
for polling sockets.
Why this matters in HFT terms:
- Polling (
busy-spin
) is like having a guard constantly watching the scoreboard — you pay CPU (power) for ultra-low and deterministic latency. - Kernel wakeups (condvars, epoll) are energy efficient but introduce jitter — like waiting for the PA announcer to tell you the buzzer sounded.
Practical safe-testing rules:
- Test on a dedicated lab box (do not change kernel settings on prod network appliances).
- Keep a remote admin session and a recovery plan (rescue kernel, reboot). Use
sysctl -w
for transient changes. - Record baselines before each change. Use
ethtool -T
,ptp4l -m
(if PTP),tcpdump -tt
,perf record
/perf top
.
Commands you will use often:
- Check timestamping/offloads:
ethtool -T eth0
,ethtool -k eth0
- Resize rings:
ethtool -G eth0 rx 4096 tx 512
- Disable offloads:
ethtool -K eth0 gro off gso off tso off
- Affix IRQ to CPU mask:
echo 2 > /proc/irq/<irq>/smp_affinity
(mask is hex; be careful) - Transient sysctl:
sysctl -w net.core.busy_poll=50
Tiny experiment (run locally)
Below is a C++ program that simulates a simple producer (market-data) and consumer (strategy) pair and measures notification latency in three scenarios:
- unpinned threads (default scheduler)
- pinned to the same core (bad)
- pinned to different cores (good)
This will help you reason about isolcpus
and thread pinning effects. It includes both condition_variable
(kernel wake) and polling
(busy-spin) modes. Try it on a multi-core Linux VM and change the CPU numbers (or run with isolcpus=
kernel param) to see the difference.
Note: This is a simulation — it doesn't change kernel IRQ routing or NIC offloads. Run real network tests separately with pktgen
and ethtool
once you're comfortable.
Challenge: Run the program, then:
- Change
prod_cpu
/cons_cpu
values to match cores on your machine (try0
and1
). - Switch between
use_polling = true
andfalse
. - Observe mean and max latencies. Relate improvements to what you'd expect if you used
isolcpus
and bound the NIC IRQ to a nearby core.
Now the code — save as main.cpp
and compile with g++ -O2 -std=c++17 -pthread main.cpp -o tune_test
and run ./tune_test
.
xxxxxxxxxx
}
using namespace std;
using namespace std::chrono;
// Pin a std::thread to a CPU core (returns true on success)
bool pin_thread_to_cpu(std::thread &t, int cpu) {
if (cpu < 0) return true; // -1 means leave unpinned
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
int rc = pthread_setaffinity_np(t.native_handle(), sizeof(cpu_set_t), &cpuset);
return rc == 0;
}
struct Results {
double mean_ns;
uint64_t max_ns;
};
Are you sure you're getting this? Is this statement true or false?
Disabling NIC offloads (for example GRO
, GSO
, and TSO
) improves per-packet timing accuracy and reduces packet aggregation at the kernel level, thereby lowering jitter for latency-critical HFT workloads.
Press true if you believe the statement is correct, or false otherwise.
Choosing Development Tools and Workflow
A pragmatic toolkit and repeatable workflow are the difference between a hobby algo and a deployable HFT component. Think of your toolchain like a basketball team: the IDE is the coach drawing plays, the build system is your training plan, the compiler is the athlete whose performance you tune, and the debugger/benchmarks are the film room where you analyze every microsecond. If your favorite player is Kobe Bryant
, the goal is to give him the best practice, shoes, and playbook — same idea for code.
High-level workflow (ASCII):
Editor/IDE --> Build System (CMake + deps) --> Local Tests & Linters | | v v Debug/Run <-- Compiler (gcc/clang) <-- Profilers/Benchmarks --> CI/CD
Quick recommendations for a beginner who's familiar with Java, C, JS and starting C++/Python:
Editors / IDEs
- VS Code: lightweight, great extensions for C++ (
ms-vscode.cpptools
), Python, and Git. - CLion: excellent CMake integration (commercial), great for stepping through C++ code with the debugger.
- Neovim/Emacs: if you like keyboard-driven workflows — pair with LSP (clangd, pyright).
- VS Code: lightweight, great extensions for C++ (
Build systems & package management
CMake
— the defacto C/C++ cross-platform build generator. If you usedMaven
ornpm
, think ofCMake
as that for native builds.Conan
orvcpkg
— dependency managers for C++ (similar role topip
/npm
).- Python: use
venv
orconda
for reproducible environments.
Compilers
gcc
andclang
are the main choices.clang
often gives nicer diagnostics;gcc
is widely used in prod HFT stacks.- Use
-O2
/-O3
,-march=native
, and-flto
for performance builds; use-g
for debug builds. Keep separateDebug
andRelease
CMake targets.
Debugging & profiling
gdb
/lldb
for source-level debugging.perf
/VTune
/hotspot
for profiling CPU hotspots.- Use sanitizers during dev:
-fsanitize=address,undefined
to catch memory errors early (disable in performance builds).
Linters, formatters & CI
clang-format
andclang-tidy
for C++ style and static checks.black
,flake8
,isort
for Python.- Pre-commit hooks + pull-request template: require tests, lint pass, and performance notes (expected budget) on PRs.
Recommended small rules for HFT codebases
- Small, focused commits and code reviews that check algorithmic complexity, not just style.
- Add microbenchmarks for performance-critical changes and record baselines.
- Reproducible builds and pinned dependency versions (Conan lockfiles, pip requirements.txt).
Why this matters for a beginner:
- If you come from Java (
mvn
) or JS (npm
), the surprise is native builds are multi-stage: configure (CMake) → compile (gcc/clang) → link. LearningCMakeLists.txt
is worth the time. - Python is great for rapid prototyping. Use
pybind11
to move a hot function to C++ later — keep the Python layer small and well-tested.
Practical challenge (below): a tiny C++ microbenchmark you can compile with different flags to see how the compiler transforms code. Try compiling with:
g++ -O0 main.cpp -o main_dbg
(debug)g++ -O3 -march=native -flto main.cpp -o main_opt
(optimized)
Run both and compare runtimes. Also try the same with clang++
and observe differences.
Change suggestions
- In the code: adjust loop size
N
to suit your machine (smaller on laptops). Try adding/removingvolatile
to see how optimizers behave. - In your workflow: set up a simple
CMakeLists.txt
, a.clang-format
, and a GitHub Actions CI that runsclang-tidy
, unit tests, and the microbenchmark in a permissive mode.
Now: compile and run the C++ program in the code
pane. Notice how compiler flags change the runtime — this is the first step toward understanding how build choices affect HFT latency.
xxxxxxxxxx
}
using namespace std;
int main() {
// Quick environment info
cout << "Compiler: clang\n";
cout << "Compiler: gcc/clang-compatible\n";
cout << "Compiler: unknown\n";
cout << "__cplusplus: " << __cplusplus << "\n";
// Microbenchmark: tight math loop vs. small vector walk
// Adjust N if your machine is small. On a modern laptop try 20'000'000.
const long N = 20000000L;
volatile double sink = 0.0; // volatile prevents some optimizations that would remove the loop
// 1) Tight math loop
{
auto t0 = chrono::high_resolution_clock::now();
double x = 1.0000001;
for (long i = 0; i < N; ++i) {
x = x * 1.000000001 + 0.0000000001;
}
sink += x;
Let's test your knowledge. Is this statement true or false?
Using separate Debug
and Release
CMake targets — where Debug
builds include -g
and sanitizers and Release
builds use -O3
, -march=native
and link-time optimizations — is the recommended approach to balance debuggability and peak performance in HFT development.
Press true if you believe the statement is correct, or false otherwise.
Setting Up the C++ Development Environment
Welcome — you're stepping from Java/C/JS into native C++ land, with the specific goal of building low-latency HFT components. Think of this setup like assembling a race car: the engine (compiler), the chassis (build system), the pit tools (package manager), and the telemetry (logging/profiling libs). If you played point guard in basketball, the tools are your teammates — each must know its role and pass the ball cleanly.
Quick checklist (what we'll install & why)
- Compilers:
gcc
/clang
— the engines. Useclang
for nicer diagnostics,gcc
in many production HFT stacks. - Build system:
CMake
— the cross-platform playbook that generates builds for different toolchains. - Package managers:
Conan
orvcpkg
— likemaven
/npm
for native libraries. - Key libraries:
Boost
(utilities),fmt
(fast formatting),spdlog
(low-latency logging),Eigen
(linear algebra for numeric work). - Project skeleton & recommended flags for reproducible, high-performance builds.
Install commands (Ubuntu / macOS shortcuts)
Ubuntu (Debian-based):
Install compilers + cmake:
SNIPPET1sudo apt update 2sudo apt install -y build-essential cmake clang ninja-build python3-pip
Conan (Python-based):
SNIPPET1python3 -m pip install --user conan
macOS (Homebrew):
SNIPPET1brew install cmake clang-format ninja conan
Tip: if you used mvn
/npm
, think of CMake
as the build generator and Conan
/vcpkg
as dependency managers (like pom.xml
/package.json
).
Recommended compiler flags (two build profiles)
- Debug (dev):
-g -O0 -fsanitize=address,undefined
— safe, catches errors. - Release (perf):
-O3 -march=native -flto -ffast-math -DNDEBUG -fno-plt
— aggressive optimizations for latency-critical code.
Why keep them separate? Debug builds are your practice sessions; Release builds are game day. Never run sanitizers in high-frequency production builds.
Minimal CMakeLists (project skeleton)
1cmake_minimum_required(VERSION 3.16)
2project(hft_microservice VERSION 0.1 LANGUAGES CXX)
3set(CMAKE_CXX_STANDARD 17)
4set(CMAKE_CXX_STANDARD_REQUIRED ON)
5# Debug config
6set(CMAKE_CXX_FLAGS_DEBUG "-g -O0")
7# Release config
8set(CMAKE_CXX_FLAGS_RELEASE "-O3 -march=native -flto -DNDEBUG")
9add_executable(hft_demo src/main.cpp)
10# Example: use Conan to inject dependencies
11# find_package(fmt CONFIG REQUIRED)
12# target_link_libraries(hft_demo PRIVATE fmt::fmt)
ASCII project layout (quick visual):
hft_microservice/ ├─ CMakeLists.txt ├─ conanfile.txt (optional) ├─ src/ │ └─ main.cpp └─ tests/
Libraries — quick notes
Boost
: broad utility belt (asio, lockfree, containers). Use only required modules.fmt
: printf-style formatting but type-safe and fast — replacestd::ostringstream
in hot paths.spdlog
: builds onfmt
, supports async sinks for lower impact logging.Eigen
: header-only, excellent for small-matrix math (used in model computations).
Use Conan
to pin library versions and create reproducible lockfiles — this prevents "works on my laptop" surprises.
Practical tips for someone coming from Java/C/JS
- No single package manager: you will mix system packages (apt/brew), CMake, and Conan/vcpkg. Think of CMake as the project POM and Conan as your private registry.
- Linking matters: native linking is explicit and can silently fail if you forget to link
-l
flags. Always run a small run after adding a dependency. - Build caches: CMake + Ninja is faster than plain Make for iterative development.
What to try now (challenge)
- Create the project skeleton above.
- Put the
code
panemain.cpp
intosrc/main.cpp
. - Compile twice:
- Debug:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug && cmake --build build
- Release:
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build
- Debug:
- Run both builds and compare timings.
Questions to explore:
- How does changing
-O0
->-O3
affect runtime? (You'll see differences in the microbenchmark below.) - Try reducing
N
if you're on a laptop. Try adding/removingvolatile
in the code to see optimizer effects.
Happy building — think of your first working build like hitting your first clean 3-pointer: small, satisfying, and the first step toward consistently scoring under pressure.
xxxxxxxxxx
}
using namespace std;
int main() {
// Tiny loop microbenchmark to show how compiler flags change runtime.
// Try: compile with -O0 (debug) and -O3 -march=native (release) and compare.
const long long N = 50000000; // reduce if this is too big on your machine
volatile long long sink = 0; // prevent optimizer from removing the loop
auto t0 = chrono::high_resolution_clock::now();
for (long long i = 0; i < N; ++i) {
sink += i & 0xFF; // cheap work with a bitwise op
}
auto t1 = chrono::high_resolution_clock::now();
auto us = chrono::duration_cast<chrono::microseconds>(t1 - t0).count();
// Compiler identification (works for GCC/Clang)
cout << "Compiler version macro: " << __VERSION__ << "\n";
cout << "N = " << N << "\n";
cout << "Sink (mod 1000) = " << (sink % 1000) << "\n";
cout << "Elapsed = " << us << " us\n";
string player = "Kobe Bryant"; // a nod to your basketball analogy
cout << "Go-to player: " << player << "\n";
Let's test your knowledge. Click the correct answer from the options.
Which of the following compiler flag sets is the recommended "Release (perf)" configuration for latency-critical HFT components as described in the C++ environment setup?
Click the option that best answers the question.
- `-g -O0 -fsanitize=address,undefined` — full debug with sanitizers
- `-O3 -march=native -flto -ffast-math -DNDEBUG -fno-plt` — aggressive optimizations for performance
- `-O2 -pipe -static -s` — conservative optimizations with static linking and strip symbols
- `-Ofast -Og -fno-exceptions -funroll-loops` — mixed optimization flags (fast + debug)
Setting Up the Python Environment
Welcome — this screen gets your Python workspace ready for prototyping HFT strategies and for migrating hotspots to C++. You're a multi-language beginner (C++, Python, Java, C, JS): think of Python as your fast sketchpad (like a REPL version of javac
+ quick scripts) and C++ as the production engine you call when speed matters.
Why a dedicated Python env?
- Isolation: a
venv
/conda
prevents library-version clashes (like keepingnode_modules
for different JS projects separate). - Reproducibility: pin
numpy
/pandas
/numba
/cython
/pybind11
versions so your backtests don't silently change behavior across machines. - Iterate fast: prototype a strategy in Python, profile it, then move the hot loop to C++ (via
pybind11
) if needed.
Quick visual: Prototype -> Profile -> Push to C++
Prototype (Python) ---> Profile (cProfile / line_profiler / numba) ---> C++ (pybind11) ---> Deploy
ASCII flow:
[Python REPL / Jupyter] | v [Prototype: pandas + numpy] | v [Profile: find hot loop] | v [C++ function exposed with pybind11] | v [Import extension in Python]
Create an environment (venv)
venv (lightweight, stdlib):
SNIPPET1python3 -m venv .venv 2source .venv/bin/activate # macOS / Linux 3.\.venv\Scripts\activate # Windows (PowerShell) 4python -m pip install --upgrade pip 5pip install numpy pandas numba cython pybind11
conda (easier binary packages on some systems):
SNIPPET1conda create -n hft_py python=3.10 -y 2conda activate hft_py 3conda install -c conda-forge numpy pandas numba cython pybind11 -y
Tip: For HFT work, prefer conda
or pip
wheels built for your CPU to avoid long compile times for packages like numba
/cython
.
Install list (minimum for this course)
numpy
— numeric arrays (likestd::vector<double>
but with fast vectorized ops)pandas
— dataframes for tick/bar data processingnumba
— JIT speedups for numerical loops (great before deciding to rewrite in C++)cython
— compile Python-like code to C for intermediate speed gainspybind11
— clean bridge to call C++ from Python
Pin them in requirements.txt
or a conda
YAML for reproducible setups.
Pybind11 workflow (short)
- Prototype in Python with
numpy
. - Profile to find the hot loop (e.g., computing a moving average over millions of ticks).
- Reimplement the hot function in C++ and expose it with
pybind11
. - Build the extension,
import
it from Python, and compare results and timings.
A tiny conceptual pybind11 binding looks like:
1// (concept only) expose `double fast_sma(ndarray prices, int window)` to Python
2#include <pybind11/pybind11.h>
3#include <pybind11/numpy.h>
4
5namespace py = pybind11;
6
7py::array_t<double> fast_sma(py::array_t<double> prices, int window) {
8 // ... C++ implementation using raw pointers for speed
9}
10
11PYBIND11_MODULE(myhft, m) {
12 m.def("fast_sma", &fast_sma);
13}
(You will later compile this into a Python extension; for now, focus on environment and prototyping.)
Rapid prototyping vs production
- Rapid: use
pandas
+numpy
ornumba
in avenv
; iterate in Jupyter. - Production: compile C++ components with pinned compiler flags, link via
pybind11
or run them as a separate microservice (RPC). Use CI to build wheels or containers.
Challenge (try this now)
- Create a
venv
and install the packages above. - Run the C++ example in the
code
pane (compile + run). It computes a simple moving average (SMA) on a small price array — the same logic you'd first write in Python. - Then implement the same SMA in Python using
numpy.convolve
and compare outputs and readability.
Questions to reflect on:
- Where does Python make iteration easy but slow? (Answer: per-element Python loops.)
- When does
numba
make sense vs jumping straight to C++ withpybind11
? (Answer: if JIT gives enough speed-up and you want faster iteration without C++ build complexity.)
Next step: after running the C++ example, we'll show a short pybind11
binding and the setup.py
/CMake recipe to build it so you can import it directly into Python.
xxxxxxxxxx
}
using namespace std;
// Simple moving average (SMA) over a window. This mirrors what you'd first
// prototype in Python with numpy, then port when it's a hotspot.
double compute_sma_window(const vector<double>& prices, int start, int window) {
double sum = 0.0;
for (int i = start; i < start + window; ++i) {
sum += prices[i];
}
return sum / window;
}
int main() {
// Example tick prices (think: small simulated price stream)
vector<double> prices = {100.5, 100.7, 100.2, 100.9, 101.1, 100.8, 101.3};
int window = 3;
cout << fixed << setprecision(4);
cout << "Prices: ";
for (double p : prices) cout << p << " ";
cout << "\nWindow: " << window << "\n";
cout << "SMA results:\n";
for (size_t i = 0; i + window <= prices.size(); ++i) {
Try this exercise. Is this statement true or false?
True or false: The primary reason to create a Python virtual environment (venv or conda) for HFT development is to improve the runtime performance of your Python programs.
Press true if you believe the statement is correct, or false otherwise.
Low-Latency Networking Libraries and Frameworks
Welcome — this screen gives you a practical overview of the common kernel-bypass and kernel-based networking options used in HFT, and a small C++ playground that simulates a common performance trade-off: extra copies vs direct parsing. You're an engineer learning algorithmic trading with a mixed background (C++
, Python
, Java
, C
, JS
) — think of this as learning the difference between playing pickup basketball (raw sockets) and running a pro training session with the best coaches and gear (DPDK).
Why this matters for HFT
- Market data and order traffic arrive at huge rates — microseconds matter. Choosing the right I/O layer affects latency, throughput, and complexity.
- The trade-offs are: complexity (how hard to set up and maintain) vs performance (latency, throughput) vs portability (works across distros/NICs).
ASCII diagram (data flow)
Market -> Fiber -> NIC (hardware) -----------------------------+ | v Kernel network stack -> sockets -> user process (kernel path) (easier, slower)
NIC -> kernel bypass -> user-space poll (PF_RING / DPDK / Onload) (complex, fastest)
Short overview of stacks
raw sockets
- What: standard BSD sockets read with
recvfrom
/recvmsg
. - Pros: simplest to try, portable, easy to prototype in Python/Java/C++.
- Cons: kernel overhead, context switches, copy from kernel to user memory — higher latency.
- Analogy: pickup game at a public court — accessible but noisy.
- What: standard BSD sockets read with
PF_RING
(and ZC/AF_PACKET enhancements)- What: a packet capture and RX improvement layer;
PF_RING ZC
supports zero-copy. - Pros: lower CPU cost than raw sockets; can be simpler than full DPDK.
- Cons: NIC/driver support varies; still some complexity.
- Use when: you want better perf than raw sockets but not full DPDK complexity.
- What: a packet capture and RX improvement layer;
DPDK
(Data Plane Development Kit)- What: full user-space networking stack with NIC drivers, hugepages, polling, and zero-copy.
- Pros: best throughput/lowest packet-processing latency; fine-grained control (RSS, queues, batching).
- Cons: heavy setup (hugepages, binding NICs, custom drivers), less portable, requires careful memory/pinning.
- Analogy: pro training center with bespoke gear and coaches — maximum speed at highest cost.
Solarflare / OpenOnload
- What: vendor-specific kernel-bypass (NIC-offload) solutions. Often provide socket semantics with kernel-bypass.
- Pros: easier port of socket-based apps to bypass; vendor tested for low latency.
- Cons: vendor lock-in, driver quirks.
Key trade-offs summary
- Complexity: raw sockets < PF_RING < OpenOnload < DPDK
- Performance: raw sockets < PF_RING < OpenOnload < DPDK (general trend)
- Portability: raw sockets > PF_RING > OpenOnload > DPDK
Practical tips for a beginner
- Prototype in Python/C++ with raw sockets to understand message parsing and sequencing.
- When you need production latency, move to PF_RING or DPDK. Expect an engineering effort: NUMA, hugepages, IRQ affinity.
- Use hardware timestamping and measure: theory won't replace benchmarks.
- If your team is small and needs portability, prefer PF_RING or vendor offload to DPDK if you can't maintain it.
Challenge for you (after running the code):
- Change the number of simulated messages (
N
) in the C++ code. Does the extra-copy approach scale worse? - Try increasing the packet work (e.g., additional math or conditional logic) — does the relative gap change?
- If you program in Python: imagine the same loop in Python — where would the overhead be? (answer: interpreter loop, allocations)
Remember the analogy: in basketball terms, if you want predictable split-second plays (HFT strategies), you eventually need a pro facility (DPDK or vendor kernel-bypass), but you start learning playbook and fundamentals with a pickup game (raw sockets).
Now compile and run the C++ playground in the code
pane below. It simulates many tiny binary packets and measures two approaches: an extra-copy (simulating user-space copy from kernel buffers) vs direct memcpy from a contiguous ring buffer (simulating zero-copy/parsing from pre-mapped memory). Try editing N
, batch sizes, or the simulated packet contents to see how costs change.
xxxxxxxxxx
}
using namespace std;
using Clock = chrono::high_resolution_clock;
// A tiny synthetic "market packet" -- real NIC frames are binary blobs like this.
struct Packet {
uint64_t seq;
double price;
char side; // 'B' or 'S'
};
int main() {
// Tweak this to simulate more/less load (try e.g. 100000, 1000000, 5000000)
const size_t N = 1000000;
const size_t pkt_size = sizeof(Packet);
// Build a contiguous buffer that simulates a pre-filled ring (zero-copy friendly)
vector<uint8_t> ring;
ring.reserve(N * pkt_size);
// Fill with synthetic packets (deterministic pseudo-random prices)
std::mt19937_64 rng(42);
Are you sure you're getting this? Click the correct answer from the options.
Which networking option typically delivers the lowest packet-processing latency but requires hugepages, binding NICs to user-space drivers, and a heavier setup effort?
Click the option that best answers the question.
- Standard BSD `raw sockets` (e.g., `recvfrom` / `recvmsg`)
- Packet-capture / RX improvements like `PF_RING` (with zero-copy variants)
- Vendor kernel-bypass solutions such as `Solarflare` / `OpenOnload`
- `DPDK` (Data Plane Development Kit) with user-space NIC drivers and hugepages
Exchange Connectivity and Protocols
Understanding how your system connects to exchanges is foundational for any HFT engineer — like knowing the court, the ref, and the scoreboard before you run plays. This screen gives a practical intro to the most common protocols (FIX
, OUCH
, ITCH
, and binary multicast
), how messages are parsed/serialized, and which tools/libraries help you test connectivity.
Why this matters (microsecond mindset)
- Exchanges speak different "languages": some send text-based order messages (
FIX
), others push high-rate market data as binary multicast (ITCH
). - Parsing/serializing correctly and recovering from gaps (sequence numbers) is critical: a missed packet = a missed trade.
- For beginners in
C++
,Python
,Java
,C
,JS
: start by understanding message shape and invariants (lengths, checksums, sequence fields) before optimizing for latency.
Quick protocol cheat-sheet
FIX
(Financial Information eXchange)- Text protocol:
tag=value
pairs separated by ASCII SOH (0x01). - Common for orders/trades over TCP. Libraries:
QuickFIX
(C++),QuickFIX/J
(Java),quickfix
Python wrappers. - Analogy: a referee announcing plays over a PA system — human-readable, reliable, and standardized.
- Text protocol:
OUCH
- Exchange-specific binary order protocol (example: NASDAQ OUCH for order entry).
- Binary, fixed-length fields, compact and fast — think of a coach's shorthand playbook.
ITCH
/ binary multicast- High-throughput, low-latency multicast for market data updates (adds, trades, deletes).
- Messages are compact binary records; you often map a memory buffer and parse in-place.
- Analogy: a fast live video feed — many frames per second, you must keep up or fall behind.
binary multicast
general notes- No retransmit: if you miss a multicast packet, you must detect gaps (
sequence numbers
) and request a resend from a TCP replay or use snapshots. - NIC features (hardware timestamping, RSS queues) and kernel-bypass (DPDK, PF_RING) become relevant here.
- No retransmit: if you miss a multicast packet, you must detect gaps (
Parsing & serialization: practical rules of thumb
- Always validate lengths and sequence fields before trusting payload.
- For
FIX
:- Split on SOH (0x01). The
9=
(BodyLength) and10=
(Checksum) fields help detect corruption. - Implement a small, robust parser first (proof-of-concept in
C++
orPython
) before pushing to low-latency optimizations.
- Split on SOH (0x01). The
- For binary protocols (
OUCH
/ITCH
):- Define exact struct layout, prefer reading fixed fields (no string ops in the hot path).
- Use big-endian vs little-endian correctly (spec doc will say).
- Sequence recovery: maintain last seen sequence ID, detect gaps, and trigger snapshot/recovery logic.
Tools & libraries to test connectivity
- FIX:
QuickFIX
(C++/Python/Java), test withtcpdump
,Wireshark
(FIX dissector), and simple test clients. - Binary multicast: use
tcpreplay
,pktgen
,nping
or vendor simulators to replay captures;Wireshark
with ITCH dissector to inspect. - Generic:
tcpdump
,tshark
,pcap
,netcat/socat
for simple TCP tests;iperf
/netperf
for bandwidth;strace
/perf
for profiling.
ASCII diagram (data-flow simplified)
Exchange multicast (ITCH) --> Fiber --> NIC --\ | \ v > Your market-data handler (in C++/DPDK/PF_RING) Exchange TCP (FIX/OUCH) --> Fiber --> NIC ----/ (parsing, seq. recovery, order gateway)
Hands-on: C++ playground (parse a FIX string and a simulated binary multicast)
Below is a small, beginner-friendly C++
program that:
- Parses a simple
FIX
message (splits tag=value by SOH). - Builds and parses a small simulated binary multicast packet (an
ITCH
-like layout).
Notes for you coming from Python
, Java
, C
, or JS
:
- This is intentionally simple: it shows the core idea of tokenizing (like JavaScript's
split
) and byte decoding (like reading an ArrayBuffer in JS). - Modify the sample messages (change symbol from the playful
KB24
— a Kobe Bryant nod — to your favorite symbol) and recompile to see how parsing changes.
Challenge for you:
- Add checksum verification for the
FIX
message (10=
field) in the C++ code. - Simulate a missing sequence number in the binary packet and print a warning for gap detection.
- If you prefer Python: reimplement
parse_fix
in Python quickly to feel the contrast in allocation/cost.
Now compile and run the C++ code in the code pane. After running, try the challenges above.
xxxxxxxxxx
}
using namespace std;
void parse_fix(const string& msg) {
cout << "=== Parsing FIX (tag=value separated by SOH) ===\n";
size_t start = 0;
while (start < msg.size()) {
size_t pos = msg.find('\x01', start);
string field = msg.substr(start, pos == string::npos ? string::npos : pos - start);
size_t eq = field.find('=');
if (eq != string::npos) {
string tag = field.substr(0, eq);
string val = field.substr(eq+1);
cout << "Tag " << tag << " => " << val << "\n";
} else if (!field.empty()) {
cout << "Malformed field: " << field << "\n";
}
if (pos == string::npos) break;
start = pos + 1;
}
}
void parse_simple_binary(const vector<uint8_t>& pkt) {
cout << "=== Parsing simple binary multicast (simulated ITCH-like) ===\n";
// Layout: [8 bytes ts][4 bytes price][4 bytes size][1 byte symlen][symlen bytes symbol ascii]
Try this exercise. Is this statement true or false?
Binary multicast market data feeds (for example, ITCH) automatically retransmit any packets a receiver misses, so your market-data handler does not need to detect sequence gaps or request a replay.
Press true if you believe the statement is correct, or false otherwise.
Market Data Handler and Order Gateway: Initial Implementation
Hands-on goal: design a tiny, practical skeleton that shows how a UDP multicast market-data listener and a TCP/UDP order gateway fit together. We'll simulate both sides so you (a beginner in C++
, Python
, Java
, C
, JS
) can see the core ideas without needing exchange credentials or NIC tuning yet.
Why this matters (short):
- Market data (multicast) is usually UDP-style: fire-and-forget, no retransmit. If you miss a packet you must detect gaps (sequence numbers) and recover via snapshot or TCP replay.
- Orders (gateway) are typically TCP (or a binary TCP protocol like
OUCH
): reliable, ordered, and you must acknowledge and validate. - Building a clean separation helps:
MarketDataHandler
(parse, detect gaps, publish) andOrderGateway
(validate orders, send, ack).
Core concepts you should walk away with:
- Sequence recovery: maintain
last_seq
, detectseq != last_seq + 1
. Missing packet -> request replay/snapshot. - Parsing: binary parsing (map bytes to fields) vs text parsing (
FIX
tag=value
with SOH\x01
). - Minimal API design: fast-in path for market updates, slightly heavier path for order validation.
ASCII data-flow (simplified)
Exchange multicast (ITCH-like) ---> NIC ---> [MarketDataHandler: parse, seq-check] ---> Strategy ---> [OrderGateway: validate, TCP submit] ---> Exchange TCP
A few notes tailored to your background:
- If you're comfortable in
Python
, reimplement the C++ demo inPython
to feel the ergonomic differences (strings, slicing, map ops). ForJava
,C
, orJS
the same primitives apply: buffer handling + sequence bookkeeping. - For basketball fans: we use symbol
KB24
in the demo — change it to your favourite player ticker to make it fun while you learn.
What the included C++ example does (run it in the code pane):
- Simulates two binary multicast packets (big-endian sequence number + fixed 8-byte symbol + price + size).
- Intentionally sends a gap (1001 then 1003) to show gap detection and a simulated recovery trigger.
- Parses a tiny
FIX
-like order and prints parsedtag => value
pairs.
Guided challenges (pick one or more):
- Add checksum verification for the
FIX
message (tag 10
) and reject orders with bad checksums. - Simulate a replay server: when a gap is detected, fetch a synthetic snapshot (in C++ or Python) and patch the state.
- Reimplement the multicast parser in Python using
struct.unpack
and compare code size and readability. - Replace the simulated packets with a real UDP socket receiver (Linux) and test with a local pcap replay tool like
tcpreplay
(only after you understand sandbox/network safety).
Practical tips before you move on:
- Keep the hot path minimal: avoid allocations per packet in production. This demo uses strings/vectors for clarity.
- Validate lengths and fields before trusting values (never trust network input).
- Log sequence gaps and add metrics (counter of gaps, last good seq) — useful when you graduate to profiling with
perf
.
Try this next: edit the C++ code to (1) change KB24
to your favorite ticker/player, (2) simulate a second gap, and (3) print how many gaps were detected. Or, reimplement the parse_fix
function in Python to compare parsing convenience.
Happy hacking — this small demo is the seed of a real market-data handler and a safe order gateway skeleton. Build on it, then we’ll add sockets, recovery protocols, and latency measurements in later labs.
xxxxxxxxxx
}
using namespace std;
// Simple helpers to simulate incoming binary multicast packets (big-endian)
uint32_t read_be32(const vector<uint8_t>& b, size_t off) {
return (uint32_t(b[off]) << 24) | (uint32_t(b[off+1]) << 16) | (uint32_t(b[off+2]) << 8) | uint32_t(b[off+3]);
}
vector<uint8_t> make_packet(uint32_t seq, const string& sym, uint32_t price, uint32_t size) {
vector<uint8_t> p;
// 4-byte sequence (big-endian)
p.push_back((seq >> 24) & 0xFF);
p.push_back((seq >> 16) & 0xFF);
p.push_back((seq >> 8) & 0xFF);
p.push_back(seq & 0xFF);
// fixed 8-byte symbol (padded with \0)
string s = sym;
s.resize(8, '\0');
for (char c : s) p.push_back((uint8_t)c);
// 4-byte price and 4-byte size
p.push_back((price >> 24) & 0xFF);
p.push_back((price >> 16) & 0xFF);
Are you sure you're getting this? Fill in the missing part by typing it in.
In a UDP multicast market-data handler you must compare each incoming packet's seq
value to last_seq + 1
to detect gaps. The packet field you check to determine ordering and missing packets is the ___.
Write the missing line below.
Backtesting and Simulation Environment
Design goal: build reproducible backtests and a small market-replay + synthetic exchange so you can validate strategies offline. This screen gives a compact mental model and a tiny C++ playground you can run and edit — perfect for beginners in C++
, Python
, Java
, C
, or JS
who want to see the whole pipeline.
Why this matters
- Reproducible backtests let you compare strategy changes deterministically (same ticks -> same trades).
- A market replay engine feeds your strategy with historical
ticks
andsequence
numbers so you can test gap-handling. - A synthetic exchange / matching engine implements a minimal order book to check execution logic and slippage.
Core components (ASCII diagram)
Historical ticks (file/array) --> Market Replay --> Strategy --> Order Gateway --> Matching Engine --> Trades/Logs
Key concepts
Tick
= (timestamp, seq, price, size). Useseq
to detect dropped packets / gaps.- Replayer: emits ticks in order (optionally with controllable timing) so your algorithm sees the same stream each run.
- Matching engine: a tiny
order book
that matches buys vs sells deterministically — good for unit tests. - Deterministic randomness: if you must use randomness, seed it (e.g.,
std::mt19937 rng(42)
).
Analogy for beginners: think of the replay as a basketball practice tape — you can replay Kobe Bryant's moves frame-by-frame to refine your passes (orders) and reactions (strategy). Replace Kobe Bryant
with your favorite player and watch how changing one pass timing changes the play outcome.
What the included C++ demo does
- Simulates a short tick stream with an intentional sequence gap to show detection.
- Runs a trivial strategy: place a limit buy when price drops below a threshold.
- Implements a tiny matching engine that prints executed trades and remaining order-book state.
Try these challenges
- Change the favorite player string (currently
Kobe Bryant
) to yours and print it with each trade. - Add latency emulation: when matching, add a small sleep to simulate network delays and measure missed fills.
- Reimplement the replay part in Python (use
struct.unpack
and lists) and compare code size and ergonomics. - Add VWAP calculation for each simulated trade batch and log it.
Now run the C++ example below and then try modifying it: change tick prices, insert another sequence gap, or make the strategy more aggressive.
1// see the separate code block below — run and edit main.cpp
xxxxxxxxxx
}
using namespace std;
struct Tick {
int timestamp;
int seq;
double price;
int size;
};
struct Order {
int id;
bool is_buy;
double price; // limit price
int size;
};
struct Trade {
int buy_id;
int sell_id;
double price;
int size;
};
// Very small, naive matching: match incoming order with opposite book
Build your intuition. Click the correct answer from the options.
In a deterministic backtest that replays historical market ticks, you want to detect dropped packets or gaps in the tick stream so the strategy can handle missing data. Each tick is represented as (timestamp, seq, price, size)
. Which field should you check to reliably detect sequence gaps or dropped packets?
Click the option that best answers the question.
- timestamp
- `seq` (sequence number)
- price
- size
Strategy Prototyping: From Python to C++
Why this screen matters
- You will normally prototype quickly in Python (pandas/NumPy) to validate strategy logic. When the inner loop becomes a bottleneck, you migrate just that hotspot to C++ for speed and deterministic performance.
- Think of Python as your whiteboard sketch and C++ as the high-performance court — the plays are the same, but execution is faster and more precise.
High-level workflow (ASCII diagram)
Python prototype (fast iterate) ---> Profile (cProfile/line_profiler/pyinstrument) ---> Identify hot function(s) ---> Reimplement hot function(s) in C++ (pybind11 or RPC) ---> Integrate & benchmark ---> Deploy
Analogy for beginners (basketball)
- Prototype in Python = film-study with
Kobe Bryant
highlights. You find the play that scores most often. - Hotspot = the quick cut that wins the game (micro-ops inside your loop).
- Migrating to C++ = sending your best shooter to the court who always hits under pressure.
Concrete tips for a beginner in C++, Python, Java, C, JS
- Prototype quickly in Python: use small, readable code & synthetic ticks (lists of
(timestamp, price, size)
). - Profile early: find the exact function (not the file) that takes most time—
funcA
doing rolling sums? that's your candidate. - Reimplement minimally: keep the same inputs/outputs. Start with a small, well-tested C++ function that computes e.g. a rolling average or VWAP.
- Expose to Python: start with
pybind11
(a thin wrapper). If deployment needs process isolation, use an RPC boundary (nanomsg, gRPC, or raw TCP).
What to migrate (common hotspots)
- Inner loops that process every tick (aggregation, feature extraction, order decision logic).
- Parsing heavy binary formats (market
ITCH
/OUCH
) — low-level parsers in C++ can drastically reduce CPU and copies. - Memory-allocation hot spots — re-use buffers in C++ and avoid per-tick malloc.
Quick checklist before migrating
- Can I vectorize this in NumPy? If yes, you may not need C++.
- Is the function called millions of times per second? If yes, it's a prime candidate.
- Are allocations and copies dominating CPU? Move to a C++ ring buffer.
Mini-exercise (what the C++ code below demonstrates)
- Generates a stream of synthetic
prices
(deterministic seed so results are reproducible). - Implements two ways to compute a rolling simple moving average (SMA):
naive_sma
: recompute the sum each tick (like a straightforward Python loop).incremental_sma
: maintain an incremental sum (how you'd implement it in C++ for speed).
- Compares timings so you can see why migrating the inner loop matters.
Try these challenges after running the example
- Change the window size (
WINDOW
) and re-run. How does the speed gap evolve? - Replace the random tick generator with a small histogram or real CSV replay (simulate
Kobe Bryant
moments by injecting spikes). - Wrap the
incremental_sma
inpybind11
and call it from Python for a real prototype -> production path.
Now run the C++ example below (it prints timings and a few sample buy decisions). Then try the challenges!
xxxxxxxxxx
}
using namespace std;
using clock_t = chrono::high_resolution_clock;
// Small struct to look like a tick: (timestamp, price, size)
struct Tick { long long ts; double price; double size; };
// Naive SMA: recompute sum every time (like a simple Python loop over a list slice)
vector<double> naive_sma(const vector<Tick>& ticks, size_t window) {
vector<double> out;
out.reserve(ticks.size());
for (size_t i = 0; i < ticks.size(); ++i) {
if (i + 1 < window) { out.push_back(0.0); continue; }
double s = 0.0;
for (size_t j = i + 1 - window; j <= i; ++j) s += ticks[j].price;
out.push_back(s / double(window));
}
return out;
}
// Incremental SMA: maintain running sum (the typical C++ hotspot implementation)
vector<double> incremental_sma(const vector<Tick>& ticks, size_t window) {
Try this exercise. Is this statement true or false?
If profiling shows that a small, well-defined function is responsible for most runtime in per-tick processing, reimplementing only that function in C++ and exposing it to Python (for example with pybind11) is an appropriate next step to reduce latency.
Press true if you believe the statement is correct, or false otherwise.
Measuring Latency and Throughput
Why this matters for HFT engineers (beginner-friendly)
- In HFT the difference between
1,500 ns
and2,500 ns
per tick can change whether your order wins a trade or not. Think of latency like a fast-break in basketball: a small delay can be the difference between an easy layup and a contested shot. - Throughput (
ops/sec
) is how many ticks your system can handle per second — like how many possessions a team can run in a game.
Quick ASCII diagram: where measurement fits in the pipeline
[Market feed NIC] --(packets)--> [Capture / Handler] --(parse)--> [Strategy inner-loop] --(orders)--> [Exchange gateway] ^ | |------ instrument (timestamps) ---|
- The critical path (where latency matters) is from packet arrival to order emission.
- We measure: per-event latency (ns) and overall throughput (ops/sec).
Core approaches and tools (what to reach for)
- Software timers: use
std::chrono::steady_clock
in C++,time.perf_counter()
in Python,System.nanoTime()
in Java,clock_gettime(CLOCK_MONOTONIC_RAW)
in C,performance.now()
in JS. These give you program-side timings. - Kernel/hardware timestamps: NICs and kernel support (SO_TIMESTAMPING or PTP). These give lower-level absolute times and remove user-space scheduling jitter.
- Packet capture:
tcpdump -tt -i eth0 -w out.pcap
and analyze timestamps with Wireshark. Use hardware timestamping where available. - Profilers and counters:
perf record
/perf stat
for CPU metrics and hotspots.perf
helps find the hot function you should optimize.
Commands (beginner-safe examples)
- Capture packets (software timestamps):
sudo tcpdump -i eth0 -w feed.pcap
- Profile CPU to find hotspots:
sudo perf record -F 99 -- ./your_binary
sudo perf report --stdio
- Check NIC timestamping capability:
ethtool -T eth0
How to interpret measurements (simple rules)
- Look at percentiles, not just average:
p95
andp99
show tail latency which kills HFT performance. - Correlate throughput and latency: higher throughput often raises latency (queueing).
- Watch for long tails caused by GC, page faults, IRQs, or CPU frequency scaling.
Analogy to basketball (keep it intuitive)
- Average latency = team's average shot time.
- p99 latency = worst possession in the last 100 possessions (the play that cost you the game).
- Throughput = possessions per minute.
The supplied C++ example (in the code block) shows a reproducible microbenchmark:
- It builds deterministic
ticks
(vector<Tick>
) so results are reproducible. - Measures per-tick latency (nanoseconds) and computes
min
,avg
,p50
,p95
,p99
,max
andops/sec
. - Prints SLO breaches for a simple service-level check.
Beginner challenges (try these after running the code)
- Change
ITERATIONS
to10000
and500000
. How doops/sec
andp99
change? - Toggle the
heavy
boolean totrue
to simulate a slower inner loop (like an unoptimized Python hotspot migrated to C++). What happens to throughput? - Replace the synthetic price generator with a replay from CSV: read timestamps and prices into
ticks
and rerun the benchmark. - Implement the same microbenchmark in Python using
time.perf_counter()
and compareops/sec
. (Hint: Python will be much slower per-op; that’s why we migrate hotspots.)
Practical next steps and what to measure in the field
- For network I/O benchmarks, use pcap with hardware timestamps when possible and compute hop-to-order latency.
- Use
perf
to see if allocations, syscalls, or branch mispredictions dominate the time. - Establish SLOs early (e.g., p99 < 5us) and continuously measure against them; alert when breached.
Try a small modification now (exercise):
- Edit the C++ example and:
- increase
ITERATIONS
by 10x, - or add
std::this_thread::sleep_for(std::chrono::nanoseconds(2000));
inside the loop to simulate NIC queueing jitter, - or switch to the
heavy
workload.
- increase
Observe how the numbers change (min, p95, p99 and ops/sec). Understanding how these metrics move when you change workload or environment is the key skill here.
xxxxxxxxxx
}
using namespace std;
using ns = std::chrono::nanoseconds;
using Clock = std::chrono::steady_clock;
// Simple deterministic PRNG for reproducible "ticks" (no <random> overhead)
uint32_t lcg(uint32_t &state) {
state = state * 1664525u + 1013904223u;
return state;
}
// Synthetic tick: timestamp + price
struct Tick { uint64_t ts_ns; double price; };
// Simulated processing workload: a small amount of math per tick
inline double process_tick_fast(const Tick &t) {
// cheap arithmetic that an HFT inner loop might do
double p = t.price;
// combine a few ops to simulate feature extraction
return (p * 1.0001 + p / 123.456 - (p > 100.0 ? 0.42 : 0.21));
}
inline double process_tick_heavy(const Tick &t) {
Let's test your knowledge. Fill in the missing part by typing it in.
When measuring latency for an HFT critical path, you should examine percentiles such as p95
and p99
rather than relying only on the average, because these percentiles reveal the system's _ which often determines whether you meet your SLOs.
Write the missing line below.
Profiling, Performance Optimization and Vectorization
Why this matters for HFT engineers (beginner-friendly)
- In algorithmic trading you often process millions of
ticks
per second. Small inefficiencies in loops or data layout become huge latency and throughput problems. - Think of optimization like a fast-break in basketball: you want the ball (data) to move in straight lines with no unnecessary stops. Poor data layout is like zig-zag dribbling that wastes time.
Quick ASCII visuals — memory layout and cache friendliness
AoS (Array of Structs) — awkward for per-field hot loops:
[ Tick{price,size,ts} ][ Tick{price,size,ts} ] [ Tick{...} ]
^ accessing price
touches other fields too
SoA (Struct of Arrays) — cache-friendly when you only need one field:
prices: [p0, p1, p2, p3, ...]
sizes: [s0, s1, s2, s3, ...]
ts: [t0, t1, t2, t3, ...]
^ sequential memory for prices
-> better L1/L2 prefetching
Core concepts to remember
- Profilers: use
perf
(Linux) and Intel VTune to find hot functions and cache-miss hotspots. - Algorithmic choices: a better algorithm beats micro-optimizations (O(n) vs O(n log n)).
- Data layout:
SoA
often outperformsAoS
for tight numeric loops. - Memory pools: avoid frequent small allocations; use pools/arenas to reduce allocator overhead and fragmentation.
- Vectorization (SIMD): modern compilers auto-vectorize loops when the code is simple and memory-aliasing is clear. You can also use intrinsics later.
Commands & profiling tips (beginner-safe)
- Quick perf run:
sudo perf record -F 99 -- ./main
thensudo perf report --stdio
- See whether the compiler vectorized a loop (GCC/Clang): compile with
-O3 -ftree-vectorize -fopt-info-vec
and check messages. - Use
perf stat -e cache-references,cache-misses ./main
to get cache miss rates.
How this ties to languages you know
- From Python/NumPy: moving inner loops to contiguous
numpy
arrays or C++ gives big speedups. In Python, prefernumpy
vector ops to Python loops. - From Java: Java's HotSpot does JIT vectorization — same principles (contiguous arrays, simple loops) apply.
- From C/JS: memory layout and cache behavior still matter — in JS typed arrays are faster for numeric tight loops.
Try this interactive C++ microbenchmark (below). It demonstrates:
- Generating reproducible
ticks
(deterministic RNG) — similar to replaying market data. - Two implementations of a simple moving-average workload:
AoS
vsSoA
. - Timings using
std::chrono
and a small checksum to keep results honest.
Compile hints (try these locally):
- g++ -O3 -march=native -std=c++17 main.cpp -o main
- Run
perf stat -e cache-references,cache-misses ./main
and compare cache misses between AoS and SoA - If you're curious about vectorization, add
-fopt-info-vec
(GCC/Clang) to see what loops the compiler vectorized.
Beginner challenges to try after running the code
- Re-run with different
N
(e.g., 100k, 10M). How doesops/sec
scale? Which version scales better? - Compile with and without
-O3
. Inspect perf differences and runobjdump -d
to see generated assembly. - Implement the same logic in Python with
numpy
arrays; compare runtime and think about whynumpy
can sometimes match C++ for vectorizable ops. - (Stretch) Try rewriting the hot loop with explicit intrinsics (
<immintrin.h>
) and compare — only after you’re comfortable with the auto-vectorized result.
Small motivational analogy: if Kobe Bryant
is the best at taking a direct straight-to-the-basket path, think of SoA + vectorized loop as the most direct path your code can take — fewer stops, fewer wasted cycles.
Now run the example below and experiment with the challenges above. The code is self-contained and prints timings and simple checksums so you can verify correctness while measuring performance.
xxxxxxxxxx
}
using namespace std;
using steady = chrono::steady_clock;
struct Tick {
double price;
double size;
long ts;
};
// Generate deterministic ticks so runs are reproducible
void gen_ticks_aos(vector<Tick>& ticks, size_t N) {
mt19937_64 rng(42);
uniform_real_distribution<double> price_d(100.0, 101.0);
uniform_real_distribution<double> size_d(1.0, 10.0);
ticks.resize(N);
for (size_t i = 0; i < N; ++i) {
ticks[i].price = price_d(rng);
ticks[i].size = size_d(rng);
ticks[i].ts = static_cast<long>(i);
}
}
Let's test your knowledge. Click the correct answer from the options.
You have an Array-of-Structs (AoS) layout and a tight numeric loop that computes a moving average over ticks
:
1#include <vector>
2struct Tick { double price; int size; uint64_t ts; };
3
4void process(std::vector<Tick>& ticks) {
5 double sum = 0.0;
6 for (size_t i = 0; i < ticks.size(); ++i) {
7 // hot inner loop touching only `price`
8 sum += ticks[i].price;
9 }
10 (void)sum; // keep result to avoid optimizing away
11}
Which of the following is the best first action to determine whether poor data locality (AoS vs SoA) and cache behavior are causing a performance problem?
Click the option that best answers the question.
- Run a low-level profiler (e.g., perf) to measure cache-references/cache-misses and find the hot spots before changing layout.
- Immediately refactor the data into a Struct-of-Arrays (SoA) and compare wall-clock times.
- Increase the number of threads to hide cache misses by parallelizing the loop.
- Just compile with `-O3 -march=native` and assume the compiler will auto-fix layout and vectorize the loop.
- Use a source-level debugger (gdb) to step through the loop and inspect memory addresses.
Testing, Reliability and Deterministic Builds
Why this matters for HFT engineers (beginner-friendly)
- In HFT, a tiny bug in market data parsing or an unreproducible build can cause real money loss or missed trades. Tests + deterministic builds are your safety net.
- Think of tests as pre-game practice drills (free throws, fast-breaks). Deterministic builds are like running the same playbook each time — no surprises at game time.
Key concepts at a glance
- Unit tests: small, fast checks for pure functions (e.g.,
VWAP
, message parsers). - Integration tests: run components together (feed handler → matching logic → order gateway) in a sandbox.
- Fuzz testing: throw random / malformed packets at parsers to find crashes or undefined behavior.
- Deterministic builds: produce byte-for-byte reproducible binaries/artifacts so CI artifacts are trustworthy.
- CI pipelines: automate tests, static analysis, fuzzing, and artifact signing on every commit.
ASCII diagram — a minimal CI flow for HFT microservice
[push to git] -> [CI: compile (deterministic)] -> [unit tests + linters] |-> [integration tests w/ replay] |-> [fuzzing harness (sanitizers)] --> [artifact: signed, reproducible .tar.gz]
Practical tips — testing and determinism for a beginner coming from C++, Python, Java, JS
- Start with unit tests in both languages:
gtest
or a tiny home-grown harness in C++;pytest
in Python. - Keep pure logic (math, parsing) in small, testable functions. If you can test
VWAP
in isolation, you avoid whole-system runs early. - For message parsing, add golden-file tests: store a known binary multicast packet and assert parser fields match expected values.
- Fuzzing path: begin with property-based tests (Hypothesis for Python; libFuzzer/oss-fuzz for C++ when you scale). Run sanitizers (
-fsanitize=address,undefined
) in CI to catch UB. - Deterministic runtime: avoid calling
rand()
without seed. Usestd::mt19937_64
with a fixed seed for deterministic replays (see code). - Deterministic builds: set
SOURCE_DATE_EPOCH
, avoid embedding build timestamps, and strip or fix linker--build-id
. Build with reproducible flags in CI.
Concrete checklist for your repo
- Unit tests for parsing, VWAP, and order-serialization.
- Integration replay tests using deterministic tick generator / pcap replay.
- Fuzz harness for parsers and message handling.
- CI job that sets reproducible env vars and produces signed artifacts.
Challenge — try this now
Run the C++ example below. It:
- generates deterministic
ticks
withstd::mt19937_64
, - computes a
VWAP
and checksum, - verifies deterministic behavior (same seed → same checksum),
- runs a tiny fuzz loop to ensure no NaNs/crashes for many seeds.
- generates deterministic
Modify the seed and
N
(number of ticks) to see when floating-point differences appear — it's like changing a game tempo.
Code (below) is the runnable test-harness. After running it locally, try integrating it into your CI as a unit
job.
1#include <iostream>
2#include <vector>
3#include <numeric>
4#include <random>
5#include <cmath>
6
7using namespace std;
8
9struct Tick { double price; int size; uint64_t ts; };
10
11vector<Tick> gen_ticks(size_t n, uint64_t seed=12345) {
12 std::mt19937_64 rng(seed);
13 std::uniform_real_distribution<double> price_dist(100.0, 101.0);
14 std::uniform_int_distribution<int> size_dist(1, 10);
15 vector<Tick> ticks; ticks.reserve(n);
16 uint64_t ts = 0;
17 for (size_t i = 0; i < n; ++i) {
18 ticks.push_back({price_dist(rng), size_dist(rng), ts++});
19 }
20 return ticks;
21}
22
23// Method A: straightforward VWAP
24double vwap_a(const vector<Tick>& ticks) {
25 double pv = 0.0; double vol = 0.0;
26 for (auto &t : ticks) { pv += t.price * t.size; vol += t.size; }
27 return vol ? pv / vol : 0.0;
28}
29
30// Method B: use std::accumulate with lambdas (same result expected)
31double vwap_b(const vector<Tick>& ticks) {
32 double pv = std::accumulate(ticks.begin(), ticks.end(), 0.0,
33 [](double acc, const Tick &t){ return acc + t.price * t.size; });
34 double vol = std::accumulate(ticks.begin(), ticks.end(), 0.0,
35 [](double acc, const Tick &t){ return acc + t.size; });
36 return vol ? pv / vol : 0.0;
37}
38
39int main() {
40 const size_t N = 1000;
41 const uint64_t seed = 424242ULL; // change this to experiment
42
43 auto ticks1 = gen_ticks(N, seed);
44 auto ticks2 = gen_ticks(N, seed); // regenerate to prove determinism
45
46 double pv1 = vwap_a(ticks1);
47 double pv2 = vwap_b(ticks2);
48
49 // checksum = sum(price * size) to quickly compare streams
50 double checksum1 = std::accumulate(ticks1.begin(), ticks1.end(), 0.0,
51 [](double acc, const Tick &t){ return acc + t.price * t.size; });
52 double checksum2 = std::accumulate(ticks2.begin(), ticks2.end(), 0.0,
53 [](double acc, const Tick &t){ return acc + t.price * t.size; });
54
55 cout << "VWAP method A: " << pv1 << "\n";
56 cout << "VWAP method B: " << pv2 << "\n";
57 cout << "Checksums: " << checksum1 << " " << checksum2 << "\n";
58
59 bool deterministic = fabs(checksum1 - checksum2) < 1e-12;
60 bool agree = fabs(pv1 - pv2) < 1e-12;
61
62 cout << (deterministic ? "[PASS] deterministic replay" : "[FAIL] non-deterministic") << "\n";
63 cout << (agree ? "[PASS] VWAP agreement" : "[FAIL] VWAP mismatch") << "\n";
64
65 // tiny fuzz loop: make sure we never get NaN or inf for many seeds
66 int bad = 0;
67 for (uint64_t s = 0; s < 500; ++s) {
68 auto t = gen_ticks(200, s);
69 double v = vwap_a(t);
70 if (!std::isfinite(v)) ++bad;
71 }
72 cout << "Fuzz checks (NaN/inf count): " << bad << "\n";
73
74 if (!deterministic || !agree || bad > 0) {
75 cout << "One or more tests failed.\n";
76 return 1;
77 }
78
79 cout << "All basic tests passed. Integrate into CI as a unit job.\n";
80 return 0;
81}
Next steps
- Add this harness as a
unit
job in CI and gate merges on it. - Replace the tiny harness with
gtest
for readable test reports when you grow the suite. - For deterministic builds: set
SOURCE_DATE_EPOCH
, avoid embedding timestamps, and ask your CI to produce a signed tarball and store it as a release artifact.
Quick reading suggestions
GoogleTest
(C++) andpytest
(Python) guideslibFuzzer
/oss-fuzz
for C++ fuzzing- Reproducible Builds project for concrete build flags and CI recipes
Now run the example and try changing seed
and N
. If you like basketball, imagine tweaking the tempo of a Kobe-era fast-break: small changes in rhythm can expose weaknesses — same with seeds and test inputs.
xxxxxxxxxx
}
using namespace std;
struct Tick { double price; int size; uint64_t ts; };
vector<Tick> gen_ticks(size_t n, uint64_t seed=12345) {
std::mt19937_64 rng(seed);
std::uniform_real_distribution<double> price_dist(100.0, 101.0);
std::uniform_int_distribution<int> size_dist(1, 10);
vector<Tick> ticks; ticks.reserve(n);
uint64_t ts = 0;
for (size_t i = 0; i < n; ++i) {
ticks.push_back({price_dist(rng), size_dist(rng), ts++});
}
return ticks;
}
// Method A: straightforward VWAP
double vwap_a(const vector<Tick>& ticks) {
double pv = 0.0; double vol = 0.0;
for (auto &t : ticks) { pv += t.price * t.size; vol += t.size; }
return vol ? pv / vol : 0.0;
}
Build your intuition. Click the correct answer from the options.
Which of the following is the most important practice to produce deterministic, reproducible build artifacts in a CI pipeline for an HFT microservice?
Click the option that best answers the question.
- Set reproducibility-friendly environment variables (e.g., SOURCE_DATE_EPOCH), avoid embedding timestamps/build-ids, and sign the produced artifacts
- Allow the build system to embed timestamps and random build IDs so each artifact is uniquely identifiable
- Use unseeded global RNGs (e.g., `rand()`) during test data generation so CI runs exercise varied inputs
- Disable unit tests in CI to speed up artifact creation and run tests only locally before release
Logging, Observability and Incident Response
Why this matters for HFT engineers (beginner-friendly)
- In algorithmic trading, especially HFT, a missing or slow log can hide a latency spike that costs money. Logs are your breadcrumbs, metrics are your heartbeat, and traces are your map when something goes wrong.
- Think of your system like a fast-break basketball play: the ball (market data) flies through different players (feed handler → strategy → order gateway). If one player hesitates 2ms, the play fails. Logging must show who hesitated and why — without slowing the play.
Primary goals for this section
- Design low-latency, non-blocking logging that doesn't add jitter.
- Collect lightweight metrics (counters, gauges, histograms) and export them.
- Add simple tracing IDs to tie together a market tick's path through the system.
- Create alert rules and a minimal runbook to act when SLOs break.
Quick ASCII diagram — where to tap logs/metrics
[Exchange NIC] -> [Feed Handler] -> [Strategy] -> [Order Gateway] -> [Exchange] | | | logs/metrics logs/trace logs/metrics
Key patterns and trade-offs (for a multi-language view)
- Use a lock-free or bounded ring buffer for logs in C++ (
spdlog::sinks::ringbuffer_sink_mt
or a hand-rolled SPSC ring) to avoid allocations in the hot path. - In Python/Java/C/JS prototypes, prefer structured logging and metrics via
json
lines. But beware: garbage + allocations can add latency — profile! - Emit minimal data on hot path:
timestamp
,trace_id
,event_type
,latency_ns
— push heavy context to async uploaders. - Batch and flush: coalesce many small log writes into a single I/O operation off the critical path.
- Metrics: counters (events/sec), gauges (queue length), histograms (latency distribution) — keep them lightweight. Use prom-client / Prometheus exporters in non-latency critical threads.
Incident response basics (mini runbook)
- Alert fires: e.g., 99.9th percentile latency > 1ms for 1 minute.
- Check quick health endpoints:
/metrics
, process CPU, queue sizes, NIC errors. - Look at recent trace IDs logged around the spike and replay those ticks locally.
- If the cause is config/kernel change, roll back and escalate.
Hands-on example (C++):
- A tiny, runnable demo that shows a bounded ring buffer logger, a producer simulating ticks (with occasional latency spikes), a consumer that drains logs and creates metrics.
- This is an educational prototype — in production you'd replace strings with preallocated structures and avoid std::string allocations.
Study tasks / Challenges
- Run the example below and change
BUFFER_SIZE
to8
and then to128
. Observedropped
log counts and max latency reported. - Change
LATENCY_ALERT_NS
to a lower value and see how the simulated spike triggers an alert. - Extend the logger to pre-allocate a pool of fixed-size char arrays to avoid heap allocations (advanced).
Code (compile as main.cpp). Try changing buffer size and alert threshold.
1#include <iostream>
2#include <vector>
3#include <string>
4#include <atomic>
5#include <thread>
6#include <chrono>
7#include <random>
8#include <sstream>
9#include <iomanip>
10
11using namespace std;
12using namespace std::chrono;
13
14struct SimpleRingLogger {
15 vector<string> buf;
16 size_t capacity;
17 atomic<size_t> head{0}; // next write index
18 atomic<size_t> tail{0}; // next read index
19 atomic<size_t> dropped{0};
20
21 SimpleRingLogger(size_t cap) : buf(cap), capacity(cap) {}
22
23 // Non-blocking push: returns false if buffer is full
24 bool push(string msg) {
25 size_t h = head.load(memory_order_relaxed);
26 size_t t = tail.load(memory_order_acquire);
27 if (h - t >= capacity) { // full
28 dropped.fetch_add(1, memory_order_relaxed);
29 return false;
30 }
31 buf[h % capacity] = move(msg);
32 head.store(h + 1, memory_order_release);
33 return true;
34 }
35
36 // Non-blocking pop: returns true if there was an item
37 bool pop(string &out) {
38 size_t t = tail.load(memory_order_relaxed);
39 size_t h = head.load(memory_order_acquire);
40 if (t >= h) return false; // empty
41 out = move(buf[t % capacity]);
42 tail.store(t + 1, memory_order_release);
43 return true;
44 }
45};
46
47int main() {
48 const size_t BUFFER_SIZE = 16; // try 8 / 128 as experiments
49 const size_t TOTAL_TICKS = 500; // how many simulated ticks
50 const long long LATENCY_ALERT_NS = 1'000'000; // 1 ms in ns
51
52 SimpleRingLogger logger(BUFFER_SIZE);
53
54 atomic<bool> done{false};
55
56 atomic<uint64_t> total_logged{0};
57 atomic<uint64_t> max_latency_ns{0};
58 atomic<uint64_t> events_processed{0};
59
60 // Consumer: drains logs and updates metrics
61 thread consumer([&]() {
62 string item;
63 while (!done.load() || logger.head.load() != logger.tail.load()) {
64 while (logger.pop(item)) {
65 // parse simple "trace_id,seq,latency_ns,timestamp"
66 stringstream ss(item);
67 string trace; uint64_t seq; uint64_t lat; uint64_t ts;
68 char comma;
69 if ((ss >> trace >> comma >> seq >> comma >> lat >> comma >> ts)) {
70 events_processed.fetch_add(1);
71 uint64_t prev_max = max_latency_ns.load();
72 while (lat > prev_max && !max_latency_ns.compare_exchange_weak(prev_max, lat)) {}
73 // simulate exporting to disk/network in batches (not blocking producer)
74 if (lat > (uint64_t)LATENCY_ALERT_NS) {
75 cout << "[ALERT] High latency detected trace=" << trace << " seq=" << seq
76 << " lat_ns=" << lat << "\n";
77 }
78 }
79 total_logged.fetch_add(1);
80 }
81 // small sleep to avoid busy spin in this demo
82 this_thread::sleep_for(milliseconds(1));
83 }
84 });
85
86 // Producer: simulates handling incoming ticks and logs latency
87 thread producer([&]() {
88 mt19937_64 rng(424242);
89 uniform_int_distribution<int> base_ns(100, 800); // normal path 100-800 ns
90 for (uint64_t i = 0; i < TOTAL_TICKS; ++i) {
91 // simulate work
92 int simulated = base_ns(rng);
93
94 // simulate a rare spike every 120 ticks
95 if (i % 120 == 0 && i != 0) {
96 simulated += 2'000'000; // +2 ms spike
97 this_thread::sleep_for(milliseconds(2));
98 }
99
100 // timestamp and record latency
101 auto t0 = high_resolution_clock::now();
102 // (work would happen here)
103 auto t1 = high_resolution_clock::now();
104
105 uint64_t observed_ns = (uint64_t)duration_cast<nanoseconds>(t1 - t0).count() + simulated;
106
107 // build lightweight structured log: trace_id,seq,lat_ns,timestamp_ns
108 stringstream ss;
109 ss << "T" << setw(6) << setfill('0') << (i % 999999) << "," << i << "," << observed_ns << ","
110 << duration_cast<nanoseconds>(t1.time_since_epoch()).count();
111 string msg = ss.str();
112
113 if (!logger.push(move(msg))) {
114 // in a real system you might increment a metric and continue
115 // keep the hot path fast and avoid blocking
116 }
117
118 // pacing: very small sleep to emulate incoming tick rate
119 this_thread::sleep_for(microseconds(100));
120 }
121 done.store(true);
122 });
123
124 producer.join();
125 consumer.join();
126
127 cout << "\n--- Summary ---\n";
128 cout << "Total ticks generated: " << TOTAL_TICKS << "\n";
129 cout << "Total logs consumed: " << total_logged.load() << "\n";
130 cout << "Events processed (metrics): " << events_processed.load() << "\n";
131 cout << "Dropped logs (ring full): " << logger.dropped.load() << "\n";
132 cout << "Max observed latency (ns): " << max_latency_ns.load() << "\n";
133 cout << "(Change BUFFER_SIZE and LATENCY_ALERT_NS in code to experiment)\n";
134
135 return 0;
136}
xxxxxxxxxx
}
using namespace std;
using namespace std::chrono;
struct SimpleRingLogger {
vector<string> buf;
size_t capacity;
atomic<size_t> head{0}; // next write index
atomic<size_t> tail{0}; // next read index
atomic<size_t> dropped{0};
SimpleRingLogger(size_t cap) : buf(cap), capacity(cap) {}
// Non-blocking push: returns false if buffer is full
bool push(string msg) {
size_t h = head.load(memory_order_relaxed);
size_t t = tail.load(memory_order_acquire);
if (h - t >= capacity) { // full
dropped.fetch_add(1, memory_order_relaxed);
return false;
Let's test your knowledge. Is this statement true or false?
Using a bounded, non-blocking ring buffer for logging in the HFT hot path is a good pattern because it avoids heap allocations and blocking I/O, even if that means occasionally dropping log entries under extreme load.
Press true if you believe the statement is correct, or false otherwise.
Security, Compliance and Risk Controls
Why this matters for HFT engineers (beginner-friendly)
- In HFT, an unchecked order can cause market, financial, or regulatory harm in milliseconds. Think of your system like a basketball play: the
feed handler
passes the ball, thestrategy
drives to the rim, and theorder gateway
must be the coach saying "no" when the shot is bad. A pre-trade check is that coach.
High-level flow (ASCII diagram)
[Exchange] ---> [NIC] ---> [Feed Handler] ---> [Strategy] ---> [Order Gateway] | | | <-- pre-trade -- | | risk checks, | | throttling, | | auditing |
Core controls you'll implement and test in labs
- Pre-trade risk checks:
max_order_size
,max_notional
,allowed_symbols
,only_market_hours
. - Order throttling / rate limiting: per-client token bucket or leaky bucket to prevent bursts.
- Auditing: immutable, append-only logs of every order decision (
accept
/reject
+ reason + trace id). - Circuit breakers: disable live trading on severe rule breaches or exchanges errors.
- Compliance hooks: exportable audit events and sequence numbers for regulators.
Trade-offs and pragmatic advice (for folks with C++, Python, Java, C, JS backgrounds)
- C++: implement fast, allocation-free checks on the hot path. Use preallocated structures, plain arrays, and
enum
reasons. - Python/JS: great for prototyping checks quickly, but watch allocations and the GIL/event loop — keep hot path tiny and push heavy work to background threads/processes.
- Java/C#: good middle-ground — use non-blocking queues and careful GC tuning.
- Always keep the decision (accept/reject) cheap and deterministic.
What an audit entry should include (minimal hot-path fields)
timestamp_ns
,client_id
,order_id
,symbol
,size
,price
,decision
,reason
,trace_id
Hands-on demo (C++):
- The code below simulates a tiny
Order Gateway
implementing simple pre-trade checks, a per-client token-bucket throttler, and an audit ring buffer. - It prints decisions and a summary so you can tinker with thresholds and see effects immediately.
Try these challenges after running the demo:
- Change
MAX_ORDER_SIZE
to a smaller value and re-run — how many orders are rejected? - Lower
TOKEN_RATE
to throttle clients more; simulate a burst by increasingBURST_ORDERS
. - Replace the C++ token bucket logic with a Python
asyncio
coroutine (exercise for Python practice). - Add a
blacklist
ofclient_id
s and ensure blacklisted clients are always rejected with reasonblacklisted
.
Short notes on compliance and production hardening
- Make audit logs tamper-evident: append-only files with rotation, checksums, and offsite replication.
- Expose health and safety endpoints (read-only):
GET /health
,GET /stats
,POST /pause
(operator-controlled circuit breaker). - Unit test the rule set and simulate time drift / replays in your backtest environment.
Ready? Run the C++ example below (main.cpp). Modify the constants to explore behaviors and think how you'd implement the same in Python or Java.
xxxxxxxxxx
}
using namespace std;
using ns = chrono::nanoseconds;
using steady = chrono::steady_clock;
struct Order {
string client;
string symbol;
int size;
double price;
uint64_t order_id;
};
// Simple token bucket for rate limiting (tokens measured as "orders")
struct TokenBucket {
double tokens = 0.0;
double rate_per_sec = 1.0; // tokens added per second
double capacity = 10.0;
steady::time_point last = steady::now();
Let's test your knowledge. Fill in the missing part by typing it in.
Security, Compliance and Risk Controls Fill In
Complete the audit-entry example by filling the blank below. An audit entry should include timestamp_ns
, client_id
, order_id
, symbol
, size
, price
, decision
, reason
, and _____________
.
Hint: This field lets you correlate events across services and logs (useful for tracing and post-incident analysis).
Write the missing line below.
Project Skeleton: Build Your First HFT Microservice
Quick goal: assemble a tiny, end-to-end microservice that replays market data, runs a simple strategy, submits orders through a minimal gateway, logs decisions, and reports backtest PnL — all locally and reproducibly.
- Target reader: you — an engineer into Algorithmic Trading with beginner familiarity in
C++
,Python
,Java
,C
, andJS
. This screen gives you a low-friction C++ starting point and clear follow-ups for your other languages.
ASCII architecture (what we'll simulate locally):
[Synthetic Multicast Feed] --> [Feed Handler / Replay] --> [Strategy] --> [Order Gateway] --> [Simulated Exchange] | | `----> [Logger / Backtest Recorder] <-'
Think of it like a small pit crew: the feed handler hands the tire (price) to the mechanic (strategy); the mechanic decides whether to pit (trade) and the pit-box (order gateway) enforces safety checks.
Why C++ here?
- C++ shows hot-path structure (tight loops, low allocation). Beginners: treat this as a clear, opinionated starting point; later you can prototype in
Python
for fast experiments or rewrite hotspots back into C++.
What the provided C++ program does (run it as main.cpp):
- Generates a deterministic synthetic feed (
generate_feed
) — reproducible like a unit test. - Computes a
simple_sma
over aSMA_WINDOW
and runs a tiny mean-reversion strategy: when price deviates from the SMA byTHRESH
, it places aBUY
orSELL
order. - Submits orders to a naive
submit_order
gateway which enforcesMAX_ORDER_SIZE
and a simple rate limitMAX_ORDERS_PER_SEC
. - Executes orders immediately (simulated exchange), updates
position
andcash
, and logs events. - Prints a final backtest summary (final PnL) so you can iterate quickly.
Why this is useful to you (language crosswalks):
- C++: shows how to keep the hot path allocation-light —
deque
for SMA window, plain structs forTick
/Order
. - Python: port the same logic into
pandas
or a tight loop withnumpy
for fast prototyping; keep the same knobs (SMA_WINDOW
,THRESH
) so results are comparable. - Java/C: similar structure applies — use arrays/pools for low-allocation paths.
- JS: great for visualization and teaching — replay the same tick vector in a browser and draw live PnL charts.
Exercises & challenges (try these after running the C++ program):
- Tweak
SMA_WINDOW
andTHRESH
to see how trade frequency and PnL change. - Reduce
MAX_ORDERS_PER_SEC
to simulate an exchange throttling you — watch rejections. - Port the strategy loop to Python (keep random seed same) and compare final PnL — are they identical?
- Replace immediate execution with a simple matching engine: keep an
order_book
vector and match orders at best price. - Add an
audit
event (append-only) that recordstimestamp_ns
,decision
,reason
,trace_id
— then export to CSV. - For fun: change the strategy to a momentum rule (buy when price > SMA + x) — which performs better on this synthetic feed?
Mini-challenges tailored to your background:
- If you like Java: implement the
Order
andFeed
as small POJOs and run the replay loop in aScheduledExecutorService
. - If you like Python: re-implement
generate_feed
withnumpy.random.default_rng(42)
and vectorize SMA vianumpy.convolve
. - If you like JS: visualize the replay and PnL using
d3
or a simple HTML canvas — great for demoing to teammates.
Next steps after this skeleton
- Replace synthetic feed with real multicast capture (lab later covers
PF_RING
/DPDK
). - Harden the gateway: add pre-trade rules, per-client token buckets, and immutable audit logs.
- Add unit tests and a deterministic CI job that compiles the C++ and runs the backtest with fixed seeds.
Try this now
- Run the C++ program above. Then try one change: halve
SMA_WINDOW
and re-run — what happens to order count and PnL?
Happy hacking — this tiny microservice is your playground for moving from prototypes to production-ready HFT components. Feel free to port pieces to Python/Java/JS to learn tradeoffs and iterate fast.
xxxxxxxxxx
}
using namespace std;
using ns = chrono::nanoseconds;
using clk = chrono::high_resolution_clock;
struct Tick {
ns ts;
double price;
};
struct Order {
string side; // "BUY" or "SELL"
int size;
double price;
ns ts;
};
// Simple console logger (hot-path should be lighter in real HFT)
void log_event(const string &s) {
auto now = chrono::duration_cast<ns>(clk::now().time_since_epoch()).count();
cout << "[" << now << "] " << s << "\n";
Build your intuition. Click the correct answer from the options.
In the provided C++ microservice skeleton (synthetic feed generator, SMA-based mean-reversion strategy, and a simple order gateway), which statement best describes how the example handles order execution?
Click the option that best answers the question.
- Orders are sent to a real exchange over a production FIX/TCP connection and await actual fills.
- Orders are executed immediately by the simulated exchange in-process; position and cash are updated deterministically.
- Orders are written to an on-disk persistent order book and matched asynchronously by a background thread.
- Orders are never executed — the program only logs decisions for offline analysis and does not change position or cash.
Course Wrap-up and Next Steps
A quick, actionable finale. You've built a tiny end-to-end HFT microservice, learned how to set up fast C++ and Python toolchains, and seen the latency-sensitive pieces that matter most in production. Below is a compact recap, a visual roadmap, immediate next projects, and career/practice tips — tailored for you (a beginner in C++
, Python
, Java
, C
, and JS
).
====================================
ASCII Roadmap (what you built → where to go):
[Synthetic Feed] --> [Feed Handler (C++
/DPDK
)] --> [Strategy (Python
prototype)] --> [Order Gateway (C++
)] --> [Simulated Exchange]
|
`--> [Logger / Backtester] (CSV / SQLite)
====================================
What you learned (recap):
- Core components:
market data feed handler
,strategy
,order gateway
,simulated exchange
,logger/backtest
. - Tooling:
CMake
,g++
/clang
,venv
/conda
,pybind11
for C++/Python bridges. - Low-latency basics: kernel tuning knobs, IRQ affinity,
TX/RX
ring sizing, hardware timestamping,TSC
caveats. - Testing & reproducibility: deterministic feeds, unit tests for parsing & matching, CI for deterministic builds.
- Measurement: using
perf
,tcpdump
/pcap
, hardware timestamps and microbenchmarks to find hotspots.
Deeper topics to pick next (recommended order):
- Kernel-bypass networking & frameworks:
DPDK
,PF_RING
,Solarflare/OpenOnload
— great next step if you liked the feed handler lab. - Profiling & optimization:
perf
,VTune
, cache-aware data structures, memory pools, andSIMD
vectorization. - Concurrency & OS internals:
isolcpus
, IRQ affinity, lock-free queues,NUMA
placement. - Hardware accelerators: FPGA basics for order-book / matching offload (start with reading & simulated examples).
- Time sync & accuracy:
PTP
, hardware timestamping, and handling clock skew in backtests.
Immediate project ideas (pick one; 1–4 weeks each depending on depth):
1) Implement a minimal matching engine (orders, book, match loop) — language: C++
.
2) Replace the synthetic feed with a local pcap replay and parse a binary multicast format — language: C++
or Python
.
3) Prototype a strategy in Python
, profile it, then move hot parts to C++
via pybind11
.
4) Do a kernel-bypass lab: capture & replay packets with DPDK
(read tutorials first).
5) Build a visualizer in JS
that consumes your backtest CSV and plots PnL and orders in real-time.
Career & practice advice (practical, non-fluffy):
- Build small, reproducible demos — one repo per project with README, deterministic seeds, and a sample dataset.
- Learn systems design questions that focus on throughput & latency: practice describing trade-offs (complexity vs latency, reliability vs throughput).
- Contribute to open-source tooling (network libs, parsers) — practical code review experience matters.
- Prepare for interviews: expect questions on concurrency, TCP vs UDP tradeoffs, and how you measured/optimized latency in a past project.
Further reading & tooling (start here):
- Books/Papers: Advances in Financial Machine Learning (Marcos López de Prado), High-Frequency Trading (Irene Aldridge), research on
kernel-bypass
and FPGA in finance. - Tools:
perf
,gdb
/lldb
,Wireshark
/tcpdump
,pktgen
,VTune
,strace
,bcc/ebpf
. - Libraries & libs to explore:
pybind11
,spdlog
,fmt
,Eigen
,DPDK
,PF_RING
.
Challenges — pick one and try now (hands-on):
- Short: Re-run the provided C++ microservice and halve
SMA_WINDOW
. Do trades increase? What's PnL direction? - Medium: Port the strategy loop to
Python
using the same random seed — do results match? If not, why? - Long: Replace immediate execution with a matching engine and record latency per order (use a simple timestamp pair).
Try editing the small C++ helper below: change the hours/targets and the advanced
flag, recompile, and use the printed checklist as your personal sprint plan.
Happy hacking — think like an engineer and a detective: measure, hypothesize, change one thing, and measure again. Like a pick-and-roll in basketball (channel your inner Kobe Bryant
energy): set the screen (infrastructure), make the move (strategy), and finish at the rim (reliable execution).
xxxxxxxxxx
}
using namespace std;
int main() {
// Personalize these constants to make a mini-study plan
const bool advanced = false; // set true to include DPDK/FPGA in the sprint
vector<pair<string,int>> roadmap = {
{"Feed Handler (C++)", 8},
{"Order Gateway (C++)", 6},
{"Strategy Prototype (Python)", 5},
{"pybind11 Bridge", 4}
};
if (advanced) {
roadmap.push_back({"Kernel-bypass (DPDK)", 12});
roadmap.push_back({"FPGA reading & simulation", 10});
}
cout << "Course Wrap-up checklist:\n";
cout << "- You have a reproducible microservice and a list of next projects.\n";
cout << "- Pick 1 small project and 1 deep topic (e.g., profiling or DPDK).\n\n";
cout << "Personal roadmap (estimated hours):\n";
for (auto &p : roadmap) {
cout << " - " << p.first << " : " << p.second << "h ";
Try this exercise. Is this statement true or false?
Completing this course equips you to build a minimal end-to-end HFT microservice that performs multicast market data ingestion, runs a simple strategy (prototyped in Python), and submits orders through an order gateway.
Press true if you believe the statement is correct, or false otherwise.
Generating complete for this lesson!