Profiling, Performance Optimization and Vectorization
Why this matters for HFT engineers (beginner-friendly)
- In algorithmic trading you often process millions of
ticks
per second. Small inefficiencies in loops or data layout become huge latency and throughput problems. - Think of optimization like a fast-break in basketball: you want the ball (data) to move in straight lines with no unnecessary stops. Poor data layout is like zig-zag dribbling that wastes time.
Quick ASCII visuals — memory layout and cache friendliness
AoS (Array of Structs) — awkward for per-field hot loops:
[ Tick{price,size,ts} ][ Tick{price,size,ts} ] [ Tick{...} ]
^ accessing price
touches other fields too
SoA (Struct of Arrays) — cache-friendly when you only need one field:
prices: [p0, p1, p2, p3, ...]
sizes: [s0, s1, s2, s3, ...]
ts: [t0, t1, t2, t3, ...]
^ sequential memory for prices
-> better L1/L2 prefetching
Core concepts to remember
- Profilers: use
perf
(Linux) and Intel VTune to find hot functions and cache-miss hotspots. - Algorithmic choices: a better algorithm beats micro-optimizations (O(n) vs O(n log n)).
- Data layout:
SoA
often outperformsAoS
for tight numeric loops. - Memory pools: avoid frequent small allocations; use pools/arenas to reduce allocator overhead and fragmentation.
- Vectorization (SIMD): modern compilers auto-vectorize loops when the code is simple and memory-aliasing is clear. You can also use intrinsics later.
Commands & profiling tips (beginner-safe)
- Quick perf run:
sudo perf record -F 99 -- ./main
thensudo perf report --stdio
- See whether the compiler vectorized a loop (GCC/Clang): compile with
-O3 -ftree-vectorize -fopt-info-vec
and check messages. - Use
perf stat -e cache-references,cache-misses ./main
to get cache miss rates.
How this ties to languages you know
- From Python/NumPy: moving inner loops to contiguous
numpy
arrays or C++ gives big speedups. In Python, prefernumpy
vector ops to Python loops. - From Java: Java's HotSpot does JIT vectorization — same principles (contiguous arrays, simple loops) apply.
- From C/JS: memory layout and cache behavior still matter — in JS typed arrays are faster for numeric tight loops.
Try this interactive C++ microbenchmark (below). It demonstrates:
- Generating reproducible
ticks
(deterministic RNG) — similar to replaying market data. - Two implementations of a simple moving-average workload:
AoS
vsSoA
. - Timings using
std::chrono
and a small checksum to keep results honest.
Compile hints (try these locally):
- g++ -O3 -march=native -std=c++17 main.cpp -o main
- Run
perf stat -e cache-references,cache-misses ./main
and compare cache misses between AoS and SoA - If you're curious about vectorization, add
-fopt-info-vec
(GCC/Clang) to see what loops the compiler vectorized.
Beginner challenges to try after running the code
- Re-run with different
N
(e.g., 100k, 10M). How doesops/sec
scale? Which version scales better? - Compile with and without
-O3
. Inspect perf differences and runobjdump -d
to see generated assembly. - Implement the same logic in Python with
numpy
arrays; compare runtime and think about whynumpy
can sometimes match C++ for vectorizable ops. - (Stretch) Try rewriting the hot loop with explicit intrinsics (
<immintrin.h>
) and compare — only after you’re comfortable with the auto-vectorized result.
Small motivational analogy: if Kobe Bryant
is the best at taking a direct straight-to-the-basket path, think of SoA + vectorized loop as the most direct path your code can take — fewer stops, fewer wasted cycles.
Now run the example below and experiment with the challenges above. The code is self-contained and prints timings and simple checksums so you can verify correctness while measuring performance.
xxxxxxxxxx
}
using namespace std;
using steady = chrono::steady_clock;
struct Tick {
double price;
double size;
long ts;
};
// Generate deterministic ticks so runs are reproducible
void gen_ticks_aos(vector<Tick>& ticks, size_t N) {
mt19937_64 rng(42);
uniform_real_distribution<double> price_d(100.0, 101.0);
uniform_real_distribution<double> size_d(1.0, 10.0);
ticks.resize(N);
for (size_t i = 0; i < N; ++i) {
ticks[i].price = price_d(rng);
ticks[i].size = size_d(rng);
ticks[i].ts = static_cast<long>(i);
}
}