AlgoDaily - Course Overview & Environment Setup

Home > Algorithmic Trading for HFTS using C++ and python > Algorithmic Trading for HFTS using C++ and python > Course Overview & Environment Setup

Profiling, Performance Optimization and Vectorization

Why this matters for HFT engineers (beginner-friendly)

In algorithmic trading you often process millions of ticks per second. Small inefficiencies in loops or data layout become huge latency and throughput problems.
Think of optimization like a fast-break in basketball: you want the ball (data) to move in straight lines with no unnecessary stops. Poor data layout is like zig-zag dribbling that wastes time.

Quick ASCII visuals — memory layout and cache friendliness

AoS (Array of Structs) — awkward for per-field hot loops:

[ Tick{price,size,ts} ][ Tick{price,size,ts} ] [ Tick{...} ] ^ accessing price touches other fields too

SoA (Struct of Arrays) — cache-friendly when you only need one field:

prices: [p0, p1, p2, p3, ...] sizes: [s0, s1, s2, s3, ...] ts: [t0, t1, t2, t3, ...] ^ sequential memory for prices -> better L1/L2 prefetching

Core concepts to remember

Profilers: use perf (Linux) and Intel VTune to find hot functions and cache-miss hotspots.
Algorithmic choices: a better algorithm beats micro-optimizations (O(n) vs O(n log n)).
Data layout: SoA often outperforms AoS for tight numeric loops.
Memory pools: avoid frequent small allocations; use pools/arenas to reduce allocator overhead and fragmentation.
Vectorization (SIMD): modern compilers auto-vectorize loops when the code is simple and memory-aliasing is clear. You can also use intrinsics later.

Commands & profiling tips (beginner-safe)

Quick perf run:
- sudo perf record -F 99 -- ./main then sudo perf report --stdio
See whether the compiler vectorized a loop (GCC/Clang): compile with -O3 -ftree-vectorize -fopt-info-vec and check messages.
Use perf stat -e cache-references,cache-misses ./main to get cache miss rates.

How this ties to languages you know

From Python/NumPy: moving inner loops to contiguous numpy arrays or C++ gives big speedups. In Python, prefer numpy vector ops to Python loops.
From Java: Java's HotSpot does JIT vectorization — same principles (contiguous arrays, simple loops) apply.
From C/JS: memory layout and cache behavior still matter — in JS typed arrays are faster for numeric tight loops.

Try this interactive C++ microbenchmark (below). It demonstrates:

Generating reproducible ticks (deterministic RNG) — similar to replaying market data.
Two implementations of a simple moving-average workload: AoS vs SoA.
Timings using std::chrono and a small checksum to keep results honest.

Compile hints (try these locally):

g++ -O3 -march=native -std=c++17 main.cpp -o main
Run perf stat -e cache-references,cache-misses ./main and compare cache misses between AoS and SoA
If you're curious about vectorization, add -fopt-info-vec (GCC/Clang) to see what loops the compiler vectorized.

Beginner challenges to try after running the code

Re-run with different N (e.g., 100k, 10M). How does ops/sec scale? Which version scales better?
Compile with and without -O3. Inspect perf differences and run objdump -d to see generated assembly.
Implement the same logic in Python with numpy arrays; compare runtime and think about why numpy can sometimes match C++ for vectorizable ops.
(Stretch) Try rewriting the hot loop with explicit intrinsics (<immintrin.h>) and compare — only after you’re comfortable with the auto-vectorized result.

Small motivational analogy: if Kobe Bryant is the best at taking a direct straight-to-the-basket path, think of SoA + vectorized loop as the most direct path your code can take — fewer stops, fewer wasted cycles.

Now run the example below and experiment with the challenges above. The code is self-contained and prints timings and simple checksums so you can verify correctness while measuring performance.

xxxxxxxxxx
}
 
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <numeric>
#include <iomanip>
​
using namespace std;
using steady = chrono::steady_clock;
​
struct Tick {
  double price;
  double size;
  long ts;
};
​
// Generate deterministic ticks so runs are reproducible
void gen_ticks_aos(vector<Tick>& ticks, size_t N) {
  mt19937_64 rng(42);
  uniform_real_distribution<double> price_d(100.0, 101.0);
  uniform_real_distribution<double> size_d(1.0, 10.0);
  ticks.resize(N);
  for (size_t i = 0; i < N; ++i) {
    ticks[i].price = price_d(rng);
    ticks[i].size = size_d(rng);
    ticks[i].ts = static_cast<long>(i);
  }
}
​

Profiling, Performance Optimization and Vectorization

Programming Categories

Popular Lessons