Logging, Observability and Incident Response
Why this matters for HFT engineers (beginner-friendly)
- In algorithmic trading, especially HFT, a missing or slow log can hide a latency spike that costs money. Logs are your breadcrumbs, metrics are your heartbeat, and traces are your map when something goes wrong.
- Think of your system like a fast-break basketball play: the ball (market data) flies through different players (feed handler → strategy → order gateway). If one player hesitates 2ms, the play fails. Logging must show who hesitated and why — without slowing the play.
Primary goals for this section
- Design low-latency, non-blocking logging that doesn't add jitter.
- Collect lightweight metrics (counters, gauges, histograms) and export them.
- Add simple tracing IDs to tie together a market tick's path through the system.
- Create alert rules and a minimal runbook to act when SLOs break.
Quick ASCII diagram — where to tap logs/metrics
[Exchange NIC] -> [Feed Handler] -> [Strategy] -> [Order Gateway] -> [Exchange] | | | logs/metrics logs/trace logs/metrics
Key patterns and trade-offs (for a multi-language view)
- Use a lock-free or bounded ring buffer for logs in C++ (
spdlog::sinks::ringbuffer_sink_mt
or a hand-rolled SPSC ring) to avoid allocations in the hot path. - In Python/Java/C/JS prototypes, prefer structured logging and metrics via
json
lines. But beware: garbage + allocations can add latency — profile! - Emit minimal data on hot path:
timestamp
,trace_id
,event_type
,latency_ns
— push heavy context to async uploaders. - Batch and flush: coalesce many small log writes into a single I/O operation off the critical path.
- Metrics: counters (events/sec), gauges (queue length), histograms (latency distribution) — keep them lightweight. Use prom-client / Prometheus exporters in non-latency critical threads.
Incident response basics (mini runbook)
- Alert fires: e.g., 99.9th percentile latency > 1ms for 1 minute.
- Check quick health endpoints:
/metrics
, process CPU, queue sizes, NIC errors. - Look at recent trace IDs logged around the spike and replay those ticks locally.
- If the cause is config/kernel change, roll back and escalate.
Hands-on example (C++):
- A tiny, runnable demo that shows a bounded ring buffer logger, a producer simulating ticks (with occasional latency spikes), a consumer that drains logs and creates metrics.
- This is an educational prototype — in production you'd replace strings with preallocated structures and avoid std::string allocations.
Study tasks / Challenges
- Run the example below and change
BUFFER_SIZE
to8
and then to128
. Observedropped
log counts and max latency reported. - Change
LATENCY_ALERT_NS
to a lower value and see how the simulated spike triggers an alert. - Extend the logger to pre-allocate a pool of fixed-size char arrays to avoid heap allocations (advanced).
Code (compile as main.cpp). Try changing buffer size and alert threshold.
TEXT/X-C++SRC
1#include <iostream>
2#include <vector>
3#include <string>
4#include <atomic>
5#include <thread>
6#include <chrono>
7#include <random>
8#include <sstream>
9#include <iomanip>
10
11using namespace std;
12using namespace std::chrono;
13
14struct SimpleRingLogger {
15 vector<string> buf;
16 size_t capacity;
17 atomic<size_t> head{0}; // next write index
18 atomic<size_t> tail{0}; // next read index
19 atomic<size_t> dropped{0};
20
21 SimpleRingLogger(size_t cap) : buf(cap), capacity(cap) {}
22
23 // Non-blocking push: returns false if buffer is full
24 bool push(string msg) {
25 size_t h = head.load(memory_order_relaxed);
26 size_t t = tail.load(memory_order_acquire);
27 if (h - t >= capacity) { // full
28 dropped.fetch_add(1, memory_order_relaxed);
29 return false;
30 }
31 buf[h % capacity] = move(msg);
32 head.store(h + 1, memory_order_release);
33 return true;
34 }
35
36 // Non-blocking pop: returns true if there was an item
37 bool pop(string &out) {
38 size_t t = tail.load(memory_order_relaxed);
39 size_t h = head.load(memory_order_acquire);
40 if (t >= h) return false; // empty
41 out = move(buf[t % capacity]);
42 tail.store(t + 1, memory_order_release);
43 return true;
44 }
45};
46
47int main() {
48 const size_t BUFFER_SIZE = 16; // try 8 / 128 as experiments
49 const size_t TOTAL_TICKS = 500; // how many simulated ticks
50 const long long LATENCY_ALERT_NS = 1'000'000; // 1 ms in ns
51
52 SimpleRingLogger logger(BUFFER_SIZE);
53
54 atomic<bool> done{false};
55
56 atomic<uint64_t> total_logged{0};
57 atomic<uint64_t> max_latency_ns{0};
58 atomic<uint64_t> events_processed{0};
59
60 // Consumer: drains logs and updates metrics
61 thread consumer([&]() {
62 string item;
63 while (!done.load() || logger.head.load() != logger.tail.load()) {
64 while (logger.pop(item)) {
65 // parse simple "trace_id,seq,latency_ns,timestamp"
66 stringstream ss(item);
67 string trace; uint64_t seq; uint64_t lat; uint64_t ts;
68 char comma;
69 if ((ss >> trace >> comma >> seq >> comma >> lat >> comma >> ts)) {
70 events_processed.fetch_add(1);
71 uint64_t prev_max = max_latency_ns.load();
72 while (lat > prev_max && !max_latency_ns.compare_exchange_weak(prev_max, lat)) {}
73 // simulate exporting to disk/network in batches (not blocking producer)
74 if (lat > (uint64_t)LATENCY_ALERT_NS) {
75 cout << "[ALERT] High latency detected trace=" << trace << " seq=" << seq
76 << " lat_ns=" << lat << "\n";
77 }
78 }
79 total_logged.fetch_add(1);
80 }
81 // small sleep to avoid busy spin in this demo
82 this_thread::sleep_for(milliseconds(1));
83 }
84 });
85
86 // Producer: simulates handling incoming ticks and logs latency
87 thread producer([&]() {
88 mt19937_64 rng(424242);
89 uniform_int_distribution<int> base_ns(100, 800); // normal path 100-800 ns
90 for (uint64_t i = 0; i < TOTAL_TICKS; ++i) {
91 // simulate work
92 int simulated = base_ns(rng);
93
94 // simulate a rare spike every 120 ticks
95 if (i % 120 == 0 && i != 0) {
96 simulated += 2'000'000; // +2 ms spike
97 this_thread::sleep_for(milliseconds(2));
98 }
99
100 // timestamp and record latency
101 auto t0 = high_resolution_clock::now();
102 // (work would happen here)
103 auto t1 = high_resolution_clock::now();
104
105 uint64_t observed_ns = (uint64_t)duration_cast<nanoseconds>(t1 - t0).count() + simulated;
106
107 // build lightweight structured log: trace_id,seq,lat_ns,timestamp_ns
108 stringstream ss;
109 ss << "T" << setw(6) << setfill('0') << (i % 999999) << "," << i << "," << observed_ns << ","
110 << duration_cast<nanoseconds>(t1.time_since_epoch()).count();
111 string msg = ss.str();
112
113 if (!logger.push(move(msg))) {
114 // in a real system you might increment a metric and continue
115 // keep the hot path fast and avoid blocking
116 }
117
118 // pacing: very small sleep to emulate incoming tick rate
119 this_thread::sleep_for(microseconds(100));
120 }
121 done.store(true);
122 });
123
124 producer.join();
125 consumer.join();
126
127 cout << "\n--- Summary ---\n";
128 cout << "Total ticks generated: " << TOTAL_TICKS << "\n";
129 cout << "Total logs consumed: " << total_logged.load() << "\n";
130 cout << "Events processed (metrics): " << events_processed.load() << "\n";
131 cout << "Dropped logs (ring full): " << logger.dropped.load() << "\n";
132 cout << "Max observed latency (ns): " << max_latency_ns.load() << "\n";
133 cout << "(Change BUFFER_SIZE and LATENCY_ALERT_NS in code to experiment)\n";
134
135 return 0;
136}
xxxxxxxxxx
136
}
using namespace std;
using namespace std::chrono;
struct SimpleRingLogger {
vector<string> buf;
size_t capacity;
atomic<size_t> head{0}; // next write index
atomic<size_t> tail{0}; // next read index
atomic<size_t> dropped{0};
SimpleRingLogger(size_t cap) : buf(cap), capacity(cap) {}
// Non-blocking push: returns false if buffer is full
bool push(string msg) {
size_t h = head.load(memory_order_relaxed);
size_t t = tail.load(memory_order_acquire);
if (h - t >= capacity) { // full
dropped.fetch_add(1, memory_order_relaxed);
return false;
OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment