Mark As Completed Discussion

Logging, Observability and Incident Response

Why this matters for HFT engineers (beginner-friendly)

  • In algorithmic trading, especially HFT, a missing or slow log can hide a latency spike that costs money. Logs are your breadcrumbs, metrics are your heartbeat, and traces are your map when something goes wrong.
  • Think of your system like a fast-break basketball play: the ball (market data) flies through different players (feed handler → strategy → order gateway). If one player hesitates 2ms, the play fails. Logging must show who hesitated and why — without slowing the play.

Primary goals for this section

  • Design low-latency, non-blocking logging that doesn't add jitter.
  • Collect lightweight metrics (counters, gauges, histograms) and export them.
  • Add simple tracing IDs to tie together a market tick's path through the system.
  • Create alert rules and a minimal runbook to act when SLOs break.

Quick ASCII diagram — where to tap logs/metrics

[Exchange NIC] -> [Feed Handler] -> [Strategy] -> [Order Gateway] -> [Exchange] | | | logs/metrics logs/trace logs/metrics

Key patterns and trade-offs (for a multi-language view)

  • Use a lock-free or bounded ring buffer for logs in C++ (spdlog::sinks::ringbuffer_sink_mt or a hand-rolled SPSC ring) to avoid allocations in the hot path.
  • In Python/Java/C/JS prototypes, prefer structured logging and metrics via json lines. But beware: garbage + allocations can add latency — profile!
  • Emit minimal data on hot path: timestamp, trace_id, event_type, latency_ns — push heavy context to async uploaders.
  • Batch and flush: coalesce many small log writes into a single I/O operation off the critical path.
  • Metrics: counters (events/sec), gauges (queue length), histograms (latency distribution) — keep them lightweight. Use prom-client / Prometheus exporters in non-latency critical threads.

Incident response basics (mini runbook)

  1. Alert fires: e.g., 99.9th percentile latency > 1ms for 1 minute.
  2. Check quick health endpoints: /metrics, process CPU, queue sizes, NIC errors.
  3. Look at recent trace IDs logged around the spike and replay those ticks locally.
  4. If the cause is config/kernel change, roll back and escalate.

Hands-on example (C++):

  • A tiny, runnable demo that shows a bounded ring buffer logger, a producer simulating ticks (with occasional latency spikes), a consumer that drains logs and creates metrics.
  • This is an educational prototype — in production you'd replace strings with preallocated structures and avoid std::string allocations.

Study tasks / Challenges

  • Run the example below and change BUFFER_SIZE to 8 and then to 128. Observe dropped log counts and max latency reported.
  • Change LATENCY_ALERT_NS to a lower value and see how the simulated spike triggers an alert.
  • Extend the logger to pre-allocate a pool of fixed-size char arrays to avoid heap allocations (advanced).

Code (compile as main.cpp). Try changing buffer size and alert threshold.

TEXT/X-C++SRC
1#include <iostream>
2#include <vector>
3#include <string>
4#include <atomic>
5#include <thread>
6#include <chrono>
7#include <random>
8#include <sstream>
9#include <iomanip>
10
11using namespace std;
12using namespace std::chrono;
13
14struct SimpleRingLogger {
15  vector<string> buf;
16  size_t capacity;
17  atomic<size_t> head{0}; // next write index
18  atomic<size_t> tail{0}; // next read index
19  atomic<size_t> dropped{0};
20
21  SimpleRingLogger(size_t cap) : buf(cap), capacity(cap) {}
22
23  // Non-blocking push: returns false if buffer is full
24  bool push(string msg) {
25    size_t h = head.load(memory_order_relaxed);
26    size_t t = tail.load(memory_order_acquire);
27    if (h - t >= capacity) { // full
28      dropped.fetch_add(1, memory_order_relaxed);
29      return false;
30    }
31    buf[h % capacity] = move(msg);
32    head.store(h + 1, memory_order_release);
33    return true;
34  }
35
36  // Non-blocking pop: returns true if there was an item
37  bool pop(string &out) {
38    size_t t = tail.load(memory_order_relaxed);
39    size_t h = head.load(memory_order_acquire);
40    if (t >= h) return false; // empty
41    out = move(buf[t % capacity]);
42    tail.store(t + 1, memory_order_release);
43    return true;
44  }
45};
46
47int main() {
48  const size_t BUFFER_SIZE = 16;          // try 8 / 128 as experiments
49  const size_t TOTAL_TICKS = 500;         // how many simulated ticks
50  const long long LATENCY_ALERT_NS = 1'000'000; // 1 ms in ns
51
52  SimpleRingLogger logger(BUFFER_SIZE);
53
54  atomic<bool> done{false};
55
56  atomic<uint64_t> total_logged{0};
57  atomic<uint64_t> max_latency_ns{0};
58  atomic<uint64_t> events_processed{0};
59
60  // Consumer: drains logs and updates metrics
61  thread consumer([&]() {
62    string item;
63    while (!done.load() || logger.head.load() != logger.tail.load()) {
64      while (logger.pop(item)) {
65        // parse simple "trace_id,seq,latency_ns,timestamp"
66        stringstream ss(item);
67        string trace; uint64_t seq; uint64_t lat; uint64_t ts;
68        char comma;
69        if ((ss >> trace >> comma >> seq >> comma >> lat >> comma >> ts)) {
70          events_processed.fetch_add(1);
71          uint64_t prev_max = max_latency_ns.load();
72          while (lat > prev_max && !max_latency_ns.compare_exchange_weak(prev_max, lat)) {}
73          // simulate exporting to disk/network in batches (not blocking producer)
74          if (lat > (uint64_t)LATENCY_ALERT_NS) {
75            cout << "[ALERT] High latency detected trace=" << trace << " seq=" << seq
76                 << " lat_ns=" << lat << "\n";
77          }
78        }
79        total_logged.fetch_add(1);
80      }
81      // small sleep to avoid busy spin in this demo
82      this_thread::sleep_for(milliseconds(1));
83    }
84  });
85
86  // Producer: simulates handling incoming ticks and logs latency
87  thread producer([&]() {
88    mt19937_64 rng(424242);
89    uniform_int_distribution<int> base_ns(100, 800); // normal path 100-800 ns
90    for (uint64_t i = 0; i < TOTAL_TICKS; ++i) {
91      // simulate work
92      int simulated = base_ns(rng);
93
94      // simulate a rare spike every 120 ticks
95      if (i % 120 == 0 && i != 0) {
96        simulated += 2'000'000; // +2 ms spike
97        this_thread::sleep_for(milliseconds(2));
98      }
99
100      // timestamp and record latency
101      auto t0 = high_resolution_clock::now();
102      // (work would happen here)
103      auto t1 = high_resolution_clock::now();
104
105      uint64_t observed_ns = (uint64_t)duration_cast<nanoseconds>(t1 - t0).count() + simulated;
106
107      // build lightweight structured log: trace_id,seq,lat_ns,timestamp_ns
108      stringstream ss;
109      ss << "T" << setw(6) << setfill('0') << (i % 999999) << "," << i << "," << observed_ns << ","
110         << duration_cast<nanoseconds>(t1.time_since_epoch()).count();
111      string msg = ss.str();
112
113      if (!logger.push(move(msg))) {
114        // in a real system you might increment a metric and continue
115        // keep the hot path fast and avoid blocking
116      }
117
118      // pacing: very small sleep to emulate incoming tick rate
119      this_thread::sleep_for(microseconds(100));
120    }
121    done.store(true);
122  });
123
124  producer.join();
125  consumer.join();
126
127  cout << "\n--- Summary ---\n";
128  cout << "Total ticks generated: " << TOTAL_TICKS << "\n";
129  cout << "Total logs consumed:  " << total_logged.load() << "\n";
130  cout << "Events processed (metrics): " << events_processed.load() << "\n";
131  cout << "Dropped logs (ring full): " << logger.dropped.load() << "\n";
132  cout << "Max observed latency (ns): " << max_latency_ns.load() << "\n";
133  cout << "(Change BUFFER_SIZE and LATENCY_ALERT_NS in code to experiment)\n";
134
135  return 0;
136}
CPP
OUTPUT
:001 > Cmd/Ctrl-Enter to run, Cmd/Ctrl-/ to comment