Welcome to Agents Academy

Module 12 · Production · ~8 min

Monitoring.

By the end of this module, you can replay any decision your agent made, what it saw, what it thought, what it did. The difference between “something went wrong” and “I know exactly what went wrong.”

To get there, you’ll emit a stable NDJSON log line per step and persist every reasoning trace so you can rebuild any decision after the fact. This is the observability layer you’ll rely on every morning, and the raw material Modules 11, 12, and 13 query when something goes wrong.

Production tier · Reference card
Quick answer

How do you monitor an LLM trading agent’s decisions?

Log six things on every run, emit one NDJSON line per step, persist reasoning traces to SQLite, and skim them in a 10-minute daily audit; that is what lets you replay any decision your agent made: what it saw, what it thought, what it did. The six essentials: an ISO-8601 UTC timestamp, the model and version, the full prompt at the moment of the call, every tool call and result in order, the final action, and the reasoning trace. The logger writes one JSON object per line to stdout with a stable schema (ts, run_id, step, level, event); pipe it into $ACADEMY_DATA_DIR/agent.log.ndjson and the Module 02 operator panel renders new lines as they appear. For cross-run questions, traces land in a SQLite traces.db indexed by run_id, ts, and kind. Alert on leading metrics, cost-per-run, error rate, decision rate, not just PnL.

Verified 2026-06-09 where it touches Limitless; the rest is illustrative agent-runtime teaching.

Section 01

What to log.

Every agent run generates a lot of text and it is tempting to log everything. That gets you logs that nobody reads. The opposite is tempting too, log only orders, and you end up with traces that cannot answer “why”. Six things, every run. No more, no less.

01 · Timestamp

ISO-8601, UTC. Sortable, timezone-proof.

02 · Model + version

e.g. claude-opus-4-8. Provider upgrades will change behaviour.

03 · Full prompt

System + messages at the moment of the call. Essential for reproducing decisions.

04 · Tool calls + results

Every call, every result, in order. Redact PII but never the inputs.

05 · Final action

Placed order? Closed position? Did nothing? The single “output” of the run.

06 · Reasoning trace

The LLM’s text content per step. The “why”, the bit you will actually read at 3am.

Section 02

Structured logging.

One JSON object per line. Write to stdout in production (your container runtime picks it up) and to a file on disk in development. Do not use a logging framework that colours output or adds formatting, those break NDJSON parsing. Simpler is better.

How to run this

  1. No env vars required, the logger only writes to stdout. To capture the output to the file Module 02’s dashboard tails, pipe with | tee -a $ACADEMY_DATA_DIR/agent.log.ndjson.
  2. Save the snippet as logger.ts, then run npx tsx logger.ts from that folder.
  3. You see five one-line JSON objects on stdout, each with run_id, step, and an event. Pipe through jq .event to confirm it parses as valid NDJSON.
// Module 12: Minimal structured logger (NDJSON to stdout)
import { randomUUID } from 'node:crypto';

type Level = 'info' | 'warn' | 'error';

export class Logger {
  readonly runId = randomUUID();
  private  step  = 0;

  private emit(level: Level, event: string, extra: Record<string, unknown> = {}) {
    const line = JSON.stringify({
      ts:     new Date().toISOString(),
      run_id: this.runId,
      step:   this.step++,
      level,
      event,
      ...extra,
    });
    process.stdout.write(line + '\n');
  }

  info(event: string, extra?: Record<string, unknown>)  { this.emit('info',  event, extra); }
  warn(event: string, extra?: Record<string, unknown>)  { this.emit('warn',  event, extra); }
  error(event: string, extra?: Record<string, unknown>) { this.emit('error', event, extra); }
}

// Usage
const log = new Logger();
log.info('agent_start',  { model: 'claude-opus-4-8', dry_run: process.env.DRY_RUN === 'true' });
log.info('tool_call',    { name: 'browse_markets', input: { limit: 20 } });
log.info('tool_result',  { name: 'browse_markets', n_markets: 20, latency_ms: 142 });
log.info('decision',     { action: 'skip', reason: 'no markets above edge threshold' });
log.info('agent_finish', { iters: 3, orders_placed: 0 });

// One NDJSON line per call. Easy to tail, grep, pipe into jq.

Want to see this NDJSON live in a browser? Module 02 covers the operator dashboard.

The structured logs you wire up in this module are the raw material for a real operator surface, collapsible reasoning blocks, kill-switch toggle, manual override, cost meter. Module 02, Your Dashboard tails the same $ACADEMY_DATA_DIR/agent.log.ndjson file; pipe this logger’s stdout into it and the panel renders new lines as they appear.

Section 03

Reasoning trace storage.

NDJSON is fine for short-term triage. For anything you want to query across many runs, “how often did the agent skip markets last week and why”, put the traces into SQLite. One table, one row per step, indexed by run_id. It is boring, cheap, and you can query it from a notebook.

// Module 12: Persist agent reasoning to SQLite
import Database from 'better-sqlite3';
import { mkdirSync } from 'node:fs';
import path from 'node:path';

const DATA_DIR = process.env.ACADEMY_DATA_DIR ?? './data';
const DB_DIR   = path.join(DATA_DIR, 'state');
mkdirSync(DB_DIR, { recursive: true });
const db = new Database(path.join(DB_DIR, 'traces.db'));

db.exec(`
  CREATE TABLE IF NOT EXISTS traces (
    run_id     TEXT NOT NULL,
    step       INTEGER NOT NULL,
    ts         TEXT NOT NULL,
    kind       TEXT NOT NULL,
    model      TEXT,
    content    TEXT,          -- JSON blob
    PRIMARY KEY (run_id, step)
  );
  CREATE INDEX IF NOT EXISTS idx_traces_ts   ON traces(ts);
  CREATE INDEX IF NOT EXISTS idx_traces_kind ON traces(kind);
`);

const insert = db.prepare(
  'INSERT INTO traces (run_id, step, ts, kind, model, content) VALUES (?, ?, ?, ?, ?, ?)'
);

export function recordTrace(runId: string, step: number, kind: string, model: string | null, content: unknown) {
  insert.run(runId, step, new Date().toISOString(), kind, model, JSON.stringify(content));
}

// Usage
recordTrace('run-42', 0, 'prompt',       'claude-opus-4-8', { system: '...', user: 'scan markets' });
recordTrace('run-42', 1, 'assistant',    'claude-opus-4-8', { text: 'I will first call browse_markets' });
recordTrace('run-42', 2, 'tool_call',    null,              { name: 'browse_markets', input: { limit: 20 } });
recordTrace('run-42', 3, 'tool_result',  null,              { output: { n: 20 } });
recordTrace('run-42', 4, 'assistant',    'claude-opus-4-8', { text: 'Nothing interesting, stopping.' });

// Example daily-audit query:
//   SELECT kind, COUNT(*) FROM traces WHERE ts > date('now', '-1 day') GROUP BY kind;

How to run this

  1. Set ACADEMY_DATA_DIR to the persistent volume from Module 01 (defaults to ./data for local runs). The SQLite file lands at $ACADEMY_DATA_DIR/state/traces.db and survives the redeploy, the snippet creates the directory on first run.
  2. Install the driver with npm i better-sqlite3, save the snippet as trace-store.ts, then run npx tsx trace-store.ts.
  3. Confirm with sqlite3 $ACADEMY_DATA_DIR/state/traces.db 'SELECT kind, COUNT(*) FROM traces GROUP BY kind;', you see five rows (prompt, assistant, tool_call, tool_result, assistant) from run-42.

Section 04

The daily audit.

Monitoring is worthless if nobody looks. Block out 10 minutes every morning, coffee, terminal, trace store, and skim the previous 24 hours. You are not trying to catch every mistake. You are trying to catch the pattern that tells you something is off before it becomes expensive.

Hallucinated trades

Any tool call referencing a market_id that does not exist in browse_markets results. Always a bug, always worth chasing.

Cost spikes

LLM token usage 3× above the 7-day median. Usually means the model got stuck in a loop and your max-iters cap saved you, but double-check.

Repeated failed tool calls

Same tool, same error, multiple runs. Schema drift, rate limit, or a downstream API change.

Drawdown watermarks

Compare current P&L to the kill-switch threshold from Module 14. If you are halfway there, tighten limits before the switch fires.

Wire monitoring into your dashboard.

Two payoffs land in Module 02’s panel as soon as this module’s logging is in place. Log streaming: every NDJSON line you write here surfaces in the panel’s reasoning view via the SSE endpoint, collapsible per iteration. Cost meter: the tokens_in, tokens_out, and cost_usd fields you record now drive the iters/hr, tokens/hr, and cost/hr cells in the panel’s cost meter. The cost meter is the canary, not the budget, Module 14 has the hard cap that stops the runaway.

Common questions

Agent monitoring: what people ask

Each answer also ships invisibly as schema.org FAQ data for search engines and AI assistants. Tap a question to expand.

  1. What should a trading agent log on every run?
    Six things, no more, no less: an ISO-8601 UTC timestamp (sortable, timezone-proof); the model and version, because provider upgrades change behaviour; the full prompt (system + messages) at the moment of the call; every tool call and result, in order, redacting PII but never the inputs; the final action (placed an order, closed a position, did nothing); and the reasoning trace, the “why” you will actually read at 3am.
  2. Why log agent output as NDJSON?
    One JSON object per line on stdout is trivially machine-readable: tail it, grep it, pipe it through jq, and let the container runtime collect it. Keep a stable schema per line: ts, run_id, step, level, event, plus event-specific extras. Avoid logging frameworks that colour or reformat output, they break NDJSON parsing. Pipe stdout into $ACADEMY_DATA_DIR/agent.log.ndjson and the Module 02 dashboard tails the same file over SSE.
  3. Where should reasoning traces live long-term?
    SQLite. NDJSON is fine for short-term triage, but anything you want to query across many runs goes into a traces table: one row per step (prompt, assistant message, tool call, tool result), primary key (run_id, step), indexed by ts and kind, stored at $ACADEMY_DATA_DIR/state/traces.db on the persistent volume so it survives redeploys. It is boring, cheap, and queryable from a notebook with one SELECT.
  4. What does a daily agent audit check?
    Ten minutes every morning against the last 24 hours of traces, looking for four patterns: hallucinated trades (any tool call referencing a market_id that does not exist in browse_markets results, always a bug); cost spikes (token usage 3× the 7-day median, usually a loop your max-iters cap caught); repeated failed tool calls (schema drift, a rate limit, or a downstream API change); and drawdown watermarks creeping toward the kill-switch threshold, tighten limits before the switch fires.
  5. Why is watching only PnL not enough?
    PnL is the lagging measure of the agent’s decisions; decision rate, error rate, latency, and cost-per-run lead it. PnL-only alerting misses runaway loops (cost spikes before PnL moves), tool-error storms (errors spike while PnL stays flat because nothing fills), and model drift (decision rate halves overnight, PnL bleeds slowly). Wire alarms on cost-per-run, tool-calls-per-run, latency p95, error rate, decision rate, and kill-switch state, each thresholded against a 7-day baseline.

Module checklist

Five quick confirmations.

Tick each item once you’ve actually done it. The Continue button unlocks at 5/5.

Module 12 complete

Eyes on the agent.

Your agent can be debugged in the past tense. When something looks off the next morning, you don’t have to guess, you can pull up the exact prompt, tool call, and reasoning step that produced the trade and decide from evidence whether the agent or the prompt was at fault.

Concretely, you can reconstruct any run of your agent, prompt, tool calls, decision, reasoning, from disk.

01

A Logger class that emits NDJSON to stdout with a stable six-field schema: ts, run_id, step, level, event, plus whatever extras the event needs.

02

A queryable traces.db SQLite file with a traces table indexed by run_id, ts, and kind, one row per prompt, assistant message, tool call, and tool result.

03

A daily-audit checklist, hallucinated market_ids, cost spikes, repeated tool errors, drawdown watermarks, you can run in 10 minutes against the last 24 hours of traces.

Next up: turning those traces into tests, golden prompts, tool-call assertions, and replay harnesses that catch regressions before they reach production.

Complete the checklist above to unlock