Production tier · Reference card · Agents Academy
Production.
Production.
One sheet.
Deploy gate, the metrics every running agent emits, kill-switch shapes, prompt-injection defense layers, the trace-replay audit pattern, and the recovery decision tree. Pin it; print it; come back to it.
- Tests green. Unit + integration + 1 replay test. See card 04.
- Risk caps in code, not prompt. Verify
place_limit_orderrejects oversize even if prompt says yes. - Kill switch verified. Touch
$ACADEMY_DATA_DIR/kill_switch.flag, watch loop refuse to act within 1 cycle. - Trace redaction tested. Run with a fake API key; confirm it doesn’t appear in
$ACADEMY_DATA_DIR/state/traces/. - Health monitor wired. Cost / latency / decision-rate / error-rate page on threshold.
- Cancel-on-disconnect on. Drop the WS; resting orders are removed exchange-side.
- Run-book exists. 1 page: how to stop, how to flatten, who to call. Pinned in the team channel.
| Metric | Page on |
|---|---|
| cost_per_run | 2× baseline |
| tool_calls_per_run | > 30 |
| latency_p95 | 2× baseline |
| error_rate_5m | > 1% |
| kill_switch_active | any |
| chain_rest_drift | > dust |
| Unit | Each tool, in isolation, with mocked deps |
| Integration | Loop + tools against testnet / mock exchange |
| Replay | Re-run an NDJSON trace; assert decisions match |
| Adversarial | Inject prompt-injection payloads; confirm no escape |
| Smoke | 30s live run on testnet; confirm zero crashes |
// Daily cron
1. read state/traces/yesterday
2. for each runId:
- replay against same
model + tools (mocked)
- assert decisions match
- flag drift > 0
3. write replay-report.md
Catches model-version drift, prompt regressions, and silent strategy breakage.
| Shape | Trigger | Effect |
|---|---|---|
| File flag | touch $ACADEMY_DATA_DIR/kill_switch.flag (or Module 02 panel tap) | Loop top check; refuse to act in next iteration |
| Daily-loss | NAV drop > 3% | Cancel resting, flatten, halt 24h |
| Cost cap | $X/day spent on inference | Halt; alert |
| Tool error rate | > 5% over 5m | Halt; alert |
| Chain drift | REST ≠ chain > dust | Halt; reconcile by hand |
All five wire to the SAME halt path, cancel resting, flag $ACADEMY_DATA_DIR/kill_switch.flag, alert. One halt path = one thing to test.
- Layer 1. Sanitise tool output before feeding back, strip control sequences, <script>, “ignore previous instructions”.
- Layer 2. Risk caps in code, not prompt. The model can’t talk its way past
maxSizeinplace_limit_order. - Layer 3. Allowlist tools per loop. The agent that browses markets shouldn’t have
place_limit_orderavailable. - Layer 4. Human approval for new market types / oversize trades.
- Crash mid-run. Restart from atomic state. Replay last 60s of trace before resuming.
- Hung tool. 30s timeout aborts. Skip iteration; alert.
- Cost spike. Halt. Inspect last trace for runaway loop.
- Suspicious output. Halt. Rotate any key that appeared in trace.
- Chain drift. Halt new orders. Source of truth = chain.
08
Production pitfalls
Cross-module- Runaway loops. Agent gets stuck calling tools forever. Always set
maxSteps; alert on cap. - Kill-switch race conditions. Loop checks
$ACADEMY_DATA_DIR/kill_switch.flagat the top, but the order is already in-flight. Cancel-on-disconnect + idempotent retries close the gap. - Prompt injection via tool output. A scraped market description containing “ignore previous instructions”. Sanitise.
- Monitoring gaps that hide bad behavior. If you only watch P&L, the agent can lose money slowly and pass cost caps. Watch decision rate, error rate, latency.
- Trace logs leak secrets. Pre-redact; keep
$ACADEMY_DATA_DIR/state/traces/off the deploy image and out of shared backups. - State drift between dev and prod. Two writers, atomic-rename isn’t enough. Add
flock; only one process holds the file. - Model drift. Same prompt, different decisions next month. Replay audit catches it; freeze model version in config.