Production tier · Reference card · Agents Academy

Production.
One sheet.

Deploy gate, the metrics every running agent emits, kill-switch shapes, prompt-injection defense layers, the trace-replay audit pattern, and the recovery decision tree. Pin it; print it; come back to it.

01

Deploy gate

M11
  1. Tests green. Unit + integration + 1 replay test. See card 04.
  2. Risk caps in code, not prompt. Verify place_limit_order rejects oversize even if prompt says yes.
  3. Kill switch verified. Touch $ACADEMY_DATA_DIR/kill_switch.flag, watch loop refuse to act within 1 cycle.
  4. Trace redaction tested. Run with a fake API key; confirm it doesn’t appear in $ACADEMY_DATA_DIR/state/traces/.
  5. Health monitor wired. Cost / latency / decision-rate / error-rate page on threshold.
  6. Cancel-on-disconnect on. Drop the WS; resting orders are removed exchange-side.
  7. Run-book exists. 1 page: how to stop, how to flatten, who to call. Pinned in the team channel.
02

Monitoring metrics

M12
MetricPage on
cost_per_run2× baseline
tool_calls_per_run> 30
latency_p952× baseline
error_rate_5m> 1%
kill_switch_activeany
chain_rest_drift> dust
03

Test types

M13
UnitEach tool, in isolation, with mocked deps
IntegrationLoop + tools against testnet / mock exchange
ReplayRe-run an NDJSON trace; assert decisions match
AdversarialInject prompt-injection payloads; confirm no escape
Smoke30s live run on testnet; confirm zero crashes
04

Replay audit

M13
// Daily cron 1. read state/traces/yesterday 2. for each runId: - replay against same model + tools (mocked) - assert decisions match - flag drift > 0 3. write replay-report.md

Catches model-version drift, prompt regressions, and silent strategy breakage.

05

Kill switch shapes

M14
ShapeTriggerEffect
File flagtouch $ACADEMY_DATA_DIR/kill_switch.flag (or Module 02 panel tap)Loop top check; refuse to act in next iteration
Daily-lossNAV drop > 3%Cancel resting, flatten, halt 24h
Cost cap$X/day spent on inferenceHalt; alert
Tool error rate> 5% over 5mHalt; alert
Chain driftREST ≠ chain > dustHalt; reconcile by hand

All five wire to the SAME halt path, cancel resting, flag $ACADEMY_DATA_DIR/kill_switch.flag, alert. One halt path = one thing to test.

06

Prompt-injection defense

M15
  • Layer 1. Sanitise tool output before feeding back, strip control sequences, <script>, “ignore previous instructions”.
  • Layer 2. Risk caps in code, not prompt. The model can’t talk its way past maxSize in place_limit_order.
  • Layer 3. Allowlist tools per loop. The agent that browses markets shouldn’t have place_limit_order available.
  • Layer 4. Human approval for new market types / oversize trades.
07

Recovery tree

M16
  • Crash mid-run. Restart from atomic state. Replay last 60s of trace before resuming.
  • Hung tool. 30s timeout aborts. Skip iteration; alert.
  • Cost spike. Halt. Inspect last trace for runaway loop.
  • Suspicious output. Halt. Rotate any key that appeared in trace.
  • Chain drift. Halt new orders. Source of truth = chain.
08

Production pitfalls

Cross-module
  1. Runaway loops. Agent gets stuck calling tools forever. Always set maxSteps; alert on cap.
  2. Kill-switch race conditions. Loop checks $ACADEMY_DATA_DIR/kill_switch.flag at the top, but the order is already in-flight. Cancel-on-disconnect + idempotent retries close the gap.
  3. Prompt injection via tool output. A scraped market description containing “ignore previous instructions”. Sanitise.
  4. Monitoring gaps that hide bad behavior. If you only watch P&L, the agent can lose money slowly and pass cost caps. Watch decision rate, error rate, latency.
  5. Trace logs leak secrets. Pre-redact; keep $ACADEMY_DATA_DIR/state/traces/ off the deploy image and out of shared backups.
  6. State drift between dev and prod. Two writers, atomic-rename isn’t enough. Add flock; only one process holds the file.
  7. Model drift. Same prompt, different decisions next month. Replay audit catches it; freeze model version in config.