What metrics should you monitor on a live trading bot?

Eight, each with an alert threshold, because a number without a trigger is a decoration: heartbeat (alert if the last successful loop is over 60 seconds old); fill rate (% of orders filling within 30 s); error rate (a burst of 401s means a rotated key, a burst of 5xx means call off new orders); PnL per strategy; signed inventory per market; latency p50/p95/p99 from signal to fill; your own 429 count plus limiter utilisation; and on-chain balance reconciled against REST, where any discrepancy means stop immediately.

How does a trading bot kill switch work?

Three layers, all enforced inside the main loop. A file flag you can touch over SSH at the path in KILL_SWITCH_FILE; a drawdown threshold the bot enforces automatically (the example halts at 10% from peak); and an HTTP /panic endpoint on a separate admin port (8081) for when you’re on your phone. Any layer tripping logs [KILL] and exits non-zero within one loop tick, and the switch is never re-armed automatically: human review only.

Why is logging not enough for a production bot?

Because logs without alerting are a trace nobody reads. Error rates creep, cost-per-trade doubles, PnL decays, all visible in the logfile and none of it visible to you, until you’re a week into a regime you don’t recognise. Page on PnL spikes, latency spikes, error spikes, and cost spikes, thresholding each metric against a 7-day baseline, and route alerts to a channel that actually gets your attention. Alert noise is solvable; a silent failure isn’t.

How do you health-check a Dockerised trading bot?

Expose a /health endpoint that returns {ok: true, ts} and wire a Docker HEALTHCHECK with --interval=30s --timeout=3s --retries=3 so the container is marked unhealthy when the bot stops answering. Verify with curl localhost:8080/health and confirm docker ps shows the container as healthy. Keep the health port (8080) separate from the admin panic port (8081), and pair the endpoint with a heartbeat metric so a silently dead loop still pages you.

Welcome to API Academy

Module 18 · Production · ~30 min · Final module

Production bot.

Q: How do you deploy a trading bot to production?

One Dockerfile, env-var config, and a /health endpoint, then monitoring and a kill switch; resist every urge to stand up Kubernetes. The image carries a HEALTHCHECK that polls /health every 30 seconds and ships unchanged to a $5–20/month VPS, Railway, or Fly. Config comes in through environment variables like LIMITLESS_API_KEY, NAV_USD, and KILL_SWITCH_FILE. Pipe eight metrics to a free Grafana Cloud tier with an alert threshold on every one: heartbeat, fill rate, error rate, PnL, inventory, latency, rate-limit signals, and on-chain balance. Finally, wire a three-layer kill switch, a file flag, a drawdown threshold, and an HTTP panic endpoint, that exits non-zero the moment any layer trips and never re-arms without a human. Then kill the bot once on purpose and confirm it stays dead.

By the end of this module, you’ll have a live trading bot you trust enough to leave running, one that deploys cleanly, watches itself, and shuts itself down before a bad day becomes a bad year. The final exam of API Academy.

To get there, you’ll deploy, monitor, kill, the three rules of a bot that stays alive. The final bookend of the API Academy.

Production tier · Reference card

Quick answer

How do you deploy a trading bot to production?

One Dockerfile, env-var config, and a /health endpoint, then monitoring and a kill switch; resist every urge to stand up Kubernetes. The image carries a HEALTHCHECK that polls /health every 30 seconds and ships unchanged to a $5–20/month VPS, Railway, or Fly. Config comes in through environment variables like LIMITLESS_API_KEY, NAV_USD, and KILL_SWITCH_FILE. Pipe eight metrics to a free Grafana Cloud tier with an alert threshold on every one: heartbeat, fill rate, error rate, PnL, inventory, latency, rate-limit signals, and on-chain balance. Finally, wire a three-layer kill switch, a file flag, a drawdown threshold, and an HTTP panic endpoint, that exits non-zero the moment any layer trips and never re-arms without a human. Then kill the bot once on purpose and confirm it stays dead.

No new Limitless endpoint claims; this wires the earlier tiers together. Verified 2026-06-09.

Section 01

Architecture overview.

A one-person trading bot is five boxes in a row, not a microservices diagram. Each box does one thing, every arrow is a clean hand-off, and the monitor sees everything.

Signal
ingest

→

Strategy

→

Risk filter

Order
executor

→

Limitless API

↺

Monitor &
kill switch

Honest sizing guide (one person)

Compute

A single $5–20/mo VPS. Two cores, 2 GB RAM, SSD. You’re not running a model farm.

Storage

SQLite on disk for state, Parquet for historical data, a cheap S3 bucket for backups.

Observability

A Grafana Cloud free tier, a Discord webhook for alerts, a Telegram bot for kill-switch control. No enterprise stack.

Common pitfall

Logs without alerting are a trace nobody reads.

A bot that writes everything to a logfile but never pages anyone is one degraded run away from disaster. The error rate creeps up, the cost-per-trade doubles, the PnL decays, all visible in the log, none of it visible to you. By the time you check, you’re a week into a regime you don’t recognise. Monitoring without alerting is a comfortable illusion of safety.

Fix Page on PnL spike, latency spike, error spike, cost spike. Threshold each metric against a 7-day baseline and route the alert to a channel that gets your attention (phone, on-call rotation). Alert noise is solvable; a silent failure isn’t.

Section 02

Deployment.

One Dockerfile, env-var config, a /health endpoint. That’s it. Resist every urge to stand up Kubernetes.

Dockerfile + health

# Module 18, Dockerfile + env-config + health endpoint (TypeScript / Node)

# ---------- Dockerfile ----------
FROM node:20-alpine AS build
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine
WORKDIR /app
COPY --from=build /app/dist  ./dist
COPY --from=build /app/node_modules ./node_modules
COPY --from=build /app/package.json ./
ENV NODE_ENV=production
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
  CMD wget -qO- http://localhost:8080/health || exit 1
CMD ["node", "dist/bot.js"]

# ---------- dist/bot.ts ----------
import express from 'express';
import { runBot } from './strategy';

const cfg = {
  apiKey:  process.env.LIMITLESS_API_KEY!,
  navUsd:  Number(process.env.NAV_USD ?? 10_000),
  kill:    process.env.KILL_SWITCH_FILE ?? '/tmp/kill.flag',
};

const app = express();
app.get('/health', (_req, res) => res.json({ ok: true, ts: Date.now() }));
app.listen(8080);

runBot(cfg).catch(err => { console.error(err); process.exit(1); });

# Module 18, Dockerfile + env-config + health endpoint (Python / FastAPI)

# ---------- Dockerfile ----------
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PYTHONUNBUFFERED=1
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')" || exit 1
CMD ["uvicorn", "bot:app", "--host", "0.0.0.0", "--port", "8080"]

# ---------- bot.py ----------
import asyncio
import os
import time

from fastapi import FastAPI
from strategy import run_bot  # your strategy module

cfg = {
    "api_key":  os.environ["LIMITLESS_API_KEY"],
    "nav_usd":  float(os.environ.get("NAV_USD", 10_000)),
    "kill":     os.environ.get("KILL_SWITCH_FILE", "/tmp/kill.flag"),
}

app = FastAPI()


@app.get("/health")
def health() -> dict:
    return {"ok": True, "ts": int(time.time())}


@app.on_event("startup")
async def _boot() -> None:
    asyncio.create_task(run_bot(cfg))

# Module 18, Dockerfile + env-config + health endpoint (Go)

# ---------- Dockerfile ----------
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /bot ./cmd/bot

FROM alpine:3.19
COPY --from=build /bot /bot
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
  CMD wget -qO- http://localhost:8080/health || exit 1
ENTRYPOINT ["/bot"]

// ---------- cmd/bot/main.go ----------
package main

import (
    "encoding/json"
    "log"
    "net/http"
    "os"
    "strconv"
    "time"

    "example.com/bot/strategy"
)

type Config struct {
    APIKey string
    NAVUsd float64
    Kill   string
}

func mustFloat(env string, def float64) float64 {
    if v := os.Getenv(env); v != "" {
        f, _ := strconv.ParseFloat(v, 64)
        return f
    }
    return def
}

func main() {
    cfg := Config{
        APIKey: os.Getenv("LIMITLESS_API_KEY"),
        NAVUsd: mustFloat("NAV_USD", 10_000),
        Kill:   os.Getenv("KILL_SWITCH_FILE"),
    }

    http.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) {
        _ = json.NewEncoder(w).Encode(map[string]any{"ok": true, "ts": time.Now().Unix()})
    })
    go func() { log.Fatal(http.ListenAndServe(":8080", nil)) }()

    if err := strategy.Run(cfg.APIKey, cfg.NAVUsd, cfg.Kill); err != nil {
        log.Fatal(err)
    }
}

How to run this

Split the snippet above into its two blocks, the Dockerfile at your project root, the entrypoint next to it. Drop your real strategy code into the strategy module so runBot / run_bot / strategy.Run has something to do.
Save the entrypoint as src/bot.ts, then docker build -t bot . && docker run --rm -p 8080:8080 -e LIMITLESS_API_KEY=$LIMITLESS_API_KEY bot.
Save the entrypoint as bot.py, then docker build -t bot . && docker run --rm -p 8080:8080 -e LIMITLESS_API_KEY=$LIMITLESS_API_KEY bot.
Save the entrypoint as cmd/bot/main.go inside a Go module, then docker build -t bot . && docker run --rm -p 8080:8080 -e LIMITLESS_API_KEY=$LIMITLESS_API_KEY bot.
curl localhost:8080/health returns {"ok": true, "ts": …} and docker ps shows the container as healthy. Ship the same image to Fly, Railway, or any VPS with Docker installed.

Section 03

Monitoring & alerting.

A bot you can’t see is a bot that’s already lost you money. Pipe these eight metrics to Grafana Cloud or whatever free tier you prefer, and wire alerts to a Discord or Telegram webhook. Every metric needs a threshold, a number without a trigger is a decoration.

Heartbeat

Last successful loop timestamp. Alert if older than 60 seconds, your bot has died silently.

Fill rate

% of orders that fill within 30 s. Collapsing fill rate = stale quotes or sudden toxic flow.

Error rate

4xx/5xx per minute. Burst of 401s = key rotated. Burst of 5xx = call off new orders.

PnL

Realised + unrealised, per strategy. Alert on daily loss threshold and peak-to-trough drawdown.

Inventory

Signed position per market. Alert on breach of per-market cap or total gross exposure.

Latency

p50/p95/p99 of signal → fill. Slow creep is the canary for everything else.

Rate-limit signals

Track your own 429 response count + client-side limiter utilisation. Alert when 429 rate exceeds a threshold.

On-chain balance

Actual wallet balance + collateral locked. Reconcile against REST. Discrepancy = stop immediately.

Section 04

The kill switch.

Three layers, all enforced inside the main loop. A file flag you can touch from SSH, a drawdown threshold the bot enforces automatically, and an HTTP panic endpoint for when you’re on your phone. The kill switch is never re-armed automatically, human review only.

How to run this

Set KILL_SWITCH_FILE (e.g. /tmp/kill.flag) so the file-flag layer knows where to look. Call checkKill(equity) from your main trading loop and expose the admin port (8081) separately from your health port.
Save the snippet above as kill-switch.ts, import it from run-bot.ts, then npx tsx run-bot.ts.
Save the snippet above as kill_switch.py, import check_kill from your bot loop, and expose admin via uvicorn kill_switch:admin --port 8081.
Save the snippet above as kill.go, call StartAdmin() once at boot, and invoke CheckKill(equity) every loop.
Run touch /tmp/kill.flag (or curl -X POST localhost:8081/panic) and your bot logs [KILL] and exits non-zero within one loop tick. That’s your final checklist item, kill it on purpose, confirm it stays dead.

Kill switch

// Module 18, Three-layer kill switch.
// 1) file flag, 2) drawdown threshold, 3) HTTP panic endpoint.

import fs from 'fs';
import express from 'express';

const KILL_FILE = process.env.KILL_SWITCH_FILE ?? '/tmp/kill.flag';
const MAX_DD    = 0.10;  // 10% drawdown from peak

let peak = 1;
let dead = false;

export function panic(reason: string): void {
  console.error('[KILL]', reason);
  dead = true;
  // TODO: cancel all open orders before exit
  process.exit(1);
}

export function checkKill(equity: number): void {
  if (dead) panic('already dead');

  // layer 1, file flag
  if (fs.existsSync(KILL_FILE)) panic(`file flag: ${KILL_FILE}`);

  // layer 2, drawdown
  peak = Math.max(peak, equity);
  const dd = (equity - peak) / peak;
  if (dd < -MAX_DD) panic(`drawdown ${(dd * 100).toFixed(2)}%`);
}

// layer 3, HTTP panic endpoint
const admin = express();
admin.post('/panic', (req, res) => {
  res.json({ ok: true });
  panic(`http panic: ${req.ip}`);
});
admin.listen(8081);

# Module 18, Three-layer kill switch.
# 1) file flag, 2) drawdown threshold, 3) HTTP panic endpoint.

import os
import sys

from fastapi import FastAPI

KILL_FILE = os.environ.get("KILL_SWITCH_FILE", "/tmp/kill.flag")
MAX_DD    = 0.10  # 10% drawdown from peak

state = {"peak": 1.0, "dead": False}


def panic(reason: str) -> None:
    print(f"[KILL] {reason}", file=sys.stderr, flush=True)
    state["dead"] = True
    # TODO: cancel all open orders before exit
    os._exit(1)


def check_kill(equity: float) -> None:
    if state["dead"]:
        panic("already dead")

    # layer 1, file flag
    if os.path.exists(KILL_FILE):
        panic(f"file flag: {KILL_FILE}")

    # layer 2, drawdown
    state["peak"] = max(state["peak"], equity)
    dd = (equity - state["peak"]) / state["peak"]
    if dd < -MAX_DD:
        panic(f"drawdown {dd * 100:.2f}%")


# layer 3, HTTP panic endpoint
admin = FastAPI()


@admin.post("/panic")
def http_panic() -> dict:
    panic("http panic")
    return {"ok": True}

// Module 18, Three-layer kill switch.
// 1) file flag, 2) drawdown threshold, 3) HTTP panic endpoint.

package main

import (
    "log"
    "net/http"
    "os"
)

const (
    KillEnv = "KILL_SWITCH_FILE"
    MaxDD   = 0.10 // 10% drawdown from peak
)

var (
    peak     = 1.0
    dead     = false
    killFile = os.Getenv(KillEnv)
)

func Panic(reason string) {
    log.Printf("[KILL] %s", reason)
    dead = true
    // TODO: cancel all open orders before exit
    os.Exit(1)
}

func CheckKill(equity float64) {
    if dead {
        Panic("already dead")
    }

    // layer 1, file flag
    if killFile != "" {
        if _, err := os.Stat(killFile); err == nil {
            Panic("file flag: " + killFile)
        }
    }

    // layer 2, drawdown
    if equity > peak {
        peak = equity
    }
    dd := (equity - peak) / peak
    if dd < -MaxDD {
        Panic("drawdown breach")
    }
}

// layer 3, HTTP panic endpoint
func StartAdmin() {
    http.HandleFunc("/panic", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte(`{"ok":true}`))
        Panic("http panic: " + r.RemoteAddr)
    })
    go http.ListenAndServe(":8081", nil)
}

Harden the control panel before going live.

The panel from Module 02 has been running on seed data, then on your dev bot. Before flipping DRY_RUN=false, walk these four items: (1) rotate PANEL_TOKEN so anything that ever leaked is dead; (2) confirm the cancel-button confirm dialog on every open-order row (an accidental click should not cancel); (3) trip the kill switch from your phone with the bot running and confirm it halts on the next iteration; (4) verify the audit log captures both panel cancels and bot fills with the same shape so the “who did this” question is answerable. Anything that fails here is a production blocker.

Common questions

Production trading bots: what people ask

Each answer also ships invisibly as schema.org FAQ data for search engines and AI assistants. Tap a question to expand.

What infrastructure does a solo trading bot actually need?

A single $5–20/month VPS with two cores, 2 GB RAM, and an SSD; you’re not running a model farm. SQLite on disk for state, Parquet for historical data, a cheap S3 bucket for backups. For observability, a Grafana Cloud free tier, a Discord webhook for alerts, and a Telegram bot for kill-switch control. The architecture is five boxes in a row, signal ingest, strategy, risk filter, order executor, monitor, not a microservices diagram.
What metrics should you monitor on a live trading bot?

Eight, each with an alert threshold, because a number without a trigger is a decoration: heartbeat (alert if the last successful loop is over 60 seconds old); fill rate (% of orders filling within 30 s); error rate (a burst of 401s means a rotated key, a burst of 5xx means call off new orders); PnL per strategy; signed inventory per market; latency p50/p95/p99 from signal to fill; your own 429 count plus limiter utilisation; and on-chain balance reconciled against REST, where any discrepancy means stop immediately.
How does a trading bot kill switch work?

Three layers, all enforced inside the main loop. A file flag you can touch over SSH at the path in KILL_SWITCH_FILE; a drawdown threshold the bot enforces automatically (the example halts at 10% from peak); and an HTTP /panic endpoint on a separate admin port (8081) for when you’re on your phone. Any layer tripping logs [KILL] and exits non-zero within one loop tick, and the switch is never re-armed automatically: human review only.
Why is logging not enough for a production bot?

Because logs without alerting are a trace nobody reads. Error rates creep, cost-per-trade doubles, PnL decays, all visible in the logfile and none of it visible to you, until you’re a week into a regime you don’t recognise. Page on PnL spikes, latency spikes, error spikes, and cost spikes, thresholding each metric against a 7-day baseline, and route alerts to a channel that actually gets your attention. Alert noise is solvable; a silent failure isn’t.
How do you health-check a Dockerised trading bot?

Expose a /health endpoint that returns {ok: true, ts} and wire a Docker HEALTHCHECK with --interval=30s --timeout=3s --retries=3 so the container is marked unhealthy when the bot stops answering. Verify with curl localhost:8080/health and confirm docker ps shows the container as healthy. Keep the health port (8080) separate from the admin panic port (8081), and pair the endpoint with a heartbeat metric so a silently dead loop still pages you.

Section 05

Module checklist.

Five checks before continuing. Completing all five unlocks the Continue to Module 02 button below, where you wrap the production bot in an operator surface.

My bot’s architecture is five boxes in a row and I can point at each one in my code

I deploy via Dockerfile with env-var config and a working /health endpoint

All eight metrics are piped to a dashboard and every one has an alert threshold

My kill switch has a file flag, drawdown threshold, and HTTP panic endpoint

I’ve killed my bot at least once on purpose and verified it shuts down cleanly

API Academy · complete

Congratulations, graduate.

You shipped a real trading bot. One that deploys, monitors itself, and pulls its own plug when something goes wrong, the difference between a script you babysit and a service that runs without you in the room.

Concretely, sixteen modules in, your bot is shipped: Dockerfile, monitoring contract, three-layer kill switch, and an exit-non-zero discipline that never re-arms without a human. Three artifacts you walk away with from this module:

A deployable Dockerfile + entrypoint with env-var config, a /health endpoint, and HEALTHCHECK wiring, one image you can ship to a $5 VPS, Railway, or Fly without changes.

A monitoring contract, heartbeat, fill rate, error rate, PnL, inventory, latency, rate-limit signals, on-chain balance, each with an alert threshold piped to a free Grafana / Discord / Telegram stack.

A three-layer kill switch, file flag, drawdown threshold, HTTP /panic, that exits non-zero the moment any layer trips and never re-arms without a human.

Quick recall

Without scrolling back, can you answer these?

Five questions across the Production tier. Click each to reveal, the test is whether you can answer first.

Market-making spread + skew. What does spread cover, and what does skew cover?

Spread covers risk + fees: you need to make at least the round-trip cost back on every cycle, plus a buffer for adverse moves. Skew covers inventory: when you’re long, you tilt both quotes lower to encourage selling and discourage more buying. Spread without skew quotes blindly and accumulates inventory until you blow risk caps. Skew without spread can quote inside cost, every fill loses money. Both, every cycle.
Triangular arb shows a 0.5% edge. Three reasons it might still lose money.

(a) Liquidity isn’t there for all three legs simultaneously, fill the first two, the third moves before you cross. (b) Costs eat the edge, 0.5% gross with 0.3% in fees + slippage is 0.2% with execution risk attached. (c) The 0.5% was measured against quotes, not depth, the first share fills at the quote, the next 99 walk down the book. Walk depth, fill IOC, and hedge if any leg misses.
Why don’t signal-based strategies need lower latency than market-making?

Signal strategies trigger on events with discrete time horizons, news, model output, fill prints, and the information ages on minute-to-hour timescales. Market-making quotes update tick by tick; every 50ms a stale MM is off the wave while someone else is on it. Different strategies, different latency floors. If your signal strategy needs sub-50ms RTT, it’s really an MM strategy in disguise.
Risk caps live in code, NOT in the strategy parameters file. Why?

Strategy parameters get tuned. Every time you nudge a knob to chase backtest numbers, you’re one keystroke from disabling your stop-loss. Risk caps in code (a maxSize constant, a daily-loss check that halts the loop, a hard positionLimit on every place_order) cannot be tuned away, they’re a fence, not a parameter. The fence outlives the strategy author.
Your bot’s been live 48 hours. PnL is fine but average fee-per-trade has doubled. Where do you look first?

You’re probably trading thinner markets than the backtest assumed. Fees are spread crossed + maker-taker schedule + slippage; if your strategy migrated to lower-volume markets (because the high-volume ones got crowded and your edge shrank there), spread + slippage spike even when the gross win rate doesn’t move. Compare your live fill-distribution-by-market vs the backtest’s. If they diverge, your backtest’s universe assumption broke and the strategy is operating off-spec.

You finished API Academy. Eighteen modules from your first authenticated GET request to a deployed bot with kill switches and an operator surface. Trade live on Limitless, share your work, and ship.

Return to Limitless Academy

Complete the checklist above to unlock