Cost safety
Auto-termination, idle watchdog, price caps, exit guards, and reconciliation — so a forgotten box never burns your budget.
Cost safety is built into the core, not bolted on. A rented box is torn down through multiple independent paths, so it's hard to leave one running by accident.
Termination paths
- On completion and on failure. When a job reaches a terminal state, its
instance is terminated (
auto_terminate, on by default). - Idle watchdog. A box with no log activity for
idle_timeout_seconds(default 1800) is killed automatically. - On exit / Ctrl-C. An
atexitguard terminates active boxes when the control-plane process exits (terminate_on_exit, on by default). - On startup reconciliation. When the control plane starts, it terminates any
active instances orphaned by a previously-crashed process
(
reconcile_on_start).
price_cap
Refuse to provision above a $/hr ceiling:
run(task={"entrypoint": "train.py"}, provider="runpod", gpus="A100:8", price_cap=12.0)If the cheapest matching offer (Vast.ai) or the pod's hourly cost (RunPod)
exceeds the cap, the box is destroyed and a PriceCapExceededError is raised
before training starts.
price_cap requires the provider to expose pricing. RunPod and Vast.ai do; the
GCP client does not, so price_cap is not enforced for gcp (and cost_usd
stays 0). Use the idle watchdog and auto-terminate there. colab is your own
already-running runtime, so there's nothing to bill or cap.
Manual escape hatches
gpu.kill(job.id) # cancel one job + terminate its instance
gpu.kill_all() # terminate every active instance (the panic button)
gpu.reconcile() # sweep and terminate orphaned instancesFrom the CLI:
gpu-train kill <job-id>
gpu-train kill --all
gpu-train reconcileCost tracking
Each job records price_per_hr and a computed cost_usd (duration × hourly
rate). The dashboard aggregates these into a per-day and per-provider
cost report.