Research CommonsResearch Commons
gpu-train/Cost safety

Cost safety

Auto-termination, idle watchdog, price caps, exit guards, and reconciliation — so a forgotten box never burns your budget.

Cost safety is built into the core, not bolted on. A rented box is torn down through multiple independent paths, so it's hard to leave one running by accident.

Termination paths

  • On completion and on failure. When a job reaches a terminal state, its instance is terminated (auto_terminate, on by default).
  • Idle watchdog. A box with no log activity for idle_timeout_seconds (default 1800) is killed automatically.
  • On exit / Ctrl-C. An atexit guard terminates active boxes when the control-plane process exits (terminate_on_exit, on by default).
  • On startup reconciliation. When the control plane starts, it terminates any active instances orphaned by a previously-crashed process (reconcile_on_start).

price_cap

Refuse to provision above a $/hr ceiling:

run(task={"entrypoint": "train.py"}, provider="runpod", gpus="A100:8", price_cap=12.0)

If the cheapest matching offer (Vast.ai) or the pod's hourly cost (RunPod) exceeds the cap, the box is destroyed and a PriceCapExceededError is raised before training starts.

price_cap is provider-dependent

price_cap requires the provider to expose pricing. RunPod and Vast.ai do; the GCP client does not, so price_cap is not enforced for gcp (and cost_usd stays 0). Use the idle watchdog and auto-terminate there. colab is your own already-running runtime, so there's nothing to bill or cap.

Manual escape hatches

gpu.kill(job.id)     # cancel one job + terminate its instance
gpu.kill_all()       # terminate every active instance (the panic button)
gpu.reconcile()      # sweep and terminate orphaned instances

From the CLI:

gpu-train kill <job-id>
gpu-train kill --all
gpu-train reconcile

Cost tracking

Each job records price_per_hr and a computed cost_usd (duration × hourly rate). The dashboard aggregates these into a per-day and per-provider cost report.