gpu-train
A local control plane for training on rented production GPUs — develop locally, launch a distributed run with one call, stream logs back, and tear the box down automatically when you're done.
gpu-train lets you develop locally and train on rented production GPUs
(RunPod, Vast.ai, GCP, Colab) with one Python call, no infra to set up,
and automatic cost teardown.
Your local machine is the control plane: it provisions a co-located GPU box at a
single provider, ships your code, launches the run (torchrun or Ray), streams
logs back, and terminates the box when you're done. Distributed training stays
inside one provider over its fast interconnect — your laptop never sits in the
training hot path.
Why it exists
Renting GPUs is cheap; the glue around them is not. You end up re-writing the
same provisioning, SSH, code-sync, log-streaming, and — most importantly —
teardown logic for every project, and a single forgotten box can quietly
burn your budget overnight. gpu-train is that glue, hardened and reusable, with
cost safety built into the core.
- One call, many providers.
local,runpod,vastai,gcp, andcolabbehind the samerun(...). - Develop locally. Iterate on your laptop; launch on an A100 box only when
you're ready. The zero-cost
localprovider runs the exact same path as a subprocess. - Automatic cost teardown. Boxes are terminated on completion, on failure,
on idle timeout, on Ctrl-C, and on control-plane exit. A
price_caprefuses to provision above your $/hr ceiling. - Distributed by default.
torchrun(DDP) or Ray, single- or multi-node, all inside the rented provider. - Experiment tracking. The control plane mints a Weights & Biases run and
injects
WANDB_*into every job automatically. - A local dashboard.
gpu-train serveopens a branded UI to start runs, watch live logs, connect providers, and track cost — see Dashboard (UI).
It is a sibling of llm-rotate: the same
configure(registry=..., use=[...]) ergonomics, the same secret_ref="env://VAR"
pattern, and the same project conventions.
At a glance
from gpu_train import configure, run, gpu
configure(
registry={
"credentials": [
{"cred_id": "runpod-1", "provider": "runpod",
"secret_ref": "env://RUNPOD_API_KEY"},
],
"tracking": {"wandb": {"secret_ref": "env://WANDB_API_KEY", "project": "my-proj"}},
},
use=["runpod-1"],
)
job = run(
task={"entrypoint": "train.py", "args": ["--epochs", "3"]},
provider="runpod",
gpus="A100:4",
price_cap=2.50, # $/hr ceiling; refuse to provision above it
)
gpu.logs(job.id) # print logs
gpu.kill_all() # terminate everything (the panic button)gpu-train is currently alpha (0.0.x). It is not yet published to PyPI —
install it from the GitHub Releases wheel (see
Installation). The cloud adapters are covered by
mock-based unit tests but have not been validated against live billing accounts;
Colab in particular is a best-effort SSH-over-tunnel connector. See the
roadmap for what's next.
Where to next
- New here? Start with Installation and the Quickstart.
- Wiring it into a project? Read Configuration and Providers.
- Care about the bill? Read Cost safety.
- Prefer a UI? See the Dashboard.