Research CommonsResearch Commons
gpu-train/Overview

gpu-train

A local control plane for training on rented production GPUs — develop locally, launch a distributed run with one call, stream logs back, and tear the box down automatically when you're done.

gpu-train lets you develop locally and train on rented production GPUs (RunPod, Vast.ai, GCP, Colab) with one Python call, no infra to set up, and automatic cost teardown.

Your local machine is the control plane: it provisions a co-located GPU box at a single provider, ships your code, launches the run (torchrun or Ray), streams logs back, and terminates the box when you're done. Distributed training stays inside one provider over its fast interconnect — your laptop never sits in the training hot path.

Why it exists

Renting GPUs is cheap; the glue around them is not. You end up re-writing the same provisioning, SSH, code-sync, log-streaming, and — most importantly — teardown logic for every project, and a single forgotten box can quietly burn your budget overnight. gpu-train is that glue, hardened and reusable, with cost safety built into the core.

  • One call, many providers. local, runpod, vastai, gcp, and colab behind the same run(...).
  • Develop locally. Iterate on your laptop; launch on an A100 box only when you're ready. The zero-cost local provider runs the exact same path as a subprocess.
  • Automatic cost teardown. Boxes are terminated on completion, on failure, on idle timeout, on Ctrl-C, and on control-plane exit. A price_cap refuses to provision above your $/hr ceiling.
  • Distributed by default. torchrun (DDP) or Ray, single- or multi-node, all inside the rented provider.
  • Experiment tracking. The control plane mints a Weights & Biases run and injects WANDB_* into every job automatically.
  • A local dashboard. gpu-train serve opens a branded UI to start runs, watch live logs, connect providers, and track cost — see Dashboard (UI).

It is a sibling of llm-rotate: the same configure(registry=..., use=[...]) ergonomics, the same secret_ref="env://VAR" pattern, and the same project conventions.

At a glance

from gpu_train import configure, run, gpu
 
configure(
    registry={
        "credentials": [
            {"cred_id": "runpod-1", "provider": "runpod",
             "secret_ref": "env://RUNPOD_API_KEY"},
        ],
        "tracking": {"wandb": {"secret_ref": "env://WANDB_API_KEY", "project": "my-proj"}},
    },
    use=["runpod-1"],
)
 
job = run(
    task={"entrypoint": "train.py", "args": ["--epochs", "3"]},
    provider="runpod",
    gpus="A100:4",
    price_cap=2.50,   # $/hr ceiling; refuse to provision above it
)
 
gpu.logs(job.id)   # print logs
gpu.kill_all()     # terminate everything (the panic button)
Status

gpu-train is currently alpha (0.0.x). It is not yet published to PyPI — install it from the GitHub Releases wheel (see Installation). The cloud adapters are covered by mock-based unit tests but have not been validated against live billing accounts; Colab in particular is a best-effort SSH-over-tunnel connector. See the roadmap for what's next.

Where to next