gpu-train/Roadmap
Roadmap
What's shipped, what's planned, and the known limitations of the alpha.
gpu-train is alpha (0.0.x). The control plane, the local provider, and the
cost-safety machinery are implemented; the cloud adapters are wired end-to-end and
unit-tested with mocks, but have not been validated against live billing accounts.
Shipped today
- One-call
run(...)with a non-blocking background lifecycle (provision → SSH → push → install → launch → stream → terminate) - Five providers:
local,runpod,vastai,gcp,colab(shared SSH runtime mixin for the cloud ones) torchrun,ray, andlocalruntimes; single- and multi-node- Cost safety: auto-terminate on completion/failure, idle watchdog,
price_cap, exit/Ctrl-C guard, and startup reconciliation of orphaned boxes - Pluggable secret resolution (
env://,file://,store://,literal://) - Local credential store (
~/.gpu-train/credentials.json,chmod 600) with log redaction and masked API reads - Automatic Weights & Biases tracking (run minting +
WANDB_*injection) - A branded local dashboard (
gpu-train serve) — start runs, live logs, connect providers, cost reports — with the prebuilt UI bundled in the wheel gpu-trainCLI (jobs,logs -f,kill,reconcile,serve)- SQLite job/instance/log registry; fully typed (
py.typed, strict mypy)
Known limitations
| Area | Notes |
|---|---|
| Not on PyPI | Install from the GitHub Releases wheel — see Installation. |
| Live validation | Cloud adapters are mock-tested; treat first real runs as smoke tests and watch the bill. |
| GCP pricing | The GCE client doesn't expose pricing, so price_cap isn't enforced and cost_usd stays 0 — rely on auto-terminate / idle timeout. |
| Colab | Best-effort SSH over an ngrok tunnel; may bump into Colab's ToS, and terminate() is a no-op. |
| Dashboard auth | Loopback-only, no authentication — don't expose it directly to a network. |
Planned
| Area | Status | Notes |
|---|---|---|
| PyPI publishing | planned | Trusted-publishing release flow like llm-rotate. |
| More providers | exploring | Additional clouds as demand warrants. |
| Spot / preemptible | exploring | Cheaper capacity tiers with checkpointing. |
| Live billing tests | planned | End-to-end validation against real accounts. |
Contributing
Install the dev extras and make sure the gate passes locally:
git clone https://github.com/Research-Commons/gpu-train
cd gpu-train
python -m pip install -e ".[dev,server]"
ruff check src tests
mypy src
pytest --cov --cov-fail-under=90Releases are cut by bumping _version.py, tagging vX.Y.Z, building the
sdist + wheel, and attaching them to a GitHub Release. See the repository for
full contributor docs.