Research CommonsResearch Commons
gpu-train/Roadmap

Roadmap

What's shipped, what's planned, and the known limitations of the alpha.

gpu-train is alpha (0.0.x). The control plane, the local provider, and the cost-safety machinery are implemented; the cloud adapters are wired end-to-end and unit-tested with mocks, but have not been validated against live billing accounts.

Shipped today

  • One-call run(...) with a non-blocking background lifecycle (provision → SSH → push → install → launch → stream → terminate)
  • Five providers: local, runpod, vastai, gcp, colab (shared SSH runtime mixin for the cloud ones)
  • torchrun, ray, and local runtimes; single- and multi-node
  • Cost safety: auto-terminate on completion/failure, idle watchdog, price_cap, exit/Ctrl-C guard, and startup reconciliation of orphaned boxes
  • Pluggable secret resolution (env://, file://, store://, literal://)
  • Local credential store (~/.gpu-train/credentials.json, chmod 600) with log redaction and masked API reads
  • Automatic Weights & Biases tracking (run minting + WANDB_* injection)
  • A branded local dashboard (gpu-train serve) — start runs, live logs, connect providers, cost reports — with the prebuilt UI bundled in the wheel
  • gpu-train CLI (jobs, logs -f, kill, reconcile, serve)
  • SQLite job/instance/log registry; fully typed (py.typed, strict mypy)

Known limitations

AreaNotes
Not on PyPIInstall from the GitHub Releases wheel — see Installation.
Live validationCloud adapters are mock-tested; treat first real runs as smoke tests and watch the bill.
GCP pricingThe GCE client doesn't expose pricing, so price_cap isn't enforced and cost_usd stays 0 — rely on auto-terminate / idle timeout.
ColabBest-effort SSH over an ngrok tunnel; may bump into Colab's ToS, and terminate() is a no-op.
Dashboard authLoopback-only, no authentication — don't expose it directly to a network.

Planned

AreaStatusNotes
PyPI publishingplannedTrusted-publishing release flow like llm-rotate.
More providersexploringAdditional clouds as demand warrants.
Spot / preemptibleexploringCheaper capacity tiers with checkpointing.
Live billing testsplannedEnd-to-end validation against real accounts.

Contributing

Install the dev extras and make sure the gate passes locally:

git clone https://github.com/Research-Commons/gpu-train
cd gpu-train
python -m pip install -e ".[dev,server]"
ruff check src tests
mypy src
pytest --cov --cov-fail-under=90

Releases are cut by bumping _version.py, tagging vX.Y.Z, building the sdist + wheel, and attaching them to a GitHub Release. See the repository for full contributor docs.