Quickstart
Configure gpu-train, run on a rented GPU, and stream logs — or run the zero-cost local path.
This walks through a first run end-to-end. You'll configure the control plane, launch a job on a rented GPU, follow its logs, and let it tear the box down automatically.
1. Configure
configure() must be called once, before the first use of run or gpu
(same contract as llm-rotate). It takes a registry dict and the list of
credential ids to activate.
from gpu_train import configure, run, gpu
configure(
registry={
"credentials": [
{"cred_id": "runpod-1", "provider": "runpod",
"secret_ref": "env://RUNPOD_API_KEY"},
],
"tracking": {"wandb": {"secret_ref": "env://WANDB_API_KEY", "project": "my-proj"}},
},
use=["runpod-1"],
)The secret itself never goes in the registry — only a secret_ref. See
Credentials & secrets.
2. Launch a run
job = run(
task={"entrypoint": "train.py", "args": ["--epochs", "3"]},
provider="runpod",
gpus="A100:4", # "<type>:<count>", or "cpu"
price_cap=2.50, # $/hr ceiling; refuse to provision above it
project_dir=".", # synced to the box with rsync
)
print(job.id, job.status)run() is non-blocking by default: it registers the job, kicks off a
background driver thread (provision → wait for SSH → push code → install deps →
launch → stream), and returns the JobRecord immediately. Pass wait=True to
block until the job reaches a terminal state.
3. Follow it
gpu.wait(job.id) # block until terminal
for line in gpu.stream(job.id): # tail logs until the job ends
print(line)
gpu.job(job.id) # latest JobRecord (status, cost, exit code)
gpu.jobs() # list recent jobsThe instance is terminated automatically when the job finishes. To stop things early:
gpu.kill(job.id) # cancel one job + terminate its instance
gpu.kill_all() # terminate every active instance (the panic button)Zero-cost local path
No credentials, no cloud, no cost — great for wiring up your train.py and CI:
from gpu_train import configure, run
configure(registry={}, use=[])
job = run(task={"entrypoint": "train.py"}, provider="local", gpus="cpu", wait=True)
print(job.status, job.exit_code)The local provider runs your entrypoint as a subprocess with the same
lifecycle (push → launch → stream → finalize) as a rented box.
Prefer a UI?
Install the dashboard and drive everything from the browser:
pip install "gpu-train[server]"
gpu-train serveSee Dashboard (UI).