Research CommonsResearch Commons
gpu-train/Quickstart

Quickstart

Configure gpu-train, run on a rented GPU, and stream logs — or run the zero-cost local path.

This walks through a first run end-to-end. You'll configure the control plane, launch a job on a rented GPU, follow its logs, and let it tear the box down automatically.

1. Configure

configure() must be called once, before the first use of run or gpu (same contract as llm-rotate). It takes a registry dict and the list of credential ids to activate.

from gpu_train import configure, run, gpu
 
configure(
    registry={
        "credentials": [
            {"cred_id": "runpod-1", "provider": "runpod",
             "secret_ref": "env://RUNPOD_API_KEY"},
        ],
        "tracking": {"wandb": {"secret_ref": "env://WANDB_API_KEY", "project": "my-proj"}},
    },
    use=["runpod-1"],
)

The secret itself never goes in the registry — only a secret_ref. See Credentials & secrets.

2. Launch a run

job = run(
    task={"entrypoint": "train.py", "args": ["--epochs", "3"]},
    provider="runpod",
    gpus="A100:4",      # "<type>:<count>", or "cpu"
    price_cap=2.50,     # $/hr ceiling; refuse to provision above it
    project_dir=".",    # synced to the box with rsync
)
 
print(job.id, job.status)

run() is non-blocking by default: it registers the job, kicks off a background driver thread (provision → wait for SSH → push code → install deps → launch → stream), and returns the JobRecord immediately. Pass wait=True to block until the job reaches a terminal state.

3. Follow it

gpu.wait(job.id)                       # block until terminal
for line in gpu.stream(job.id):        # tail logs until the job ends
    print(line)
 
gpu.job(job.id)                        # latest JobRecord (status, cost, exit code)
gpu.jobs()                             # list recent jobs

The instance is terminated automatically when the job finishes. To stop things early:

gpu.kill(job.id)    # cancel one job + terminate its instance
gpu.kill_all()      # terminate every active instance (the panic button)

Zero-cost local path

No credentials, no cloud, no cost — great for wiring up your train.py and CI:

from gpu_train import configure, run
 
configure(registry={}, use=[])
job = run(task={"entrypoint": "train.py"}, provider="local", gpus="cpu", wait=True)
print(job.status, job.exit_code)

The local provider runs your entrypoint as a subprocess with the same lifecycle (push → launch → stream → finalize) as a rented box.

Prefer a UI?

Install the dashboard and drive everything from the browser:

pip install "gpu-train[server]"
gpu-train serve

See Dashboard (UI).