API reference
The public Python surface — configure, run, the gpu singleton, and the core data types.
Everything below is importable from the top-level gpu_train package.
from gpu_train import configure, run, gpuconfigure
configure(
use: list[str] | None = None,
*,
registry: dict | None = None,
config: GpuTrainConfig | None = None,
) -> NoneInitialise the singleton once, before the first use of gpu/run. Pass
either a registry dict (+ use ids to activate) or a prebuilt config. Raises
ConfigurationError if called twice or with neither. See
Configuration.
Related: configure_from_dict(registry, use) builds a GpuTrainConfig without
touching the singleton; reset_singleton() clears it (tests only).
run
run(
task: Task | dict | str,
*,
provider: str | None = None,
gpus: str | None = None, # "A100:4", "H100:8", "cpu", …
label: str | None = None,
runtime: str | Runtime | None = None, # "torchrun" | "ray" | "local"
price_cap: float | None = None, # $/hr ceiling
disk_gb: int | None = None,
region: str | None = None,
nodes: int = 1,
project_dir: str | Path = ".", # rsynced to the box
wait: bool = False, # block until terminal if True
) -> JobRecordLaunch a training job. Non-blocking by default — returns the JobRecord
immediately and drives the lifecycle in a background thread. A convenience
wrapper over gpu.run(...).
The gpu singleton
gpu is a proxy to the configured GpuTrain control plane:
| Method | Returns | Purpose |
|---|---|---|
gpu.run(task, …) | JobRecord | Same as top-level run. |
gpu.jobs(status=None, limit=100) | list[JobRecord] | List jobs. |
gpu.job(job_id) | JobRecord | Fetch one (raises JobNotFoundError). |
gpu.logs(job_id, limit=1000) | list[str] | Stored log lines. |
gpu.stream(job_id, follow=True) | Iterator[str] | Tail logs until the job ends. |
gpu.wait(job_id, timeout=None) | JobRecord | Block until terminal. |
gpu.kill(job_id) | JobRecord | Cancel a job + terminate its box. |
gpu.kill_all() | int | Terminate every active instance. |
gpu.reconcile() | int | Terminate orphaned instances. |
Core types
All types are pydantic v2 models, importable from gpu_train.
Task
What to run. Accepts a str (entrypoint), a dict, or a Task:
Task(
entrypoint="train.py",
args=["--epochs", "3"],
working_dir=".",
env={"HF_HOME": "/workspace/hf"},
deps=Deps(requirements=["torch==2.4.0"], setup=["apt-get install -y git"]),
)JobRecord
The persisted record of a run: id, provider, status (JobStatus), spec
(ResourceSpec), task, runtime, instance_id, exit_code, cost_usd,
price_per_hr, wandb_run_id / wandb_url, timestamps, and error.
Other types
ResourceSpec— hardware request (gpu_type,count,disk_gb,image,region,price_cap,nodes).GpuSpec— a parsed"A100:4"-style accelerator request.Deps— environment setup (requirements,requirements_file,setup,python).Instance— a rented (or local) box.Handle— a reference to a launched process.JobStatus—queued·provisioning·running·succeeded·failed·cancelled.Runtime—torchrun·ray·local.
Config types
GpuTrainConfig, CredentialConfig, TrackingConfig, WandbConfig — the typed
shapes behind the registry dict (see Configuration).
For the exceptions these raise, see Errors.