API reference

The public Python surface — configure, run, the gpu singleton, and the core data types.

Everything below is importable from the top-level gpu_train package.

from gpu_train import configure, run, gpu

configure

configure(
    use: list[str] | None = None,
    *,
    registry: dict | None = None,
    config: GpuTrainConfig | None = None,
) -> None

Initialise the singleton once, before the first use of gpu/run. Pass either a registry dict (+ use ids to activate) or a prebuilt config. Raises ConfigurationError if called twice or with neither. See Configuration.

Related: configure_from_dict(registry, use) builds a GpuTrainConfig without touching the singleton; reset_singleton() clears it (tests only).

run

run(
    task: Task | dict | str,
    *,
    provider: str | None = None,
    gpus: str | None = None,           # "A100:4", "H100:8", "cpu", …
    label: str | None = None,
    runtime: str | Runtime | None = None,   # "torchrun" | "ray" | "local"
    price_cap: float | None = None,    # $/hr ceiling
    disk_gb: int | None = None,
    region: str | None = None,
    nodes: int = 1,
    project_dir: str | Path = ".",     # rsynced to the box
    wait: bool = False,                # block until terminal if True
) -> JobRecord

Launch a training job. Non-blocking by default — returns the JobRecord immediately and drives the lifecycle in a background thread. A convenience wrapper over gpu.run(...).

The `gpu` singleton

gpu is a proxy to the configured GpuTrain control plane:

Method	Returns	Purpose
`gpu.run(task, …)`	`JobRecord`	Same as top-level `run`.
`gpu.jobs(status=None, limit=100)`	`list[JobRecord]`	List jobs.
`gpu.job(job_id)`	`JobRecord`	Fetch one (raises `JobNotFoundError`).
`gpu.logs(job_id, limit=1000)`	`list[str]`	Stored log lines.
`gpu.stream(job_id, follow=True)`	`Iterator[str]`	Tail logs until the job ends.
`gpu.wait(job_id, timeout=None)`	`JobRecord`	Block until terminal.
`gpu.kill(job_id)`	`JobRecord`	Cancel a job + terminate its box.
`gpu.kill_all()`	`int`	Terminate every active instance.
`gpu.reconcile()`	`int`	Terminate orphaned instances.

Core types

All types are pydantic v2 models, importable from gpu_train.

Task

What to run. Accepts a str (entrypoint), a dict, or a Task:

Task(
    entrypoint="train.py",
    args=["--epochs", "3"],
    working_dir=".",
    env={"HF_HOME": "/workspace/hf"},
    deps=Deps(requirements=["torch==2.4.0"], setup=["apt-get install -y git"]),
)

JobRecord

The persisted record of a run: id, provider, status (JobStatus), spec (ResourceSpec), task, runtime, instance_id, exit_code, cost_usd, price_per_hr, wandb_run_id / wandb_url, timestamps, and error.

Other types

ResourceSpec — hardware request (gpu_type, count, disk_gb, image, region, price_cap, nodes).
GpuSpec — a parsed "A100:4"-style accelerator request.
Deps — environment setup (requirements, requirements_file, setup, python).
Instance — a rented (or local) box.
Handle — a reference to a launched process.
JobStatus — queued · provisioning · running · succeeded · failed · cancelled.
Runtime — torchrun · ray · local.

Config types

GpuTrainConfig, CredentialConfig, TrackingConfig, WandbConfig — the typed shapes behind the registry dict (see Configuration).

For the exceptions these raise, see Errors.

configure#

run#

The gpu singleton#

Core types#

Task#

JobRecord#

Other types#

Config types#