Research CommonsResearch Commons
gpu-train/API reference

API reference

The public Python surface — configure, run, the gpu singleton, and the core data types.

Everything below is importable from the top-level gpu_train package.

from gpu_train import configure, run, gpu

configure

configure(
    use: list[str] | None = None,
    *,
    registry: dict | None = None,
    config: GpuTrainConfig | None = None,
) -> None

Initialise the singleton once, before the first use of gpu/run. Pass either a registry dict (+ use ids to activate) or a prebuilt config. Raises ConfigurationError if called twice or with neither. See Configuration.

Related: configure_from_dict(registry, use) builds a GpuTrainConfig without touching the singleton; reset_singleton() clears it (tests only).

run

run(
    task: Task | dict | str,
    *,
    provider: str | None = None,
    gpus: str | None = None,           # "A100:4", "H100:8", "cpu", …
    label: str | None = None,
    runtime: str | Runtime | None = None,   # "torchrun" | "ray" | "local"
    price_cap: float | None = None,    # $/hr ceiling
    disk_gb: int | None = None,
    region: str | None = None,
    nodes: int = 1,
    project_dir: str | Path = ".",     # rsynced to the box
    wait: bool = False,                # block until terminal if True
) -> JobRecord

Launch a training job. Non-blocking by default — returns the JobRecord immediately and drives the lifecycle in a background thread. A convenience wrapper over gpu.run(...).

The gpu singleton

gpu is a proxy to the configured GpuTrain control plane:

MethodReturnsPurpose
gpu.run(task, …)JobRecordSame as top-level run.
gpu.jobs(status=None, limit=100)list[JobRecord]List jobs.
gpu.job(job_id)JobRecordFetch one (raises JobNotFoundError).
gpu.logs(job_id, limit=1000)list[str]Stored log lines.
gpu.stream(job_id, follow=True)Iterator[str]Tail logs until the job ends.
gpu.wait(job_id, timeout=None)JobRecordBlock until terminal.
gpu.kill(job_id)JobRecordCancel a job + terminate its box.
gpu.kill_all()intTerminate every active instance.
gpu.reconcile()intTerminate orphaned instances.

Core types

All types are pydantic v2 models, importable from gpu_train.

Task

What to run. Accepts a str (entrypoint), a dict, or a Task:

Task(
    entrypoint="train.py",
    args=["--epochs", "3"],
    working_dir=".",
    env={"HF_HOME": "/workspace/hf"},
    deps=Deps(requirements=["torch==2.4.0"], setup=["apt-get install -y git"]),
)

JobRecord

The persisted record of a run: id, provider, status (JobStatus), spec (ResourceSpec), task, runtime, instance_id, exit_code, cost_usd, price_per_hr, wandb_run_id / wandb_url, timestamps, and error.

Other types

  • ResourceSpec — hardware request (gpu_type, count, disk_gb, image, region, price_cap, nodes).
  • GpuSpec — a parsed "A100:4"-style accelerator request.
  • Deps — environment setup (requirements, requirements_file, setup, python).
  • Instance — a rented (or local) box.
  • Handle — a reference to a launched process.
  • JobStatusqueued · provisioning · running · succeeded · failed · cancelled.
  • Runtimetorchrun · ray · local.

Config types

GpuTrainConfig, CredentialConfig, TrackingConfig, WandbConfig — the typed shapes behind the registry dict (see Configuration).

For the exceptions these raise, see Errors.