Configuration
The registry, configure(), provider defaults, and where gpu-train keeps its state.
gpu-train is configured once with a registry — a dict describing your
credentials, optional per-provider overrides, tracking, and runtime defaults.
This mirrors llm-rotate's configure() model.
configure()
from gpu_train import configure
configure(
use=["runpod-1"], # which credential ids to activate
registry={...}, # the registry dict (below)
)configure() accepts either a registry dict or a prebuilt config object:
from gpu_train import configure, configure_from_dict
cfg = configure_from_dict(registry, use=["runpod-1"])
configure(config=cfg)It must be called before the first use of gpu/run, and only once per
process. In tests, call gpu_train.reset_singleton() to reconfigure.
The registry
registry = {
"credentials": [
{"cred_id": "runpod-1", "provider": "runpod",
"secret_ref": "env://RUNPOD_API_KEY"},
],
"providers": { # optional per-provider catalog overrides
"runpod": {"default_image": "runpod/pytorch:2.4.0-cu124"},
},
"tracking": {
"wandb": {"secret_ref": "env://WANDB_API_KEY", "project": "my-proj",
"entity": "my-team", "enabled": True},
},
"defaults": {
"runtime": "torchrun",
"auto_terminate": True,
"terminate_on_exit": True,
"reconcile_on_start": True,
"idle_timeout_seconds": 1800,
},
}credentials
A list of credential entries. Each has a cred_id, a provider, and a
secret_ref (see Credentials & secrets). Some
providers carry extra fields (ssh_key_ref, region, extra) — see
Providers. Only the ids listed in use=[...] are
activated.
providers
Optional per-name overrides layered on top of the built-in provider catalog (default image, disk, ssh user, runtime, etc.). Providers you don't mention keep their built-in defaults.
tracking
A wandb block enables automatic experiment tracking — see
Monitoring.
defaults
| Key | Default | Meaning |
|---|---|---|
runtime | torchrun | Execution backend when not overridden per-run. |
auto_terminate | true | Terminate the box when a job finishes. |
terminate_on_exit | true | Terminate active boxes on control-plane exit / Ctrl-C. |
reconcile_on_start | true | On startup, terminate boxes orphaned by a prior crash. |
idle_timeout_seconds | 1800 | Kill a box after this many seconds with no log activity. |
These power cost safety.
Standalone / environment config
The CLI and dashboard build their registry from the environment so they work
without a config file. RUNPOD_API_KEY, VAST_API_KEY,
GOOGLE_APPLICATION_CREDENTIALS (+ GOOGLE_CLOUD_PROJECT / CLOUDSDK_COMPUTE_ZONE),
and WANDB_API_KEY are picked up automatically, and merged with any keys saved
from the dashboard. Environment variables take precedence over stored keys.
State location
The SQLite registry of jobs, instances, and logs lives in ~/.gpu-train
(override with the GPU_TRAIN_HOME environment variable). The same directory
holds the dashboard-managed credentials.json (chmod 600).
Logging
As a library, gpu-train stays silent (a NullHandler is attached). Opt in
with gpu_train.setup_logging(), or set GPU_TRAIN_LOG_LEVEL=DEBUG for the CLI
and server which auto-configure handlers.