Research CommonsResearch Commons
gpu-train/Configuration

Configuration

The registry, configure(), provider defaults, and where gpu-train keeps its state.

gpu-train is configured once with a registry — a dict describing your credentials, optional per-provider overrides, tracking, and runtime defaults. This mirrors llm-rotate's configure() model.

configure()

from gpu_train import configure
 
configure(
    use=["runpod-1"],            # which credential ids to activate
    registry={...},              # the registry dict (below)
)

configure() accepts either a registry dict or a prebuilt config object:

from gpu_train import configure, configure_from_dict
 
cfg = configure_from_dict(registry, use=["runpod-1"])
configure(config=cfg)

It must be called before the first use of gpu/run, and only once per process. In tests, call gpu_train.reset_singleton() to reconfigure.

The registry

registry = {
    "credentials": [
        {"cred_id": "runpod-1", "provider": "runpod",
         "secret_ref": "env://RUNPOD_API_KEY"},
    ],
    "providers": {                      # optional per-provider catalog overrides
        "runpod": {"default_image": "runpod/pytorch:2.4.0-cu124"},
    },
    "tracking": {
        "wandb": {"secret_ref": "env://WANDB_API_KEY", "project": "my-proj",
                  "entity": "my-team", "enabled": True},
    },
    "defaults": {
        "runtime": "torchrun",
        "auto_terminate": True,
        "terminate_on_exit": True,
        "reconcile_on_start": True,
        "idle_timeout_seconds": 1800,
    },
}

credentials

A list of credential entries. Each has a cred_id, a provider, and a secret_ref (see Credentials & secrets). Some providers carry extra fields (ssh_key_ref, region, extra) — see Providers. Only the ids listed in use=[...] are activated.

providers

Optional per-name overrides layered on top of the built-in provider catalog (default image, disk, ssh user, runtime, etc.). Providers you don't mention keep their built-in defaults.

tracking

A wandb block enables automatic experiment tracking — see Monitoring.

defaults

KeyDefaultMeaning
runtimetorchrunExecution backend when not overridden per-run.
auto_terminatetrueTerminate the box when a job finishes.
terminate_on_exittrueTerminate active boxes on control-plane exit / Ctrl-C.
reconcile_on_starttrueOn startup, terminate boxes orphaned by a prior crash.
idle_timeout_seconds1800Kill a box after this many seconds with no log activity.

These power cost safety.

Standalone / environment config

The CLI and dashboard build their registry from the environment so they work without a config file. RUNPOD_API_KEY, VAST_API_KEY, GOOGLE_APPLICATION_CREDENTIALS (+ GOOGLE_CLOUD_PROJECT / CLOUDSDK_COMPUTE_ZONE), and WANDB_API_KEY are picked up automatically, and merged with any keys saved from the dashboard. Environment variables take precedence over stored keys.

State location

The SQLite registry of jobs, instances, and logs lives in ~/.gpu-train (override with the GPU_TRAIN_HOME environment variable). The same directory holds the dashboard-managed credentials.json (chmod 600).

Logging

As a library, gpu-train stays silent (a NullHandler is attached). Opt in with gpu_train.setup_logging(), or set GPU_TRAIN_LOG_LEVEL=DEBUG for the CLI and server which auto-configure handlers.