Research CommonsResearch Commons
gpu-train/Errors

Errors

The gpu-train exception hierarchy and when each is raised.

All exceptions inherit from GpuTrainError and are importable from gpu_train.

from gpu_train import GpuTrainError, ConfigurationError, ProvisionError

Hierarchy

GpuTrainError
├── ConfigurationError
├── ProviderError
│   ├── ProvisionError
│   │   └── PriceCapExceededError
│   └── LaunchError
└── JobNotFoundError

When each is raised

ExceptionRaised when
ConfigurationErrorconfigure() is missing/duplicated, a credential is required but not active, an unknown provider is requested, or a secret_ref can't be resolved.
ProviderErrorA provider API call fails.
ProvisionErrorA box can't be acquired or readied (no capacity, never became SSH-ready, missing credential fields).
PriceCapExceededErrorThe cheapest matching offer / pod cost exceeds price_cap. Carries price and cap.
LaunchErrorThe training command fails to launch on the box.
JobNotFoundErrorA job id isn't present in the registry.

GpuTrainError carries an optional provider attribute, included in its string form (e.g. … | provider=runpod).

In the driver

Inside a running job, errors are caught by the lifecycle driver: the job is marked failed with the message stored on JobRecord.error, an event is emitted to the log stream, and the box is always torn down in a finally block. So a provider failure never leaks a running instance — see Cost safety.

from gpu_train import run, PriceCapExceededError
 
try:
    job = run(task="train.py", provider="vastai", gpus="H100:8",
              price_cap=1.0, wait=True)
except PriceCapExceededError as e:
    print(f"too expensive: ${e.price}/hr > ${e.cap}/hr cap")