gpu-train/Errors
Errors
The gpu-train exception hierarchy and when each is raised.
All exceptions inherit from GpuTrainError and are importable from gpu_train.
from gpu_train import GpuTrainError, ConfigurationError, ProvisionErrorHierarchy
GpuTrainError
├── ConfigurationError
├── ProviderError
│ ├── ProvisionError
│ │ └── PriceCapExceededError
│ └── LaunchError
└── JobNotFoundErrorWhen each is raised
| Exception | Raised when |
|---|---|
ConfigurationError | configure() is missing/duplicated, a credential is required but not active, an unknown provider is requested, or a secret_ref can't be resolved. |
ProviderError | A provider API call fails. |
ProvisionError | A box can't be acquired or readied (no capacity, never became SSH-ready, missing credential fields). |
PriceCapExceededError | The cheapest matching offer / pod cost exceeds price_cap. Carries price and cap. |
LaunchError | The training command fails to launch on the box. |
JobNotFoundError | A job id isn't present in the registry. |
GpuTrainError carries an optional provider attribute, included in its string
form (e.g. … | provider=runpod).
In the driver
Inside a running job, errors are caught by the lifecycle driver: the job is
marked failed with the message stored on JobRecord.error, an event is
emitted to the log stream, and the box is always torn down in a finally
block. So a provider failure never leaks a running instance — see
Cost safety.
from gpu_train import run, PriceCapExceededError
try:
job = run(task="train.py", provider="vastai", gpus="H100:8",
price_cap=1.0, wait=True)
except PriceCapExceededError as e:
print(f"too expensive: ${e.price}/hr > ${e.cap}/hr cap")