Errors

All exceptions inherit from GpuTrainError and are importable from gpu_train.

from gpu_train import GpuTrainError, ConfigurationError, ProvisionError

Hierarchy

GpuTrainError
├── ConfigurationError
├── ProviderError
│   ├── ProvisionError
│   │   └── PriceCapExceededError
│   └── LaunchError
└── JobNotFoundError

When each is raised

Exception	Raised when
`ConfigurationError`	`configure()` is missing/duplicated, a credential is required but not active, an unknown provider is requested, or a `secret_ref` can't be resolved.
`ProviderError`	A provider API call fails.
`ProvisionError`	A box can't be acquired or readied (no capacity, never became SSH-ready, missing credential fields).
`PriceCapExceededError`	The cheapest matching offer / pod cost exceeds `price_cap`. Carries `price` and `cap`.
`LaunchError`	The training command fails to launch on the box.
`JobNotFoundError`	A job id isn't present in the registry.

GpuTrainError carries an optional provider attribute, included in its string form (e.g. … | provider=runpod).

In the driver

Inside a running job, errors are caught by the lifecycle driver: the job is marked failed with the message stored on JobRecord.error, an event is emitted to the log stream, and the box is always torn down in a finally block. So a provider failure never leaks a running instance — see Cost safety.

from gpu_train import run, PriceCapExceededError
 
try:
    job = run(task="train.py", provider="vastai", gpus="H100:8",
              price_cap=1.0, wait=True)
except PriceCapExceededError as e:
    print(f"too expensive: ${e.price}/hr > ${e.cap}/hr cap")

Hierarchy#

When each is raised#

In the driver#

Hierarchy

When each is raised

In the driver