Distributed training
torchrun and Ray runtimes, single- and multi-node, all inside one provider.
gpu-train turns your task into a launch command using a runtime backend.
Distributed training stays entirely inside the rented provider over its fast
interconnect — the control plane only drives it over SSH and never sits in the
data path.
Runtimes
| Runtime | When | Notes |
|---|---|---|
torchrun | default | Standard PyTorch DDP. nproc_per_node = your GPU count. |
ray | [ray] extra | Starts a Ray cluster inside the provider and runs the task. |
local | local / CPU | Plain python subprocess — the zero-cost path. |
Pick one per run, or set a default in defaults.runtime:
run(
task={"entrypoint": "train.py"},
provider="runpod",
gpus="A100:8",
runtime="torchrun", # "ray" | "local" | provider default if omitted
)torchrun (DDP)
With gpus="A100:8", gpu-train launches torchrun with
--nproc_per_node=8 on the box, so your script runs one process per GPU. Write
train.py as a normal DDP entrypoint that reads LOCAL_RANK / RANK /
WORLD_SIZE from the environment.
Ray
Install the [ray] extra. The Ray backend starts a head node inside the
provider and submits your task to it — useful for non-DDP parallelism or Ray-native
workloads.
Multi-node
Request more than one node with nodes:
run(task={"entrypoint": "train.py"}, provider="runpod", gpus="A100:8", nodes=2)Multi-node support depends on the provider's catalog entry advertising
supports_multinode. The rendezvous (master address/port, node rank) is wired
up for you across the rented nodes.
Dependencies
Declare environment setup in the task's deps so the box is ready before launch:
run(
task={
"entrypoint": "train.py",
"deps": {
"requirements": ["torch==2.4.0", "transformers"],
# or: "requirements_file": "requirements.txt",
"setup": ["apt-get install -y git"],
},
},
provider="runpod",
gpus="A100:4",
)gpu-train rsyncs your project_dir to the box, runs setup commands, then
pip installs your requirements before launching the entrypoint.