Research CommonsResearch Commons
gpu-train/Distributed training

Distributed training

torchrun and Ray runtimes, single- and multi-node, all inside one provider.

gpu-train turns your task into a launch command using a runtime backend. Distributed training stays entirely inside the rented provider over its fast interconnect — the control plane only drives it over SSH and never sits in the data path.

Runtimes

RuntimeWhenNotes
torchrundefaultStandard PyTorch DDP. nproc_per_node = your GPU count.
ray[ray] extraStarts a Ray cluster inside the provider and runs the task.
locallocal / CPUPlain python subprocess — the zero-cost path.

Pick one per run, or set a default in defaults.runtime:

run(
    task={"entrypoint": "train.py"},
    provider="runpod",
    gpus="A100:8",
    runtime="torchrun",   # "ray" | "local" | provider default if omitted
)

torchrun (DDP)

With gpus="A100:8", gpu-train launches torchrun with --nproc_per_node=8 on the box, so your script runs one process per GPU. Write train.py as a normal DDP entrypoint that reads LOCAL_RANK / RANK / WORLD_SIZE from the environment.

Ray

Install the [ray] extra. The Ray backend starts a head node inside the provider and submits your task to it — useful for non-DDP parallelism or Ray-native workloads.

Multi-node

Request more than one node with nodes:

run(task={"entrypoint": "train.py"}, provider="runpod", gpus="A100:8", nodes=2)

Multi-node support depends on the provider's catalog entry advertising supports_multinode. The rendezvous (master address/port, node rank) is wired up for you across the rented nodes.

Dependencies

Declare environment setup in the task's deps so the box is ready before launch:

run(
    task={
        "entrypoint": "train.py",
        "deps": {
            "requirements": ["torch==2.4.0", "transformers"],
            # or: "requirements_file": "requirements.txt",
            "setup": ["apt-get install -y git"],
        },
    },
    provider="runpod",
    gpus="A100:4",
)

gpu-train rsyncs your project_dir to the box, runs setup commands, then pip installs your requirements before launching the entrypoint.