Skip to main content

Multi-Node Training

Multi-node training allows you to scale your deep learning workloads across multiple machines, each with multiple GPUs, for training very large models or processing massive datasets.

Overview

TensorPool makes it easy to deploy multi-node GPU clusters with high-speed interconnects. All nodes in a cluster are connected with low-latency networking optimized for distributed training.

Supported Instance Types

Currently, multi-node deployments are supported for:
  • 8xB200 - 2 or more nodes, each with 8 B200 GPUs
  • 8xH200 - 2 or more nodes, each with 8 H200 GPUs
More instance types will support multi-node configurations soon.

Creating a Multi-Node Cluster

Create a multi-node cluster by specifying the number of nodes with the -n flag:
# 2-node cluster with 8xH200 each (16 GPUs total)
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200 -n 2 --name my-multinode-cluster

# 4-node cluster with 8xB200 each (32 GPUs total)
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xB200 -n 4 --name large-training
Multi-node deployments are only available for 8xH200 and 8xB200 instance types. Single H100 instances are available, but multi-node H100 clusters are not supported.

Cluster Configuration

Network Setup

Each node in your cluster:
  • Has its own IP address for SSH access
  • Is connected to other nodes via high-speed networking
  • Can communicate with all other nodes in the cluster
  • Has access to shared NFS storage (if attached)

SSH Access

You can SSH into any node in your cluster. When you run tp cluster list or tp cluster info, you’ll see the instance IDs for all nodes.
# SSH into the first node
tp ssh connect <node-1-instance-id>

# SSH into the second node
tp ssh connect <node-2-instance-id>

Distributed Training Frameworks

Multi-node training requires using a distributed training framework. Popular options include:

PyTorch Distributed Data Parallel (DDP)

PyTorch’s built-in distributed training:
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize the process group
dist.init_process_group(backend="nccl")

# Wrap your model
model = YourModel()
model = DDP(model)

# Train as usual
Launch with torchrun:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=<node-1-ip> --master_port=29500 train.py

DeepSpeed

Microsoft’s DeepSpeed for training very large models:
import deepspeed

# DeepSpeed configuration
ds_config = {
    "train_batch_size": 32,
    "gradient_accumulation_steps": 1,
    "fp16": {"enabled": True},
    "zero_optimization": {"stage": 2}
}

# Initialize
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config
)
Launch with DeepSpeed launcher:
deepspeed --num_gpus=8 --num_nodes=2 --hostfile=hostfile train.py

Horovod

Uber’s Horovod for distributed training:
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank
torch.cuda.set_device(hvd.local_rank())

# Wrap optimizer with Horovod
optimizer = hvd.DistributedOptimizer(optimizer)

Best Practices

1. Use High-Speed Interconnects

TensorPool clusters come with high-speed networking. Use NCCL backend for PyTorch for best performance:
dist.init_process_group(backend="nccl")

2. Batch Size Scaling

When scaling to multiple nodes, scale your batch size accordingly:
  • 1 node (8 GPUs): batch_size = 256
  • 2 nodes (16 GPUs): batch_size = 512
  • 4 nodes (32 GPUs): batch_size = 1024

3. Gradient Accumulation

If memory is limited, use gradient accumulation instead of increasing batch size:
for i, (data, target) in enumerate(dataloader):
    output = model(data)
    loss = criterion(output, target) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

4. Save Checkpoints on One Node

Only save checkpoints from the main process to avoid conflicts:
if dist.get_rank() == 0:
    torch.save(model.state_dict(), "checkpoint.pt")

5. Use Shared Storage

For multi-node clusters, use NFS storage to share data and checkpoints across nodes:
# Create NFS volume
tp nfs create -s 1000 --name shared-data

# Attach to your cluster
tp nfs attach <storage_id> <cluster_id>
Access the shared storage at ~/nfs-<storage_id>/ from any node.

Troubleshooting

Communication Issues

If nodes can’t communicate:
  1. Check that all nodes can ping each other
  2. Verify firewall rules aren’t blocking traffic
  3. Ensure you’re using the correct master address and port

Out of Memory Errors

If you run out of memory with multi-node training:
  1. Reduce batch size per GPU
  2. Use gradient accumulation
  3. Enable mixed precision training (FP16/BF16)
  4. Use memory-efficient optimizers (like DeepSpeed ZeRO)

Slow Training

If training is slower than expected:
  1. Verify you’re using NCCL backend
  2. Check network bandwidth between nodes
  3. Profile your training loop to find bottlenecks
  4. Ensure data loading isn’t the bottleneck (use enough workers)

Example: PyTorch DDP Training

Here’s a complete example of multi-node training with PyTorch DDP:
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)

    # Create model and move to GPU
    model = YourModel().to(rank)
    model = DDP(model, device_ids=[rank])

    # Create distributed sampler
    train_sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    train_loader = DataLoader(dataset, sampler=train_sampler, batch_size=32)

    # Training loop
    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)
        for batch in train_loader:
            # Training code here
            pass

    cleanup()

if __name__ == "__main__":
    world_size = torch.cuda.device_count()
    rank = int(os.environ["LOCAL_RANK"])
    train(rank, world_size)
Launch with:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=<node-1-ip> train.py

Next Steps

I