Skip to main content

Multi-Node Training

Multi-node training allows you to scale your deep learning workloads across multiple machines, each with multiple GPUs, for training very large models or processing massive datasets.

Overview

TensorPool makes it easy to deploy multi-node GPU clusters with high-speed interconnects. All nodes in a cluster are connected with low-latency networking optimized for distributed training.

Supported Instance Types

Currently, multi-node deployments are supported for:
  • 8xB200 - 2 or more nodes, each with 8 B200 GPUs
  • 8xH200 - 2 or more nodes, each with 8 H200 GPUs

Creating a Multi-Node Cluster

Create a multi-node cluster by specifying the number of nodes with the -n flag:
# 2-node cluster with 8xH200 each (16 GPUs)
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200 -n 2 --name my-multinode-cluster

# 4-node cluster with 8xB200 each (32 GPUs)
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xB200 -n 4 --name large-training
Multi-node deployments are only available for 8xH200 and 8xB200 instance types. Single H100 instances are available, but multi-node H100 clusters are not supported.

Cluster Configuration

Cluster Architecture

Multi-node clusters use a jumphost architecture:
  • Jumphost: {cluster_id}-jumphost - The SLURM login/controller node with a public IP address
  • Worker Nodes: {cluster_id}-0, {cluster_id}-1, etc. - Compute nodes with private IP addresses only

Network Setup

  • Jumphost: Has a public IP address for direct SSH access
  • Worker Nodes: Have private IP addresses and are only accessible from within the cluster network
  • Inter-node Communication: All nodes are connected via high-speed networking optimized for distributed training
  • Shared Storage: All nodes have access to shared NFS storage (if attached)

SSH Access

  1. Get cluster information to see all nodes and their instance IDs:
tp cluster info <cluster_id>
  1. SSH into the jumphost (this is the only node with direct public access):
tp ssh <jumphost-instance-id>
  1. Access worker nodes from the jumphost using either the instance name or private IP:
# Using instance name (replace <cluster_id> with your actual cluster ID)
ssh <cluster_id>-0
ssh <cluster_id>-1

# Or using the private IP address (found in cluster info)
ssh <worker-node-private-ip>

SLURM Job Scheduling

All multi-node clusters come with SLURM (Simple Linux Utility for Resource Management) preinstalled and configured. SLURM manages job scheduling and resource allocation across your cluster.

Basic SLURM Commands

Submit a job:
# Submit a batch job script
sbatch train.sh

# Submit with specific resource requirements
sbatch --nodes=2 --ntasks-per-node=8 --gres=gpu:8 train.sh
Check job status:
# View all jobs in the queue
squeue

# View your jobs only
squeue -u $USER

# View detailed information about a specific job
scontrol show job <job_id>
Cancel a job:
# Cancel a specific job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER
View cluster information:
# Show cluster nodes and their status
sinfo

# Show detailed node information
sinfo -N -l

SLURM Job Script Example

Create a job script (e.g., train.sh):
#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=22
#SBATCH --time=24:00:00

# Your training command here
torchrun --nproc_per_node=8 --nnodes=2 train.py
Then submit it:
sbatch train.sh

SLURM Resource Allocation

  • --nodes=N: Number of nodes to use (e.g., --nodes=2 for a 2-node job)
  • --ntasks-per-node=8: Number of tasks per node (typically matches GPU count)
  • --gres=gpu:8: Request 8 GPUs per node
  • --cpus-per-task=N: CPUs per task (adjust based on your workload)
  • --time=HH:MM:SS: Maximum job runtime (e.g., --time=24:00:00 for 24 hours)

Distributed Training Frameworks

Multi-node training requires using a distributed training framework. Popular options include:

PyTorch Distributed Data Parallel (DDP)

PyTorch’s built-in distributed training:
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize the process group
dist.init_process_group(backend="nccl")

# Wrap your model
model = YourModel()
model = DDP(model)

# Train as usual
Launch with torchrun:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=<node-1-ip> --master_port=29500 train.py

DeepSpeed

Microsoft’s DeepSpeed for training very large models:
import deepspeed

# DeepSpeed configuration
ds_config = {
    "train_batch_size": 32,
    "gradient_accumulation_steps": 1,
    "fp16": {"enabled": True},
    "zero_optimization": {"stage": 2}
}

# Initialize
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config
)
Launch with DeepSpeed launcher:
deepspeed --num_gpus=8 --num_nodes=2 --hostfile=hostfile train.py

Horovod

Uber’s Horovod for distributed training:
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank
torch.cuda.set_device(hvd.local_rank())

# Wrap optimizer with Horovod
optimizer = hvd.DistributedOptimizer(optimizer)

Best Practices

1. Use High-Speed Interconnects

TensorPool clusters come with high-speed networking. Use NCCL backend for PyTorch for best performance:
dist.init_process_group(backend="nccl")

2. Batch Size Scaling

When scaling to multiple nodes, scale your batch size accordingly:
  • 1 node (8 GPUs): batch_size = 256
  • 2 nodes (16 GPUs): batch_size = 512
  • 4 nodes (32 GPUs): batch_size = 1024

3. Gradient Accumulation

If memory is limited, use gradient accumulation instead of increasing batch size:
for i, (data, target) in enumerate(dataloader):
    output = model(data)
    loss = criterion(output, target) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

4. Save Checkpoints on One Node

Only save checkpoints from the main process to avoid conflicts:
if dist.get_rank() == 0:
    torch.save(model.state_dict(), "checkpoint.pt")

5. Use Shared Storage

For multi-node clusters, use storage to share data and checkpoints across nodes:
# Create storage volume
tp storage create -t fast -s 1000 --name shared-data

# Attach to your cluster
tp cluster attach <cluster_id> <storage_id>
Access the shared storage at /mnt/fast-<storage_id>/ from any node.

Troubleshooting

Communication Issues

If nodes can’t communicate:
  1. Check that all nodes can ping each other
  2. Verify firewall rules aren’t blocking traffic
  3. Ensure you’re using the correct master address and port

Out of Memory Errors

If you run out of memory with multi-node training:
  1. Reduce batch size per GPU
  2. Use gradient accumulation
  3. Enable mixed precision training (FP16/BF16)
  4. Use memory-efficient optimizers (like DeepSpeed ZeRO)

Slow Training

If training is slower than expected:
  1. Verify you’re using NCCL backend
  2. Check network bandwidth between nodes
  3. Profile your training loop to find bottlenecks
  4. Ensure data loading isn’t the bottleneck (use enough workers)

Example: PyTorch DDP Training

Here’s a complete example of multi-node training with PyTorch DDP:
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)

    # Create model and move to GPU
    model = YourModel().to(rank)
    model = DDP(model, device_ids=[rank])

    # Create distributed sampler
    train_sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    train_loader = DataLoader(dataset, sampler=train_sampler, batch_size=32)

    # Training loop
    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)
        for batch in train_loader:
            # Training code here
            pass

    cleanup()

if __name__ == "__main__":
    world_size = torch.cuda.device_count()
    rank = int(os.environ["LOCAL_RANK"])
    train(rank, world_size)
Launch with:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=<node-1-ip> train.py

Next Steps