> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorpool.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Multi-Node Training

> Guide to distributed training across multiple nodes

# Multi-Node Training

Multi-node training allows you to scale your deep learning workloads across multiple machines, each with multiple GPUs, for training very large models or processing massive datasets.

## Overview

TensorPool makes it easy to deploy multi-node GPU clusters with high-speed interconnects. All nodes in a cluster are connected with low-latency networking optimized for distributed training.

## Supported Instance Types

Currently, multi-node deployments are supported for:

* **8xB200** - 2 or more nodes, each with 8 B200 GPUs
* **8xH200** - 2 or more nodes, each with 8 H200 GPUs

## Creating a Multi-Node Cluster

Create a multi-node cluster by specifying the number of nodes with the `-n` flag:

```bash theme={null}
# 4-node cluster with 8xB200 each (32 GPUs)
tp cluster create 8xB200 -n 4 --name training
```

<Note>
  Multi-node deployments are only available for **8xH200** and **8xB200** instance types. Single H100 instances are available, but multi-node H100 clusters are not supported.
</Note>

## Cluster Configuration

### Cluster Architecture

Multi-node clusters use a jumphost architecture:

* **Jumphost**: `{cluster_id}-jumphost` - The SLURM login/controller node with a public IP address
* **Worker Nodes**: `{cluster_id}-0`, `{cluster_id}-1`, etc. - Compute nodes with private IP addresses only

### Network Setup

* **Jumphost**: Has a public IP address for direct SSH access
* **Worker Nodes**: Have private IP addresses and are only accessible from within the cluster network
* **Inter-node Communication**: All nodes are connected via high-speed networking optimized for distributed training
* **Shared Storage**: All nodes have access to shared storage volumes (if attached)

### SSH Access

1. **Get cluster information** to see all nodes and their instance IDs:

```bash theme={null}
tp cluster info <cluster_id>
```

2. **SSH into the jumphost** (this is the only node with direct public access):

```bash theme={null}
tp ssh <jumphost-instance-id>
```

3. **Access worker nodes** from the jumphost using either the instance name or private IP:

```bash theme={null}
# Using instance name (replace <cluster_id> with your actual cluster ID)
ssh <cluster_id>-0
ssh <cluster_id>-1

# Or using the private IP address (found in cluster info)
ssh <worker-node-private-ip>
```

### SLURM Job Scheduling

All multi-node clusters come with **SLURM** (Simple Linux Utility for Resource Management) preinstalled and configured. SLURM manages job scheduling and resource allocation across your cluster.

#### Basic SLURM Commands

**Submit a job:**

```bash theme={null}
# Submit a batch job script
sbatch train.sh

# Submit with specific resource requirements
sbatch --nodes=2 --ntasks-per-node=8 --gres=gpu:8 train.sh
```

**Check job status:**

```bash theme={null}
# View all jobs in the queue
squeue

# View your jobs only
squeue -u $USER

# View detailed information about a specific job
scontrol show job <job_id>
```

**Cancel a job:**

```bash theme={null}
# Cancel a specific job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER
```

**View cluster information:**

```bash theme={null}
# Show cluster nodes and their status
sinfo

# Show detailed node information
sinfo -N -l
```

#### SLURM Job Script Example

Create a job script (e.g., `train.sh`):

```bash theme={null}
#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=22
#SBATCH --time=24:00:00

# Your training command here
torchrun --nproc_per_node=8 --nnodes=2 train.py
```

Then submit it:

```bash theme={null}
sbatch train.sh
```

#### SLURM Resource Allocation

* **`--nodes=N`**: Number of nodes to use (e.g., `--nodes=2` for a 2-node job)
* **`--ntasks-per-node=8`**: Number of tasks per node (typically matches GPU count)
* **`--gres=gpu:8`**: Request 8 GPUs per node
* **`--cpus-per-task=N`**: CPUs per task (adjust based on your workload)
* **`--time=HH:MM:SS`**: Maximum job runtime (e.g., `--time=24:00:00` for 24 hours)

## Distributed Training Frameworks

Multi-node training requires using a distributed training framework. Popular options include:

### PyTorch Distributed Data Parallel (DDP)

PyTorch's built-in distributed training:

```python theme={null}
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize the process group
dist.init_process_group(backend="nccl")

# Wrap your model
model = YourModel()
model = DDP(model)

# Train as usual
```

Launch with `torchrun`:

```bash theme={null}
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=<node-1-ip> --master_port=29500 train.py
```

### DeepSpeed

Microsoft's DeepSpeed for training very large models:

```python theme={null}
import deepspeed

# DeepSpeed configuration
ds_config = {
    "train_batch_size": 32,
    "gradient_accumulation_steps": 1,
    "fp16": {"enabled": True},
    "zero_optimization": {"stage": 2}
}

# Initialize
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config
)
```

Launch with DeepSpeed launcher:

```bash theme={null}
deepspeed --num_gpus=8 --num_nodes=2 --hostfile=hostfile train.py
```

### Horovod

Uber's Horovod for distributed training:

```python theme={null}
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank
torch.cuda.set_device(hvd.local_rank())

# Wrap optimizer with Horovod
optimizer = hvd.DistributedOptimizer(optimizer)
```

## Best Practices

### 1. Use High-Speed Interconnects

TensorPool clusters come with high-speed networking. Use NCCL backend for PyTorch for best performance:

```python theme={null}
dist.init_process_group(backend="nccl")
```

### 2. Batch Size Scaling

When scaling to multiple nodes, scale your batch size accordingly:

* 1 node (8 GPUs): batch\_size = 256
* 2 nodes (16 GPUs): batch\_size = 512
* 4 nodes (32 GPUs): batch\_size = 1024

### 3. Gradient Accumulation

If memory is limited, use gradient accumulation instead of increasing batch size:

```python theme={null}
for i, (data, target) in enumerate(dataloader):
    output = model(data)
    loss = criterion(output, target) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
```

### 4. Save Checkpoints on One Node

Only save checkpoints from the main process to avoid conflicts:

```python theme={null}
if dist.get_rank() == 0:
    torch.save(model.state_dict(), "checkpoint.pt")
```

### 5. Use Shared Storage

For multi-node clusters, use storage to share data and checkpoints across nodes:

```bash theme={null}
# Create storage volume
tp storage create 1000 --name shared-data

# Attach to your cluster
tp cluster attach <cluster_id> <storage_id>
```

Access the shared storage at `/mnt/<storage_id>/` from any node.

## Troubleshooting

### Communication Issues

If nodes can't communicate:

1. Check that all nodes can ping each other
2. Verify firewall rules aren't blocking traffic
3. Ensure you're using the correct master address and port

### Out of Memory Errors

If you run out of memory with multi-node training:

1. Reduce batch size per GPU
2. Use gradient accumulation
3. Enable mixed precision training (FP16/BF16)
4. Use memory-efficient optimizers (like DeepSpeed ZeRO)

### Slow Training

If training is slower than expected:

1. Verify you're using NCCL backend
2. Check network bandwidth between nodes
3. Profile your training loop to find bottlenecks
4. Ensure data loading isn't the bottleneck (use enough workers)

## Example: PyTorch DDP Training

Here's a complete example of multi-node training with PyTorch DDP:

```python theme={null}
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)

    # Create model and move to GPU
    model = YourModel().to(rank)
    model = DDP(model, device_ids=[rank])

    # Create distributed sampler
    train_sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    train_loader = DataLoader(dataset, sampler=train_sampler, batch_size=32)

    # Training loop
    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)
        for batch in train_loader:
            # Training code here
            pass

    cleanup()

if __name__ == "__main__":
    world_size = torch.cuda.device_count()
    rank = int(os.environ["LOCAL_RANK"])
    train(rank, world_size)
```

Launch with:

```bash theme={null}
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=<node-1-ip> train.py
```

## Next Steps

* Learn about [storage volumes](/features/storage) for sharing data
* Review [best practices](/guides/best-practices)
