Multi-Node Training
Multi-node training allows you to scale your deep learning workloads across multiple machines, each with multiple GPUs, for training very large models or processing massive datasets.Overview
TensorPool makes it easy to deploy multi-node GPU clusters with high-speed interconnects. All nodes in a cluster are connected with low-latency networking optimized for distributed training.Supported Instance Types
Currently, multi-node deployments are supported for:- 8xB200 - 2 or more nodes, each with 8 B200 GPUs
- 8xH200 - 2 or more nodes, each with 8 H200 GPUs
Creating a Multi-Node Cluster
Create a multi-node cluster by specifying the number of nodes with the-n flag:
Multi-node deployments are only available for 8xH200 and 8xB200 instance types. Single H100 instances are available, but multi-node H100 clusters are not supported.
Cluster Configuration
Cluster Architecture
Multi-node clusters use a jumphost architecture:- Jumphost:
{cluster_id}-jumphost- The SLURM login/controller node with a public IP address - Worker Nodes:
{cluster_id}-0,{cluster_id}-1, etc. - Compute nodes with private IP addresses only
Network Setup
- Jumphost: Has a public IP address for direct SSH access
- Worker Nodes: Have private IP addresses and are only accessible from within the cluster network
- Inter-node Communication: All nodes are connected via high-speed networking optimized for distributed training
- Shared Storage: All nodes have access to shared NFS storage (if attached)
SSH Access
- Get cluster information to see all nodes and their instance IDs:
- SSH into the jumphost (this is the only node with direct public access):
- Access worker nodes from the jumphost using either the instance name or private IP:
SLURM Job Scheduling
All multi-node clusters come with SLURM (Simple Linux Utility for Resource Management) preinstalled and configured. SLURM manages job scheduling and resource allocation across your cluster.Basic SLURM Commands
Submit a job:SLURM Job Script Example
Create a job script (e.g.,train.sh):
SLURM Resource Allocation
--nodes=N: Number of nodes to use (e.g.,--nodes=2for a 2-node job)--ntasks-per-node=8: Number of tasks per node (typically matches GPU count)--gres=gpu:8: Request 8 GPUs per node--cpus-per-task=N: CPUs per task (adjust based on your workload)--time=HH:MM:SS: Maximum job runtime (e.g.,--time=24:00:00for 24 hours)
Distributed Training Frameworks
Multi-node training requires using a distributed training framework. Popular options include:PyTorch Distributed Data Parallel (DDP)
PyTorch’s built-in distributed training:torchrun:
DeepSpeed
Microsoft’s DeepSpeed for training very large models:Horovod
Uber’s Horovod for distributed training:Best Practices
1. Use High-Speed Interconnects
TensorPool clusters come with high-speed networking. Use NCCL backend for PyTorch for best performance:2. Batch Size Scaling
When scaling to multiple nodes, scale your batch size accordingly:- 1 node (8 GPUs): batch_size = 256
- 2 nodes (16 GPUs): batch_size = 512
- 4 nodes (32 GPUs): batch_size = 1024
3. Gradient Accumulation
If memory is limited, use gradient accumulation instead of increasing batch size:4. Save Checkpoints on One Node
Only save checkpoints from the main process to avoid conflicts:5. Use Shared Storage
For multi-node clusters, use storage to share data and checkpoints across nodes:/mnt/fast-<storage_id>/ from any node.
Troubleshooting
Communication Issues
If nodes can’t communicate:- Check that all nodes can ping each other
- Verify firewall rules aren’t blocking traffic
- Ensure you’re using the correct master address and port
Out of Memory Errors
If you run out of memory with multi-node training:- Reduce batch size per GPU
- Use gradient accumulation
- Enable mixed precision training (FP16/BF16)
- Use memory-efficient optimizers (like DeepSpeed ZeRO)
Slow Training
If training is slower than expected:- Verify you’re using NCCL backend
- Check network bandwidth between nodes
- Profile your training loop to find bottlenecks
- Ensure data loading isn’t the bottleneck (use enough workers)
Example: PyTorch DDP Training
Here’s a complete example of multi-node training with PyTorch DDP:Next Steps
- Learn about NFS storage for sharing data
- See storage management guide
- Review best practices