Multi-Node Training
Multi-node training allows you to scale your deep learning workloads across multiple machines, each with multiple GPUs, for training very large models or processing massive datasets.Overview
TensorPool makes it easy to deploy multi-node GPU clusters with high-speed interconnects. All nodes in a cluster are connected with low-latency networking optimized for distributed training.Supported Instance Types
Currently, multi-node deployments are supported for:- 8xB200 - 2 or more nodes, each with 8 B200 GPUs
- 8xH200 - 2 or more nodes, each with 8 H200 GPUs
Creating a Multi-Node Cluster
Create a multi-node cluster by specifying the number of nodes with the-n
flag:
Multi-node deployments are only available for 8xH200 and 8xB200 instance types. Single H100 instances are available, but multi-node H100 clusters are not supported.
Cluster Configuration
Network Setup
Each node in your cluster:- Has its own IP address for SSH access
- Is connected to other nodes via high-speed networking
- Can communicate with all other nodes in the cluster
- Has access to shared NFS storage (if attached)
SSH Access
You can SSH into any node in your cluster. When you runtp cluster list
or tp cluster info
, you’ll see the instance IDs for all nodes.
Distributed Training Frameworks
Multi-node training requires using a distributed training framework. Popular options include:PyTorch Distributed Data Parallel (DDP)
PyTorch’s built-in distributed training:torchrun
:
DeepSpeed
Microsoft’s DeepSpeed for training very large models:Horovod
Uber’s Horovod for distributed training:Best Practices
1. Use High-Speed Interconnects
TensorPool clusters come with high-speed networking. Use NCCL backend for PyTorch for best performance:2. Batch Size Scaling
When scaling to multiple nodes, scale your batch size accordingly:- 1 node (8 GPUs): batch_size = 256
- 2 nodes (16 GPUs): batch_size = 512
- 4 nodes (32 GPUs): batch_size = 1024
3. Gradient Accumulation
If memory is limited, use gradient accumulation instead of increasing batch size:4. Save Checkpoints on One Node
Only save checkpoints from the main process to avoid conflicts:5. Use Shared Storage
For multi-node clusters, use NFS storage to share data and checkpoints across nodes:~/nfs-<storage_id>/
from any node.
Troubleshooting
Communication Issues
If nodes can’t communicate:- Check that all nodes can ping each other
- Verify firewall rules aren’t blocking traffic
- Ensure you’re using the correct master address and port
Out of Memory Errors
If you run out of memory with multi-node training:- Reduce batch size per GPU
- Use gradient accumulation
- Enable mixed precision training (FP16/BF16)
- Use memory-efficient optimizers (like DeepSpeed ZeRO)
Slow Training
If training is slower than expected:- Verify you’re using NCCL backend
- Check network bandwidth between nodes
- Profile your training loop to find bottlenecks
- Ensure data loading isn’t the bottleneck (use enough workers)
Example: PyTorch DDP Training
Here’s a complete example of multi-node training with PyTorch DDP:Next Steps
- Learn about NFS storage for sharing data
- See storage management guide
- Review best practices