Best Practices
Follow these best practices to get the most out of TensorPool while optimizing costs and performance.
SSH Key Management
Keep Your Private Keys Secure
- Never share your private key - Only share the public key (
.pub
file)
- Use strong passphrases - Protect your private keys with passphrases
- Proper permissions - Set correct permissions on your private key:
chmod 600 ~/.ssh/id_ed25519
- Backup your keys - Keep secure backups of your SSH keys
Organize Your Keys
If you use multiple SSH keys, organize them clearly:
~/.ssh/
├── tensorpool_id_ed25519
├── tensorpool_id_ed25519.pub
├── personal_id_ed25519
└── personal_id_ed25519.pub
Configure SSH to use specific keys:
# ~/.ssh/config
Host *.tensorpool.dev
IdentityFile ~/.ssh/tensorpool_id_ed25519
User tensorpool
Cluster Naming
Use Descriptive Names
Use clear, descriptive names that indicate the purpose:
# Good names
tp cluster create -t 8xH100 -n 4 --name llama-70b-training
tp cluster create -t 1xH100 --name dev-testing
tp cluster create -t 8xH200 -n 2 --name research-experiments
# Avoid generic names
tp cluster create -t 8xH100 --name cluster1 # Not descriptive
Consider including:
- Project name
- Model being trained
- Purpose (training, inference, dev)
- Team or user
Cost Management
Destroy Clusters When Not in Use
The most important cost-saving practice:
# When you're done training
tp cluster destroy <cluster_id>
Set reminders to check for unused clusters:
# Check your active clusters
tp cluster list
# Destroy unused clusters
tp cluster destroy cls_abc123
Monitor Your Resources
Regularly check your active resources:
# List all clusters
tp cluster list
# List all NFS volumes
tp nfs list
# Check account usage
tp me
Right-Size Your Instances
Choose the appropriate instance type for your workload:
- Development/Debugging: Start with
1xH100
or 2xH100
- Small Models (< 13B):
1xH100
or 2xH100
- Medium Models (13B-70B):
4xH100
or 8xH100
- Large Models (70B+):
8xH100
multi-node or 8xH200
Use Spot/Interruptible Instances
Spot instance support coming soon! This will allow significant cost savings for fault-tolerant workloads.
Data Persistence
Use NFS Volumes for Important Data
Never rely solely on cluster local storage for important data:
# Create NFS volume for datasets
tp nfs create -s 1000 --name training-data
# Attach to your cluster
tp nfs attach <storage_id> <cluster_id>
# Store datasets, checkpoints, and results on NFS
~/nfs-<storage_id>/
Checkpoint Regularly
Save checkpoints frequently to NFS storage:
# Save every N epochs
if epoch % 5 == 0:
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, f"~/nfs-{storage_id}/checkpoints/epoch_{epoch}.pt")
Download Results Before Cleanup
Before destroying NFS volumes, download important results:
# Download to local machine
rsync -avz tensorpool@<cluster_ip>:~/nfs-abc123/results/ ./results/
Multi-Node Training
Use appropriate distributed training frameworks:
# PyTorch DDP
import torch.distributed as dist
dist.init_process_group(backend="nccl") # Use NCCL for best performance
# DeepSpeed
import deepspeed
model_engine, _, _, _ = deepspeed.initialize(config=ds_config)
Scale Batch Size with Nodes
When scaling to multiple nodes, adjust your batch size:
# Single node (8 GPUs): batch_size = 256
# 2 nodes (16 GPUs): batch_size = 512
# 4 nodes (32 GPUs): batch_size = 1024
world_size = dist.get_world_size()
batch_size = base_batch_size * world_size
Use Shared Storage for Multi-Node
For multi-node clusters, use NFS to share data and checkpoints:
# All nodes access the same data
cd ~/nfs-<storage_id>/dataset
Use Mixed Precision Training
Enable FP16 or BF16 for faster training:
# PyTorch AMP
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
output = model(input)
loss = criterion(output, target)
Optimize Data Loading
Use efficient data loading:
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=8, # Multiple workers
pin_memory=True, # Faster GPU transfer
persistent_workers=True # Keep workers alive
)
Profile Your Code
Find bottlenecks with profiling:
import torch.profiler
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
]
) as prof:
# Your training code
pass
print(prof.key_averages().table())
Monitoring
Regularly Check Cluster List
Keep track of your active resources:
# Set up an alias for quick checking
echo "alias tpl='tp cluster list'" >> ~/.bashrc
# Use it frequently
tpl
Monitor GPU Usage
When SSH’d into your cluster, monitor GPU usage:
# Watch GPU usage
watch -n 1 nvidia-smi
# Or use gpustat
pip install gpustat
gpustat -i 1
Track Training Metrics
Use tools like Weights & Biases or TensorBoard:
import wandb
wandb.init(project="my-training")
wandb.log({"loss": loss, "accuracy": acc})
Security
API Key Security
Keep your API key secure:
# Use environment variables
export TENSORPOOL_API_KEY="your_key_here"
# Never commit API keys to git
echo "TENSORPOOL_API_KEY" >> .gitignore
Network Security
- Only open necessary ports
- Use SSH key authentication (no passwords)
- Regularly rotate SSH keys
- Monitor for unauthorized access
Workflow Recommendations
Development Workflow
- Start small: Use
1xH100
for development
- Test your code: Verify everything works on small scale
- Scale up: Move to larger instances once tested
- Monitor: Watch metrics and GPU utilization
- Clean up: Destroy clusters when done
Production Workflow
- Use NFS: Store all important data on NFS volumes
- Checkpoint frequently: Save progress regularly
- Monitor costs: Track usage and spending
- Automate: Script common workflows
- Document: Keep notes on experiments and configurations
Common Mistakes to Avoid
❌ Leaving Clusters Running
Don’t forget to destroy clusters when you’re done:
# Always clean up
tp cluster destroy <cluster_id>
❌ Not Using NFS for Important Data
Don’t store critical data only on cluster local storage:
# Use NFS for persistent data
tp nfs create -s 500 --name important-data
❌ Choosing Wrong Instance Type
Don’t use oversized instances for small tasks:
# For development, start small
tp cluster create -t 1xH100 --name dev # ✓ Good
tp cluster create -t 8xH200 -n 4 --name dev # ❌ Overkill
❌ Ignoring Checkpoints
Don’t train without saving checkpoints:
# Save regularly
if epoch % checkpoint_interval == 0:
torch.save(model.state_dict(), checkpoint_path)
❌ Not Monitoring Usage
Don’t let resources run without monitoring:
# Check regularly
tp cluster list
tp me
Next Steps