Skip to main content

Best Practices

Follow these best practices to get the most out of TensorPool while optimizing costs and performance.

SSH Key Management

Keep Your Private Keys Secure

  • Never share your private key - Only share the public key (.pub file)
  • Use strong passphrases - Protect your private keys with passphrases
  • Proper permissions - Set correct permissions on your private key:
    chmod 600 ~/.ssh/id_ed25519
    
  • Backup your keys - Keep secure backups of your SSH keys

Organize Your Keys

If you use multiple SSH keys, organize them clearly:
~/.ssh/
├── tensorpool_id_ed25519
├── tensorpool_id_ed25519.pub
├── personal_id_ed25519
└── personal_id_ed25519.pub
Configure SSH to use specific keys:
# ~/.ssh/config
Host *.tensorpool.dev
    IdentityFile ~/.ssh/tensorpool_id_ed25519
    User tensorpool

Cluster Naming

Use Descriptive Names

Use clear, descriptive names that indicate the purpose:
# Good names
tp cluster create -t 8xH100 -n 4 --name llama-70b-training
tp cluster create -t 1xH100 --name dev-testing
tp cluster create -t 8xH200 -n 2 --name research-experiments

# Avoid generic names
tp cluster create -t 8xH100 --name cluster1  # Not descriptive

Include Key Information

Consider including:
  • Project name
  • Model being trained
  • Purpose (training, inference, dev)
  • Team or user

Cost Management

Destroy Clusters When Not in Use

The most important cost-saving practice:
# When you're done training
tp cluster destroy <cluster_id>
Set reminders to check for unused clusters:
# Check your active clusters
tp cluster list

# Destroy unused clusters
tp cluster destroy cls_abc123

Monitor Your Resources

Regularly check your active resources:
# List all clusters
tp cluster list

# List all NFS volumes
tp nfs list

# Check account usage
tp me

Right-Size Your Instances

Choose the appropriate instance type for your workload:
  • Development/Debugging: Start with 1xH100 or 2xH100
  • Small Models (< 13B): 1xH100 or 2xH100
  • Medium Models (13B-70B): 4xH100 or 8xH100
  • Large Models (70B+): 8xH100 multi-node or 8xH200

Use Spot/Interruptible Instances

Spot instance support coming soon! This will allow significant cost savings for fault-tolerant workloads.

Data Persistence

Use NFS Volumes for Important Data

Never rely solely on cluster local storage for important data:
# Create NFS volume for datasets
tp nfs create -s 1000 --name training-data

# Attach to your cluster
tp nfs attach <storage_id> <cluster_id>

# Store datasets, checkpoints, and results on NFS
~/nfs-<storage_id>/

Checkpoint Regularly

Save checkpoints frequently to NFS storage:
# Save every N epochs
if epoch % 5 == 0:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, f"~/nfs-{storage_id}/checkpoints/epoch_{epoch}.pt")

Download Results Before Cleanup

Before destroying NFS volumes, download important results:
# Download to local machine
rsync -avz tensorpool@<cluster_ip>:~/nfs-abc123/results/ ./results/

Multi-Node Training

Configure Distributed Training Properly

Use appropriate distributed training frameworks:
# PyTorch DDP
import torch.distributed as dist
dist.init_process_group(backend="nccl")  # Use NCCL for best performance

# DeepSpeed
import deepspeed
model_engine, _, _, _ = deepspeed.initialize(config=ds_config)

Scale Batch Size with Nodes

When scaling to multiple nodes, adjust your batch size:
# Single node (8 GPUs): batch_size = 256
# 2 nodes (16 GPUs): batch_size = 512
# 4 nodes (32 GPUs): batch_size = 1024

world_size = dist.get_world_size()
batch_size = base_batch_size * world_size

Use Shared Storage for Multi-Node

For multi-node clusters, use NFS to share data and checkpoints:
# All nodes access the same data
cd ~/nfs-<storage_id>/dataset

Performance Optimization

Use Mixed Precision Training

Enable FP16 or BF16 for faster training:
# PyTorch AMP
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    output = model(input)
    loss = criterion(output, target)

Optimize Data Loading

Use efficient data loading:
dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,      # Multiple workers
    pin_memory=True,    # Faster GPU transfer
    persistent_workers=True  # Keep workers alive
)

Profile Your Code

Find bottlenecks with profiling:
import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as prof:
    # Your training code
    pass

print(prof.key_averages().table())

Monitoring

Regularly Check Cluster List

Keep track of your active resources:
# Set up an alias for quick checking
echo "alias tpl='tp cluster list'" >> ~/.bashrc

# Use it frequently
tpl

Monitor GPU Usage

When SSH’d into your cluster, monitor GPU usage:
# Watch GPU usage
watch -n 1 nvidia-smi

# Or use gpustat
pip install gpustat
gpustat -i 1

Track Training Metrics

Use tools like Weights & Biases or TensorBoard:
import wandb

wandb.init(project="my-training")
wandb.log({"loss": loss, "accuracy": acc})

Security

API Key Security

Keep your API key secure:
# Use environment variables
export TENSORPOOL_API_KEY="your_key_here"

# Never commit API keys to git
echo "TENSORPOOL_API_KEY" >> .gitignore

Network Security

  • Only open necessary ports
  • Use SSH key authentication (no passwords)
  • Regularly rotate SSH keys
  • Monitor for unauthorized access

Workflow Recommendations

Development Workflow

  1. Start small: Use 1xH100 for development
  2. Test your code: Verify everything works on small scale
  3. Scale up: Move to larger instances once tested
  4. Monitor: Watch metrics and GPU utilization
  5. Clean up: Destroy clusters when done

Production Workflow

  1. Use NFS: Store all important data on NFS volumes
  2. Checkpoint frequently: Save progress regularly
  3. Monitor costs: Track usage and spending
  4. Automate: Script common workflows
  5. Document: Keep notes on experiments and configurations

Common Mistakes to Avoid

❌ Leaving Clusters Running

Don’t forget to destroy clusters when you’re done:
# Always clean up
tp cluster destroy <cluster_id>

❌ Not Using NFS for Important Data

Don’t store critical data only on cluster local storage:
# Use NFS for persistent data
tp nfs create -s 500 --name important-data

❌ Choosing Wrong Instance Type

Don’t use oversized instances for small tasks:
# For development, start small
tp cluster create -t 1xH100 --name dev  # ✓ Good
tp cluster create -t 8xH200 -n 4 --name dev  # ❌ Overkill

❌ Ignoring Checkpoints

Don’t train without saving checkpoints:
# Save regularly
if epoch % checkpoint_interval == 0:
    torch.save(model.state_dict(), checkpoint_path)

❌ Not Monitoring Usage

Don’t let resources run without monitoring:
# Check regularly
tp cluster list
tp me

Next Steps

I