Best Practices

Follow these best practices to get the most out of TensorPool while optimizing costs and performance.

SSH Key Management

Keep Your Private Keys Secure

Never share your private key - Only share the public key (.pub file)
Use strong passphrases - Protect your private keys with passphrases
Proper permissions - Set correct permissions on your private key:
```
chmod 600 ~/.ssh/id_ed25519
```
Backup your keys - Keep secure backups of your SSH keys

Organize Your Keys

If you use multiple SSH keys, organize them clearly:

~/.ssh/
├── tensorpool_id_ed25519
├── tensorpool_id_ed25519.pub
├── personal_id_ed25519
└── personal_id_ed25519.pub

Configure SSH to use specific keys:

# ~/.ssh/config
Host *.tensorpool.dev
    IdentityFile ~/.ssh/tensorpool_id_ed25519
    User tensorpool

Cluster Naming

Use Descriptive Names

Use clear, descriptive names that indicate the purpose:

# Good names
tp cluster create -t 8xH100 -n 4 --name llama-70b-training
tp cluster create -t 1xH100 --name dev-testing
tp cluster create -t 8xH200 -n 2 --name research-experiments

# Avoid generic names
tp cluster create -t 8xH100 --name cluster1  # Not descriptive

Include Key Information

Consider including:

Project name
Model being trained
Purpose (training, inference, dev)
Team or user

Cost Management

Destroy Clusters When Not in Use

The most important cost-saving practice:

# When you're done training
tp cluster destroy <cluster_id>

Set reminders to check for unused clusters:

# Check your active clusters
tp cluster list

# Destroy unused clusters
tp cluster destroy cls_abc123

Monitor Your Resources

Regularly check your active resources:

# List all clusters
tp cluster list

# List all NFS volumes
tp nfs list

# Check account usage
tp me

Right-Size Your Instances

Choose the appropriate instance type for your workload:

Development/Debugging: Start with 1xH100 or 2xH100
Small Models (< 13B): 1xH100 or 2xH100
Medium Models (13B-70B): 4xH100 or 8xH100
Large Models (70B+): 8xH100 multi-node or 8xH200

Use Spot/Interruptible Instances

Spot instance support coming soon! This will allow significant cost savings for fault-tolerant workloads.

Data Persistence

Use NFS Volumes for Important Data

Never rely solely on cluster local storage for important data:

# Create NFS volume for datasets
tp nfs create -s 1000 --name training-data

# Attach to your cluster
tp nfs attach <storage_id> <cluster_id>

# Store datasets, checkpoints, and results on NFS
~/nfs-<storage_id>/

Checkpoint Regularly

Save checkpoints frequently to NFS storage:

# Save every N epochs
if epoch % 5 == 0:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, f"~/nfs-{storage_id}/checkpoints/epoch_{epoch}.pt")

Download Results Before Cleanup

Before destroying NFS volumes, download important results:

# Download to local machine
rsync -avz tensorpool@<cluster_ip>:~/nfs-abc123/results/ ./results/

Multi-Node Training

Configure Distributed Training Properly

Use appropriate distributed training frameworks:

# PyTorch DDP
import torch.distributed as dist
dist.init_process_group(backend="nccl")  # Use NCCL for best performance

# DeepSpeed
import deepspeed
model_engine, _, _, _ = deepspeed.initialize(config=ds_config)

Scale Batch Size with Nodes

When scaling to multiple nodes, adjust your batch size:

# Single node (8 GPUs): batch_size = 256
# 2 nodes (16 GPUs): batch_size = 512
# 4 nodes (32 GPUs): batch_size = 1024

world_size = dist.get_world_size()
batch_size = base_batch_size * world_size

Use Shared Storage for Multi-Node

For multi-node clusters, use NFS to share data and checkpoints:

# All nodes access the same data
cd ~/nfs-<storage_id>/dataset

Performance Optimization

Use Mixed Precision Training

Enable FP16 or BF16 for faster training:

# PyTorch AMP
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    output = model(input)
    loss = criterion(output, target)

Optimize Data Loading

Use efficient data loading:

dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,      # Multiple workers
    pin_memory=True,    # Faster GPU transfer
    persistent_workers=True  # Keep workers alive
)

Profile Your Code

Find bottlenecks with profiling:

import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as prof:
    # Your training code
    pass

print(prof.key_averages().table())

Monitoring

Regularly Check Cluster List

Keep track of your active resources:

# Set up an alias for quick checking
echo "alias tpl='tp cluster list'" >> ~/.bashrc

# Use it frequently
tpl

Monitor GPU Usage

When SSH’d into your cluster, monitor GPU usage:

# Watch GPU usage
watch -n 1 nvidia-smi

# Or use gpustat
pip install gpustat
gpustat -i 1

Track Training Metrics

Use tools like Weights & Biases or TensorBoard:

import wandb

wandb.init(project="my-training")
wandb.log({"loss": loss, "accuracy": acc})

Security

API Key Security

Keep your API key secure:

# Use environment variables
export TENSORPOOL_API_KEY="your_key_here"

# Never commit API keys to git
echo "TENSORPOOL_API_KEY" >> .gitignore

Network Security

Only open necessary ports
Use SSH key authentication (no passwords)
Regularly rotate SSH keys
Monitor for unauthorized access

Workflow Recommendations

Development Workflow

Start small: Use 1xH100 for development
Test your code: Verify everything works on small scale
Scale up: Move to larger instances once tested
Monitor: Watch metrics and GPU utilization
Clean up: Destroy clusters when done

Production Workflow

Use NFS: Store all important data on NFS volumes
Checkpoint frequently: Save progress regularly
Monitor costs: Track usage and spending
Automate: Script common workflows
Document: Keep notes on experiments and configurations

Common Mistakes to Avoid

❌ Leaving Clusters Running

Don’t forget to destroy clusters when you’re done:

# Always clean up
tp cluster destroy <cluster_id>

❌ Not Using NFS for Important Data

Don’t store critical data only on cluster local storage:

# Use NFS for persistent data
tp nfs create -s 500 --name important-data

❌ Choosing Wrong Instance Type

Don’t use oversized instances for small tasks:

# For development, start small
tp cluster create -t 1xH100 --name dev  # ✓ Good
tp cluster create -t 8xH200 -n 4 --name dev  # ❌ Overkill

❌ Ignoring Checkpoints

Don’t train without saving checkpoints:

# Save regularly
if epoch % checkpoint_interval == 0:
    torch.save(model.state_dict(), checkpoint_path)

❌ Not Monitoring Usage

Don’t let resources run without monitoring:

# Check regularly
tp cluster list
tp me

Guides

​Best Practices

​SSH Key Management

​Keep Your Private Keys Secure

​Organize Your Keys

​Cluster Naming

​Use Descriptive Names

​Include Key Information

​Cost Management

​Destroy Clusters When Not in Use

​Monitor Your Resources

​Right-Size Your Instances

​Use Spot/Interruptible Instances

​Data Persistence

​Use NFS Volumes for Important Data

​Checkpoint Regularly

​Download Results Before Cleanup

​Multi-Node Training

​Configure Distributed Training Properly

​Scale Batch Size with Nodes

​Use Shared Storage for Multi-Node

​Performance Optimization

​Use Mixed Precision Training

​Optimize Data Loading

​Profile Your Code

​Monitoring

​Regularly Check Cluster List

​Monitor GPU Usage

​Track Training Metrics

​Security

​API Key Security

​Network Security

​Workflow Recommendations

​Development Workflow

​Production Workflow

​Common Mistakes to Avoid

​❌ Leaving Clusters Running

​❌ Not Using NFS for Important Data

​❌ Choosing Wrong Instance Type

​❌ Ignoring Checkpoints

​❌ Not Monitoring Usage

​Next Steps

Best Practices

SSH Key Management

Keep Your Private Keys Secure

Organize Your Keys

Cluster Naming

Use Descriptive Names

Include Key Information

Cost Management

Destroy Clusters When Not in Use

Monitor Your Resources

Right-Size Your Instances

Use Spot/Interruptible Instances

Data Persistence

Use NFS Volumes for Important Data

Checkpoint Regularly

Download Results Before Cleanup

Multi-Node Training

Configure Distributed Training Properly

Scale Batch Size with Nodes

Use Shared Storage for Multi-Node

Performance Optimization

Use Mixed Precision Training

Optimize Data Loading

Profile Your Code

Monitoring

Regularly Check Cluster List

Monitor GPU Usage

Track Training Metrics

Security

API Key Security

Network Security

Workflow Recommendations

Development Workflow

Production Workflow

Common Mistakes to Avoid

❌ Leaving Clusters Running

❌ Not Using NFS for Important Data

❌ Choosing Wrong Instance Type

❌ Ignoring Checkpoints

❌ Not Monitoring Usage

Next Steps