> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorpool.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Best Practices

> Tips and best practices for using TensorPool

Follow these best practices to get the most out of TensorPool while optimizing performance and cost.

## SSH Key Management

### Keep Your Private Keys Secure

* **Never share your private key** - Only share the public key (`.pub` file)
* **Use strong passphrases** - Protect your private keys with passphrases
* **Proper permissions** - Set correct permissions on your private key:
  ```bash theme={null}
  chmod 600 ~/.ssh/id_ed25519
  ```
* **Backup your keys** - Keep secure backups of your SSH keys

### Organize Your Keys

If you use multiple SSH keys, organize them clearly:

```bash theme={null}
~/.ssh/
├── tensorpool_id_ed25519
├── tensorpool_id_ed25519.pub
├── personal_id_ed25519
└── personal_id_ed25519.pub
```

Configure SSH to use specific keys:

```bash theme={null}
# ~/.ssh/config
Host *.tensorpool.dev
    IdentityFile ~/.ssh/tensorpool_id_ed25519
    User tensorpool
```

## Cluster Naming

### Use Descriptive Names

Use clear, descriptive names that indicate the purpose:

```bash theme={null}
# Good names
tp cluster create 8xB200 -n 4 --name pretraining
tp cluster create 1xH100 --name joshua-workbench
tp cluster create 8xH200 -n 2 --name research-experiments

# Avoid generic names
tp cluster create 8xH100 --name cluster1  # Not descriptive
```

If you're in a TensorPool Organization, other people can see your clusters! Descriptive names avoid misunderstandings.

## Cost Management

### Destroy Clusters When Not in Use

The most important cost-saving practice:

```bash theme={null}
# When you're done training
tp cluster destroy <cluster_id>
```

Set reminders to check for unused clusters:

```bash theme={null}
# Check your active clusters
tp cluster list

# Destroy unused clusters
tp cluster destroy c-abc123
```

### Monitor Your Resources

Regularly check your active resources:

```bash theme={null}
# List all clusters
tp cluster list

# List all storage volumes
tp storage list

# Check account usage
tp me
```

## Data Persistence

### Use Storage Volumes for Important Data

Never rely solely on cluster local storage for important data:

```bash theme={null}
# Create storage volume for datasets
tp storage create 1000 --name training-data

# Attach to your cluster
tp cluster attach <cluster_id> <storage_id>

# Store datasets, checkpoints, and results on storage
/mnt/<storage_id>/
```

### Checkpoint Regularly

Save checkpoints frequently to shared storage:

```python theme={null}
# Save every N epochs
if epoch % 5 == 0:
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, f"/mnt/{storage_id}/checkpoints/epoch_{epoch}.pt")
```

### Download Results Before Cleanup

Before destroying storage volumes, download important results:

```bash theme={null}
# Download to local machine
rsync -avz tensorpool@<cluster_ip>:/mnt/<storage_id>/results/ ./results/
```

## Multi-Node Training

### Configure Distributed Training Properly

Use appropriate distributed training frameworks:

```python theme={null}
# PyTorch DDP
import torch.distributed as dist
dist.init_process_group(backend="nccl")  # Use NCCL for best performance

# DeepSpeed
import deepspeed
model_engine, _, _, _ = deepspeed.initialize(config=ds_config)
```

### Scale Batch Size with Nodes

When scaling to multiple nodes, adjust your batch size:

```python theme={null}
# Single node (8 GPUs): batch_size = 256
# 2 nodes (16 GPUs): batch_size = 512
# 4 nodes (32 GPUs): batch_size = 1024

world_size = dist.get_world_size()
batch_size = base_batch_size * world_size
```

### Use Shared Storage for Multi-Node

For multi-node clusters, use shared storage to share data and checkpoints:

```bash theme={null}
# All nodes access the same data
cd /mnt/<storage_id>/dataset
```

## Performance Optimization

### Use Mixed Precision Training

Enable FP16 or BF16 for faster training:

```python theme={null}
# PyTorch AMP
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    output = model(input)
    loss = criterion(output, target)
```

### Profile Your Code

Find bottlenecks with profiling:

```python theme={null}
import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as prof:
    # Your training code
    pass

print(prof.key_averages().table())
```

## Monitoring

### Regularly Check Cluster List

Keep track of your active resources:

```bash theme={null}
# Set up an alias for quick checking
echo "alias tpl='tp cluster list'" >> ~/.bashrc

# Use it frequently
tpl
```

### Monitor GPU Usage

When SSH'd into your cluster, monitor GPU usage:

```bash theme={null}
# Watch GPU usage
watch -n 1 nvidia-smi

# Or use gpustat
pip install gpustat
gpustat -i 1
```

### Track Training Metrics

Use tools like Weights & Biases or TensorBoard:

```python theme={null}
import wandb

wandb.init(project="my-training")
wandb.log({"loss": loss, "accuracy": acc})
```

## Security

### API Key Security

Keep your API key secure:

```bash theme={null}
# Use environment variables
export TENSORPOOL_KEY="your_key_here"

# Never commit API keys to git
echo "TENSORPOOL_KEY" >> .gitignore
```

### Network Security

* Only open necessary ports
* Use SSH key authentication (no passwords)
* Regularly rotate SSH keys
* Monitor for unauthorized access

## Workflow Recommendations

### Development Workflow

1. **Start small**: Use `1xH100` for development
2. **Test your code**: Verify everything works on small scale
3. **Scale up**: Move to larger instances once tested
4. **Monitor**: Watch metrics and GPU utilization
5. **Clean up**: Destroy clusters when done

### Production Workflow

1. **Use storage volumes**: Store all important data on storage volumes
2. **Checkpoint frequently**: Save progress regularly
3. **Monitor costs**: Track usage and spending
4. **Automate**: Script common workflows
5. **Document**: Keep notes on experiments and configurations

## Common Mistakes to Avoid

### Leaving Clusters Running

Don't forget to destroy clusters when you're done:

```bash theme={null}
# Always clean up
tp cluster destroy <cluster_id>
```

### Not Using Storage for Important Data

Don't store critical data only on cluster local storage:

```bash theme={null}
# Use storage for persistent data
tp storage create 500 --name important-data
```

### Choosing Wrong Instance Type

Don't use oversized instances for small tasks:

```bash theme={null}
# For development, start small
tp cluster create 1xH100 --name dev  # Good
tp cluster create 8xH200 -n 4 --name dev  # Overkill
```

### Ignoring Checkpoints

Don't train without saving checkpoints:

```python theme={null}
# Save regularly
if epoch % checkpoint_interval == 0:
    torch.save(model.state_dict(), checkpoint_path)
```

### Not Monitoring Usage

Don't let resources run without monitoring:

```bash theme={null}
# Check regularly
tp cluster list
tp me
```

## Next Steps

* [Multi-node training guide](/guides/multi-node-training)
* [Storage volumes](/features/storage)
* [CLI reference](/cli/overview)
