Skip to main content
TensorPool job’s are git-style interface for GPUs that act as a management system is for you to run, monitor, and manage tasks across GPU clusters. Jobs are ideal for running many experiments because they are easy to kick off, and you only pay for the time your job is running. Jobs are configured using TOML configuration files that specify your training commands, GPU requirements, and output files.

Quick Start

  1. Initialize a job:
tp job init
This creates a tp.config.toml file in your current working directory.
  1. Configure your job by editing the TOML file:
commands = [
    "pip install -r requirements.txt",
    "python train.py",
]
instance_type = "1xH100"

outputs = [
    "checkpoints/",
    "weights.pth",
]

ignore = [
    ".venv",
    "__pycache__/",
    ".git",
]
  1. Submit your job:
tp job push tp.config.toml

Job Configuration

Commands

The commands array specifies shell commands to run sequentially. Each job starts from a fresh virtual environment:
commands = [
    "pip install torch torchvision",
    "python -m pip install -e .",
    "python train.py --epochs 100",
]

Instance Types

Specify the GPU for your job:
instance_type = "1xH200"
All instances types are supported

Output Files

Define which files to save after job completion. Supports glob patterns:
outputs = [
    "checkpoints/",           # Entire directory
    "model_*.pth",           # Glob pattern
    "results.json",          # Single file
    "/logs/*",               # All files in logs/
]

Ignored Files

Exclude files from being uploaded with your job:
ignore = [
    ".venv",
    "venv/",
    "__pycache__/",
    ".git",
    "*.pyc",
    "data/",                 # Exclude large datasets
]

Managing Jobs

List Jobs

View all your jobs:
tp job list
List all jobs in your organization:
tp job list --org

Job Information

Get detailed information about a specific job:
tp job info <job_id>

Monitor Jobs

Stream real-time logs from a running job:
tp job listen <job_id>

Pull Output Files

Download output files from a completed job:
tp job pull <job_id>
Force overwrite existing local files:
tp job pull <job_id> --force

Cancel Jobs

Cancel a running job:
tp job cancel <job_id>

Multiple Configurations

You can create multiple configuration files for different experiments:
# Create named configs
tp job init  # Creates tp.config.toml
# Rename to tp.baseline.toml

tp job init  # Creates tp.config1.toml
# Rename to tp.experiment.toml

# Run specific configs
tp job push tp.baseline.toml
tp job push tp.experiment.toml

SSH Keys

Jobs require SSH keys for authentication. TensorPool uses your default SSH key (~/.ssh/id_ed25519) for job operations. Ensure your public key is registered:
tp ssh key create ~/.ssh/id_ed25519.pub

Next Steps

I