Skip to main content
TensorPool job’s are git-style interface for GPUs that act as a management system is for you to run, monitor, and manage tasks across GPU clusters. Jobs are ideal for running many experiments because they are easy to kick off, and you only pay for the time your job is running. Jobs are configured using TOML configuration files that specify your training commands, GPU requirements, and output files.

Job Configuration

Commands

The commands array specifies shell commands to run sequentially. Each job starts from a fresh virtual environment:
commands = [
    "pip install torch torchvision",
    "python -m pip install -e .",
    "python train.py --epochs 100",
]

Instance Types

Specify an instance type for the job:
instance_type = "1xH200"
All instances types are supported

Output Files

Define which files to save after job completion. Supports glob patterns:
outputs = [
    "checkpoints/",           # Entire directory
    "model_*.pth",           # Glob pattern
    "results.json",          # Single file
    "/logs/*",               # All files in logs/
]

Ignored Files

Exclude files from being uploaded with your job:
ignore = [
    ".venv",
    "venv/",
    "__pycache__/",
    ".git",
    "*.pyc",
    "data/",                 # Exclude large datasets
]

Job Statuses

Jobs progress through various statuses throughout their lifecycle:
StatusDescription
PendingJob is uploading and waiting to be assigned to a cluster.
RunningJob commands are being executed
CompletedAll job commmands have returned an exit code of 0 and output files have been saved.
ErrorUser-level problem: a command has returned a non-zero exit code. Check the logs for details.
FailedSystem-level problem: the cluster executing the job has failed (e.g., node failure, GPU error). TensorPool will investigate.
CancelingJob cancellation in progress. The job outputs are being saved and cluster being shut down gracefully.
CanceledJob was successfully canceled.

Managing Jobs

List Jobs

View all your jobs:
tp job list
List all jobs in your organization:
tp job list --org

Job Information

Get detailed information about a specific job:
tp job info <job_id>

Monitor Jobs

Stream real-time logs from a running job:
tp job listen <job_id>

Pull Output Files

Download output files from a completed job:
tp job pull <job_id>
Force overwrite existing local files:
tp job pull <job_id> --force

Cancel Jobs

Cancel a running job:
tp job cancel <job_id>

Multiple Configurations

You can create multiple configuration files for different experiments:
# Create named configs
tp job init  # Creates tp.config.toml
# Rename to tp.baseline.toml

tp job init  # Creates tp.config1.toml
# Rename to tp.experiment.toml

# Run specific configs
tp job push tp.baseline.toml
tp job push tp.experiment.toml

Next Steps