Jobs

TensorPool job’s are git-style interface for GPUs that act as a management system is for you to run, monitor, and manage tasks across GPU clusters. Jobs are ideal for running many experiments because they are easy to kick off, and you only pay for the time your job is running. Jobs are configured using TOML configuration files that specify your training commands, GPU requirements, and output files.

Job Configuration

Commands

The commands array specifies shell commands to run sequentially. Each job starts from a fresh virtual environment:

commands = [
    "pip install torch torchvision",
    "python -m pip install -e .",
    "python train.py --epochs 100",
]

Instance Types

Specify an instance type for the job:

instance_type = "1xH200"

All instances types are supported

Output Files

Define which files to save after job completion. Supports glob patterns:

outputs = [
    "checkpoints/",           # Entire directory
    "model_*.pth",           # Glob pattern
    "results.json",          # Single file
    "/logs/*",               # All files in logs/
]

Ignored Files

Exclude files from being uploaded with your job:

ignore = [
    ".venv",
    "venv/",
    "__pycache__/",
    ".git",
    "*.pyc",
    "data/",                 # Exclude large datasets
]

Job Statuses

Jobs progress through various statuses throughout their lifecycle:

Status	Description
Pending	Job is uploading and waiting to be assigned to a cluster.
Running	Job commands are being executed
Completed	All job commmands have returned an exit code of 0 and output files have been saved.
Error	User-level problem: a command has returned a non-zero exit code. Check the logs for details.
Failed	System-level problem: the cluster executing the job has failed (e.g., node failure, GPU error). TensorPool will investigate.
Canceling	Job cancellation in progress. The job outputs are being saved and cluster being shut down gracefully.
Canceled	Job was successfully canceled.

Managing Jobs

List Jobs

View all your jobs:

tp job list

List all jobs in your organization:

tp job list --org

Job Information

Get detailed information about a specific job:

tp job info <job_id>

Monitor Jobs

Stream real-time logs from a running job:

tp job listen <job_id>

Pull Output Files

Download output files from a completed job:

tp job pull <job_id>

Force overwrite existing local files:

tp job pull <job_id> --force

Cancel Jobs

Cancel a running job:

tp job cancel <job_id>

Multiple Configurations

You can create multiple configuration files for different experiments:

# Create named configs
tp job init  # Creates tp.config.toml
# Rename to tp.baseline.toml

tp job init  # Creates tp.config1.toml
# Rename to tp.experiment.toml

# Run specific configs
tp job push tp.baseline.toml
tp job push tp.experiment.toml

Next Steps

Learn about job commands
Explore multi-node training for distributed workloads
Manage SSH keys for cluster access

Getting Started

Core Features

CLI Reference

Resources

Job Configuration

Commands

Instance Types

Output Files

Ignored Files

Job Statuses

Managing Jobs

List Jobs

Job Information

Monitor Jobs

Pull Output Files

Cancel Jobs

Multiple Configurations

Next Steps

Getting Started

Core Features

CLI Reference

Resources

​Job Configuration

​Commands

​Instance Types

​Output Files

​Ignored Files

​Job Statuses

​Managing Jobs

​List Jobs

​Job Information

​Monitor Jobs

​Pull Output Files

​Cancel Jobs

​Multiple Configurations

​Next Steps

Job Configuration

Commands

Instance Types

Output Files

Ignored Files

Job Statuses

Managing Jobs

List Jobs

Job Information

Monitor Jobs

Pull Output Files

Cancel Jobs

Multiple Configurations

Next Steps