> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorpool.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Clusters

> Deploy and manage GPU clusters

TensorPool makes it easy to deploy and manage GPU clusters of any size, from single GPUs to large multi-node configurations.

## Core Commands

* `tp cluster create` - Deploy a new GPU cluster
* `tp cluster list` - View all your clusters
* `tp cluster info <cluster_id>` - Get detailed information about a cluster
* `tp cluster edit <cluster_id>` - Edit cluster settings
* `tp cluster attach <cluster_id> <storage_id>` - Attach a storage volume to a cluster
* `tp cluster detach <cluster_id> <storage_id>` - Detach a storage volume from a cluster
* `tp cluster destroy <cluster_id>` - Terminate a cluster

## Creating Clusters

Deploy GPU clusters with simple commands. TensorPool supports both single-node and multi-node cluster configurations.

## Single-Node Clusters

Single-node clusters are ideal for development, experimentation, and smaller training workloads. They provide direct access to GPU resources without the complexity of distributed training.

### Supported Instance Types

Single-node clusters support a wide variety of GPU configurations:

```bash theme={null}
# Single H100
tp cluster create 1xH100 -i ~/.ssh/id_ed25519.pub

# Single node with 8x H200
tp cluster create 8xH200 -i ~/.ssh/id_ed25519.pub

# Single node with 8x B200
tp cluster create 8xB200 -i ~/.ssh/id_ed25519.pub

```

<Note>
  The `-i` flag is optional if you have SSH keys saved on your account via `tp me sshkey`.
</Note>

### Accessing Single-Node Clusters

Single-node clusters provide direct SSH access. Once your cluster is ready:

```bash theme={null}
# Get cluster information to find the instance ID
tp cluster info <cluster_id>

# SSH directly into the instance
tp ssh <instance_id>

# Run your training script directly on the node
python train.py
```

## Multi-Node Clusters

Multi-node clusters are designed for distributed training workloads that require scaling across multiple machines. All multi-node clusters come with **SLURM preinstalled** for job scheduling and resource management.

### Supported Instance Types

Multi-node support is currently available for:

* **8xH200** - 2 or more nodes, each with 8 H200 GPUs
* **8xB200** - 2 or more nodes, each with 8 B200 GPUs

### Creating Multi-Node Clusters

Create multi-node clusters by specifying the number of nodes with the `-n` flag:

```bash theme={null}
# 2-node cluster with 8xH200 each (16 GPUs total)
tp cluster create 8xH200 -i ~/.ssh/id_ed25519.pub -n 2

# 4-node cluster with 8xB200 each (32 GPUs total)
tp cluster create 8xB200 -i ~/.ssh/id_ed25519.pub -n 4
```

<Note>
  The `-i` flag is optional if you have SSH keys saved on your account via `tp me sshkey`.
</Note>

<Note>
  Multi-node support is currently available for **8xH200** and **8xB200** instance types only.
</Note>

### Accessing Multi-Node Clusters

All multi-node clusters come with **SLURM** preinstalled and configured. For detailed information about using SLURM for distributed training, see the [Multi-Node Training Guide](/guides/multi-node-training).

#### Cluster Architecture

Multi-node clusters use a jumphost architecture for network access. Multi-node clusters consist of:

* **Jumphost**: `{cluster_id}-jumphost` - The SLURM login/controller node with a public IP address
* **Worker Nodes**: `{cluster_id}-0`, `{cluster_id}-1`, etc. - Compute nodes with private IP addresses only

#### Accessing Your Cluster

Follow these steps to access your multi-node cluster:

1. **Get cluster information** to see all nodes and their instance IDs:

```bash theme={null}
tp cluster info <cluster_id>
```

2. **SSH into the jumphost** (this is the only node with direct public access):

```bash theme={null}
tp ssh <jumphost-instance-id>
```

3. **Access worker nodes** from the jumphost. You can use either the instance name or private IP:

```bash theme={null}
# Using instance name (replace <cluster_id> with your actual cluster ID)
ssh <cluster_id>-0
ssh <cluster_id>-1

# Or using the private IP address (found in cluster info)
ssh <worker-node-private-ip>
```

**Note**: The jumphost serves as the SLURM login node where you submit distributed training jobs. Worker nodes are only accessible from within the cluster network.

## Container Images

When creating a cluster, you can optionally specify a **container image** to run on your nodes. Container images are pre-built, GPU-ready Docker environments that come with CUDA, Python, and common ML tooling pre-installed — so you can start training immediately without manual setup.

### Using Container Images

Pass the `--container` flag when creating a cluster:

```bash theme={null}
# Minimal GPU environment
tp cluster create 1xH100 --container base

# Full PyTorch ML stack
tp cluster create 8xH200 --container pytorch
```

<Warning>
  Container images are only supported on single-node clusters. Multi-node clusters do not support `--container`.
</Warning>

### Available Images

| Image       | Flag Value | What's Included                                                                                                                                                    |
| ----------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Base**    | `base`     | CUDA 12.8 + cuDNN, Python 3.12, git, curl, rsync, rclone, unzip, build-essential                                                                                   |
| **PyTorch** | `pytorch`  | Everything in Base + PyTorch 2.6, torchvision, torchaudio, transformers, accelerate, datasets, peft, safetensors, wandb, tensorboard, scikit-learn, einops, opencv |

<Note>
  Clusters without `--container` continue to work exactly as before — bare-metal SSH access with no container layer.
</Note>

## Cluster and Instance Statuses

A cluster's status is derived from the statuses of its individual instances. Each instance within a cluster progresses through its own lifecycle, and the cluster's displayed status reflects the highest-priority status among all its instances.

### Instance Status Lifecycle

Each instance in a cluster follows this lifecycle:

```mermaid theme={null}
stateDiagram-v2
    [*] --> PENDING: Instance requested
    PENDING --> PROVISIONING: Resources allocated
    PENDING --> FAILED: No capacity
    PROVISIONING --> CONFIGURING: Instance provisioned
    CONFIGURING --> CONTAINER_CREATING: Container image specified
    CONFIGURING --> RUNNING: No container image
    CONTAINER_CREATING --> RUNNING: Container ready
    CONTAINER_CREATING --> FAILED: Container error
    RUNNING --> CONFIGURING: Storage attached/detached
    RUNNING --> DESTROYING: User destroys instance
    PROVISIONING --> FAILED: Provisioning error
    CONFIGURING --> FAILED: Configuration error
    RUNNING --> FAILED: System failure
    DESTROYING --> DESTROYED: Cleanup complete
    DESTROYED --> [*]
    FAILED --> [*]
```

### Status Definitions

| Status                  | Description                                                                                                          |
| ----------------------- | -------------------------------------------------------------------------------------------------------------------- |
| **PENDING**             | Instance creation request has been submitted and is being queued for provisioning.                                   |
| **PROVISIONING**        | Instance has been allocated and is being provisioned.                                                                |
| **CONFIGURING**         | Instance is being configured with software, drivers, networking, and storage.                                        |
| **CONTAINER\_CREATING** | Container image is being bootstrapped on the instance. Only occurs when a cluster is created with a container image. |
| **RUNNING**             | Instance is ready for use.                                                                                           |
| **DESTROYING**          | Instance shutdown in progress, resources are being deallocated.                                                      |
| **DESTROYED**           | Instance has been successfully terminated.                                                                           |
| **FAILED**              | System-level problem (e.g., hardware failure, no capacity).                                                          |

### Cluster Status Priority

A cluster's status is determined by the highest-priority status among its instances. Priority order (highest to lowest):

1. **FAILED** - Any failed instance causes the cluster to show as failed
2. **DESTROYING** - Cluster is being torn down
3. **PENDING** - Instances are waiting to be provisioned
4. **PROVISIONING** - Instances are being provisioned
5. **CONFIGURING** - Instances are being configured
6. **CONTAINER\_CREATING** - Container image is being bootstrapped
7. **RUNNING** - All instances are running
8. **DESTROYED** - All instances have been terminated

For example, if a cluster has 3 instances where 2 are `RUNNING` and 1 is `CONFIGURING`, the cluster status will show as `CONFIGURING`.

<Note>
  Clusters targeted by jobs with `--teardown` will be automatically destroyed after the job completes or is canceled.
</Note>

## Next Steps

* Explore [instance types](/resources/instance-types) available
* Learn about [storage volumes](/features/storage) for persistent data
* Read the [CLI reference](/cli/cluster-commands) for detailed command options
