Skip to main content
TensorPool provides high performance persistent storage that can be attached to your clusters.

Storage Types

TensorPool offers two storage volume types:
FeatureFast VolumesFlex Volumes
Cluster SupportMulti-node only (2+ nodes)All cluster types
POSIX CompliantYesNo

Fast Storage Volumes

Fast storage volumes are high-performance NFS-based volumes designed for distributed training on multi-node clusters:
  • Multi-node clusters only: Requires clusters with 2 or more nodes
  • High aggregate performance: Up to 300 GB/s aggregate read throughput, 150 GB/s aggregate write throughput, 1.5M read IOPS, 750k write IOPS
  • Fixed volume size: Volume size must be defined when created, it can be increased at any time. See pricing for details.
  • Ideal for: datasets for distributed training, storing model checkpoints
Expected single client performance on a 100TB Fast Storage Volume:
MetricPerformance
Read Throughput6,000 MB/s
Write Throughput2,000 MB/s
Read IOPS6,000
Write IOPS2,000
Avg Read Latency5ms
Avg Write Latency15ms
p99 Read Latency9ms
p99 Write Latency30ms
Fast storage volume performance scales with volume size. Larger volumes provide higher throughput and IOPS.

Flex Storage Volumes

Flex storage volumes are flexible object storage backed volumes designed for use on TensorPool clusters that are used as workbenches.
  • All cluster types: Works with all cluster types, not just multi-node clusters
  • Backed by object storage: Cost-effective for large datasets
  • Unlimited volume size: Billed on usage with no size limit. See pricing for details.
  • Ideal for: Data archival, researcher collaboration, general persistent storage
    • Not Ideal for: performance critical workloads, distributed training
Flex Storage Volumes are mounted object storage buckets that are available through an optimized FUSE mount, trading performance for flexibility. Peak single-client performance (cached reads, large files):
MetricPerformance
Read Throughput2,300 MB/s
Write Throughput3,700 MB/s
Read IOPS2,300
Write IOPS3,600
Avg Read Latency10ms
Avg Write Latency8ms
p99 Read Latency19ms
p99 Write Latency16ms
Flex Storage Volumes do not have traditional performance characteristics that you may expect from traditional shared filesystems (like NFS or block storage).Due to the nature of object storage every file operation incurs fixed overhead regardless of file size:
  • FUSE overhead: User-space/kernel context switches per syscall
  • S3 API overhead: HTTP request/response cycle
For large files this overhead is negligible. For small files (under 100KB), the overhead dominates the operation time. For example, doing a simple touch file.txt translates to 3 S3 API calls being performed (HeadObject,PutObject,ListObjectsV2) under the hood.Traditionally cheap operations like ls are also time-intensive because object storage has no directory hierarchy, listing requires querying all objects with a matching prefix.
Flex storage volumes are not POSIX compliant. Unsupported features:
  • Hard links
  • Setting file permissions (chmod)
  • Sticky, set-user-ID (SUID), and set-group-ID (SGID) bits
  • Updating the modification timestamp (mtime)
  • Creating and using FIFOs (first-in-first-out) pipes
  • Creating and using Unix sockets
  • Obtaining exclusive file locks
  • Unlinking an open file while it is still readable
While symlinks are supported, their use is discouraged. Symlink targets may not exist across all clusters, which can cause unexpected behavior.The use of small files (under 100KB) is discouraged due to the request based nature of object storage.Setting up Python virtual environments within a Flex volume is not recommended due to virtual environment’s use of symlinks and large number (~1000) of small files.

Moving Data Into Flex Storage Volumes

To maximize performance when copying data into a Flex Storage Volume, we recommend using rclone with parallelized transfers to take advantage of TensorPool’s optimized fuse mount:
rclone copy /path/to/source/ /path/to/destination/ \
  --transfers 8 \
  --checkers 8 \
  --ignore-checksum \
  --progress
With these parameters, rclone copy transfers multiple files concurrently, significantly improving throughput compared to cp or rsync which transfer files sequentially. rclone is installed by default on all TensorPool clusters with Flex Storage Volumes.

Core Commands

  • tp storage create -t <type> [-s <size_gb>] - Create a new storage volume
  • tp storage list - View all your storage volumes
  • tp cluster attach <cluster_id> <storage_id> - Attach storage to a cluster
  • tp cluster detach <cluster_id> <storage_id> - Detach storage from a cluster
  • tp storage destroy <storage_id> - Delete a storage volume

Creating Storage Volumes

Create storage volumes by specifying type (fast or flex) and size:
# Create a 500GB fast volume
tp storage create -t fast -s 500 --name training-data

# Create a flex volume (size not required)
tp storage create -t flex --name models

Attaching and Detaching

Attach storage volumes to a cluster:
tp cluster attach <cluster_id> <storage_id>
Detach when you’re done:
tp cluster detach <cluster_id> <storage_id>
Fast storage volumes can only be attached to multi-node clusters (clusters with 2 or more nodes). Flex storage works with all cluster types.

Storage Locations

Volume Mount Points

When you attach a storage volume to your cluster, it will be mounted on each instance at:
/mnt/<storage-type>-<storage_id>

Example Workflow

# 1. Create a 1TB fast storage volume
tp storage create -t fast -s 1000 --name dataset

# 2. Attach the volume to a cluster
tp cluster attach <cluster_id> <storage_id>

# 3. SSH into your cluster and access the data
tp ssh <instance_id>
cd /mnt/fast-<storage_id>

# 4. When done, detach the volume
tp cluster detach <cluster_id> <storage_id>

# 5. Destroy the volume when no longer needed
tp storage destroy <storage_id>

Storage Statuses

Storage volumes progress through various statuses throughout their lifecycle:
StatusDescription
PENDINGStorage creation request has been submitted and is being queued for provisioning.
PROVISIONINGStorage has been allocated and is being provisioned.
READYStorage is ready for use.
ATTACHINGStorage is being attached to a cluster.
DETACHINGStorage is being detached from a cluster.
DESTROYINGStorage deletion in progress, resources are being deallocated.
DESTROYEDStorage has been successfully deleted.
FAILEDSystem-level problem (e.g., no capacity, hardware failure, etc.).

Best Practices

  • Data Persistence: Use storage volumes for important data that needs to persist across cluster lifecycles
  • Shared Data: Attach the same storage volume to multiple clusters to share datasets
  • Choose the Right Type: Use fast storage for multi-node distributed training workloads; use flex for cost-effective persistent storage

Next Steps