Documentation Index
Fetch the complete documentation index at: https://docs.tensorpool.dev/llms.txt
Use this file to discover all available pages before exploring further.
The TensorPool Agent is currently in beta. We’d love your feedback!
Target Failures
The TensorPool Agent is designed to address runtime errors that occur deep into training:- GPU hardware faults: Xid errors (79, 63, 48, etc.)
- Distributed communication failures, NCCL errors
- Infrastructure problems: hardware failures, kernel panics
- Storage problems: I/O errors, checkpoint corruption, S3 timeouts
- Network problems: mounted object storage bucket issues
- GPU memory problems: CUDA out of memory, memory leaks, gradient explosion
How It Works
- Registration: Provide credentials to access your job scheduler of choice (Slurm, K8s, or TensorPool Jobs) on the TensorPool Agent dashboard. Whitelist permissions you allow the agent to take on your behalf.
- Monitoring: The training job is continuously monitored for failure.
-
Recovery (if job fails): The TensorPool Agent analyzes logs, attempts to diagnose and fix the issue. The job enters a
recoveringstate. - Resolution: If recovery succeeds, monitoring resumes. You’re alerted about the failure, actions taken, and recovery status. If the TensorPool Agent lacks permissions, it provides a list of actions it attempted and would have tried.
TensorPool Agent Status Lifecycle
| Status | Description |
|---|---|
| pending | TensorPool Agent created, credentials being validated |
| enabled | TensorPool Agent is monitoring the job |
| credential_error | Credential validation failed, job is not accessible by the TensorPool Agent, fix and resubmit |
| recovering | Job failure detected, TensorPool Agent is attempting to recover it |
| completed | Job finished (succeeded or unrecoverable) |
recovering state.
Failure Detection
The TensorPool Agent has the following definitions of failure for each job scheduler:- TensorPool Jobs
- Kubernetes
- Slurm
Only jobs in
ERROR state trigger the TensorPool Agent.Setup Requirements
The information that has to be provided in order for the TensorPool Agent to monitor a job depends on the job scheduler.- TensorPool Jobs
- Kubernetes
- Slurm
The simplest option - just provide your TensorPool job ID.
| Field | Description |
|---|---|
| Job ID | Your TensorPool job ID |
Next Steps
- Set up the TensorPool Agent on the dashboard
- Learn about TensorPool Jobs for running training workloads