The TensorPool Agent is currently in beta. We’d love your feedback!
Target Failures
The TensorPool Agent is designed to address runtime errors that occur deep into training:- GPU hardware faults: Xid errors (79, 63, 48, etc.)
- Distributed communication failures, NCCL errors
- Infrastructure problems: hardware failures, kernel panics
- Storage problems: I/O errors, checkpoint corruption, S3 timeouts
- Network problems: mounted object storage bucket issues
- GPU memory problems: CUDA out of memory, memory leaks, gradient explosion
How It Works
- Registration: Provide credentials to access your job scheduler of choice (Slurm, K8s, or TensorPool Jobs) on the TensorPool Agent dashboard. Whitelist permissions you allow the agent to take on your behalf.
- Monitoring: The training job is continuously monitored for failure.
-
Recovery (if job fails): The TensorPool Agent analyzes logs, attempts to diagnose and fix the issue. The job enters a
recoveringstate. - Resolution: If recovery succeeds, monitoring resumes. You’re alerted about the failure, actions taken, and recovery status. If the TensorPool Agent lacks permissions, it provides a list of actions it attempted and would have tried.
TensorPool Agent Status Lifecycle
| Status | Description |
|---|---|
| pending | TensorPool Agent created, credentials being validated |
| enabled | TensorPool Agent is monitoring the job |
| credential_error | Credential validation failed, job is not accessible by the TensorPool Agent, fix and resubmit |
| recovering | Job failure detected, TensorPool Agent is attempting to recover it |
| completed | Job finished (succeeded or unrecoverable) |
recovering state.
Failure Detection
The TensorPool Agent has the following definitions of failure for each job scheduler:- TensorPool Jobs
- Kubernetes
- Slurm
Only jobs in
ERROR state trigger the TensorPool Agent.Setup Requirements
The information that has to be provided in order for the TensorPool Agent to monitor a job depends on the job scheduler.- TensorPool Jobs
- Kubernetes
- Slurm
The simplest option - just provide your TensorPool job ID.
| Field | Description |
|---|---|
| Job ID | Your TensorPool job ID |
Next Steps
- Set up the TensorPool Agent on the dashboard
- Learn about TensorPool Jobs for running training workloads