> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorpool.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Agent

> Autonomous monitoring and recovery for distributed training jobs

<Note>
  The TensorPool Agent is currently in **beta**. We'd love your feedback!
</Note>

The TensorPool Agent is an autonomous monitoring and recovery system for long-running distributed training jobs on Kubernetes, Slurm, or TensorPool Jobs. It's designed for large multi-node training jobs that run for days to weeks.

When the TensorPool Agent detects a runtime error, it attempts to autonomously recover your training job from its last checkpoint. You explicitly whitelist the actions the TensorPool Agent can take on your behalf.

**Best case:** The TensorPool Agent recovers your training job when you are AFK, letting you get more iteration cycles and avoid burning GPU hours.

**Worst case:** The TensorPool Agent delivers a preliminary root cause analysis and the actions it would have taken.

## Target Failures

The TensorPool Agent is designed to address runtime errors that occur *deep into training*:

* GPU hardware faults: Xid errors (79, 63, 48, etc.)
* Distributed communication failures, NCCL errors
* Infrastructure problems: hardware failures, kernel panics
* Storage problems: I/O errors, checkpoint corruption, S3 timeouts
* Network problems: mounted object storage bucket issues
* GPU memory problems: CUDA out of memory, memory leaks, gradient explosion

<Warning>
  The TensorPool Agent is **not** intended to fix errors that occur early in training, such as dependency issues or distributed communication initialization failures. It's designed to solve issues that occur **after the first checkpoint**.
</Warning>

## How It Works

1. **Registration**: Provide credentials to access your job scheduler of choice (Slurm, K8s, or TensorPool Jobs) on the [TensorPool Agent dashboard](https://tensorpool.dev/dashboard/agent). Whitelist permissions you allow the agent to take on your behalf.

2. **Monitoring**: The training job is continuously monitored for failure.

3. **Recovery** (if job fails): The TensorPool Agent analyzes logs, attempts to diagnose and fix the issue. The job enters a `recovering` state.

4. **Resolution**: If recovery succeeds, monitoring resumes. You're alerted about the failure, actions taken, and recovery status. If the TensorPool Agent lacks permissions, it provides a list of actions it attempted and would have tried.

```mermaid theme={null}
sequenceDiagram
    participant User
    participant Agent
    participant Scheduler as Job Scheduler

    User->>Agent: Register credentials
    Agent->>Scheduler: Validate job accessible
    Agent->>Scheduler: Monitor job

    loop Monitoring
        Agent->>Scheduler: Check job status
    end

    Scheduler-->>Agent: Job failure detected
    Agent->>Agent: Analyze logs & diagnose
    Agent->>Scheduler: Attempt recovery
    Agent->>User: Alert with status
```

## TensorPool Agent Status Lifecycle

```mermaid theme={null}
stateDiagram-v2
    [*] --> pending: Agent created
    pending --> enabled: Credentials validated & job running
    pending --> credential_error: Validation failed
    credential_error --> pending: Fix & resubmit
    enabled --> recovering: Error detected
    recovering --> enabled: Issue fixed
    recovering --> completed: No permissions /\nunable to recover
    enabled --> completed: Job succeeded
    completed --> [*]
```

| Status                | Description                                                                                   |
| --------------------- | --------------------------------------------------------------------------------------------- |
| **pending**           | TensorPool Agent created, credentials being validated                                         |
| **enabled**           | TensorPool Agent is monitoring the job                                                        |
| **credential\_error** | Credential validation failed, job is not accessible by the TensorPool Agent, fix and resubmit |
| **recovering**        | Job failure detected, TensorPool Agent is attempting to recover it                            |
| **completed**         | Job finished (succeeded or unrecoverable)                                                     |

You will be notified via text or email whenever the TensorPool Agent enters the `recovering` state.

## Failure Detection

The TensorPool Agent has the following definitions of failure for each job scheduler:

<Tabs>
  <Tab title="TensorPool Jobs">
    Only jobs in `ERROR` state trigger the TensorPool Agent.
  </Tab>

  <Tab title="Kubernetes">
    | K8s Resource Kind                | Failure Condition                                             |
    | -------------------------------- | ------------------------------------------------------------- |
    | Job                              | `status.failed >= 1`                                          |
    | Deployment                       | `status.availableReplicas == 0`                               |
    | StatefulSet                      | `status.readyReplicas == 0`                                   |
    | DaemonSet                        | `status.numberReady == 0`                                     |
    | Pod                              | `status.phase` in (`Failed`, `Unknown`) or resource not found |
    | KubeFlow PyTorchJob/TFJob/MPIJob | Kubeflow `Failed` condition or no active/succeeded replicas   |
  </Tab>

  <Tab title="Slurm">
    **Failure states** (triggers TensorPool Agent):

    * `FAILED`
    * `TIMEOUT`
    * `NODE_FAIL`
    * `OUT_OF_MEMORY`

    **Non-failure states** (does not trigger TensorPool Agent):

    * `CANCELLED`
    * `PREEMPTED`
    * `DEADLINE`
  </Tab>
</Tabs>

## Setup Requirements

The information that has to be provided in order for the TensorPool Agent to monitor a job depends on the job scheduler.

<Tabs>
  <Tab title="TensorPool Jobs">
    The simplest option - just provide your TensorPool job ID.

    | Field  | Description            |
    | ------ | ---------------------- |
    | Job ID | Your TensorPool job ID |
  </Tab>

  <Tab title="Kubernetes">
    | Field      | Description                        |
    | ---------- | ---------------------------------- |
    | Kubeconfig | Valid Kubernetes kubeconfig file   |
    | Job YAML   | Valid Kubernetes resource manifest |

    **Kubeconfig Requirements:**

    ```yaml theme={null}
    apiVersion: v1
    kind: Config
    clusters:
      - name: my-cluster
        cluster:
          server: https://...
    contexts:
      - name: my-context
        context:
          cluster: my-cluster
          user: my-user
    users:
      - name: my-user
        user:
          token: ...
    ```

    **Job YAML Requirements:**

    ```yaml theme={null}
    apiVersion: apps/v1
    kind: Job  # Supported: Job, Deployment, StatefulSet, DaemonSet, Pod, PyTorchJob, TFJob, MPIJob
    metadata:
      name: my-training-job  # generateName is NOT supported
      namespace: default
    spec:
      # ...
    ```
  </Tab>

  <Tab title="Slurm">
    | Field           | Description                                                |
    | --------------- | ---------------------------------------------------------- |
    | Login node IP   | IP address of the Slurm login node                         |
    | SSH port        | SSH port for the Slurm login node                          |
    | SSH username    | Username for SSH access to the Slurm login node            |
    | SSH private key | Private key for SSH authentication to the Slurm login node |
    | Slurm job ID    | ID of the job to monitor                                   |
  </Tab>
</Tabs>

## Next Steps

* [Set up the TensorPool Agent](https://tensorpool.dev/dashboard/agent) on the dashboard
* Learn about [TensorPool Jobs](/features/jobs) for running training workloads
