> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/observability/clusters.md).

# Clusters

### Cluster Observability

The Health tab gives you a single-pane view of your entire cluster's operational state, covering GPU nodes, networking interfaces, SLURM, storage, and logs. It's designed to surface problems immediately so your team can act fast without having to dig through multiple tools.

At scale, visibility isn't optional. A single degraded node, a full storage volume, or a flapping network interface can silently stall a training run for hours if you don't catch it early. TensorWave's cluster observability is built around the idea that you should know about problems before they impact your workloads, not after.

#### Nodes

The Nodes section displays a grid of live status cards covering every category of node in your cluster and their current up/down state.

**GPU Nodes**

* **Up Cluster GPU Nodes** nodes that are online and available for workloads
* **Down Cluster GPU Nodes** nodes that are currently unavailable. Expanding this card shows the hostname of each affected node
* **Non-RMA GPU Nodes** nodes that have not been flagged for Return Merchandise Authorization, meaning they are healthy and not being serviced
* **RMA GPU Nodes** nodes that have been flagged for RMA and are out of rotation for hardware servicing
* **Nodes with 8 GPUs** nodes running a full complement of 8 GPUs
* **Nodes with less than 8 GPUs** nodes that are online but have fewer than 8 GPUs available, which may indicate a GPU has failed or been taken offline
* **Nodes Missing GPUs** nodes that are online but have no GPUs detected at all

Tracking partial GPU availability matters. A node that appears healthy but is missing GPUs can silently reduce the compute available to your jobs, causing slower runs or unexpected failures without an obvious cause.

**Head Nodes**

Head nodes manage job scheduling and act as the primary entry point for SLURM workloads on your cluster.

* **Up Cluster Head Nodes** head nodes that are online and operational
* **Down Cluster Head Nodes** head nodes that are currently unavailable, with hostnames listed on expansion

**Jump Nodes**

Jump nodes serve as secure access points into your cluster environment.

* **Up Cluster Jump Nodes** jump nodes that are online and reachable
* **Down Cluster Jump Nodes** jump nodes that are currently unavailable, with hostnames listed on expansion

**Frontend and Backend Interfaces**

* **Up/Down Frontend Interfaces** frontend network interfaces handling inbound traffic to the cluster
* **Up/Down Backend Interfaces** backend network interfaces handling internal cluster communication, including high-speed interconnects between nodes

Network interface health is especially important for distributed workloads. Degraded or down backend interfaces can bottleneck inter-node communication and tank GPU utilization across your entire cluster, even when the nodes themselves appear healthy.

Expanding any down card will list the hostnames of affected nodes, making it easy to pinpoint and isolate problems immediately.

#### Kubernetes

A summary of your Kubernetes environment at a glance, scoped to the selected namespace.

**Node Health** shows the count of ready and not-ready nodes alongside an overall readiness percentage and total node count.

**Pod Status** displays the total number of pods running across the cluster, broken down by state (Running, Succeeded, and others) and visualized as a live donut chart.

**Deployments** shows how many deployments are available out of the total, with a clear indicator when all deployments are healthy.

#### SLURM

**Slurm Node States** gives you a breakdown of all SLURM nodes by their current state, including Idle, Allocated, Mixed, Down, and Drained/Draining. A progress bar shows how many nodes are active relative to the total. Keeping an eye on node states helps you spot scheduling bottlenecks early, particularly when nodes are unexpectedly Down or stuck in a Drained state that reduces the capacity available to your jobs.

**Slurm Job Queue** shows the total number of jobs currently in the queue, broken down by Running, Pending, and Other states. A growing Pending count can be an early indicator of scheduler pressure, resource contention, or a node issue that is quietly reducing cluster capacity.

#### Storage

Keeping an eye on storage is critical for uninterrupted workloads. Full or near-full volumes can cause training jobs to fail, crash running pods, or corrupt checkpoints mid-run. The storage cards are designed to help you catch these issues before they become outages.

**Shared Storage Used** shows current utilization of your shared storage volume as a percentage, with a live progress bar and the mount path for reference. Monitoring this regularly helps ensure your jobs always have the space they need to write outputs, logs, and model checkpoints.

**Storage Volumes Used above 85%** flags any storage volumes approaching capacity. Catching volumes at this threshold gives you enough runway to free up space or expand capacity before a full volume takes down a workload. If all volumes are healthy, this card confirms that none have crossed the 85% threshold.

#### Logs

**Log Trend** shows a rolling view of log activity over the last 30 minutes. Spikes in log volume are often the earliest signal of something going wrong across the cluster, whether that's a failing job, a misbehaving pod, or an infrastructure event, giving you a chance to investigate before the impact is felt.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/observability/clusters.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
