> For the complete documentation index, see [llms.txt](https://docs.tensorwave.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tensorwave.com/observability/nodes.md).

# Nodes

### Node Insights

Clicking into any node from the Nodes table opens a detailed view of that node's performance and health. The **Insights** tab provides deep telemetry across GPU and system-level metrics, giving you the data you need to understand exactly what's happening on a node at any point in time.

TensorWave believes that the teams running the most demanding AI workloads deserve the same depth of observability that was previously only available to hyperscalers. Node Insights puts that data directly in your hands, so you can stop guessing why a job underperformed, catch hardware degradation before it causes a failure, and make informed decisions about your infrastructure without opening a support ticket first.

Use the **insights category** dropdown to switch between GPU Insights and System Insights, and the **time range** dropdown to adjust the window of data displayed.

***

#### GPU Insights

GPU Insights surfaces per-GPU telemetry across compute, memory, power, and interconnect health. This is the primary view for understanding how your GPUs are performing under a workload and catching hardware issues before they cause job failures. At scale, even a single GPU behaving unexpectedly can silently degrade the performance of an entire training run, making this level of per-device visibility critical.

**Summary cards** at the top of the view show:

* **GPUs** total number of GPUs on the node
* **Allocated GPUs** how many GPUs are currently allocated to a job
* **Total ECC Counts** correctable and uncorrectable ECC memory errors detected across all GPUs. Uncorrectable ECC errors in particular can indicate failing GPU memory and should be investigated promptly

**Compute and memory**

* **Jobs by GPU Usage** shows utilization percentage per GPU, making it easy to spot underutilized or completely idle GPUs during an active workload. Low utilization on an allocated GPU is often the first sign of a bottleneck elsewhere in your pipeline
* **GPUs VRAM Used** current VRAM consumption per GPU in MB
* **Node GPU Usage** a time-series chart of GPU utilization across all GPUs on the node over the selected time range
* **Used VRAM per GPU (%)** time-series view of VRAM consumption per GPU, useful for spotting memory pressure or unexpected growth during a run. Catching VRAM exhaustion early can save you from an out-of-memory crash mid-training

**Power**

* **Package Power Usage** current power draw per GPU in watts, with a time-series sparkline for each. Unexpected drops in power draw can indicate a GPU has gone idle or been taken offline mid-job
* **GPU Power (W)** a combined time-series chart of power consumption across all GPUs, making it easy to correlate power behavior with workload activity
* **Power-Activity Delta (%)** tracks the relationship between power draw and GPU activity. A large delta can indicate a GPU is drawing power without doing useful work, which may signal a hardware or driver issue. This metric is particularly useful for identifying silent GPU failures that don't surface as outright errors

**Thermals**

* **Memory Temperature (°C)** per-GPU memory temperature over time. Sustained high memory temperatures can throttle performance and shorten hardware lifespan
* **GPU Sensor Temperatures (°C)** per-GPU sensor temperature readings over time, useful for identifying thermal outliers across the node. A GPU running consistently hotter than its peers is worth investigating before it causes a thermal throttle or hardware fault mid-run
* **GPU Memory Clock (MHz)** memory clock frequency per GPU over time. Drops in memory clock speed can indicate thermal throttling that is silently reducing your effective compute throughput
* **GPU System Clock (MHz)** core clock frequency per GPU over time

**PCIe health**

* **PCIe Counts** tracks Recovery, Replay, Replay Rollover, NACK Received, and NACK Sent events across the PCIe bus. Non-zero values here, especially Replay or NACK counts, can indicate PCIe instability that may affect GPU-to-CPU communication and overall node reliability
* **PCIe Errors (ops/s)** rate of PCIe errors over time
* **PCIe Bandwidth (MB/s)** per-GPU PCIe bandwidth over time, useful for identifying bottlenecks between GPUs and the host system

**Interconnect**

* **xGMI Transmission Rate (GB/s)** tracks data transmission rates across the GPU interconnect fabric. Degraded xGMI throughput can bottleneck multi-GPU communication and significantly reduce training efficiency on interconnect-heavy workloads. For large distributed runs, interconnect health is often the difference between hitting peak throughput and leaving performance on the table

***

#### System Insights

System Insights provides host-level telemetry covering storage, memory, CPU, and networking for the node. GPU problems don't always start with the GPU. Storage contention, CPU bottlenecks, memory pressure, and network instability can all silently degrade workload performance in ways that are difficult to diagnose without this level of system-level visibility.

**Storage**

* **Root FS Storage % Used** current utilization of the root filesystem as a percentage, with a live progress bar
* **Filesystem Available Space Over Time** a time-series chart of available space across all mounted filesystems on the node. Watching this trend over time helps you catch storage being consumed faster than expected before a full filesystem crashes a job or corrupts a checkpoint

**Memory**

* **Memory Used** current system memory utilization as a percentage
* **Memory Usage Over Time** time-series breakdown of RAM Cache and Buffer, RAM Free, RAM Used, and SWAP Used, giving you a full picture of how system memory is being consumed and whether the node is under memory pressure. Heavy SWAP usage in particular can be an early warning sign of a memory leak or an undersized allocation for your workload

**CPU**

* **Number of CPUs** total CPU count on the node
* **Sys Load** current system load as a percentage
* **CPU 1 min load avg** short-term CPU load average, useful for catching sudden spikes
* **CPU 5 min load avg** medium-term load average, smoothing out short bursts to show sustained load
* **CPU 15 min load avg** long-term load average, the best indicator of whether the node is consistently under pressure over time. A persistently high 15-minute average often points to a systemic issue rather than a transient spike

**Thermals**

* **CPU Temperatures (°C)** per-CPU temperature over time. Thermal throttling on the CPU can indirectly impact GPU workloads by creating bottlenecks in data preprocessing or job orchestration
* **NIC Temperatures (°C)** per-NIC temperature over time. Overheating network interface cards can degrade throughput and contribute to link instability

**Networking**

* **Network Traffic Over Time** inbound and outbound traffic on the node over the selected time range
* **Link Flaps Over Time** tracks network link flap events, where a network interface briefly goes down and comes back up. Frequent link flaps are a strong early indicator of a failing NIC or unstable network connection and should be investigated before they impact distributed workloads. For multi-node training jobs, a single flapping link can stall an entire run


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tensorwave.com/observability/nodes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
