Healthchecks and SDebug
Overview of TensorWave's health checks and SDebug hardware debugging tool. This tool is designed to make cluster admins life easier w.r.t managing (un)healthy nodes.
Healthchecks
Investigating Node Health with SDebug
[email protected]@slurm-login-skip-8566547b9c-qsnbh:~$ sdebug info
check-ecc check-gpu-reset check-link-flap check-lldpd check-nhc check-rdma-links check-ulimits NODES
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PASS PASS PASS PASS PASS PASS PASS tus1-p13-g[1-6,8,9,11-23,25-26,28-30,32-43,45-50,52-54,56-59,62,64],tus1-p14-g[1-31,33-49,51-52,54-61,63-64],tus1-p16-g[1-6,9-17,19-20,24-30,33-36,39,42-50,52,54-62,64]
PASS PASS FAIL PASS PASS PASS PASS tus1-p13-g[10,24,55],tus1-p16-g[21-22,37-38,53]
PASS PASS PASS PASS PASS FAIL PASS tus1-p13-g[7,61],tus1-p16-g[7,23,41,51]
PASS FAIL PASS PASS PASS PASS PASS tus1-p14-g50
PASS PASS PASS PASS PASS PASS FAIL tus1-p14-g62,tus1-p16-g8[email protected]@slurm-login-skip-8566547b9c-qsnbh:~$ sdebug info --node tus1-p14-g62
Health Check Report for Node: tus1-p14-g62
=======================================================
TEST STATUS RETCODE TIMESTAMP
-------------------------------------------------------
check-nhc PASS 0 2026-02-16T23:47:06Z
check-ecc PASS 0 2026-02-16T23:47:07Z
check-rdma-links PASS 0 2026-02-16T23:47:07Z
check-link-flap PASS 0 2026-02-16T23:47:07Z
check-ulimits FAIL 1 2026-02-16T23:47:07Z
check-gpu-reset PASS 0 2026-02-16T23:47:07Z
check-lldpd PASS 0 2026-02-16T23:47:07Z
=======================================================
Failed Test Logs:
================================================================================
Test: check-ulimits
Return Code: 1
Timestamp: 2026-02-16T23:47:07Z
Log:
--------------------------------------------------------------------------------
FAIL: ulimit check(s) failed:
- max_locked_memory: expected 'unlimited', got '8192'
max_locked_memory=8192
================================================================================Last updated

