| Plugin | Collection | Analyzer Args | Collection Args | DataModel | Collector | Analyzer |
|---|---|---|---|---|---|---|
| AmdSmiPlugin | bad-pages firmware --json list --json metric -g all partition --json process --json ras --cper --folder={folder} ras --afid --cper-file {cper_file} static -g all --json static -g {gpu_id} --json topology version --json xgmi -l xgmi -m |
Analyzer Args: - check_static_data: bool — If True, run static data checks (e.g. driver version, partition mode).- expected_gpu_processes: Optional[int] — Expected number of GPU processes.- expected_max_power: Optional[int] — Expected maximum power value (e.g. watts).- expected_driver_version: Optional[str] — Expected AMD driver version string.- expected_memory_partition_mode: Optional[str] — Expected memory partition mode (e.g. sp3, dp).- expected_compute_partition_mode: Optional[str] — Expected compute partition mode.- expected_pldm_version: Optional[str] — Expected PLDM version string.- l0_to_recovery_count_error_threshold: Optional[int] — L0-to-recovery count above which an error is raised.- l0_to_recovery_count_warning_threshold: Optional[int] — L0-to-recovery count above which a warning is raised.- vendorid_ep: Optional[str] — Expected endpoint vendor ID (e.g. for PCIe).- vendorid_ep_vf: Optional[str] — Expected endpoint VF vendor ID.- devid_ep: Optional[str] — Expected endpoint device ID.- devid_ep_vf: Optional[str] — Expected endpoint VF device ID.- sku_name: Optional[str] — Expected SKU name string for GPU.- expected_xgmi_speed: Optional[list[float]] — Expected xGMI speed value(s) (e.g. link rate).- analysis_range_start: Optional[datetime.datetime] — Start of time range for time-windowed analysis.- analysis_range_end: Optional[datetime.datetime] — End of time range for time-windowed analysis. |
Collection Args: - cper_file_path: Optional[str] — Path to CPER folder or file for RAS AFID collection (ras --afid --cper-file). |
AmdSmiDataModel | AmdSmiCollector | AmdSmiAnalyzer |
| BiosPlugin | sh -c 'cat /sys/devices/virtual/dmi/id/bios_version' wmic bios get SMBIOSBIOSVersion /Value |
Analyzer Args: - exp_bios_version: list[str] — Expected BIOS version(s) to match against collected value (str or list).- regex_match: bool — If True, match exp_bios_version as regex; otherwise exact match. |
- | BiosDataModel | BiosCollector | BiosAnalyzer |
| CmdlinePlugin | cat /proc/cmdline | Analyzer Args: - required_cmdline: Union[str, List] — Command-line parameters that must be present (e.g. 'pci=bfsort').- banned_cmdline: Union[str, List] — Command-line parameters that must not be present.- os_overrides: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-OS overrides for required_cmdline and banned_cmdline (keyed by OS identifier).- platform_overrides: Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig] — Per-platform overrides for required_cmdline and banned_cmdline (keyed by platform). |
- | CmdlineDataModel | CmdlineCollector | CmdlineAnalyzer |
| DeviceEnumerationPlugin | powershell -Command "(Get-WmiObject -Class Win32_Processor | Measure-Object).Count" lspci -d {vendorid_ep}: | grep -i 'VGA|Display|3D' | wc -l powershell -Command "(wmic path win32_VideoController get name | findstr AMD | Measure-Object).Count" lscpu lshw lspci -d {vendorid_ep}: | grep -i 'Virtual Function' | wc -l powershell -Command "(Get-VMHostPartitionableGpu | Measure-Object).Count" |
Analyzer Args: - cpu_count: Optional[list[int]] — Expected CPU count(s); pass as int or list of ints. Analysis passes if actual is in list.- gpu_count: Optional[list[int]] — Expected GPU count(s); pass as int or list of ints. Analysis passes if actual is in list.- vf_count: Optional[list[int]] — Expected virtual function count(s); pass as int or list of ints. Analysis passes if actual is in list. |
- | DeviceEnumerationDataModel | DeviceEnumerationCollector | DeviceEnumerationAnalyzer |
| DimmPlugin | sh -c 'dmidecode -t 17 | tr -s " " | grep -v "Volatile|None|Module" | grep Size' 2>/dev/null dmidecode wmic memorychip get Capacity |
- | Collection Args: - skip_sudo: bool — If True, do not use sudo when running dmidecode or wmic for memory info. |
DimmDataModel | DimmCollector | - |
| DkmsPlugin | dkms status dkms --version |
Analyzer Args: - dkms_status: Union[str, list] — Expected dkms status string(s) to match (e.g. 'amd/1.0.0'). At least one of dkms_status or dkms_version required.- dkms_version: Union[str, list] — Expected dkms version string(s) to match. At least one of dkms_status or dkms_version required.- regex_match: bool — If True, match dkms_status and dkms_version as regex; otherwise exact match. |
- | DkmsDataModel | DkmsCollector | DkmsAnalyzer |
| DmesgPlugin | dmesg --time-format iso -x ls -1 /var/log/dmesg* 2>/dev/null | grep -E '^/var/log/dmesg(.[0-9]+(.gz)?)?$' || true |
Built-in Regexes: - Out of memory error: (?:oom_kill_process.*)|(?:Out of memory.*)- I/O Page Fault: IO_PAGE_FAULT- Kernel Panic: \bkernel panic\b.*- SQ Interrupt: sq_intr- SRAM ECC: sram_ecc.*- Failed to load driver. IP hardware init error.: \[amdgpu\]\] \*ERROR\* hw_init of IP block.*- Failed to load driver. IP software init error.: \[amdgpu\]\] \*ERROR\* sw_init of IP block.*- Real Time throttling activated: sched: RT throttling activated.*- RCU preempt detected stalls: rcu_preempt detected stalls.*- RCU preempt self-detected stall: rcu_preempt self-detected stall.*- QCM fence timeout: qcm fence wait loop timeout.*- General protection fault: (?:[\w-]+(?:\[[0-9.]+\])?\s+)?general protectio...- Segmentation fault: (?:segfault.*in .*\[)|(?:[Ss]egmentation [Ff]au...- Failed to disallow cf state: amdgpu: Failed to disallow cf state.*- Failed to terminate tmr: \*ERROR\* Failed to terminate tmr.*- Suspend of IP block failed: \*ERROR\* suspend of IP block <\w+> failed.*- amdgpu Page Fault: (amdgpu \w{4}:\w{2}:\w{2}\.\w:\s+amdgpu:\s+\[\S...- Page Fault: page fault for address.*- Fatal error during GPU init: (?:amdgpu)(.*Fatal error during GPU init)|(Fata...- PCIe AER Error Status: (pcieport [\w:.]+: AER: aer_status:[^\n]*(?:\n[...- PCIe AER Correctable Error Status: (.*aer_cor_status: 0x[0-9a-fA-F]+, aer_cor_mask...- PCIe AER Uncorrectable Error Status: (.*aer_uncor_status: 0x[0-9a-fA-F]+, aer_uncor_...- PCIe AER Uncorrectable Error Severity with TLP Header: (.*aer_uncor_severity: 0x[0-9a-fA-F]+.*)(\n.*TL...- Failed to read journal file: Failed to read journal file.*- Journal file corrupted or uncleanly shut down: journal corrupted or uncleanly shut down.*- ACPI BIOS Error: ACPI BIOS Error- ACPI Error: ACPI Error- Filesystem corrupted!: EXT4-fs error \(device .*\):- Error in buffered IO, check filesystem integrity: (Buffer I\/O error on dev)(?:ice)? (\w+)- PCIe card no longer present: pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...- PCIe Link Down: pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...- Mismatched clock configuration between PCIe device and host: pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(curren...- RAS Correctable Error: (?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....- RAS Uncorrectable Error: (?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....- RAS Deferred Error: (?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....- RAS Corrected PCIe Error: ((?:\[Hardware Error\]:\s+)?event severity: cor...- GPU Reset: (?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....- GPU reset failed: (?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....- ACA Error: (Accelerator Check Architecture[^\n]*)(?:\n[^\n...- ACA Error: (Accelerator Check Architecture[^\n]*)(?:\n[^\n...- MCE Error: \[Hardware Error\]:.+MC\d+_STATUS.*(?:\n.*){0,5}- Mode 2 Reset Failed: (?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)? (...- RAS Corrected Error: (?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....- SGX Error: x86/cpu: SGX disabled by BIOS- MMP Error: Failed to load MMP firmware qat_4xxx_mmp.bin- GPU Throttled: amdgpu \w{4}:\w{2}:\w{2}.\w: amdgpu: WARN: GPU ...- RAS Poison Consumed: amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...- RAS Poison created: amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...- Bad page threshold exceeded: (amdgpu: Saved bad pages (\d+) reaches threshol...- RAS Hardware Error: Hardware error from APEI Generic Hardware Error...- Error Address: Error Address.*(?:\s.*)- RAS EDR Event: EDR: EDR event received- DPC Event: DPC: .*- LNet: ko2iblnd has no matching interfaces: (?:\[[^\]]+\]\s*)?LNetError:.*ko2iblnd:\s*No ma...- LNet: Error starting up LNI: (?:\[[^\]]+\]\s*)?LNetError:\s*.*Error\s*-?\d+\...- Lustre: network initialisation failed: LustreError:.*ptlrpc_init_portals\(\).*network ... |
Collection Args: - collect_rotated_logs: bool — If True, also collect rotated dmesg log files from /var/log/dmesg*.- skip_sudo: bool — If True, do not use sudo when running dmesg or listing log files.- log_dmesg_data: bool — If True, log the collected dmesg output in artifacts. |
DmesgData | DmesgCollector | DmesgAnalyzer |
| FabricsPlugin | lspci | grep -i cassini lsmod | grep cxi cxi_stat ibstat ibv_devinfo ls -l /sys/class/infiniband/*/device/net fi_info -p cxi mst start mst status -v ip link show ofed_info -s |
- | - | FabricsDataModel | FabricsCollector | - |
| JournalPlugin | journalctl --no-pager --system --output=short-iso journalctl --no-pager --system --output=json |
Analyzer Args: - analysis_range_start: Optional[datetime.datetime] — Start of time range for analysis (ISO format). Only events on or after this time are analyzed.- analysis_range_end: Optional[datetime.datetime] — End of time range for analysis (ISO format). Only events before this time are analyzed.- check_priority: Optional[int] — Check against journal log priority (0=emergency..7=debug). If an entry has priority <= check_priority, an ERROR event...- group: bool — If True, group entries that have the same priority and message. |
Collection Args: - boot: Optional[int] — Optional boot ID to limit journal collection to a specific boot. |
JournalData | JournalCollector | JournalAnalyzer |
| KernelPlugin | sh -c 'uname -a' sh -c 'cat /proc/sys/kernel/numa_balancing' wmic os get Version /Value |
Analyzer Args: - exp_kernel: Union[str, list] — Expected kernel version string(s) to match (e.g. from uname -a).- exp_numa: Optional[int] — Expected value for kernel.numa_balancing (e.g. 0 or 1).- regex_match: bool — If True, match exp_kernel as regex; otherwise exact match. |
- | KernelDataModel | KernelCollector | KernelAnalyzer |
| KernelModulePlugin | cat /proc/modules modinfo amdgpu wmic os get Version /Value |
Analyzer Args: - kernel_modules: dict[str, dict] — Expected kernel module name -> {version, etc.}. Analyzer checks collected modules match.- regex_filter: list[str] — List of regex patterns to filter which collected modules are checked (default: amd). |
- | KernelModuleDataModel | KernelModuleCollector | KernelModuleAnalyzer |
| MemoryPlugin | free -b lsmem numactl -H wmic OS get FreePhysicalMemory /Value; wmic ComputerSystem get TotalPhysicalMemory /Value |
Analyzer Args: - ratio: float — Required free-memory ratio (0-1). Analysis fails if free/total < ratio.- memory_threshold: str — Minimum free memory required (e.g. '30Gi', '1T'). Used when ratio is not sufficient. |
- | MemoryDataModel | MemoryCollector | MemoryAnalyzer |
| NetworkPlugin | ip addr show curl ethtool -S {interface} ethtool {interface} lldpcli show neighbor lldpctl ip neighbor show ping ip route show ip rule show wget |
- | Collection Args: - url: Optional[str] — Optional URL to probe for network connectivity (used with netprobe).- netprobe: Optional[Literal['ping', 'wget', 'curl']] — Tool to use for network connectivity probe: ping, wget, or curl. |
NetworkDataModel | NetworkCollector | - |
| NicPlugin | niccli --listdev niccli --list niccli --list_devices niccli -dev {device_num} nvm -getoption pcie_relaxed_ordering niccli --dev {device_num} nvm --getoption pcie_relaxed_ordering niccli -dev {device_num} nvm -getoption performance_profile niccli --dev {device_num} nvm --getoption performance_profile niccli -dev {device_num} nvm -getoption support_rdma -scope 0 niccli -dev {device_num} getqos niccli --dev {device_num} nvm --getoption support_rdma niccli --dev {device_num} qos --ets --show niccli --version nicctl show card nicctl --version nicctl show card flash partition --json nicctl show card interrupts --json nicctl show card logs --non-persistent nicctl show card logs --boot-fault nicctl show card logs --persistent nicctl show card profile --json nicctl show card time --json nicctl show card statistics packet-buffer summary --json nicctl show lif statistics --json nicctl show lif internal queue-to-ud-pinning nicctl show pipeline internal anomalies nicctl show pipeline internal rsq-ring nicctl show pipeline internal statistics memory nicctl show port fsm nicctl show port transceiver --json nicctl show port statistics --json nicctl show port internal mac nicctl show qos headroom --json nicctl show rdma queue --json nicctl show rdma queue-pair --detail --json nicctl show version firmware nicctl show dcqcn nicctl show environment nicctl show lif nicctl show pcie ats nicctl show port nicctl show qos nicctl show rdma statistics nicctl show version host-software nicctl show dcqcn --card {card_id} --json nicctl show card hardware-config --card {card_id} |
Analyzer Args: - expected_values: Optional[Dict[str, Dict[str, Any]]] — Per-command expected checks keyed by canonical key (see command_to_canonical_key).- performance_profile_expected: str — Expected Broadcom performance_profile value (case-insensitive). Default RoCE.- support_rdma_disabled_values: List[str] — Values that indicate RDMA is not supported (case-insensitive).- pcie_relaxed_ordering_expected: str — Expected Broadcom pcie_relaxed_ordering value (e.g. 'Relaxed ordering = enabled'); checked case-insensitively. Defaul...- expected_qos_prio_map: Optional[Dict[Any, Any]] — Expected priority-to-TC map (e.g. {0: 0, 1: 1}; keys may be int or str in config). Checked per device when set.- expected_qos_pfc_enabled: Optional[int] — Expected PFC enabled value (0/1 or bitmask). Checked per device when set.- expected_qos_tsa_map: Optional[Dict[Any, Any]] — Expected TSA map for ETS (e.g. {0: 'ets', 1: 'strict'}; keys may be int or str in config). Checked per device when set.- expected_qos_tc_bandwidth: Optional[List[int]] — Expected TC bandwidth percentages. Checked per device when set.- require_qos_consistent_across_adapters: bool — When True and no expected_qos_* are set, require all adapters to have the same prio_map, pfc_enabled, and tsa_map.- nicctl_log_error_regex: Optional[List[Dict[str, Any]]] — Optional list of error patterns for nicctl show card logs. |
Collection Args: - commands: Optional[List[str]] — Optional list of niccli/nicctl commands to run. When None, default command set is used.- use_sudo_niccli: bool — If True, run niccli commands with sudo when required.- use_sudo_nicctl: bool — If True, run nicctl commands with sudo when required. |
NicDataModel | NicCollector | NicAnalyzer |
| NvmePlugin | nvme smart-log {dev} nvme error-log {dev} --log-entries=256 nvme id-ctrl {dev} nvme id-ns {dev}{ns} nvme fw-log {dev} nvme self-test-log {dev} nvme get-log {dev} --log-id=6 --log-len=512 nvme telemetry-log {dev} --output-file={dev}_{f_name} nvme list -o json |
- | - | NvmeDataModel | NvmeCollector | - |
| OsPlugin | sh -c '( lsb_release -ds || (cat /etc/*release | grep PRETTY_NAME) || uname -om ) 2>/dev/null | head -n1' cat /etc/*release | grep VERSION_ID wmic os get Version /value wmic os get Caption /Value |
Analyzer Args: - exp_os: Union[str, list] — Expected OS name/version string(s) to match (e.g. from lsb_release or /etc/os-release).- exact_match: bool — If True, require exact match for exp_os; otherwise substring match. |
- | OsDataModel | OsCollector | OsAnalyzer |
| PackagePlugin | dnf list --installed dpkg-query -W pacman -Q cat /etc/*release wmic product get name,version |
Analyzer Args: - exp_package_ver: Dict[str, Optional[str]] — Map package name -> expected version (None = any version). Checked against installed packages.- regex_match: bool — If True, match package versions with regex; otherwise exact or prefix match.- rocm_regex: Optional[str] — Optional regex to identify ROCm package version (used when enable_rocm_regex is True).- enable_rocm_regex: bool — If True, use rocm_regex (or default pattern) to extract ROCm version for checks. |
- | PackageDataModel | PackageCollector | PackageAnalyzer |
| PciePlugin | lspci -d {vendor_id}: -nn lspci -x lspci -xxxx lspci -PP lspci -PP -d {vendor_id}:{dev_id} lspci -vvv lspci -vvvt |
Analyzer Args: - exp_speed: int — Expected PCIe link speed (generation 1–5).- exp_width: int — Expected PCIe link width in lanes (1–16).- exp_sriov_count: int — Expected SR-IOV virtual function count.- exp_gpu_count_override: Optional[int] — Override expected GPU count for validation.- exp_max_payload_size: Union[Dict[int, int], int, NoneType] — Expected max payload size: int for all devices, or dict keyed by device ID.- exp_max_rd_req_size: Union[Dict[int, int], int, NoneType] — Expected max read request size: int for all devices, or dict keyed by device ID.- exp_ten_bit_tag_req_en: Union[Dict[int, int], int, NoneType] — Expected 10-bit tag request enable: int for all devices, or dict keyed by device ID. |
- | PcieDataModel | PcieCollector | PcieAnalyzer |
| ProcessPlugin | top -b -n 1 rocm-smi --showpids top -b -n 1 -o %CPU |
Analyzer Args: - max_kfd_processes: int — Maximum allowed number of KFD (Kernel Fusion Driver) processes; 0 disables the check.- max_cpu_usage: float — Maximum allowed CPU usage (percent) for process checks. |
Collection Args: - top_n_process: int — Number of top processes by CPU usage to collect (e.g. for top -b -n 1 -o %CPU). |
ProcessDataModel | ProcessCollector | ProcessAnalyzer |
| RdmaPlugin | rdma link -j rdma dev rdma link rdma statistic -j |
- | - | RdmaDataModel | RdmaCollector | RdmaAnalyzer |
| RocmPlugin | {rocm_path}/opencl/bin//clinfo env | grep -Ei 'rocm|hsa|hip|mpi|openmp|ucx|miopen' ls /sys/class/kfd/kfd/proc/ grep -i -E 'rocm' /etc/ld.so.conf.d/ {rocm_path}/bin/rocminfo ls -v -d {rocm_path} ls -v -d {rocm_path}-[3-7] | tail -1 ldconfig -p | grep -i -E 'rocm' grep . -r {rocm_path}/.info/* |
Analyzer Args: - exp_rocm: Union[str, list] — Expected ROCm version string(s) to match (e.g. from rocminfo).- exp_rocm_latest: str — Expected 'latest' ROCm path or version string for versioned installs.- exp_rocm_sub_versions: dict[str, Union[str, list]] — Map sub-version name (e.g. version_rocm) to expected string or list of allowed strings. |
Collection Args: - rocm_path: str — Base path to ROCm installation (e.g. /opt/rocm). Used for rocminfo, clinfo, and version discovery. |
RocmDataModel | RocmCollector | RocmAnalyzer |
| StoragePlugin | sh -c 'df -lH -B1 | grep -v 'boot'' wmic LogicalDisk Where DriveType="3" Get DeviceId,Size,FreeSpace |
- | Collection Args: - skip_sudo: bool — If True, do not use sudo when running df and related storage commands. |
StorageDataModel | StorageCollector | StorageAnalyzer |
| SysSettingsPlugin | cat /sys/{} ls -1 /sys/{} ls -l /sys/{} |
Analyzer Args: - checks: Optional[list[nodescraper.plugins.inband.sys_settings.analyzer_args.SysfsCheck]] — List of sysfs checks (path, expected values or pattern, display name). |
Collection Args: - paths: list[str] — Sysfs paths to read (cat). Paths with '' are collected with ls -l (e.g. class/net//device).- directory_paths: list[str] — Sysfs paths to list (ls -1); used for checks that match entry names by regex. |
SysSettingsDataModel | SysSettingsCollector | SysSettingsAnalyzer |
| SysctlPlugin | sysctl -n | Analyzer Args: - exp_vm_swappiness: Optional[int] — Expected vm.swappiness value.- exp_vm_numa_balancing: Optional[int] — Expected vm.numa_balancing value.- exp_vm_oom_kill_allocating_task: Optional[int] — Expected vm.oom_kill_allocating_task value.- exp_vm_compaction_proactiveness: Optional[int] — Expected vm.compaction_proactiveness value.- exp_vm_compact_unevictable_allowed: Optional[int] — Expected vm.compact_unevictable_allowed value.- exp_vm_extfrag_threshold: Optional[int] — Expected vm.extfrag_threshold value.- exp_vm_zone_reclaim_mode: Optional[int] — Expected vm.zone_reclaim_mode value.- exp_vm_dirty_background_ratio: Optional[int] — Expected vm.dirty_background_ratio value.- exp_vm_dirty_ratio: Optional[int] — Expected vm.dirty_ratio value.- exp_vm_dirty_writeback_centisecs: Optional[int] — Expected vm.dirty_writeback_centisecs value.- exp_kernel_numa_balancing: Optional[int] — Expected kernel.numa_balancing value. |
- | SysctlDataModel | SysctlCollector | SysctlAnalyzer |
| SyslogPlugin | ls -1 /var/log/syslog* 2>/dev/null | grep -E '^/var/log/syslog(.[0-9]+(.gz)?)?$' || true | - | - | SyslogData | SyslogCollector | - |
| UptimePlugin | uptime | - | - | UptimeDataModel | UptimeCollector | - |
Class for collection of inband tool amd-smi data.
Bases: ['InBandDataCollector']
Link to code: amdsmi_collector.py
- AMD_SMI_EXE:
amd-smi - SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD_VERSION:
version --json - CMD_LIST:
list --json - CMD_PROCESS:
process --json - CMD_PARTITION:
partition --json - CMD_FIRMWARE:
firmware --json - CMD_STATIC:
static -g all --json - CMD_STATIC_GPU:
static -g {gpu_id} --json - CMD_TOPOLOGY:
topology - CMD_METRIC:
metric -g all - CMD_BAD_PAGES:
bad-pages - CMD_XGMI_METRIC:
xgmi -m - CMD_XGMI_LINK:
xgmi -l - CMD_RAS:
ras --cper --folder={folder} - CMD_RAS_AFID:
ras --afid --cper-file {cper_file}
AmdSmiDataModel
- bad-pages
- firmware --json
- list --json
- metric -g all
- partition --json
- process --json
- ras --cper --folder={folder}
- ras --afid --cper-file {cper_file}
- static -g all --json
- static -g {gpu_id} --json
- topology
- version --json
- xgmi -l
- xgmi -m
Collect BIOS details
Bases: ['InBandDataCollector']
Link to code: bios_collector.py
- CMD_WINDOWS:
wmic bios get SMBIOSBIOSVersion /Value - CMD:
sh -c 'cat /sys/devices/virtual/dmi/id/bios_version'
BiosDataModel
- sh -c 'cat /sys/devices/virtual/dmi/id/bios_version'
- wmic bios get SMBIOSBIOSVersion /Value
Read linux cmdline data
Bases: ['InBandDataCollector']
Link to code: cmdline_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD:
cat /proc/cmdline
CmdlineDataModel
- cat /proc/cmdline
Collect CPU and GPU count
Bases: ['InBandDataCollector']
Link to code: device_enumeration_collector.py
- CMD_GPU_COUNT_LINUX:
lspci -d {vendorid_ep}: | grep -i 'VGA\|Display\|3D' | wc -l - CMD_VF_COUNT_LINUX:
lspci -d {vendorid_ep}: | grep -i 'Virtual Function' | wc -l - CMD_LSCPU_LINUX:
lscpu - CMD_LSHW_LINUX:
lshw - CMD_CPU_COUNT_WINDOWS:
powershell -Command "(Get-WmiObject -Class Win32_Processor | Measure-Object).Count" - CMD_GPU_COUNT_WINDOWS:
powershell -Command "(wmic path win32_VideoController get name | findstr AMD | Measure-Object).Count" - CMD_VF_COUNT_WINDOWS:
powershell -Command "(Get-VMHostPartitionableGpu | Measure-Object).Count"
DeviceEnumerationDataModel
- powershell -Command "(Get-WmiObject -Class Win32_Processor | Measure-Object).Count"
- lspci -d {vendorid_ep}: | grep -i 'VGA|Display|3D' | wc -l
- powershell -Command "(wmic path win32_VideoController get name | findstr AMD | Measure-Object).Count"
- lscpu
- lshw
- lspci -d {vendorid_ep}: | grep -i 'Virtual Function' | wc -l
- powershell -Command "(Get-VMHostPartitionableGpu | Measure-Object).Count"
Collect data on installed DIMMs
Bases: ['InBandDataCollector']
Link to code: dimm_collector.py
- CMD_WINDOWS:
wmic memorychip get Capacity - CMD:
sh -c 'dmidecode -t 17 | tr -s " " | grep -v "Volatile\|None\|Module" | grep Size' 2>/dev/null - CMD_DMIDECODE_FULL:
dmidecode
DimmDataModel
- sh -c 'dmidecode -t 17 | tr -s " " | grep -v "Volatile|None|Module" | grep Size' 2>/dev/null
- dmidecode
- wmic memorychip get Capacity
Collect DKMS status and version data
Bases: ['InBandDataCollector']
Link to code: dkms_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD_STATUS:
dkms status - CMD_VERSION:
dkms --version
DkmsDataModel
- dkms status
- dkms --version
Read dmesg log
Bases: ['InBandDataCollector']
Link to code: dmesg_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD:
dmesg --time-format iso -x - CMD_LOGS:
ls -1 /var/log/dmesg* 2>/dev/null | grep -E '^/var/log/dmesg(\.[0-9]+(\.gz)?)?$' || true
DmesgData
- dmesg --time-format iso -x
- ls -1 /var/log/dmesg* 2>/dev/null | grep -E '^/var/log/dmesg(.[0-9]+(.gz)?)?$' || true
Collect InfiniBand/RDMA fabrics configuration details
Bases: ['InBandDataCollector']
Link to code: fabrics_collector.py
- CMD_IBSTAT:
ibstat - CMD_IBV_DEVINFO:
ibv_devinfo - CMD_IB_DEV_NETDEVS:
ls -l /sys/class/infiniband/*/device/net - CMD_OFED_INFO:
ofed_info -s - CMD_MST_START:
mst start - CMD_MST_STATUS:
mst status -v - CMD_CASSINI_PCI:
lspci | grep -i cassini - CMD_NET_LINK:
ip link show - CMD_LIBFABRIC_INFO:
fi_info -p cxi - CMD_CXI_STAT:
cxi_stat - CMD_CXI_MODULES:
lsmod | grep cxi
FabricsDataModel
- lspci | grep -i cassini
- lsmod | grep cxi
- cxi_stat
- ibstat
- ibv_devinfo
- ls -l /sys/class/infiniband/*/device/net
- fi_info -p cxi
- mst start
- mst status -v
- ip link show
- ofed_info -s
Read journal log via journalctl.
Bases: ['InBandDataCollector']
Link to code: journal_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD:
journalctl --no-pager --system --output=short-iso - CMD_JSON:
journalctl --no-pager --system --output=json
JournalData
- journalctl --no-pager --system --output=short-iso
- journalctl --no-pager --system --output=json
Read kernel version
Bases: ['InBandDataCollector']
Link to code: kernel_collector.py
- CMD_WINDOWS:
wmic os get Version /Value - CMD:
sh -c 'uname -a' - CMD_NUMA_BALANCING:
sh -c 'cat /proc/sys/kernel/numa_balancing'
KernelDataModel
- sh -c 'uname -a'
- sh -c 'cat /proc/sys/kernel/numa_balancing'
- wmic os get Version /Value
Read kernel modules and associated parameters
Bases: ['InBandDataCollector']
Link to code: kernel_module_collector.py
- CMD_WINDOWS:
wmic os get Version /Value - CMD:
cat /proc/modules - CMD_MODINFO_AMDGPU:
modinfo amdgpu
KernelModuleDataModel
- cat /proc/modules
- modinfo amdgpu
- wmic os get Version /Value
Collect memory usage details
Bases: ['InBandDataCollector']
Link to code: memory_collector.py
- CMD_WINDOWS:
wmic OS get FreePhysicalMemory /Value; wmic ComputerSystem get TotalPhysicalMemory /Value - CMD:
free -b - CMD_LSMEM:
lsmem - CMD_NUMACTL:
numactl -H
MemoryDataModel
- free -b
- lsmem
- numactl -H
- wmic OS get FreePhysicalMemory /Value; wmic ComputerSystem get TotalPhysicalMemory /Value
Collect network configuration details using ip command
Bases: ['InBandDataCollector']
Link to code: network_collector.py
- CMD_ADDR:
ip addr show - CMD_ROUTE:
ip route show - CMD_RULE:
ip rule show - CMD_NEIGHBOR:
ip neighbor show - CMD_ETHTOOL_TEMPLATE:
ethtool {interface} - CMD_ETHTOOL_S_TEMPLATE:
ethtool -S {interface} - CMD_PING:
ping - CMD_WGET:
wget - CMD_CURL:
curl - CMD_LLDPCLI_NEIGHBOR:
lldpcli show neighbor - CMD_LLDPCTL:
lldpctl
NetworkDataModel
- ip addr show
- curl
- ethtool -S {interface}
- ethtool {interface}
- lldpcli show neighbor
- lldpctl
- ip neighbor show
- ping
- ip route show
- ip rule show
- wget
Collect raw output from niccli (Broadcom) and nicctl (Pensando) commands.
Bases: ['InBandDataCollector']
Link to code: nic_collector.py
- CMD_NICCLI_VERSION:
niccli --version - CMD_NICCLI_LIST:
niccli --list - CMD_NICCLI_LIST_DEVICES:
niccli --list_devices - CMD_NICCLI_LIST_DEVICES_LEGACY:
niccli --listdev - CMD_NICCLI_DISCOVERY_LEGACY:
['niccli --listdev', 'niccli --list'] - CMD_NICCLI_DISCOVERY_NEW:
['niccli --list_devices', 'niccli --list'] - CMD_NICCLI_DISCOVERY:
['niccli --listdev', 'niccli --list'] - CMD_NICCLI_DISCOVERY_ALL:
frozenset({'niccli --listdev', 'niccli --list_devices', 'niccli --list'}) - CMD_NICCLI_SUPPORT_RDMA_TEMPLATE_LEGACY:
niccli -dev {device_num} nvm -getoption support_rdma -scope 0 - CMD_NICCLI_PERFORMANCE_PROFILE_TEMPLATE_LEGACY:
niccli -dev {device_num} nvm -getoption performance_profile - CMD_NICCLI_PCIE_RELAXED_ORDERING_TEMPLATE_LEGACY:
niccli -dev {device_num} nvm -getoption pcie_relaxed_ordering - CMD_NICCLI_QOS_TEMPLATE_LEGACY:
niccli -dev {device_num} getqos - CMD_NICCLI_PER_DEVICE_LEGACY:
[ niccli -dev {device_num} nvm -getoption support_rdma -scope 0, niccli -dev {device_num} nvm -getoption performance_profile, niccli -dev {device_num} nvm -getoption pcie_relaxed_ordering, niccli -dev {device_num} getqos ] - CMD_NICCLI_SUPPORT_RDMA_TEMPLATE_NEW:
niccli --dev {device_num} nvm --getoption support_rdma - CMD_NICCLI_PERFORMANCE_PROFILE_TEMPLATE_NEW:
niccli --dev {device_num} nvm --getoption performance_profile - CMD_NICCLI_PCIE_RELAXED_ORDERING_TEMPLATE_NEW:
niccli --dev {device_num} nvm --getoption pcie_relaxed_ordering - CMD_NICCLI_QOS_TEMPLATE_NEW:
niccli --dev {device_num} qos --ets --show - CMD_NICCLI_PER_DEVICE_NEW:
[ niccli --dev {device_num} nvm --getoption support_rdma, niccli --dev {device_num} nvm --getoption performance_profile, niccli --dev {device_num} nvm --getoption pcie_relaxed_ordering, niccli --dev {device_num} qos --ets --show ] - CMD_NICCLI_SUPPORT_RDMA_TEMPLATE:
niccli -dev {device_num} nvm -getoption support_rdma -scope 0 - CMD_NICCLI_PERFORMANCE_PROFILE_TEMPLATE:
niccli -dev {device_num} nvm -getoption performance_profile - CMD_NICCLI_PCIE_RELAXED_ORDERING_TEMPLATE:
niccli -dev {device_num} nvm -getoption pcie_relaxed_ordering - CMD_NICCLI_PER_DEVICE:
[ niccli -dev {device_num} nvm -getoption support_rdma -scope 0, niccli -dev {device_num} nvm -getoption performance_profile, niccli -dev {device_num} nvm -getoption pcie_relaxed_ordering, niccli -dev {device_num} getqos ] - CMD_NICCTL_CARD_TEXT:
nicctl show card - CMD_NICCTL_GLOBAL:
[ nicctl --version, nicctl show card flash partition --json, nicctl show card interrupts --json, nicctl show card logs --non-persistent, nicctl show card logs --boot-fault, nicctl show card logs --persistent, nicctl show card profile --json, nicctl show card time --json, nicctl show card statistics packet-buffer summary --json, nicctl show lif statistics --json, nicctl show lif internal queue-to-ud-pinning, nicctl show pipeline internal anomalies, nicctl show pipeline internal rsq-ring, nicctl show pipeline internal statistics memory, nicctl show port fsm, nicctl show port transceiver --json, nicctl show port statistics --json, nicctl show port internal mac, nicctl show qos headroom --json, nicctl show rdma queue --json, nicctl show rdma queue-pair --detail --json, nicctl show version firmware ] - CMD_NICCTL_PER_CARD:
['nicctl show dcqcn --card {card_id} --json', 'nicctl show card hardware-config --card {card_id}'] - CMD_NICCTL_LEGACY_TEXT:
[ nicctl show card, nicctl show dcqcn, nicctl show environment, nicctl show lif, nicctl show pcie ats, nicctl show port, nicctl show qos, nicctl show rdma statistics, nicctl show version host-software ]
NicDataModel
- niccli --listdev
- niccli --list
- niccli --list_devices
- niccli -dev {device_num} nvm -getoption pcie_relaxed_ordering
- niccli --dev {device_num} nvm --getoption pcie_relaxed_ordering
- niccli -dev {device_num} nvm -getoption performance_profile
- niccli --dev {device_num} nvm --getoption performance_profile
- niccli -dev {device_num} nvm -getoption support_rdma -scope 0
- niccli -dev {device_num} getqos
- niccli --dev {device_num} nvm --getoption support_rdma
- niccli --dev {device_num} qos --ets --show
- niccli --version
- nicctl show card
- nicctl --version
- nicctl show card flash partition --json
- nicctl show card interrupts --json
- nicctl show card logs --non-persistent
- nicctl show card logs --boot-fault
- nicctl show card logs --persistent
- nicctl show card profile --json
- nicctl show card time --json
- nicctl show card statistics packet-buffer summary --json
- nicctl show lif statistics --json
- nicctl show lif internal queue-to-ud-pinning
- nicctl show pipeline internal anomalies
- nicctl show pipeline internal rsq-ring
- nicctl show pipeline internal statistics memory
- nicctl show port fsm
- nicctl show port transceiver --json
- nicctl show port statistics --json
- nicctl show port internal mac
- nicctl show qos headroom --json
- nicctl show rdma queue --json
- nicctl show rdma queue-pair --detail --json
- nicctl show version firmware
- nicctl show dcqcn
- nicctl show environment
- nicctl show lif
- nicctl show pcie ats
- nicctl show port
- nicctl show qos
- nicctl show rdma statistics
- nicctl show version host-software
- nicctl show dcqcn --card {card_id} --json
- nicctl show card hardware-config --card {card_id}
Collect NVMe details from the system.
Bases: ['InBandDataCollector']
Link to code: nvme_collector.py
- CMD_LINUX_LIST_JSON:
nvme list -o json - CMD_LINUX:
{'smart_log': 'nvme smart-log {dev}', 'error_log': 'nvme error-log {dev} --log-entries=256', 'id_ctrl': 'nvme id-ctrl {dev}', 'id_ns': 'nvme id-ns {dev}{ns}', 'fw_log': 'nvme fw-log {dev}', 'self_test_log': 'nvme self-test-log {dev}', 'get_log': 'nvme get-log {dev} --log-id=6 --log-len=512', 'telemetry_log': 'nvme telemetry-log {dev} --output-file={dev}_{f_name}'} - CMD_TEMPLATES:
[ nvme smart-log {dev}, nvme error-log {dev} --log-entries=256, nvme id-ctrl {dev}, nvme id-ns {dev}{ns}, nvme fw-log {dev}, nvme self-test-log {dev}, nvme get-log {dev} --log-id=6 --log-len=512, nvme telemetry-log {dev} --output-file={dev}_{f_name} ] - TELEMETRY_FILENAME:
telemetry_log.bin
NvmeDataModel
- nvme smart-log {dev}
- nvme error-log {dev} --log-entries=256
- nvme id-ctrl {dev}
- nvme id-ns {dev}{ns}
- nvme fw-log {dev}
- nvme self-test-log {dev}
- nvme get-log {dev} --log-id=6 --log-len=512
- nvme telemetry-log {dev} --output-file={dev}_{f_name}
- nvme list -o json
Collect OS details
Bases: ['InBandDataCollector']
Link to code: os_collector.py
- CMD_VERSION_WINDOWS:
wmic os get Version /value - CMD_VERSION:
cat /etc/*release | grep VERSION_ID - CMD_WINDOWS:
wmic os get Caption /Value - PRETTY_STR:
PRETTY_NAME - CMD:
sh -c '( lsb_release -ds || (cat /etc/*release | grep PRETTY_NAME) || uname -om ) 2>/dev/null | head -n1'
OsDataModel
- sh -c '( lsb_release -ds || (cat /etc/*release | grep PRETTY_NAME) || uname -om ) 2>/dev/null | head -n1'
- cat /etc/*release | grep VERSION_ID
- wmic os get Version /value
- wmic os get Caption /Value
Collecting Package information from the system
Bases: ['InBandDataCollector']
Link to code: package_collector.py
- CMD_WINDOWS:
wmic product get name,version - CMD_RELEASE:
cat /etc/*release - CMD_DPKG:
dpkg-query -W - CMD_DNF:
dnf list --installed - CMD_PACMAN:
pacman -Q
PackageDataModel
- dnf list --installed
- dpkg-query -W
- pacman -Q
- cat /etc/*release
- wmic product get name,version
class for collection of PCIe data only supports Linux OS type.
This class collects the PCIE config space using the lspci hex dump and then parses the hex dump to get the
PCIe configuration space for the GPUs in the system. If the system interaction level is set to STANDARD or higher,
then the entire pcie configuration space is collected for the GPUs in the system. If the system interaction level
is set to SURFACE then, only the first 64 bytes of the pcie configuration space is collected for the GPUs in the system.
This class will collect important PCIe data from the system running the commands
- `lspci -vvv` : Verbose collection of PCIe data
- `lspci -vvvt`: Verbose tree view of PCIe data
- `lspci -PP`: Path view of PCIe data for the GPUs
- If system interaction level is set to STANDARD or higher, the following commands will be run with sudo:
- `lspci -xxxx`: Hex view of PCIe data for the GPUs
- otherwise the following commands will be run without sudo:
- `lspci -x`: Hex view of PCIe data for the GPUs
- `lspci -d <vendor_id>:<dev_id>` : Count the number of GPUs in the system with this command
- If system interaction level is set to STANDARD or higher, the following commands will be run with sudo:
- The sudo lspci -xxxx command is used to collect the PCIe configuration space for the GPUs in the system
- otherwise the following commands will be run without sudo:
- The lspci -x command is used to collect the PCIe configuration space for the GPUs in the system
Bases: ['InBandDataCollector']
Link to code: pcie_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD_LSPCI_VERBOSE:
lspci -vvv - CMD_LSPCI_VERBOSE_TREE:
lspci -vvvt - CMD_LSPCI_PATH:
lspci -PP - CMD_LSPCI_HEX_SUDO:
lspci -xxxx - CMD_LSPCI_HEX:
lspci -x - CMD_LSPCI_AMD_DEVICES:
lspci -d {vendor_id}: -nn - CMD_LSPCI_PATH_DEVICE:
lspci -PP -d {vendor_id}:{dev_id}
PcieDataModel
- lspci -d {vendor_id}: -nn
- lspci -x
- lspci -xxxx
- lspci -PP
- lspci -PP -d {vendor_id}:{dev_id}
- lspci -vvv
- lspci -vvvt
Collect Process details
Bases: ['InBandDataCollector']
Link to code: process_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD_KFD:
rocm-smi --showpids - CMD_CPU_USAGE:
top -b -n 1 - CMD_PROCESS:
top -b -n 1 -o %CPU
ProcessDataModel
- top -b -n 1
- rocm-smi --showpids
- top -b -n 1 -o %CPU
Collect RDMA status and statistics via rdma link and rdma statistic commands.
Bases: ['InBandDataCollector']
Link to code: rdma_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD_LINK:
rdma link -j - CMD_STATISTIC:
rdma statistic -j - CMD_RDMA_DEV:
rdma dev - CMD_RDMA_LINK:
rdma link
RdmaDataModel
- rdma link -j
- rdma dev
- rdma link
- rdma statistic -j
Collect ROCm version data
Bases: ['InBandDataCollector']
Link to code: rocm_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD_ROCM_SUB_VERSIONS_TMPL:
grep . -r {rocm_path}/.info/* - CMD_ROCMINFO_TMPL:
{rocm_path}/bin/rocminfo - CMD_ROCM_LATEST_TMPL:
ls -v -d {rocm_path}-[3-7]* | tail -1 - CMD_ROCM_DIRS_TMPL:
ls -v -d {rocm_path}* - CMD_LD_CONF:
grep -i -E 'rocm' /etc/ld.so.conf.d/* - CMD_ROCM_LIBS:
ldconfig -p | grep -i -E 'rocm' - CMD_ENV_VARS:
env | grep -Ei 'rocm|hsa|hip|mpi|openmp|ucx|miopen' - CMD_CLINFO_TMPL:
{rocm_path}/opencl/bin/*/clinfo - CMD_KFD_PROC:
ls /sys/class/kfd/kfd/proc/
RocmDataModel
- {rocm_path}/opencl/bin/*/clinfo
- env | grep -Ei 'rocm|hsa|hip|mpi|openmp|ucx|miopen'
- ls /sys/class/kfd/kfd/proc/
- grep -i -E 'rocm' /etc/ld.so.conf.d/*
- {rocm_path}/bin/rocminfo
- ls -v -d {rocm_path}*
- ls -v -d {rocm_path}-[3-7]* | tail -1
- ldconfig -p | grep -i -E 'rocm'
- grep . -r {rocm_path}/.info/*
Collect disk usage details
Bases: ['InBandDataCollector']
Link to code: storage_collector.py
- CMD_WINDOWS:
wmic LogicalDisk Where DriveType="3" Get DeviceId,Size,FreeSpace - CMD:
sh -c 'df -lH -B1 | grep -v 'boot''
StorageDataModel
- sh -c 'df -lH -B1 | grep -v 'boot''
- wmic LogicalDisk Where DriveType="3" Get DeviceId,Size,FreeSpace
Collect sysfs settings from user-specified paths.
Bases: ['InBandDataCollector']
Link to code: sys_settings_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD:
cat /sys/{} - CMD_LS:
ls -1 /sys/{} - CMD_LS_LONG:
ls -l /sys/{}
SysSettingsDataModel
- cat /sys/{}
- ls -1 /sys/{}
- ls -l /sys/{}
Collect sysctl kernel VM settings.
Bases: ['InBandDataCollector']
Link to code: sysctl_collector.py
- CMD:
sysctl -n
SysctlDataModel
- sysctl -n
Read syslog log
Bases: ['InBandDataCollector']
Link to code: syslog_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD:
ls -1 /var/log/syslog* 2>/dev/null | grep -E '^/var/log/syslog(\.[0-9]+(\.gz)?)?$' || true
SyslogData
- ls -1 /var/log/syslog* 2>/dev/null | grep -E '^/var/log/syslog(.[0-9]+(.gz)?)?$' || true
Collect last boot time and uptime from uptime command
Bases: ['InBandDataCollector']
Link to code: uptime_collector.py
- SUPPORTED_OS_FAMILY:
{<OSFamily.LINUX: 3>} - CMD:
uptime
UptimeDataModel
- uptime
Data model for amd-smi data.
Optionals are used to allow for the data to be missing,
This makes the data class more flexible for the analyzer
which consumes only the required data. If any more data is
required for the analyzer then they should not be set to
default.
Link to code: amdsmidata.py
Bases: ['DataModel']
- version:
Optional[nodescraper.plugins.inband.amdsmi.amdsmidata.AmdSmiVersion] - gpu_list:
Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.AmdSmiListItem]] - partition:
Optional[nodescraper.plugins.inband.amdsmi.amdsmidata.Partition] - process:
Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.Processes]] - topology:
Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.Topo]] - firmware:
Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.Fw]] - bad_pages:
Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.BadPages]] - static:
Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.AmdSmiStatic]] - metric:
Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.AmdSmiMetric]] - xgmi_metric:
Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.XgmiMetrics]] - xgmi_link:
Optional[list[nodescraper.plugins.inband.amdsmi.amdsmidata.XgmiLinks]] - cper_data:
Optional[list[nodescraper.models.datamodel.FileModel]] - cper_afids:
dict[str, int]
Link to code: biosdata.py
Bases: ['DataModel']
- bios_version:
str
Link to code: cmdlinedata.py
Bases: ['DataModel']
- cmdline:
str
Link to code: deviceenumdata.py
Bases: ['DataModel']
- cpu_count:
Optional[int] - gpu_count:
Optional[int] - vf_count:
Optional[int] - lscpu_output:
Optional[str] - lshw_output:
Optional[str]
Link to code: dimmdata.py
Bases: ['DataModel']
- dimms:
str
Link to code: dkmsdata.py
Bases: ['DataModel']
- status:
Optional[str] - version:
Optional[str]
Data model for in band dmesg log
Link to code: dmesgdata.py
Bases: ['DataModel']
- dmesg_content:
str - skip_log_file:
bool
Complete InfiniBand/RDMA fabrics configuration data
Link to code: fabricsdata.py
Bases: ['DataModel']
- ibstat_devices:
List[nodescraper.plugins.inband.fabrics.fabricsdata.IbstatDevice] - ibv_devices:
List[nodescraper.plugins.inband.fabrics.fabricsdata.IbvDeviceInfo] - ibdev_netdev_mappings:
List[nodescraper.plugins.inband.fabrics.fabricsdata.IbdevNetdevMapping] - ofed_info:
Optional[nodescraper.plugins.inband.fabrics.fabricsdata.OfedInfo] - mst_status:
Optional[nodescraper.plugins.inband.fabrics.fabricsdata.MstStatus] - slingshot_data:
Optional[nodescraper.plugins.inband.fabrics.fabricsdata.SlingshotData]
Data model for journal logs
Link to code: journaldata.py
Bases: ['DataModel']
- journal_log:
str - journal_content_json:
list[nodescraper.plugins.inband.journal.journaldata.JournalJsonEntry]
Link to code: kerneldata.py
Bases: ['DataModel']
- kernel_info:
str - kernel_version:
str - numa_balancing:
Optional[int]
Link to code: kernel_module_data.py
Bases: ['DataModel']
- kernel_modules:
dict - amdgpu_modinfo:
Optional[nodescraper.plugins.inband.kernel_module.kernel_module_data.ModuleInfo]
Memory data model
Link to code: memorydata.py
Bases: ['DataModel']
- mem_free:
str - mem_total:
str - lsmem_data:
Optional[nodescraper.plugins.inband.memory.memorydata.LsmemData] - numa_topology:
Optional[nodescraper.plugins.inband.memory.memorydata.NumaTopology]
Complete network configuration data
Link to code: networkdata.py
Bases: ['DataModel']
- interfaces:
List[nodescraper.plugins.inband.network.networkdata.NetworkInterface] - routes:
List[nodescraper.plugins.inband.network.networkdata.Route] - rules:
List[nodescraper.plugins.inband.network.networkdata.RoutingRule] - neighbors:
List[nodescraper.plugins.inband.network.networkdata.Neighbor] - ethtool_info:
Dict[str, nodescraper.plugins.inband.network.networkdata.EthtoolInfo] - accessible:
Optional[bool]
Collected output of niccli (Broadcom) and nicctl (Pensando) commands.
Link to code: nic_data.py
Bases: ['DataModel']
- results:
Dict[str, nodescraper.plugins.inband.nic.nic_data.NicCommandResult] - card_show:
Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlCardShow] - cards:
List[nodescraper.plugins.inband.nic.nic_data.NicCtlCard] - port:
Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlPort] - lif:
Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlLif] - qos:
Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlQos] - rdma:
Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlRdma] - dcqcn:
Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlDcqcn] - environment:
Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlEnvironment] - version:
Optional[nodescraper.plugins.inband.nic.nic_data.NicCtlVersion] - broadcom_nic_devices:
List[nodescraper.plugins.inband.nic.nic_data.NicCliDevice] - broadcom_nic_qos:
Dict[int, nodescraper.plugins.inband.nic.nic_data.NicCliQos] - broadcom_nic_support_rdma:
Dict[int, str] - broadcom_nic_performance_profile:
Dict[int, str] - broadcom_nic_pcie_relaxed_ordering:
Dict[int, str] - pensando_nic_cards:
List[nodescraper.plugins.inband.nic.nic_data.PensandoNicCard] - pensando_nic_dcqcn:
List[nodescraper.plugins.inband.nic.nic_data.PensandoNicDcqcn] - pensando_nic_environment:
List[nodescraper.plugins.inband.nic.nic_data.PensandoNicEnvironment] - pensando_nic_lif:
List[nodescraper.plugins.inband.nic.nic_data.PensandoNicLif] - pensando_nic_pcie_ats:
List[nodescraper.plugins.inband.nic.nic_data.PensandoNicPcieAts] - pensando_nic_ports:
List[nodescraper.plugins.inband.nic.nic_data.PensandoNicPort] - pensando_nic_qos:
List[nodescraper.plugins.inband.nic.nic_data.PensandoNicQos] - pensando_nic_rdma_statistics:
List[nodescraper.plugins.inband.nic.nic_data.PensandoNicRdmaStatistics] - pensando_nic_version_host_software:
Optional[nodescraper.plugins.inband.nic.nic_data.PensandoNicVersionHostSoftware] - pensando_nic_version_firmware:
List[nodescraper.plugins.inband.nic.nic_data.PensandoNicVersionFirmware] - nicctl_card_logs:
Optional[Dict[str, str]]
NVMe collection output: parsed 'nvme list' entries and per-device command outputs.
Link to code: nvmedata.py
Bases: ['DataModel']
- nvme_list:
Optional[list[nodescraper.plugins.inband.nvme.nvmedata.NvmeListEntry]] - devices:
dict[str, nodescraper.plugins.inband.nvme.nvmedata.DeviceNvmeData]
Link to code: osdata.py
Bases: ['DataModel']
- os_name:
str - os_version:
str
Pacakge data contains the package data for the system
Link to code: packagedata.py
Bases: ['DataModel']
- version_info:
dict[str, str] - rocm_regex:
str - enable_rocm_regex:
bool
class for collection of PCIe data.
Optionals are used to allow for the data to be missing,
This makes the data class more flexible for the analyzer
which consumes only the required data. If any more data is
required for the analyzer then they should not be set to
default.
- pcie_cfg_space: A dictionary of PCIe cfg space for the GPUs obtained with setpci command
- lspci_verbose: Verbose collection of PCIe data
- lspci_verbose_tree: Tree view of PCIe data
- lspci_path: Path view of PCIe data for the GPUs
- lspci_hex: Hex view of PCIe data for the GPUs
Link to code: pcie_data.py
Bases: ['DataModel']
- pcie_cfg_space:
Dict[Annotated[str, AfterValidator(func=validate_bdf)], nodescraper.plugins.inband.pcie.pcie_data.PcieCfgSpace] - vf_pcie_cfg_space:
Optional[Dict[Annotated[str, AfterValidator(func=validate_bdf)], nodescraper.plugins.inband.pcie.pcie_data.PcieCfgSpace]]
Link to code: processdata.py
Bases: ['DataModel']
- kfd_process:
Optional[int] - cpu_usage:
Optional[float] - processes:
Optional[list[tuple[str, str]]]
Data model for RDMA (Remote Direct Memory Access) statistics and link information.
Link to code: rdmadata.py
Bases: ['DataModel']
- link_list:
list[nodescraper.plugins.inband.rdma.rdmadata.RdmaLink] - statistic_list:
list[nodescraper.plugins.inband.rdma.rdmadata.RdmaStatistics] - dev_list:
list[nodescraper.plugins.inband.rdma.rdmadata.RdmaDevice] - link_list_text:
list[nodescraper.plugins.inband.rdma.rdmadata.RdmaLinkText]
Link to code: rocmdata.py
Bases: ['DataModel']
- rocm_version:
str - rocm_sub_versions:
dict[str, str] - rocminfo:
List[str] - rocm_latest_versioned_path:
str - rocm_all_paths:
List[str] - ld_conf_rocm:
List[str] - rocm_libs:
List[str] - env_vars:
List[str] - clinfo:
List[str] - kfd_proc:
List[str]
Link to code: storagedata.py
Bases: ['DataModel']
- storage_data:
dict[str, nodescraper.plugins.inband.storage.storagedata.DeviceStorageData]
Data model for sysfs settings: path -> parsed value.
Values are parsed from user-specified sysfs paths (bracketed value extracted
when present, e.g. '[always] madvise never' -> 'always').
Link to code: sys_settings_data.py
Bases: ['DataModel']
- readings:
dict[str, str]
Link to code: sysctldata.py
Bases: ['DataModel']
- vm_swappiness:
Optional[int] - vm_numa_balancing:
Optional[int] - vm_oom_kill_allocating_task:
Optional[int] - vm_compaction_proactiveness:
Optional[int] - vm_compact_unevictable_allowed:
Optional[int] - vm_extfrag_threshold:
Optional[int] - vm_zone_reclaim_mode:
Optional[int] - vm_dirty_background_ratio:
Optional[int] - vm_dirty_ratio:
Optional[int] - vm_dirty_writeback_centisecs:
Optional[int] - kernel_numa_balancing:
Optional[int]
Data model for in band syslog logs
Link to code: syslogdata.py
Bases: ['DataModel']
- syslog_logs:
list[nodescraper.connection.inband.inband.TextFileArtifact]
Link to code: uptimedata.py
Bases: ['DataModel']
- current_time:
str - uptime:
str
Check AMD SMI Application data for PCIe, ECC errors, and CPER data.
Bases: ['CperAnalysisTaskMixin', 'DataAnalyzer']
Link to code: amdsmi_analyzer.py
Check bios matches expected bios details
Bases: ['DataAnalyzer']
Link to code: bios_analyzer.py
Check cmdline matches expected kernel cmdline
Bases: ['DataAnalyzer']
Link to code: cmdline_analyzer.py
Check Device Enumeration matches expected cpu and gpu count supported by all OSs, SKUs, and platforms.
Bases: ['DataAnalyzer']
Link to code: device_enumeration_analyzer.py
Check dkms matches expected status and version
Bases: ['DataAnalyzer']
Link to code: dkms_analyzer.py
Check dmesg for errors
Bases: ['RegexAnalyzer']
Link to code: dmesg_analyzer.py
- ERROR_REGEX:
[ regex=re.compile('(?:oom_kill_process.*)|(?:Out of memory.*)') message='Out of memory error' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('IO_PAGE_FAULT') message='I/O Page Fault' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('\\bkernel panic\\b.*', re.IGNORECASE) message='Kernel Panic' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('sq_intr') message='SQ Interrupt' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('sram_ecc.*') message='SRAM ECC' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('\\[amdgpu\\]\\] \\*ERROR\\* hw_init of IP block.*') message='Failed to load driver. IP hardware init error.' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('\\[amdgpu\\]\\] \\*ERROR\\* sw_init of IP block.*') message='Failed to load driver. IP software init error.' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('sched: RT throttling activated.*') message='Real Time throttling activated' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('rcu_preempt detected stalls.*') message='RCU preempt detected stalls' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('rcu_preempt self-detected stall.*') message='RCU preempt self-detected stall' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('qcm fence wait loop timeout.*') message='QCM fence timeout' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:[\\w-]+(?:\\[[0-9.]+\\])?\\s+)?general protection fault[^\\n]*') message='General protection fault' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:segfault.*in .*\\[)|(?:[Ss]egmentation [Ff]ault.*)|(?:[Ss]egfault.*)') message='Segmentation fault' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('amdgpu: Failed to disallow cf state.*') message='Failed to disallow cf state' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('\\*ERROR\\* Failed to terminate tmr.*') message='Failed to terminate tmr' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('\\*ERROR\\* suspend of IP block <\\w+> failed.*') message='Suspend of IP block failed' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(amdgpu \\w{4}:\\w{2}:\\w{2}\\.\\w:\\s+amdgpu:\\s+\\[\\S+\\]\\s*(?:retry|no-retry)? page fault[^\\n]*)(?:\\n[^\\n]*(amdgpu \\w{4}:\\w{2}:\\w{2}\\.\\w:\\s+amdgpu:[^\\n]*))?(?:\\n[^\\n]*(amdgpu \\w{4}:, re.MULTILINE) message='amdgpu Page Fault' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('page fault for address.*') message='Page Fault' event_category=<EventCategory.OS: 'OS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:amdgpu)(.*Fatal error during GPU init)|(Fatal error during GPU init)') message='Fatal error during GPU init' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(pcieport [\\w:.]+: AER: aer_status:[^\\n]*(?:\\n[^\\n]*){0,32}?pcieport [\\w:.]+: AER: aer_layer=[^\\n]*)', re.MULTILINE) message='PCIe AER Error Status' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(.*aer_cor_status: 0x[0-9a-fA-F]+, aer_cor_mask: 0x[0-9a-fA-F]+.*)') message='PCIe AER Correctable Error Status' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(.*aer_uncor_status: 0x[0-9a-fA-F]+, aer_uncor_mask: 0x[0-9a-fA-F]+.*)') message='PCIe AER Uncorrectable Error Status' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(.*aer_uncor_severity: 0x[0-9a-fA-F]+.*)(\\n.*TLP Header: (?:0x)?[0-9a-fA-F]+(?: (?:0x)?[0-9a-fA-F]+){3}.*)', re.MULTILINE) message='PCIe AER Uncorrectable Error Severity with TLP Header' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('Failed to read journal file.*') message='Failed to read journal file' event_category=<EventCategory.OS: 'OS'> event_priority=<EventPriority.WARNING: 2>, regex=re.compile('journal corrupted or uncleanly shut down.*') message='Journal file corrupted or uncleanly shut down' event_category=<EventCategory.OS: 'OS'> event_priority=<EventPriority.WARNING: 2>, regex=re.compile('ACPI BIOS Error') message='ACPI BIOS Error' event_category=<EventCategory.BIOS: 'BIOS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('ACPI Error') message='ACPI Error' event_category=<EventCategory.BIOS: 'BIOS'> event_priority=<EventPriority.WARNING: 2>, regex=re.compile('EXT4-fs error \\(device .*\\):') message='Filesystem corrupted!' event_category=<EventCategory.OS: 'OS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(Buffer I\\/O error on dev)(?:ice)? (\\w+)') message='Error in buffered IO, check filesystem integrity' event_category=<EventCategory.IO: 'IO'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('pcieport (\\w+:\\w+:\\w+\\.\\w+):\\s+(\\w+):\\s+(Slot\\(\\d+\\)):\\s+(Card not present)') message='PCIe card no longer present' event_category=<EventCategory.IO: 'IO'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('pcieport (\\w+:\\w+:\\w+\\.\\w+):\\s+(\\w+):\\s+(Slot\\(\\d+\\)):\\s+(Link Down)') message='PCIe Link Down' event_category=<EventCategory.IO: 'IO'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('pcieport (\\w+:\\w+:\\w+\\.\\w+):\\s+(\\w+):\\s+(current common clock configuration is inconsistent, reconfiguring)') message='Mismatched clock configuration between PCIe device and host' event_category=<EventCategory.IO: 'IO'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:\\d{4}-\\d+-\\d+T\\d+:\\d+:\\d+,\\d+[+-]\\d+:\\d+)?(.* correctable hardware errors detected in total in \\w+ block.*)') message='RAS Correctable Error' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:\\d{4}-\\d+-\\d+T\\d+:\\d+:\\d+,\\d+[+-]\\d+:\\d+)?(.* uncorrectable hardware errors detected in \\w+ block.*)') message='RAS Uncorrectable Error' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:\\d{4}-\\d+-\\d+T\\d+:\\d+:\\d+,\\d+[+-]\\d+:\\d+)?(.* deferred hardware errors detected in \\w+ block.*)') message='RAS Deferred Error' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('((?:\\[Hardware Error\\]:\\s+)?event severity: corrected.*)\\n.*(\\[Hardware Error\\]:\\s+Error \\d+, type: corrected.*)\\n.*(\\[Hardware Error\\]:\\s+section_type: PCIe error.*)') message='RAS Corrected PCIe Error' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:\\d{4}-\\d+-\\d+T\\d+:\\d+:\\d+,\\d+[+-]\\d+:\\d+)?(.*GPU reset begin.*)') message='GPU Reset' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:\\d{4}-\\d+-\\d+T\\d+:\\d+:\\d+,\\d+[+-]\\d+:\\d+)?(.*GPU reset(?:\\(\\d+\\))? failed.*)') message='GPU reset failed' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(Accelerator Check Architecture[^\\n]*)(?:\\n[^\\n]*){0,10}?(amdgpu[ 0-9a-fA-F:.]+:? [^\\n]*entry\\[\\d+\\]\\.STATUS=0x[0-9a-fA-F]+)(?:\\n[^\\n]*){0,5}?(amdgpu[ 0-9a-fA-F:.]+:? [^\\n]*entry\\[\\d+\\], re.MULTILINE) message='ACA Error' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(Accelerator Check Architecture[^\\n]*)(?:\\n[^\\n]*){0,10}?(amdgpu[ 0-9a-fA-F:.]+:? [^\\n]*CONTROL=0x[0-9a-fA-F]+)(?:\\n[^\\n]*){0,5}?(amdgpu[ 0-9a-fA-F:.]+:? [^\\n]*STATUS=0x[0-9a-fA-F]+)(?:\\n[^\\, re.MULTILINE) message='ACA Error' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('\\[Hardware Error\\]:.+MC\\d+_STATUS.*(?:\\n.*){0,5}') message='MCE Error' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:\\d{4}-\\d+-\\d+T\\d+:\\d+:\\d+,\\d+[+-]\\d+:\\d+)? (.*Mode2 reset failed.*)') message='Mode 2 Reset Failed' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:\\d{4}-\\d+-\\d+T\\d+:\\d+:\\d+,\\d+[+-]\\d+:\\d+)?(.*\\[Hardware Error\\]: Corrected error.*)') message='RAS Corrected Error' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('x86/cpu: SGX disabled by BIOS') message='SGX Error' event_category=<EventCategory.BIOS: 'BIOS'> event_priority=<EventPriority.WARNING: 2>, regex=re.compile('Failed to load MMP firmware qat_4xxx_mmp.bin') message='MMP Error' event_category=<EventCategory.BIOS: 'BIOS'> event_priority=<EventPriority.WARNING: 2>, regex=re.compile('amdgpu \\w{4}:\\w{2}:\\w{2}.\\w: amdgpu: WARN: GPU is throttled.*') message='GPU Throttled' event_category=<EventCategory.SW_DRIVER: 'SW_DRIVER'> event_priority=<EventPriority.WARNING: 2>, regex=re.compile('amdgpu[ 0-9a-fA-F:.]+:(?:\\s*amdgpu:)?\\s+(?:{\\d+})?poison is consumed by client \\d+, kick off gpu reset flow') message='RAS Poison Consumed' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('amdgpu[ 0-9a-fA-F:.]+:(?:\\s*amdgpu:)?\\s+(?:{\\d+})?Poison is created') message='RAS Poison created' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(amdgpu: Saved bad pages (\\d+) reaches threshold value 128)') message='Bad page threshold exceeded' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('Hardware error from APEI Generic Hardware Error Source:.*(?:\\n.*){0,14}') message='RAS Hardware Error' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('Error Address.*(?:\\s.*)') message='Error Address' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('EDR: EDR event received') message='RAS EDR Event' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('DPC: .*') message='DPC Event' event_category=<EventCategory.RAS: 'RAS'> event_priority=<EventPriority.ERROR: 3>, regex=re.compile('(?:\\[[^\\]]+\\]\\s*)?LNetError:.*ko2iblnd:\\s*No matching interfaces', re.IGNORECASE) message='LNet: ko2iblnd has no matching interfaces' event_category=<EventCategory.IO: 'IO'> event_priority=<EventPriority.WARNING: 2>, regex=re.compile('(?:\\[[^\\]]+\\]\\s*)?LNetError:\\s*.*Error\\s*-?\\d+\\s+starting up LNI\\s+\\w+', re.IGNORECASE) message='LNet: Error starting up LNI' event_category=<EventCategory.IO: 'IO'> event_priority=<EventPriority.WARNING: 2>, regex=re.compile('LustreError:.*ptlrpc_init_portals\\(\\).*network initiali[sz]ation failed', re.IGNORECASE) message='Lustre: network initialisation failed' event_category=<EventCategory.IO: 'IO'> event_priority=<EventPriority.WARNING: 2> ]
57 items defined
- Built-in Regexes:
-
- Out of memory error:
(?:oom_kill_process.*)|(?:Out of memory.*)
- Out of memory error:
-
- I/O Page Fault:
IO_PAGE_FAULT
- I/O Page Fault:
-
- Kernel Panic:
\bkernel panic\b.*
- Kernel Panic:
-
- SQ Interrupt:
sq_intr
- SQ Interrupt:
-
- SRAM ECC:
sram_ecc.*
- SRAM ECC:
-
- Failed to load driver. IP hardware init error.:
\[amdgpu\]\] \*ERROR\* hw_init of IP block.*
- Failed to load driver. IP hardware init error.:
-
- Failed to load driver. IP software init error.:
\[amdgpu\]\] \*ERROR\* sw_init of IP block.*
- Failed to load driver. IP software init error.:
-
- Real Time throttling activated:
sched: RT throttling activated.*
- Real Time throttling activated:
-
- RCU preempt detected stalls:
rcu_preempt detected stalls.*
- RCU preempt detected stalls:
-
- RCU preempt self-detected stall:
rcu_preempt self-detected stall.*
- RCU preempt self-detected stall:
-
- QCM fence timeout:
qcm fence wait loop timeout.*
- QCM fence timeout:
-
- General protection fault:
(?:[\w-]+(?:\[[0-9.]+\])?\s+)?general protectio...
- General protection fault:
-
- Segmentation fault:
(?:segfault.*in .*\[)|(?:[Ss]egmentation [Ff]au...
- Segmentation fault:
-
- Failed to disallow cf state:
amdgpu: Failed to disallow cf state.*
- Failed to disallow cf state:
-
- Failed to terminate tmr:
\*ERROR\* Failed to terminate tmr.*
- Failed to terminate tmr:
-
- Suspend of IP block failed:
\*ERROR\* suspend of IP block <\w+> failed.*
- Suspend of IP block failed:
-
- amdgpu Page Fault:
(amdgpu \w{4}:\w{2}:\w{2}\.\w:\s+amdgpu:\s+\[\S...
- amdgpu Page Fault:
-
- Page Fault:
page fault for address.*
- Page Fault:
-
- Fatal error during GPU init:
(?:amdgpu)(.*Fatal error during GPU init)|(Fata...
- Fatal error during GPU init:
-
- PCIe AER Error Status:
(pcieport [\w:.]+: AER: aer_status:[^\n]*(?:\n[...
- PCIe AER Error Status:
-
- PCIe AER Correctable Error Status:
(.*aer_cor_status: 0x[0-9a-fA-F]+, aer_cor_mask...
- PCIe AER Correctable Error Status:
-
- PCIe AER Uncorrectable Error Status:
(.*aer_uncor_status: 0x[0-9a-fA-F]+, aer_uncor_...
- PCIe AER Uncorrectable Error Status:
-
- PCIe AER Uncorrectable Error Severity with TLP Header:
(.*aer_uncor_severity: 0x[0-9a-fA-F]+.*)(\n.*TL...
- PCIe AER Uncorrectable Error Severity with TLP Header:
-
- Failed to read journal file:
Failed to read journal file.*
- Failed to read journal file:
-
- Journal file corrupted or uncleanly shut down:
journal corrupted or uncleanly shut down.*
- Journal file corrupted or uncleanly shut down:
-
- ACPI BIOS Error:
ACPI BIOS Error
- ACPI BIOS Error:
-
- ACPI Error:
ACPI Error
- ACPI Error:
-
- Filesystem corrupted!:
EXT4-fs error \(device .*\):
- Filesystem corrupted!:
-
- Error in buffered IO, check filesystem integrity:
(Buffer I\/O error on dev)(?:ice)? (\w+)
- Error in buffered IO, check filesystem integrity:
-
- PCIe card no longer present:
pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...
- PCIe card no longer present:
-
- PCIe Link Down:
pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(Slot\(...
- PCIe Link Down:
-
- Mismatched clock configuration between PCIe device and host:
pcieport (\w+:\w+:\w+\.\w+):\s+(\w+):\s+(curren...
- Mismatched clock configuration between PCIe device and host:
-
- RAS Correctable Error:
(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....
- RAS Correctable Error:
-
- RAS Uncorrectable Error:
(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....
- RAS Uncorrectable Error:
-
- RAS Deferred Error:
(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....
- RAS Deferred Error:
-
- RAS Corrected PCIe Error:
((?:\[Hardware Error\]:\s+)?event severity: cor...
- RAS Corrected PCIe Error:
-
- GPU Reset:
(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....
- GPU Reset:
-
- GPU reset failed:
(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....
- GPU reset failed:
-
- ACA Error:
(Accelerator Check Architecture[^\n]*)(?:\n[^\n...
- ACA Error:
-
- ACA Error:
(Accelerator Check Architecture[^\n]*)(?:\n[^\n...
- ACA Error:
-
- MCE Error:
\[Hardware Error\]:.+MC\d+_STATUS.*(?:\n.*){0,5}
- MCE Error:
-
- Mode 2 Reset Failed:
(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)? (...
- Mode 2 Reset Failed:
-
- RAS Corrected Error:
(?:\d{4}-\d+-\d+T\d+:\d+:\d+,\d+[+-]\d+:\d+)?(....
- RAS Corrected Error:
-
- SGX Error:
x86/cpu: SGX disabled by BIOS
- SGX Error:
-
- MMP Error:
Failed to load MMP firmware qat_4xxx_mmp.bin
- MMP Error:
-
- GPU Throttled:
amdgpu \w{4}:\w{2}:\w{2}.\w: amdgpu: WARN: GPU ...
- GPU Throttled:
-
- RAS Poison Consumed:
amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...
- RAS Poison Consumed:
-
- RAS Poison created:
amdgpu[ 0-9a-fA-F:.]+:(?:\s*amdgpu:)?\s+(?:{\d+...
- RAS Poison created:
-
- Bad page threshold exceeded:
(amdgpu: Saved bad pages (\d+) reaches threshol...
- Bad page threshold exceeded:
-
- RAS Hardware Error:
Hardware error from APEI Generic Hardware Error...
- RAS Hardware Error:
-
- Error Address:
Error Address.*(?:\s.*)
- Error Address:
-
- RAS EDR Event:
EDR: EDR event received
- RAS EDR Event:
-
- DPC Event:
DPC: .*
- DPC Event:
-
- LNet: ko2iblnd has no matching interfaces:
(?:\[[^\]]+\]\s*)?LNetError:.*ko2iblnd:\s*No ma...
- LNet: ko2iblnd has no matching interfaces:
-
- LNet: Error starting up LNI:
(?:\[[^\]]+\]\s*)?LNetError:\s*.*Error\s*-?\d+\...
- LNet: Error starting up LNI:
-
- Lustre: network initialisation failed:
LustreError:.*ptlrpc_init_portals\(\).*network ...
- Lustre: network initialisation failed:
Check journalctl for errors
Bases: ['DataAnalyzer']
Link to code: journal_analyzer.py
Check kernel matches expected versions
Bases: ['DataAnalyzer']
Link to code: kernel_analyzer.py
Check kernel matches expected versions
Bases: ['DataAnalyzer']
Link to code: kernel_module_analyzer.py
Check memory usage is within the maximum allowed used memory
Bases: ['DataAnalyzer']
Link to code: memory_analyzer.py
Analyze niccli/nicctl data; checks Broadcom support_rdma, performance_profile (RoCE), pcie_relaxed_ordering (enabled), and getqos (expected QoS across adapters).
Bases: ['DataAnalyzer']
Link to code: nic_analyzer.py
Check os matches expected versions
Bases: ['DataAnalyzer']
Link to code: os_analyzer.py
Check the package version data against the expected package version data
Bases: ['DataAnalyzer']
Link to code: package_analyzer.py
Check PCIe Data for errors
This calls checks the following:
- PCIe link status for each BDF
- This checks if the link speed and width are as expected
- AER uncorrectable errors
- Checks PCIe AER uncorrectable error registers UNCORR_ERR_STAT_REG and reports any errors
- AER correctable errors
- Checks the AERs correctable error registers CORR_ERR_STAT_REG and reports any errors
- PCIe device status errors
- Checks PCIe device status errors reported in fields `CORR_ERR_DET` `NON_FATAL_ERR_DET` `FATAL_ERR_DET` `UR_DET`
- PCIe status errors
- Checks PCIe status errors reported in fields `MSTR_DATA_PAR_ERR` `SIGNALED_TARGET_ABORT` `RCVD_TARGET_ABORT`
`RCVD_MSTR_ABORT` `SIGNALED_SYS_ERR` `DET_PARITY_ERR`
Bases: ['DataAnalyzer']
Link to code: pcie_analyzer.py
- GPU_BRIDGE_USP_ID:
0x1501 - GPU_BRIDGE_DSP_ID:
0x1500
Check cpu and kfd processes are within allowed maximum cpu and gpu usage
Bases: ['DataAnalyzer']
Link to code: process_analyzer.py
Check RDMA statistics for errors (RoCE and other RDMA error counters).
Bases: ['DataAnalyzer']
Link to code: rdma_analyzer.py
- ERROR_FIELDS:
[ recoverable_errors, tx_roce_errors, tx_roce_discards, rx_roce_errors, rx_roce_discards, local_ack_timeout_err, packet_seq_err, max_retry_exceeded, rnr_nak_retry_err, implied_nak_seq_err, unrecoverable_err, bad_resp_err, local_qp_op_err, local_protection_err, mem_mgmt_op_err, req_remote_invalid_request, req_remote_access_errors, remote_op_err, duplicate_request, res_exceed_max, resp_local_length_error, res_exceeds_wqe, res_opcode_err, res_rx_invalid_rkey, res_rx_domain_err, res_rx_no_perm, res_rx_range_err, res_tx_invalid_rkey, res_tx_domain_err, res_tx_no_perm, res_tx_range_err, res_irrq_oflow, res_unsup_opcode, res_unaligned_atomic, res_rem_inv_err, res_mem_err, res_srq_err, res_cmp_err, res_invalid_dup_rkey, res_wqe_format_err, res_cq_load_err, res_srq_load_err, res_tx_pci_err, res_rx_pci_err, out_of_buffer, out_of_sequence, req_cqe_error, req_cqe_flush_error, resp_cqe_error, resp_cqe_flush_error, resp_remote_access_errors, req_rx_pkt_seq_err, req_rx_rnr_retry_err, req_rx_rmt_acc_err, req_rx_rmt_req_err, req_rx_oper_err, req_rx_impl_nak_seq_err, req_rx_cqe_err, req_rx_cqe_flush, req_rx_dup_response, req_rx_inval_pkts, req_tx_loc_acc_err, req_tx_loc_oper_err, req_tx_mem_mgmt_err, req_tx_retry_excd_err, req_tx_loc_sgl_inv_err, resp_rx_dup_request, resp_rx_outof_buf, resp_rx_outouf_seq, resp_rx_cqe_err, resp_rx_cqe_flush, resp_rx_loc_len_err, resp_rx_inval_request, resp_rx_loc_oper_err, resp_rx_outof_atomic, resp_tx_pkt_seq_err, resp_tx_rmt_inval_req_err, resp_tx_rmt_acc_err, resp_tx_rmt_oper_err, resp_tx_rnr_retry_err, resp_tx_loc_sgl_inv_err, resp_rx_s0_table_err, resp_rx_ccl_cts_outouf_seq, tx_rdma_ack_timeout, tx_rdma_ccl_cts_ack_timeout, rx_rdma_mtu_discard_pkts ] - CRITICAL_ERROR_FIELDS:
['unrecoverable_err', 'res_tx_pci_err', 'res_rx_pci_err', 'res_mem_err']
Check ROCm matches expected versions.
The expected ROCm version (exp_rocm) can be a string or a list of allowed strings.
Sub-versions (exp_rocm_sub_versions) are a dict: each value can be a string or
a list of allowed strings for that key.
Bases: ['DataAnalyzer']
Link to code: rocm_analyzer.py
Check storage usage
Bases: ['DataAnalyzer']
Link to code: storage_analyzer.py
Check sysfs settings against expected values from the checks list.
Bases: ['DataAnalyzer']
Link to code: sys_settings_analyzer.py
Check sysctl matches expected sysctl details
Bases: ['DataAnalyzer']
Link to code: sysctl_analyzer.py
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- check_static_data:
bool— If True, run static data checks (e.g. driver version, partition mode). - expected_gpu_processes:
Optional[int]— Expected number of GPU processes. - expected_max_power:
Optional[int]— Expected maximum power value (e.g. watts). - expected_driver_version:
Optional[str]— Expected AMD driver version string. - expected_memory_partition_mode:
Optional[str]— Expected memory partition mode (e.g. sp3, dp). - expected_compute_partition_mode:
Optional[str]— Expected compute partition mode. - expected_pldm_version:
Optional[str]— Expected PLDM version string. - l0_to_recovery_count_error_threshold:
Optional[int]— L0-to-recovery count above which an error is raised. - l0_to_recovery_count_warning_threshold:
Optional[int]— L0-to-recovery count above which a warning is raised. - vendorid_ep:
Optional[str]— Expected endpoint vendor ID (e.g. for PCIe). - vendorid_ep_vf:
Optional[str]— Expected endpoint VF vendor ID. - devid_ep:
Optional[str]— Expected endpoint device ID. - devid_ep_vf:
Optional[str]— Expected endpoint VF device ID. - sku_name:
Optional[str]— Expected SKU name string for GPU. - expected_xgmi_speed:
Optional[list[float]]— Expected xGMI speed value(s) (e.g. link rate). - analysis_range_start:
Optional[datetime.datetime]— Start of time range for time-windowed analysis. - analysis_range_end:
Optional[datetime.datetime]— End of time range for time-windowed analysis.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- exp_bios_version:
list[str]— Expected BIOS version(s) to match against collected value (str or list). - regex_match:
bool— If True, match exp_bios_version as regex; otherwise exact match.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- required_cmdline:
Union[str, List]— Command-line parameters that must be present (e.g. 'pci=bfsort'). - banned_cmdline:
Union[str, List]— Command-line parameters that must not be present. - os_overrides:
Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig]— Per-OS overrides for required_cmdline and banned_cmdline (keyed by OS identifier). - platform_overrides:
Dict[str, nodescraper.plugins.inband.cmdline.cmdlineconfig.OverrideConfig]— Per-platform overrides for required_cmdline and banned_cmdline (keyed by platform).
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- cpu_count:
Optional[list[int]]— Expected CPU count(s); pass as int or list of ints. Analysis passes if actual is in list. - gpu_count:
Optional[list[int]]— Expected GPU count(s); pass as int or list of ints. Analysis passes if actual is in list. - vf_count:
Optional[list[int]]— Expected virtual function count(s); pass as int or list of ints. Analysis passes if actual is in list.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- dkms_status:
Union[str, list]— Expected dkms status string(s) to match (e.g. 'amd/1.0.0'). At least one of dkms_status or dkms_version required. - dkms_version:
Union[str, list]— Expected dkms version string(s) to match. At least one of dkms_status or dkms_version required. - regex_match:
bool— If True, match dkms_status and dkms_version as regex; otherwise exact match.
Arguments for journal analyzer
Bases: ['TimeRangeAnalysisArgs']
Link to code: analyzer_args.py
- analysis_range_start:
Optional[datetime.datetime]— Start of time range for analysis (ISO format). Only events on or after this time are analyzed. - analysis_range_end:
Optional[datetime.datetime]— End of time range for analysis (ISO format). Only events before this time are analyzed. - check_priority:
Optional[int]— Check against journal log priority (0=emergency..7=debug). If an entry has priority <= check_priority, an ERROR event is raised. - group:
bool— If True, group entries that have the same priority and message.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- exp_kernel:
Union[str, list]— Expected kernel version string(s) to match (e.g. from uname -a). - exp_numa:
Optional[int]— Expected value for kernel.numa_balancing (e.g. 0 or 1). - regex_match:
bool— If True, match exp_kernel as regex; otherwise exact match.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- kernel_modules:
dict[str, dict]— Expected kernel module name -> {version, etc.}. Analyzer checks collected modules match. - regex_filter:
list[str]— List of regex patterns to filter which collected modules are checked (default: amd).
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- ratio:
float— Required free-memory ratio (0-1). Analysis fails if free/total < ratio. - memory_threshold:
str— Minimum free memory required (e.g. '30Gi', '1T'). Used when ratio is not sufficient.
Analyzer args for niccli/nicctl data, with expected_values keyed by canonical command key.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- expected_values:
Optional[Dict[str, Dict[str, Any]]]— Per-command expected checks keyed by canonical key (see command_to_canonical_key). - performance_profile_expected:
str— Expected Broadcom performance_profile value (case-insensitive). Default RoCE. - support_rdma_disabled_values:
List[str]— Values that indicate RDMA is not supported (case-insensitive). - pcie_relaxed_ordering_expected:
str— Expected Broadcom pcie_relaxed_ordering value (e.g. 'Relaxed ordering = enabled'); checked case-insensitively. Default enabled. - expected_qos_prio_map:
Optional[Dict[Any, Any]]— Expected priority-to-TC map (e.g. {0: 0, 1: 1}; keys may be int or str in config). Checked per device when set. - expected_qos_pfc_enabled:
Optional[int]— Expected PFC enabled value (0/1 or bitmask). Checked per device when set. - expected_qos_tsa_map:
Optional[Dict[Any, Any]]— Expected TSA map for ETS (e.g. {0: 'ets', 1: 'strict'}; keys may be int or str in config). Checked per device when set. - expected_qos_tc_bandwidth:
Optional[List[int]]— Expected TC bandwidth percentages. Checked per device when set. - require_qos_consistent_across_adapters:
bool— When True and no expected_qos_* are set, require all adapters to have the same prio_map, pfc_enabled, and tsa_map. - nicctl_log_error_regex:
Optional[List[Dict[str, Any]]]— Optional list of error patterns for nicctl show card logs.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- exp_os:
Union[str, list]— Expected OS name/version string(s) to match (e.g. from lsb_release or /etc/os-release). - exact_match:
bool— If True, require exact match for exp_os; otherwise substring match.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- exp_package_ver:
Dict[str, Optional[str]]— Map package name -> expected version (None = any version). Checked against installed packages. - regex_match:
bool— If True, match package versions with regex; otherwise exact or prefix match. - rocm_regex:
Optional[str]— Optional regex to identify ROCm package version (used when enable_rocm_regex is True). - enable_rocm_regex:
bool— If True, use rocm_regex (or default pattern) to extract ROCm version for checks.
Arguments for PCIe analyzer
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- exp_speed:
int— Expected PCIe link speed (generation 1–5). - exp_width:
int— Expected PCIe link width in lanes (1–16). - exp_sriov_count:
int— Expected SR-IOV virtual function count. - exp_gpu_count_override:
Optional[int]— Override expected GPU count for validation. - exp_max_payload_size:
Union[Dict[int, int], int, NoneType]— Expected max payload size: int for all devices, or dict keyed by device ID. - exp_max_rd_req_size:
Union[Dict[int, int], int, NoneType]— Expected max read request size: int for all devices, or dict keyed by device ID. - exp_ten_bit_tag_req_en:
Union[Dict[int, int], int, NoneType]— Expected 10-bit tag request enable: int for all devices, or dict keyed by device ID.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- max_kfd_processes:
int— Maximum allowed number of KFD (Kernel Fusion Driver) processes; 0 disables the check. - max_cpu_usage:
float— Maximum allowed CPU usage (percent) for process checks.
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- exp_rocm:
Union[str, list]— Expected ROCm version string(s) to match (e.g. from rocminfo). - exp_rocm_latest:
str— Expected 'latest' ROCm path or version string for versioned installs. - exp_rocm_sub_versions:
dict[str, Union[str, list]]— Map sub-version name (e.g. version_rocm) to expected string or list of allowed strings.
Sysfs settings for analysis via a list of checks (path, expected values, name).
The path in each check is the sysfs path to read; the collector uses these paths
when collection_args is derived from analysis_args (e.g. by the plugin).
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- checks:
Optional[list[nodescraper.plugins.inband.sys_settings.analyzer_args.SysfsCheck]]— List of sysfs checks (path, expected values or pattern, display name).
Bases: ['AnalyzerArgs']
Link to code: analyzer_args.py
- exp_vm_swappiness:
Optional[int]— Expected vm.swappiness value. - exp_vm_numa_balancing:
Optional[int]— Expected vm.numa_balancing value. - exp_vm_oom_kill_allocating_task:
Optional[int]— Expected vm.oom_kill_allocating_task value. - exp_vm_compaction_proactiveness:
Optional[int]— Expected vm.compaction_proactiveness value. - exp_vm_compact_unevictable_allowed:
Optional[int]— Expected vm.compact_unevictable_allowed value. - exp_vm_extfrag_threshold:
Optional[int]— Expected vm.extfrag_threshold value. - exp_vm_zone_reclaim_mode:
Optional[int]— Expected vm.zone_reclaim_mode value. - exp_vm_dirty_background_ratio:
Optional[int]— Expected vm.dirty_background_ratio value. - exp_vm_dirty_ratio:
Optional[int]— Expected vm.dirty_ratio value. - exp_vm_dirty_writeback_centisecs:
Optional[int]— Expected vm.dirty_writeback_centisecs value. - exp_kernel_numa_balancing:
Optional[int]— Expected kernel.numa_balancing value.