定位联想 ThinkServer RD450X ECC 故障内存槽位

WHAT

厂里一台联想 ThinkServer RD450X 机型 SEL 日志内存 ECC 报错没有 内存槽位 信息:

$ dmidecode -t 1
# dmidecode 2.12-dmifs
SMBIOS 3.0 present.
# SMBIOS implementations newer than version 2.8 are not
# fully supported by this version of dmidecode.

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: Lenovo
        Product Name: RD450X
        Version: 70FRR156CN
        Serial Number: PC0FZ
        UUID: 422F93DA-1A8C-E611-BC31-6C0B
        Wake-up Type: Power Switch
        SKU Number: 0
        Family: ThinkServer RD450X

$ ipmitool mc info|grep ^Firm
Firmware Revision         : 4.11

$ ipmitool sel elist last 15|grep -i memory
 c50 | 05/11/2021 | 10:07:21 | Memory #0x08 | Uncorrectable ECC | Asserted
 c52 | 05/11/2021 | 10:09:02 | Memory #0x08 | Uncorrectable ECC | Asserted

HOW

Google 搜到 Diagnosing memory errors with IPMI 提及 ipmiutil 可以查看 ECC 内存故障信息。

EPEL 源中安装好 ipmiutil 工具后,确实可以查到故障内存槽位:

$ ipmiutil sel -e -l 15|grep -i memory
0c52 05/11/21 18:09:02 MAJ SMI  Memory #08 CPU1 DIMM-AB VR Uncorrectable ECC, _Node0_Channel0_Dimm0/CPU1 DIMM A0 6f [a1 03 00]
0c50 05/11/21 18:07:21 MAJ SMI  Memory #08 CPU1 DIMM-AB VR Uncorrectable ECC, _Node0_Channel0_Dimm0/CPU1 DIMM A0 6f [a1 02 00]

$ dmidecode -t memory|egrep '^\s+(Manufacturer|Serial|Locator)'|awk 'ORS=NR%3?FS:RS'|grep -v NO
        Locator: CPU1 DIMM A0   Manufacturer: Samsung   Serial Number: 3304349     <-- 故障内存
        Locator: CPU1 DIMM B0   Manufacturer: Samsung   Serial Number: 330430D
        Locator: CPU1 DIMM C0   Manufacturer: Samsung   Serial Number: 3304335
        Locator: CPU1 DIMM D0   Manufacturer: Samsung   Serial Number: 3304306

WHY

ipmiutil 是通过 Event Data 来解析内存 DIMM 槽位的:

$ ipmitool sel elist last 20 -v|grep -B2 ECC
Running Get PICMG Properties my_addr 0x20, transit 0, target 0
Error Response 0xc1 from Get PICMG Properities
No PICMG Extenstion discovered
 Event Direction       : Assertion Event
 Event Data            : a10200
 Description           : Uncorrectable ECC
--
 Event Direction       : Assertion Event
 Event Data            : a10300
 Description           : Uncorrectable ECC

将 Event Data 16 进制 a10200a10300 转换为 二进制

$ python3
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> format(int("a10200", 16),"040b")
'0000000000000000101000010000001000000000'
>>> format(int("a10300", 16),"040b")
'0000000000000000101000010000001100000000'
>>> f'{0xa10200:0>42b}'
'000000000000000000101000010000001000000000'
>>> f'{0xa10300:0>42b}'
'000000000000000000101000010000001100000000'

a10200: 1010 0001 0000 0010 0000 0000
a10300: 1010 0001 0000 0011 0000 0000
                            ^^^^ ^^^^

参考 Diagnosing memory errors with IPMI 提供的 Event Data 内存映射 关系:

img

以及 2 个解析 Event Data 映射内存槽位的 示例

img

img

Event Data 16 进制 a1 02 00 第 3 字节 (byte) 00 转换成 8 位 (bit) 2 进制 0000 0000

0000 0000
===...---
 |  |  |
 |  |  |
 |  |  +----- 000 0-2 bit 标记 DIMM    --> DIMM 0
 |  |
 |  +-------- 00  3-4 bit 标记 Channel --> channel A
 |
 +----------- 000 5-7 bit 标记 CPU ID  --> CPU1

映射关系跟 ipmiutil 解析的内存槽位一致,都是 CPU1 DIMM A0

$ ipmiutil sel -e -l 15|grep -i memory
0c52 05/11/21 18:09:02 MAJ SMI  Memory #08 CPU1 DIMM-AB VR Uncorrectable ECC, _Node0_Channel0_Dimm0/CPU1 DIMM A0 6f [a1 03 00]
0c50 05/11/21 18:07:21 MAJ SMI  Memory #08 CPU1 DIMM-AB VR Uncorrectable ECC, _Node0_Channel0_Dimm0/CPU1 DIMM A0 6f [a1 02 00]
                                                                                                    ^^^^ ^^^^ ^^           ^^

reference

Diagnosing memory errors with IPMI

HP Decode SEL Errors