定位联想 ThinkServer RD450X ECC 故障内存槽位
分类:Server 标签:Ipmi Memory Ecc
WHAT
厂里一台联想 ThinkServer RD450X 机型 SEL 日志内存 ECC 报错没有 内存槽位 信息:
$ dmidecode -t 1
# dmidecode 2.12-dmifs
SMBIOS 3.0 present.
# SMBIOS implementations newer than version 2.8 are not
# fully supported by this version of dmidecode.
Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: Lenovo
Product Name: RD450X
Version: 70FRR156CN
Serial Number: PC0FZ
UUID: 422F93DA-1A8C-E611-BC31-6C0B
Wake-up Type: Power Switch
SKU Number: 0
Family: ThinkServer RD450X
$ ipmitool mc info|grep ^Firm
Firmware Revision : 4.11
$ ipmitool sel elist last 15|grep -i memory
c50 | 05/11/2021 | 10:07:21 | Memory #0x08 | Uncorrectable ECC | Asserted
c52 | 05/11/2021 | 10:09:02 | Memory #0x08 | Uncorrectable ECC | Asserted
HOW
Google 搜到 Diagnosing memory errors with IPMI 提及 ipmiutil
可以查看 ECC 内存故障信息。
从 EPEL 源中安装好 ipmiutil
工具后,确实可以查到故障内存槽位:
$ ipmiutil sel -e -l 15|grep -i memory
0c52 05/11/21 18:09:02 MAJ SMI Memory #08 CPU1 DIMM-AB VR Uncorrectable ECC, _Node0_Channel0_Dimm0/CPU1 DIMM A0 6f [a1 03 00]
0c50 05/11/21 18:07:21 MAJ SMI Memory #08 CPU1 DIMM-AB VR Uncorrectable ECC, _Node0_Channel0_Dimm0/CPU1 DIMM A0 6f [a1 02 00]
$ dmidecode -t memory|egrep '^\s+(Manufacturer|Serial|Locator)'|awk 'ORS=NR%3?FS:RS'|grep -v NO
Locator: CPU1 DIMM A0 Manufacturer: Samsung Serial Number: 3304349 <-- 故障内存
Locator: CPU1 DIMM B0 Manufacturer: Samsung Serial Number: 330430D
Locator: CPU1 DIMM C0 Manufacturer: Samsung Serial Number: 3304335
Locator: CPU1 DIMM D0 Manufacturer: Samsung Serial Number: 3304306
WHY
ipmiutil
是通过 Event Data 来解析内存 DIMM 槽位的:
$ ipmitool sel elist last 20 -v|grep -B2 ECC
Running Get PICMG Properties my_addr 0x20, transit 0, target 0
Error Response 0xc1 from Get PICMG Properities
No PICMG Extenstion discovered
Event Direction : Assertion Event
Event Data : a10200
Description : Uncorrectable ECC
--
Event Direction : Assertion Event
Event Data : a10300
Description : Uncorrectable ECC
将 Event Data 16 进制 a10200
和 a10300
转换为 二进制:
$ python3
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> format(int("a10200", 16),"040b")
'0000000000000000101000010000001000000000'
>>> format(int("a10300", 16),"040b")
'0000000000000000101000010000001100000000'
>>> f'{0xa10200:0>42b}'
'000000000000000000101000010000001000000000'
>>> f'{0xa10300:0>42b}'
'000000000000000000101000010000001100000000'
a10200: 1010 0001 0000 0010 0000 0000
a10300: 1010 0001 0000 0011 0000 0000
^^^^ ^^^^
参考 Diagnosing memory errors with IPMI 提供的 Event Data 内存映射 关系:
以及 2 个解析 Event Data 映射内存槽位的 示例:
Event Data 16 进制 a1 02 00
第 3 字节 (byte) 00
转换成 8 位 (bit) 2 进制 0000 0000
:
0000 0000
===...---
| | |
| | |
| | +----- 000 0-2 bit 标记 DIMM --> DIMM 0
| |
| +-------- 00 3-4 bit 标记 Channel --> channel A
|
+----------- 000 5-7 bit 标记 CPU ID --> CPU1
映射关系跟 ipmiutil
解析的内存槽位一致,都是 CPU1 DIMM A0
:
$ ipmiutil sel -e -l 15|grep -i memory
0c52 05/11/21 18:09:02 MAJ SMI Memory #08 CPU1 DIMM-AB VR Uncorrectable ECC, _Node0_Channel0_Dimm0/CPU1 DIMM A0 6f [a1 03 00]
0c50 05/11/21 18:07:21 MAJ SMI Memory #08 CPU1 DIMM-AB VR Uncorrectable ECC, _Node0_Channel0_Dimm0/CPU1 DIMM A0 6f [a1 02 00]
^^^^ ^^^^ ^^ ^^