Who would win? 15 working DDR4 DIMMs or 1 single DDR4 DIMM that ECC errored so hard the system decided it was not worth getting to the point of even telling me what DIMM had gone bad at startup
lmierzwa@mastodon.so..
replied 13 Apr 2024 19:28 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/861YGNyGc4t6rR9614
benjojo
replied 13 Apr 2024 19:39 +0000
in reply to: https://mastodon.social/users/lmierzwa/statuses/112265559326098530
MissAemilia@mastodon..
replied 13 Apr 2024 20:05 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/861YGNyGc4t6rR9614
@benjojo Pretty much every BMC I know of should be able to point to the failed DIMM. Worth looking in that direction.
benjojo
replied 14 Apr 2024 15:07 +0000
in reply to: https://mastodon.gamedev.place/users/MissAemilia/statuses/112265705142783377
@MissAemilia Yeah the issue was two fold, one that this was a blade that had not yet had it's IPMI reset, so I needed to boot it in order to see those messages, two the serial console/VGA console could not init before the bad DIMM would take the system down The second issue was that the chassis/firmware/whatever had a limit of how many ECC correctables can happen in a short time, this DIMM seemed to have DDR4 trained just fine, but instantly blew past this limit to the point where the CPU CATERR'd
flangey@chaos.social
replied 13 Apr 2024 20:11 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/861YGNyGc4t6rR9614
FritzAdalis@infosec...
replied 13 Apr 2024 20:15 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/861YGNyGc4t6rR9614
cks@mastodon.social
replied 13 Apr 2024 21:08 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/861YGNyGc4t6rR9614
@benjojo Apparently modern memory systems have to be 'trained' on system boot to establish their specific characteristics and timing and so on. I'd assume that this is separate for each DIMM so that a bad DIMM can't contaminate this process for the rest via some shared line, but now I'm wondering if I was too optimistic there. (Modern memory is kind of scary if I think about it too much, but this applies to basically all elements of a modern system. My storage runs OSes!)
benjojo
replied 14 Apr 2024 15:09 +0000
in reply to: https://mastodon.social/users/cks/statuses/112265954062793153
@cks I think the DIMM trains just fine (at least looking at the BMC seems to imply so), it's just when the DIMM then "enters the ring" it triggers so many correctable errors so quickly that the CPU just CATERR's out. The whole memory system is magic, but i'm kind of surprised that the system is not smart enough to "kick out" a DIMM that is partially bad (trains fine, can't reliably remember things)