home tags events about login
one honk maybe more

benjojo replied 22 Feb 2024 12:21 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/467Rj2M8KMYH4wtF1T

There is also a mystery QSFP port connector sitting on the side of the switch... with no connector on the other side.

Wonder what it is used for, There seem to be plenty of other programming pins on the board, so I doubt it's a factory programming connector

A QSFP connector without a cage, sitting on the side of the board where there would not be a slot for it on the case

benjojo replied 22 Feb 2024 15:51 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/qsjm5YQ9lqGP7G13hp

Well that was painful (firmware versions mismatching and not upgrading etc)

but we got there!

# ip l | grep BROADCAST | wc -l
23

# sensors
mlxsw-pci-0100
Adapter: PCI adapter
fan1:            7004 RPM
fan2:            7437 RPM
fan3:            7234 RPM
fan4:            7079 RPM
temp1:            +32.0°C  (highest = +32.0°C)
front panel 001:   +0.0°C  (crit =  +0.0°C, emerg =  +0.0°C)
...
front panel 022:   +0.0°C  (crit =  +0.0°C, emerg =  +0.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +18.0°C  (high = +98.0°C, crit = +98.0°C)
Core 1:       +18.0°C  (high = +98.0°C, crit = +98.0°C)
Core 2:       +19.0°C  (high = +98.0°C, crit = +98.0°C)
Core 3:       +19.0°C  (high = +98.0°C, crit = +98.0°C)

# uname -a
Linux bgptools-switch 6.1.78-2fast2benjojo-2 #1 SMP PREEMPT_DYNAMIC Thu Feb 22 13:43:28 UTC 2024 x86_64 GNU/Linux

A working 25G/100G switch running boring debian

zev@honk.bewilderbee.. replied 22 Feb 2024 22:00 +0000
in reply to: https://mastodon.social/users/cks/statuses/111977230760054856

@cks @benjojo There's unfortunately often a delay between when the BMC powers on the host processor and when the interfaces by which the BMC can read the temperature of that processor (e.g. PECI on Intel platforms or SB-TSI for AMD) actually come fully online. The fans are usually on the same 12V power rail as the host and hence turn on when it does, and lacking a valid temperature reading from the host CPU, going into failsafe mode is the...well, safe option. Logic like "if we just turned it on right now on it's probably not very hot" runs into problems if it had recently been on and is still holding a lot of residual heat...you could potentially get into tracking more history to disambiguate that in turn, but then you're suddenly a lot more stateful than you were which gets messy and fragile (especially considering that the BMC and the host can both reboot independently of each other), and it's ultimately just a lot simpler and less error-prone to make it (relatively) stateless and err on the side of not cooking things. And of course since most servers end up situated in places where there usually aren't people around to hear them, acoustic noise optimization is typically pretty low on the list of priorities.

(Yes, I work on BMC firmware.)