home tags events about login
one honk maybe more

benjojo posted 14 Dec 2023 10:56 +0000

Amazing how much of a difference NUMA awareness makes in code. I saved ~10% over the fleet by making memory allocation and task scheduling make more sense for bgp.tools's multi socket machines.

It turns out, when you are not stalled on trying to fetch data from RAM that is on another socket, your stuff is a lot faster!

I then saved another 10% by clearing up some internal RPC traffic. Good week!

A RRD CPU graph, The graph starts at around 60% usage, showing two slumps on CPU usage, the first being a 10% drop, with a arrow showing "NUMA Awareness" and the next another 10% drop with "RPC Awareness" pointing to it

benjojo replied 14 Dec 2023 11:38 +0000
in reply to: https://social.treehouse.systems/users/dee/statuses/111578486265430864

@dee So using https://github.com/benjojo/numa I made my bgpd's:

(A) Detect the amount of CPU Sockets
(B) Randomly pick one
(C) Set the scheduling mask to use that CPU sockets threads/cores only
(D) Set the memory policy (MPOL_LOCAL) to only allocate RAM from the socket it's scheduled on
(E) re-exec the whole program, you need to do this in go because set_mempolicy only works on the thread, or if you can run it on the root thread (hint: in go this is hard) you can have it apply to the whole application. Just re-exec'ing yourself it easier.

It is worth pointing out this only makes sense for memory heavy workloads (all that bgp.tools really does is memory access), and in places where NUMA is actually exposed to you, so cloud environments this is not useful. Given that most people are in the cloud/IAAS environments, this is not that applicable to most people anymore (other than the cheap bastards like me!)

I'm 98% sure this saves CPU "%" because while a CPU Thread is stuck waiting on the other sockets memory access, it can't do anything else, so it's not like the threads are sitting doing stuff in this case, they are just blocked. So what I've really done here is reduce memory access latency.

The only downside is that I have to be a little more careful with memory usage now, since I can now "OOM" on a single socket:

~# numastat -m
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal               256568.49       258038.88       514607.36
MemFree                 23292.68          464.24        23756.92
MemUsed                233275.81       257574.63       490850.45

But I like my current odds.

dee@social.treehouse.. replied 14 Dec 2023 12:01 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/w5Q38VXhg5CWZh24X1

@benjojo oh I agree... I still server 250k people per month on forums from $200 per month of hardware.

personally, I'm very self-host everything, just run the boxes it's really not that hard and they really don't fail as often as one fears.

but for work, we use crazy amounts and it does help. I would never want to run object storage for the way that we punish it.

benjojo replied 14 Dec 2023 10:58 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/BL4gN7SQV73WZhS3xy

I also loved watching the NUMA stats on the machine slowly drop as the deployment rolled out. Intel make some nice (and easy to compile) Prometheus exporters that exports out some CPU/RAM metrics, very useful for performance tuning your application (and seeing what is hurting)

https://github.com/intel/pcm/

I have EYPC machines in the fleet too, but their tools have been harder to get working annoyingly, Luckily none of those machines are multi socket.

A Grafana style graph showing QPI traffic between two sockets go from 4GByte/s to 400MByte/s