Amazing how much of a difference NUMA awareness makes in code. I saved ~10% over the fleet by making memory allocation and task scheduling make more sense for bgp.tools's multi socket machines. It turns out, when you are not stalled on trying to fetch data from RAM that is on another socket, your stuff is a lot faster! I then saved another 10% by clearing up some internal RPC traffic. Good week!
dee@social.treehouse..
replied 14 Dec 2023 11:16 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/BL4gN7SQV73WZhS3xy
benjojo
replied 14 Dec 2023 11:38 +0000
in reply to: https://social.treehouse.systems/users/dee/statuses/111578486265430864
@dee So using https://github.com/benjojo/numa I made my bgpd's: (A) Detect the amount of CPU Sockets It is worth pointing out this only makes sense for memory heavy workloads (all that bgp.tools really does is memory access), and in places where NUMA is actually exposed to you, so cloud environments this is not useful. Given that most people are in the cloud/IAAS environments, this is not that applicable to most people anymore (other than the cheap bastards like me!) I'm 98% sure this saves CPU "%" because while a CPU Thread is stuck waiting on the other sockets memory access, it can't do anything else, so it's not like the threads are sitting doing stuff in this case, they are just blocked. So what I've really done here is reduce memory access latency. The only downside is that I have to be a little more careful with memory usage now, since I can now "OOM" on a single socket: But I like my current odds.
(B) Randomly pick one
(C) Set the scheduling mask to use that CPU sockets threads/cores only
(D) Set the memory policy (MPOL_LOCAL) to only allocate RAM from the socket it's scheduled on
(E) re-exec the whole program, you need to do this in go because set_mempolicy
only works on the thread, or if you can run it on the root thread (hint: in go this is hard) you can have it apply to the whole application. Just re-exec'ing yourself it easier.~# numastat -m
Node 0 Node 1 Total
--------------- --------------- ---------------
MemTotal 256568.49 258038.88 514607.36
MemFree 23292.68 464.24 23756.92
MemUsed 233275.81 257574.63 490850.45
dee@social.treehouse..
replied 14 Dec 2023 11:44 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/3CLJ32gTl2XN559k4T
@benjojo we have a healthy 7-figure bill from just 1 of our 3 cloud providers... it's all CPU and RAM, and the apps we have that utilise most of this are memory heavy. so I appreciate something like this that will nerd-snipe one of my engineers nicely
benjojo
replied 14 Dec 2023 11:54 +0000
in reply to: https://social.treehouse.systems/users/dee/statuses/111578597771301251
@dee Play IaaS games, get IaaS prizes :) The last time I priced bgp.tools to run on cloud IaaS it was like 12K GBP a month, I've never spent 12K GBP in hardware capex, and I spend less than 10% of that in opex every month
dee@social.treehouse..
replied 14 Dec 2023 12:01 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/w5Q38VXhg5CWZh24X1
@benjojo oh I agree... I still server 250k people per month on forums from $200 per month of hardware. personally, I'm very self-host everything, just run the boxes it's really not that hard and they really don't fail as often as one fears. but for work, we use crazy amounts and it does help. I would never want to run object storage for the way that we punish it.
dee@social.treehouse..
replied 14 Dec 2023 12:02 +0000
in reply to: https://social.treehouse.systems/users/dee/statuses/111578662864435062
@benjojo I should caveat that I do use Linode as a reverse proxy cache though... I really don't want the whole world talking to my self-hosted tyre fire
dee@social.treehouse..
replied 14 Dec 2023 11:47 +0000
in reply to: https://social.treehouse.systems/users/dee/statuses/111578597771301251
@benjojo 😢 at the "not applicable to Cloud providers". but then... OSS improvements are still beneficial 😁
jamesog@mastodon.soc..
replied 14 Dec 2023 12:14 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/BL4gN7SQV73WZhS3xy
benjojo
replied 14 Dec 2023 12:24 +0000
in reply to: https://mastodon.social/users/jamesog/statuses/111578713825507922
@jamesog Can't beat a rrd graph! They are written to a tmpfs and every 12 hours sync'd to disk. That way you get the upsides of RRD (long haul downsampling) with little downside (disk destruction). I do have prometheus too, but they are different use cases.
jamesog@mastodon.soc..
replied 14 Dec 2023 12:40 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/49Cbvv5JF418Z7Q5PF
@benjojo 👌🏻 Having had to use Datadog for the last ~3 years I'm too wary of automatic downsampling now. Personally I'd rather having a recording rule or something that's under my control. That said, I can't remember how RRD does its downsampling. I can only presume it's less batshit than the way Datadog does it.
benjojo
replied 14 Dec 2023 14:12 +0000
in reply to: https://mastodon.social/users/jamesog/statuses/111578815393474501
@jamesog Most things are less insane that datadog. RRD just has different "arrays" for different sample ranges at different intervals. it's retro, but at least you know what you are getting
tmcfarlane@toot.comm..
replied 14 Dec 2023 13:24 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/49Cbvv5JF418Z7Q5PF
benjojo
replied 14 Dec 2023 10:58 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/BL4gN7SQV73WZhS3xy
I also loved watching the NUMA stats on the machine slowly drop as the deployment rolled out. Intel make some nice (and easy to compile) Prometheus exporters that exports out some CPU/RAM metrics, very useful for performance tuning your application (and seeing what is hurting) I have EYPC machines in the fleet too, but their tools have been harder to get working annoyingly, Luckily none of those machines are multi socket.
iximeow@haunted.comp..
replied 14 Dec 2023 10:57 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/BL4gN7SQV73WZhS3xy
tokenrove@recurse.so..
replied 14 Dec 2023 12:40 +0000
in reply to: https://benjojo.co.uk/u/benjojo/h/BL4gN7SQV73WZhS3xy
@benjojo I find people are always surprised what a huge deal NUMA awareness is for performance. Have you tried anything like `perf c2c` to find cacheline contention? I wish I was working on this kind of problem right now, it's always fun.