← Back to context

Comment by jbryu

6 months ago

Big thanks to everyone who commented so far, I wasn't able to reply to everyone (busy trying to fix the issue!) but grateful for everyone's insights.

I ended up figuring out a fix but it's a little embarrassing... Optimizing certain parts of socket.io helped a little (eg installing bufferutil: https://www.npmjs.com/package/bufferutil), but the biggest performance gain I found was actually going from 2 node.js containers on a single server to just 1! To be exact I was able to go from ~500 concurrent players on a single server to ~3000+. I feel silly because had I been load-testing with 1 container from the start, I would've clearly seen the performance loss when scaling up to 2 containers. Instead I went on a wild goose chase trying to fix things that had nothing to do with the real issue[0].

In the end it seems like the bottleneck was indeed happening at the NIC/OS layer rather than the application layer. Apparently the NIC/OS prefers to deal with a single process screaming `n` packets at it rather than `x` processes screaming `n/x` packets. In fact it seems like the bigger `x` is, the worse performance degrades. Perhaps something to do with context switching, but I'm not 100% sure. Unfortunately given my lacking infra/networking knowledge this wasn't intuitive to me at all - it didn't occur to me that scaling down could actually improve performance!

Overall a frustrating but educational experience. Again, thanks to everyone who helped along the way!

TLDR: premature optimization is the root of all evil

[0] Admittedly AI let me down pretty bad here. So far I've found AI to be an incredible learning and scaffolding tool, but most of my LLM experiences have been in domains I feel comfortable in. This time around though, it was pretty sobering to realize that I had been effectively punked by AI multiple times over. The hallucination trap is very real when working in domains outside your comfort zone, and I think I would've been able to debug more effectively had I relied more on hard metrics.