Ask HN: Why does my Node.js multiplayer game lag at 500 players with low CPU?

7 months ago

I’m hosting a turn-based multiplayer browser game on a single Hetzner CCX23 x86 cloud server (4 vCPU, 16GB RAM, 80GB disk). The backend is built with Node.js and Socket.IO and is run via Docker Swarm. I use also use Traefik for load balancing.

Matchmaking uses a round-robin sharding approach: each room is always handled by the same backend instance, letting me keep game state in memory and scale horizontally without Redis.

Here’s the issue: At ~500 concurrent players across ~60 rooms (max 8 players/room), I see low CPU usage but high event loop lag. One feature in my game is typing during a player's turn - each throttled keystroke is broadcast to the other players in real-time. If I remove this logic, I can handle 1000+ players without issue.

Scaling out backend instances on my single-server doesn't help. I expected less load per backend instance to help, but I still hit the same limit around 500 players. This suggests to me that the bottleneck isn’t CPU or app logic, but something deeper in the stack. But I’m not sure what.

Some server metrics at 500 players:

- CPU: 25% per core (according to htop)

- PPS: ~3000 in / ~3000 out

- Bandwidth: ~100KBps in / ~800KBps out

Could 500 concurrent players just be a realistic upper bound for my single-server setup, or is something misconfigured? I know scaling out with new servers should fix the issue, but I wanted to check in with the internet first to see if I'm missing anything. I’m new to multiplayer architecture so any insight would be greatly appreciated.

30 comments

hnroo99

toast0 7 months ago

What are your processes waiting on? in Linux top, show the WCHAN field. In FreeBSD top, look at the STATE field. Ideally, your service processes are waiting on i/o (epoll, select, kqread, etc) or you're CPU limited.

Is there any cross-room communication? Can you spawn a process per room? Scaling limited at 25% CPU on a 4 vcpu node strongly suggests a locked section limiting you to effectively single threaded performance. Multiple processes serving rooms should bypass that if you can't find it otherwise, but maybe there's something wrong in your load balancing etc.

Personally, I'd rather run with fewer layers, because then you don't have to debug the layers when you have perf issues. Do matchmaking wherever with whatever layers, and let your room servers run in the host os, no containers. But nobody likes my ideas. :P

Edit to add: your network load is tiny. This is almost certainly something with your software, or how you've setup your layers. Unless those vCPUs are ancient, you should be able to push a whole lot more packets.

hnroo99 7 months ago

So when running `top` WCHAN shows `ep_poll` most of the time and sometimes `-`. Even when the game starts lagging this pattern stays pretty consistent.

There is no cross-room communication. I could spawn a process per room but I was trying to address this issue with my current Docker setup where I have multiple `game` containers that run a single node.js process and each process can host multiple rooms.

Not having to use Docker sounds simpler but it's that's where I'm at atm haha.

I agree that the network load feels very small. Maybe it's a socket.io related issue where when many broadcasts are being fired at once, then a shared I/O step gets bottlenecked?

Here's my actual typing broadcast code, I was originally broadcasting from the socket event callback itself but I found performance improved slightly by batching broadcasts per player in a setInterval loop (also note that only 1 player in a given room can be typing at once, so batching broadcasts per room shouldn't address the bottleneck).

  /**
   * Used to handle very frequent typing events more gracefully to avoid overloading CPU
   */
  const TypingUsersMap = new Map<
    ConnectionId,
    {
      socketId: string | null; // doesn't exist for bots
      roomId: PublicRoomId;
      userId: UserId;
      currentInput: string;
    }
  >();

  type ConnectionId = `${UserId}:${PublicRoomId}`;

  // ! this should be same as client throttle interval
  const TYPING_BROADCAST_INTERVAL = 200;

  export let typingBroadcastInterval: NodeJS.Timeout | undefined = undefined;
  export const startTypingBroadcastJob = () => {
    typingBroadcastInterval = setInterval(() => {
      const freshTypingUsersMap = new Map(TypingUsersMap);
      TypingUsersMap.clear();

      if (freshTypingUsersMap.size === 0) return; // Nothing to do

      // Go through each user that has a pending update
      for (const [_connectionId, data] of freshTypingUsersMap.entries()) {
        const socket = data.socketId
          ? io.sockets.sockets.get(data.socketId)
          : undefined;

        // Use the data we stored to perform the broadcast
        if (socket) {
          // emit to other players
          socket
            .to(data.roomId)
            .volatile.emit(
              SOCKET_EVENT_NAMES.USER_TYPING_RES,
              data.userId,
              data.currentInput
            );
        } else {
          // bots emit to everyone
          io.to(data.roomId).volatile.emit(
            SOCKET_EVENT_NAMES.USER_TYPING_RES,
            data.userId,
            data.currentInput
          );
        }
      }
    }, TYPING_BROADCAST_INTERVAL);
  };

  export const stopTypingBroadcastJob = () => {
    if (typingBroadcastInterval) {
      clearInterval(typingBroadcastInterval);
      typingBroadcastInterval = undefined;
    }
  };

  // this is called from the USER_TYPING socket event callback. so effectively every throttled keystroke by the user gets queued.
  export const queueTypingEvent = ({
    socketId,
    roomId,
    userId,
    currentInput,
  }: {
    socketId: string | null;
    roomId: PublicRoomId;
    userId: UserId;
    currentInput: string;
  }) => {
    const connectionId: ConnectionId = `${userId}:${roomId}`;
    TypingUsersMap.set(connectionId, {
      socketId,
      roomId,
      userId,
      currentInput,
    });
  };

snowman_lars 7 months ago

You say you use Docker via Docker Swarm, so maybe this is about the Docker network setup?

I haven't tried Swarm, but to some degree assume it can give the same effects as Docker Compose with several services. I also less sure of the effects if you never have communication between containers, but I think perhaps there may still be the same or similar issue.

What I experienced when doing not exactly a load test, but just processing a large dataset through multiple docker containers started from a docker compose config, was that the default docker network loopback (docker0) was saturated. After creating a docker network that the various nodes were configured to use, things got a lot better.

So this is the question for you, do all the containers in the swarm talk via docker0? If yes, read up on docker networks in relation to swarm in particular.

octo888 7 months ago

3000 pps / 6 Mbps is pretty much nothing for that server. I wouldn't change random network sysctl options.

> This suggests to me that the bottleneck isn’t CPU or app logic, but something deeper in the stack

Just a word of caution - I have seen plenty of people speed towards eg "it must be a bug in the kernel" when 98% of the time it is the app or some config.

hnroo99 7 months ago

Yeah changing the sysctl options was a shot in the dark... I really hope it's my app code. But the fact that the same bottleneck occurs even when I add more containers which decreases the load per container confuses me. I mentioned this in another comment but I wonder if socket.io broadcast calls share the same I/O resource or something. Maybe a lock?

ycombinatrix 7 months ago

Are you buffering your output? Doing one syscall (write) for each client in a server for each keystroke is a significant amount of IO overhead and context switching.

Try buffering the outgoing keystrokes to each client. Then, someone typing "hello world" in a server of 50 people will use 50 syscalls instead of 550 syscalls.

Think Nagle's algorithm.

hnroo99 7 months ago
I'm somewhat buffering right now - Everytime the current turn player types I buffer their input on the backend, and I have a job setup that broadcasts typing events every ~200ms using this buffer.
I could increase this interval, but I'd like to keep it as short as I can afford to to keep that realtime feel (i.e. other players can see what the current turn player is typing).
- ycombinatrix 7 months ago
  
  Are your sockets in blocking or non-blocking mode?
  If you are sequentially sending updates to each client in blocking mode, one single slow client will block execution and slow everything down for everyone. Basically forcing all clients to run as slowly as the slowest client.
  In non-blocking mode, only the slow client will suffer. You will need to buffer the output individually per client because each one will consume the data at different rates.
  Are you already using libuv?
  You should also consider making typing updates unreliable or at least less reliable since they presumably aren't critical for gameplay e.g. dropping typing updates to a client if their send buffer is full will drastically improve performance on slow connections.

jjice 7 months ago

Node gives access to event loop utilization stats that may be of value.

    import { performance, EventLoopUtilization } from 'node:perf_hooks'
    performance.eventLoopUtilization()

See the docs for how it works and how to derive some value from it.

We had a similar situation where our application was heavily IO bound (very little CPU) which caused some initial confusion with slowdown. We ended up added better metrics surrounding IO and the event loop which lead to us batch dequeuing our jobs in a more reasonable way that made the entire application much more effective.

If you crack the nut on this issue, I'd love to see an update comment detailing what the issue and solution was!

hnroo99 7 months ago

Nut has been cracked! https://news.ycombinator.com/item?id=44436679
And yeah, I've been using prometheus' `collectDefaultMetrics()` function so far to see event loop metrics, but it looks like node:perf_hooks might provide a more detailed output... thanks for sharing

austin-cheney 7 months ago

Don’t worry about the number of people users. Instead consider:

* the number of total sockets as I suspect there could be multiple sockets per user.

* investigate what socket.io does to serialize messages both on and off the wire. I wrote my own WebSocket library for Node and noticed the cost to process messages on the receiving end is about 11x greater than on the sending end. Normally that doesn’t matter until you push it past a critical point. At the critical point everything begins to super crawl because the message quantity per interval exceeds the garbage collection cycle and everything backs up. In my case this scenario didn’t realize until 180000 or 480000 messages per second depending upon the hardware. The critical difference from the hardware side was only about memory speed and cpu availability was largely irrelevant.

* also look at what socket.io does, if at all, to queue messages at each side of the socket. Message queued both on and off the wire will be a factor if not properly managed or if absent

ivape 7 months ago

And consider a socket.io alternative that emphasizes performance: https://github.com/uNetworking/uWebSockets

hnroo99 7 months ago

Big thanks to everyone who commented so far, I wasn't able to reply to everyone (busy trying to fix the issue!) but grateful for everyone's insights.

I ended up figuring out a fix but it's a little embarrassing... Optimizing certain parts of socket.io helped a little (eg installing bufferutil: https://www.npmjs.com/package/bufferutil), but the biggest performance gain I found was actually going from 2 node.js containers on a single server to just 1! To be exact I was able to go from ~500 concurrent players on a single server to ~3000+. I feel silly because had I been load-testing with 1 container from the start, I would've clearly seen the performance loss when scaling up to 2 containers. Instead I went on a wild goose chase trying to fix things that had nothing to do with the real issue[0].

In the end it seems like the bottleneck was indeed happening at the NIC/OS layer rather than the application layer. Apparently the NIC/OS prefers to deal with a single process screaming `n` packets at it rather than `x` processes screaming `n/x` packets. In fact it seems like the bigger `x` is, the worse performance degrades. Perhaps something to do with context switching, but I'm not 100% sure. Unfortunately given my lacking infra/networking knowledge this wasn't intuitive to me at all - it didn't occur to me that scaling down could actually improve performance!

Overall a frustrating but educational experience. Again, thanks to everyone who helped along the way!

TLDR: premature optimization is the root of all evil

[0] Admittedly AI let me down pretty bad here. So far I've found AI to be an incredible learning and scaffolding tool, but most of my LLM experiences have been in domains I feel comfortable in. This time around though, it was pretty sobering to realize that I had been effectively punked by AI multiple times over. The hallucination trap is very real when working in domains outside your comfort zone, and I think I would've been able to debug more effectively had I relied more on hard metrics.

bravesoul2 7 months ago

Try cluster mode? I.e. use all cores.

Anyway please follow up or blog when you solve it. Sounds interesting.

hnroo99 7 months ago

unintuitively, less cores ended up being the fix... I did a small writeup here: https://news.ycombinator.com/item?id=44436679

nik736 7 months ago

Please also note that Hetzner is not providing CPU steal information inside of VMs. So there could be 75% steal and you wouldn't notice! It's unlikely for CCX instances, but it happened for me a lot with regular instances.

aristofun 7 months ago

Any monitoring/logging in you nodejs code?

I noticed, for example, adding a newrelic agent drops http throughput almost 10x.

pvg 7 months ago

It sounds like you want to coalesce the outbound updates otherwise everyone typing is accidentally quadratic.

hnroo99 7 months ago
I thought this might've been the issue too, but because the game is turn-based there should only ever be 1 person typing at once (in a given room).
- pvg 7 months ago
  
  60 * 7 is not all that great either if you get cascading and clumping as people type at the same time- coalescing the outbound updates still seems like a good idea and since the game is turn based you know it's not really going to affect gameplay. You've basically made yourself a first person shooter networking problem for a game that's slower than WoW. That feels like overkill in terms of self-imposed obstacles.
  
  1 reply →
- brudgers 7 months ago
  
  there should only ever be 1 person typing at once (in a given room)
  Have you verified that is the case?
  
  4 replies →

moomoo11 7 months ago

Are you awaiting anywhere, such that you might be better off doing fire n forget instead?

bravesoul2 7 months ago

This might help with the keystrokes.

bigyabai 7 months ago

25% CPU usage could indicate that your I/O throughput is bottlenecked.

cbenskxk 7 months ago

are you using uwebsockets.js?