Comment by sacrelege

10 hours ago

Thanks for putting N-Day-Bench together - really interesting benchmark design and results.

I'd love to see how the model we serve, Qwen3.5 122B A10B, stacks up against the rest on this benchmark. AI Router Switzerland (aiRouter.ch) can sponsor free API access for about a month if that helps for adding it to the evaluation set.

Nice. I've been thinking of doing something similar in our local jurisdiction (Australia).

Are you able to share (or point me toward) any high-level details: (key hardware, hosting stack, high-level economics, key challenges)?

I'd love to offer to buy you a coffee but I won't be in Switzerland any time soon.

  • Ah thanks, I love coffee

    At a high level, it's a mix of our own GPU capacity plus the ability to burst into external nodes when things get busy. Right now we're running a bunch of RTX PRO 6000s, which basically forces you into workstation/server boards since you need full x16 PCIe 5.0 lanes per card.

    We operate a small private datacenter, which gives us some flexibility in how we deploy and scale hardware. On the software side, we're currently LiteLLM as a load balancer in front of the inference servers, though I'm in the process of replacing that with a custom rust based implementation.

    We've only been online since the beginning of this month, so I can't really say much about the economics yet, but we've had some really nice feedback from early customers so far. :)

Interesting. How fast is your service? Do you guarantee a certain number of tokens/s?

  • We typically observe throughput of around 100–110 toks/s, and for larger context sizes this ranges between 90–100 toks/s.

    While we don't guarantee a fixed toks/s rate, we scale by provisioning external GPU nodes during peak demand. These nodes run our own dockerized environment over a secure tunnel.

    Our goal is to ensure a consistent baseline performance of at least 60–80 toks/s, even under high load.