Comment by sacrelege

10 hours ago

Ah thanks, I love coffee

At a high level, it's a mix of our own GPU capacity plus the ability to burst into external nodes when things get busy. Right now we're running a bunch of RTX PRO 6000s, which basically forces you into workstation/server boards since you need full x16 PCIe 5.0 lanes per card.

We operate a small private datacenter, which gives us some flexibility in how we deploy and scale hardware. On the software side, we're currently LiteLLM as a load balancer in front of the inference servers, though I'm in the process of replacing that with a custom rust based implementation.

We've only been online since the beginning of this month, so I can't really say much about the economics yet, but we've had some really nice feedback from early customers so far. :)

0 comments

sacrelege

No comments yet

Contribute on Hacker News ↗