Comment by sacrelege
10 hours ago
Ah thanks, I love coffee
At a high level, it's a mix of our own GPU capacity plus the ability to burst into external nodes when things get busy. Right now we're running a bunch of RTX PRO 6000s, which basically forces you into workstation/server boards since you need full x16 PCIe 5.0 lanes per card.
We operate a small private datacenter, which gives us some flexibility in how we deploy and scale hardware. On the software side, we're currently LiteLLM as a load balancer in front of the inference servers, though I'm in the process of replacing that with a custom rust based implementation.
We've only been online since the beginning of this month, so I can't really say much about the economics yet, but we've had some really nice feedback from early customers so far. :)
No comments yet
Contribute on Hacker News ↗