Comment by kstrauser
6 days ago
For real. Say it takes 1 machine 5 seconds to reply, and that a machine can only possibly form 1 reply at a time (which I doubt, but for argument).
If the requests were regularly spaced, and they certainly won’t be, but for the sake of argument, then 1 machine could serve 17,000 requests per day, or 120,000 per week. At that rate, you’d need about 5,600 machines to serve 700M requests. That’s a lot to me, but not to someone who owns a data center.
Yes, those 700M users will issue more than 1 query per week and they won’t be evenly spaced. However, I’d bet most of those queries will take well under 1 second to answer, and I’d also bet each machine can handle more than one at a time.
It’s a large problem, to be sure, but that seems tractable.
Yes. And batched inference is a thing, where intelligent grouping/bin packing and routing of requests happens. I expect a good amount of "secret sauce" is at this layer.
Here's an entry-level link I found quickly on Google, OP: https://medium.com/@wearegap/a-brief-introduction-to-optimiz...