Comment by electroly
8 hours ago
For AI inference you don't need to geographically distribute your data centers. Latency, throughput, and routes don't matter here. When it's 10 seconds for the first token and then a 1KB/sec streamed response, whatever is fine. You can serve Australia from the US and it'll barely matter. You can find a spot far outside populated areas with cheap power, available water, and friendly leadership, then put all of your data centers there. If you're worried about major disasters, you can pick a second city. You definitely don't need a data center in every continent.
You're not wrong about the rest but no AI company would ever build a data center in every continent for this, even if they were prepared to build data centers. AI inference isn't like general purpose hosting.
>Latency, throughput, and routes don't matter here. When it's 10 seconds for the first token and then a 1KB/sec streamed response, whatever is fine. You can serve Australia from the US and it'll barely matter.
This may be true for simpler cases where you just stream responses from a single LLM in some kind of no-brain chatbot. If the pipeline is a bit more complex (multiple calls to different models, not only LLMs but also embedding models, rerankers, agentic stuff, etc.), latencies quickly add up. It also depends on the UI/UX expectations.
Funny reading this, because the feature I developed can't go live for a few months in regions where we have to use Amazon Bedrock (for legal reasons), simply because Bedrock has very poor latency and stakeholders aren't satisfied with the final speed (users aren't expected to wait 10-15 seconds in that part of the UI, it would be awkward). And a single roundtrip to AWS Ireland from Asia is already like at least 300ms (multiply by several calls in a pipeline and it adds up to seconds, just for the roundtrips), so having one region only is not an option.
Funny though, in one region we ended up buying our own GPUs and running the models ourselves. Response times there are about 3x faster for the same models than on Bedrock on average (and Bedrock often hangs for 20+ seconds for no reason, despite all the tricks like cross-region inference and premium tiers AWS managers recommended). For me, it's been easier and less stressful to run LLMs/embedders/rerankers myself than to fight cloud providers' latencies :)
>then put all of your data centers there
>You definitely don't need a data center in every continent.
Not always possible due to legal reasons. Many jurisdictions already have (or plan to have) strict data processing laws. Also many B2B clients (and government clients too), require all data processing to stay in the country, or at least the region (like EU), or we simply lose the deals. So, for example, we're already required to use data centers in at least 4 continents, just 2 more continents to go (if you don't count Antarctica :)
Sounds like you're betting that the performance users experience today will be the same as the performance they'll expect tomorrow. I wouldn't take that bet.
You can build geographically close one tomorrow, when you start earning money today. US-EU latency is like 100ms, AI can handle it just fine
You mean that if you were Anthropic, you'd build the data centers on every continent? Can you explain your reasoning?
We're talking about billions of dollars of extra capex if you take the "let's build them everywhere" side of the bet instead of "let's build them in the cheapest possible place" side. It seems to me that you'd have to be really sure that you need the data center to be somewhere uneconomical. I think if you did build them in the cheap place, it's a safe bet that you'll always have at least enough latency-insensitive workloads to fill it up. I doubt that we would transition entirely to latency-sensitive workloads in the future, and that's what would have to happen for my side of the bet to go wrong. The other side goes wrong if we don't see a dramatic uptick in latency-sensitive inference workloads. As another comment pointed out, voice agents are the one genuinely latency-sensitive cloud inference workload we have right now; they do need low latency for it. Such workloads exist, but it's a slim percentage so far.
I believe I'm taking the safe bet that lets Anthropic make hay while the sun shines without risking a major misstep. Nothing stops them from using their own data centers for cheap slow "base load" while still using cloud partners for less common specialized needs. I just can't see why they would build the international data centers to reduce cloud partner costs on latency-sensitive workloads before those workloads actually show up in significant numbers.
latency absolutely matters? this is such a weird thing to say. for training sure, but customers absolutely want low latency
They want it, sure. Customers want everything if it's free, but this is about what they value with their money. In this thought experiment, you're Anthropic, not the customer. You're making a choice that's best for Anthropic. Will Anthropic lose customers because the latency is higher? No way. Customers want low cost and lots of usage more than they want low latency. In a cutthroat race to the bottom, there's no room to "give away" massively expensive freebies like a data center near every population center when the customer doesn't value those extras with actual money. It's the same reason we all tolerate the relatively slow batched token generation rate--the batching dramatically lowers the cost, and we need low cost inference more than we want fast generation. If the cost goes up we'll actually leave, for real.
After the initial announcement of "fast mode" in Claude Code, did you ever hear about anyone using it for real? I didn't. Vanishingly few people are willing to pay extra for faster inference.
Remember that the time-to-first-token is dominated by the time to process the prompt. It's orders of magnitude more latency than the network route is adding. An extra 200 milliseconds of network delay on a 5-10 second time-to-first-token is not even noticeable; it's within the normal TTFT jitter. It would be foolish to spend billions of dollars to drop data centers around the world to reduce the 200 milliseconds when it's not going to reduce the 5-10 seconds. Skip the exotic locales and put your data centers in Cheap Power Tax Haven County, USA. Perhaps run the numbers and see if Free Cooling City, Sweden is cheaper.
They’re unwilling to pay for fast mode because of the current step function price increase once you hit your quota. It’s a psychological effect. Because most shops I know in the US currently paying $125/mo per seat for Claude would happily - HAPPILY - pay 2x, and begrudgingly pay 10x that amount for the same service. If fast mode was priced 25% or 50% more they’d happily pay for that too. But it’s just not priced that way currently with weird growth subsidization & psychology.
The only AI use case that cares about latency is interactive voice agents, where you ideally want <200ms response time, and 100ms of network latency kills that. For coding and batch job agents anything under 1s isn't going to matter to the user.
tbh, that's a good point about the voice agents that I hadn't considered. I guess there are some latency-sensitive inference workloads. Thanks for pointing that out.
1 reply →
A customer service chatbot can require more than one LLM call per response to the point that latency anywhere in the system starts to show up as a degraded end-user experience.
Easy solution - use hyperscalers with super expensive API charge only when latency really matters. Otherwise build your own DC. Easy to expect customers don't care latency that much over money.