Comment by londons_explore

8 hours ago

Except the providers also cache the parsing of the prompt (the KV cache), and that has substantial cost savings (easily an 80% saving on typical coding use cases).

That caching is done server side and not passed to the client. Which in turn means they still need state management on the server side, although it perhaps doesn't need the same level of global replication and availability.

1 comment

londons_explore

cyanydeez 5 hours ago

from the march changes, it looked like they increased cache eviction rates on the VRAM at claude causing everyone to start burning tokens as they had to regen token state.