Comment by falloutx

19 days ago

Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.

12 comments

falloutx

Aurornis 19 days ago

That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.

falloutx 19 days ago
They can now easily decrease the speed for the normal mode, and then users will have to pay more for fast mode.
- Aurornis 19 days ago
  
  Do you have any evidence that this is happening? Or is it just a hypothetical threat you're proposing?
  These companies aren't operating in a vacuum. Most of their users could change providers quickly if they started degrading their service.
  
  1 reply →
- kolinko 19 days ago
  
  Are you at all familiar with the architecture of systems like theirs?
  The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.
  
  2 replies →

throw310822 19 days ago

Slowing down respect to what?

falloutx 19 days ago
Slowing down with respect to original speed of response. Basically what we used to get few months back and what is the best possible experience.
- throw310822 19 days ago
  
  There is no "original speed of response". The more resources you pour in, the faster it goes.
  
  2 replies →