Comment by falloutx

19 days ago

Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.

That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.

  • They can now easily decrease the speed for the normal mode, and then users will have to pay more for fast mode.

    • Do you have any evidence that this is happening? Or is it just a hypothetical threat you're proposing?

      These companies aren't operating in a vacuum. Most of their users could change providers quickly if they started degrading their service.

      1 reply →

    • Are you at all familiar with the architecture of systems like theirs?

      The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.

      2 replies →