Comment by cootsnuck
20 hours ago
Yea. LLM inference requires batch processing to have a shred of hope at being cost efficient. Batch processing requires a not so insignificant amount of scale (but probably not as much as people think).
I'm very pro local models, but not to have parity with SoTA frontier models. Just contextually trained small models doing smaller specific tasks.
Trying to run bigger LLMs for an individual user to do big tasks is not going to be a good time.
Wasnt this pretty evident to pretty much anyone who knew even a bit about inferencing?
Idk what people were thinking. I’ve never seen anyone offer a plausible way to sidestep batch processing for example.