Comment by jiqiren
1 day ago
This release introduces parallel requests with continuous batching for high throughput serving, all-new non-GUI deployment option, new stateful REST API, and a refreshed user interface.
1 day ago
This release introduces parallel requests with continuous batching for high throughput serving, all-new non-GUI deployment option, new stateful REST API, and a refreshed user interface.
Awesome - having the API, MCP integrations, refined CLI give you everything you might want. I have some things I'd wanted to try with ChainForge and LMStudio that are now almost trivial.
Thanks for the updates!
are parallel requests "free"? or do you half performance when sending two requests in parallel?
I have seen ~1,300 tokens/sec of total throughout with Llama 3 8B on a MacBook Pro. So no, you don’t halve the performance. But running batched inference takes more memory, so you have to use shorter contexts than if you weren’t batching.