Comment by jiqiren

11 days ago

This release introduces parallel requests with continuous batching for high throughput serving, all-new non-GUI deployment option, new stateful REST API, and a refreshed user interface.

3 comments

jiqiren

observationist 11 days ago

Awesome - having the API, MCP integrations, refined CLI give you everything you might want. I have some things I'd wanted to try with ChainForge and LMStudio that are now almost trivial.

Thanks for the updates!

nubg 11 days ago

are parallel requests "free"? or do you half performance when sending two requests in parallel?

anon373839 10 days ago

I have seen ~1,300 tokens/sec of total throughout with Llama 3 8B on a MacBook Pro. So no, you don’t halve the performance. But running batched inference takes more memory, so you have to use shorter contexts than if you weren’t batching.