Comment by behnamoh

6 months ago

While Tokasaurus’s Async-TP shows impressive throughput gains, it seems over-engineered for common use cases. The CPU overhead from async tensor parallelism only pays off at 6k+ token batches, and you need NVLink-connected GPUs to see real benefits. Most prod deployments don’t need this complexity — you’re better off with simpler approaches unless you’re specifically optimizing for massive batch throughput. The adaptive manager skipping “optional” tasks under load also feels concerning from a reliability perspective.

7 comments

behnamoh

YetAnotherNick 6 months ago

Depends on what production means for you. This is useful for batch production jobs.

Also, this seems very useful for generating synthetic data or labelling a bunch of data. 6k batch size is small for data labelling.

cpard 6 months ago
How big of a use case is synthetic data generation? I’m curious as I see a lot about it coming from academic projects but I haven’t seen much related to commercial use cases
- electroglyph 6 months ago
  
  tiny NNs distilled from LLMs can produce some amazing results, i'm surprised it's not more common tbh
  
  1 reply →

bjt12345 6 months ago

Buy surely next years production deployments will be very different to right now, with different use cases...etc

jdiff 6 months ago

Sure. Things change over time. Is there a reason to believe they'd be different in such a way that this would be more useful than in today's landscape? I haven't seen such a forecast myself.