Comment by recognity

5 hours ago

The insight about TTFT dominating everything resonates. We're seeing the same pattern in CLI tools — the perceived speed of AI features comes down to how fast you get the first useful output, not total processing time.

Curious about your semantic end-of-turn detection: are you using a separate lightweight model for that, or is it baked into the main LLM inference? That seems like the hardest part to get right without adding latency.