Comment by 8note

5 hours ago

i really want a qwen on one of these chips: https://chatjimmy.ai

15k tokens/s would get me feeling like its actually worth splitting out worktrees to try several approaches to a problem

1 comment

8note

Cerium 5 hours ago

Why is that? It seems the other direction? I want to be sure I can complete a task in a certain amount of wall clock time. If the tokens per second are slow, then I am risking more by running a single approach at a time, and then have an incentive to try to multiplex my attention between separate work-streams. If the generation is fast enough to occupy my attention then there is no more available improvement by having parallel threads.