Comment by mswphd

4 hours ago

context window for Qwen3.6 models' size increase isn't that bad/large (e.g. you can likely fix max context well within the 48GB), but macbook prompt processing is notoriously slow (At least up through M4. M5 got some speedup but I haven't messed with it).

One thing to keep in mind is that you do not need to fully fit the model in memory to run it. For example, I'm able to get acceptable token generation speed (~55 tok/s) on a 3080 by offloading expert layers. I can't remember the prompt processing speed though, but generally speaking people say prompt processing is compute bound, so benefits more from an actual GPU.

0 comments

mswphd

No comments yet

Contribute on Hacker News ↗