Comment by Havoc

15 hours ago

Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff

3 comments

Havoc

zozbot234 13 hours ago

It could run viably with SSD offload on Macs with very little memory. You could even exploit batching to make the model almost compute limited even in that challenging setting, seeing as the KV cache is so extremely small (for non-humongous context). In fact, if that approach can be made to work I'd like to see a comparison between DS4 Flash and Pro on the same (Mac) hardware.

Havoc 13 hours ago
>It could run viably with SSD offload on Macs with very little memory
Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range
Lovely as modern nvmes are they're not memory
- zozbot234 13 hours ago
  
  You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)