← Back to context

Comment by tucnak

24 days ago

> It is possible to run locally though

> running one of the heavily quantized versions

There is night and day difference in generation quality between even something like 8-bit and "heavily quantized" versions. Why not quantize to 1-bit anyway? Would that qualify as "running the model?" Food for thought. Don't get me wrong: there's plenty of stuff you can actually run on 96 GB Mac studio (let alone on 128/256 GB ones) but 1T-class models are not in that category, unfortunately. Unless you put four of them in a rack or something.

True, although the Mac Studio M3 Ultra does go up to 512GB (@ ~$10K) so models of this size are not too far out of reach (although I've no idea how useful Kimi K2.5 is compared to SOTA).

Kimi K2.5 is a MOE model with 384 "experts" and an active parameter count of only 32GB, although that doesn't really help reduce RAM requirements since you'd be swapping out that 32GB on every token. I wonder if it would be viable to come up with an MOE variant where consecutive sequences of tokens got routed to individual experts, which would change the memory thrashing from per-token to per-token-sequence, perhaps making it tolerable ?