Comment by zozbot234
9 hours ago
If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.
Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.
No comments yet
Contribute on Hacker News ↗