Comment by jiggawatts
2 days ago
The next gen inference chips will use High Bandwidth Flash (HBF) to store model weights.
These are made similarly to HBM but are lower power and much higher capacity. They can also be used for caching to reduce costs when processing long chat sessions.
No comments yet
Contribute on Hacker News ↗