Comment by jiggawatts

2 days ago

The next gen inference chips will use High Bandwidth Flash (HBF) to store model weights.

These are made similarly to HBM but are lower power and much higher capacity. They can also be used for caching to reduce costs when processing long chat sessions.