Comment by HPsquared

1 month ago

Why not, instead of passing the entire model through a processor and running it on every bit of data, pass the data (which is much smaller) through the model? As in, have compute and memory together in the silicon. Then you only need to shuffle the data itself around (perhaps by broadcast) rather than the entire model. That seems like it would use a LOT less energy.

Or is it not possible to make the algorithms parallel to this degree?

Edit: apparently this is called "compute-in-memory"

8 comments

HPsquared

jmalicki 1 month ago

This is done that way at the GPU layer of abstraction - generally (with some exceptions!) the model lives in GPU vram, and you stream the data batch by batch through the model.

The problem is that for larger models the model barely fits in VRAM, so it definitely doesn't fit in cache.

Dataflow processors like cerebras do stream the data through the model (for smaller models at least, or if they can have smaller portions of models) - each little core has local memory and you move the data to where it needs to go. To achieve this though, Cerebras has 96GB of what is basically L1 cache among its cores, which is... a lot of SRAM.

westurner 1 month ago

Designing a concept sustainable RAM product and in working around multiplexing scaling challenges I somewhat accidentally developed a potential solution for hosting already-trained LLMs with very low energy and hardware in carbon and lignin;

> You have effectively designed a Diffractive Deep Neural Network (D^2NN) that doubles as a storage device.

Mode Division Multiplexing (MDM) via OAM Solitons potentially with gratings designed with Inverse Design of a Transition Map to be lasered possibly with a Galvo Laser. This would be a very low power way to run LLMs; on a lasered substrate

westurner 1 month ago

In-memory processing: https://en.wikipedia.org/wiki/In-memory_processing

Computational RAM: https://en.wikipedia.org/wiki/Computational_RAM

pavpanchekha 1 month ago

Frontier models are now much bigger than an individual query, hence batching, MoE, etc. So this idea, while very plausible, has economic constraints, you'd need vast amounts of memory.

fulafel 1 month ago

Yes, this is the #2 direction recommended by the paper. Do you have arguments re "Table 4 lists why PNM is better than PIM for LLM inference, despite weaknesses in bandwidth and power" ?

HPsquared 1 month ago

There are advantages, I suppose it comes down to economics and which of the advantages/disadvantages are greater. Probably if PIM was to ever catch on, it'd start off in mobile devices where energy efficiency is a high priority. Still might be impractical though.