> applying this compression algorithm at scale may significantly relax the memory bottleneck issue.
I don’t think they’re going to downsize though, I think the big players are just going to use the freed up memory for more workflows or larger models because the big players want to scale up. It’s a cat and mouse race for the best models.
Does the KV cache really grow to use more memory than the model weights? The reduction in overall RAM relies on the KV cache being a substantial proportion of the memory usage but with very large models I can't see how that holds true.
I don't know, I think if you weighed up the costs of AI related datacentre spend vs. the average mathematics academic's salary you could come to a different conclusion.
It's also less frustrating to organize world wide ram production and logistics than to deal with a single mathematician.
Constantly sitting around trying to solve problems that nobody has made headway on for hundreds of years. Or inventing theorems around 15th century mysticism that won't be applicable for hundreds of years.
Now if you'll excuse me I need to multiply some numbers by 3 and divide them by 2 ... I'm so close guys.
Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.
You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.
Ive thought for a while that the real gains now will not come from throwing more hardware at the problem, but advances in mathematical techniques to make things for more efficient.
Doesn't seem relevant here. TurboQuant isn't a domain-specific technique like the BL is talking about, it's a general optimisation for transformers that helps leverage computation more effectively.
> If I were Google, I wouldn’t release research that exposes a competitive advantage.
Isn't that a classic tit for tat decision and head for a loss?
Excellence and prestige are valuable too. You get those expensive ML for a small discount, public/professional perception, etc. Considering the public communication from Google, that isn't complete sociopathic, they know this war isn't won in one night, they are the only sustainably funded company in the competition. Surely they are at risk with their business, but can either go rampant or focus. They decided to focus.
We will not see memory demand decrease because this will simply allow AI companies to run more instances. They still want an infinite amount of memory at the moment, no matter how AI improves.
As I understand this advancement, this doesn't let you run bigger models, it lets you maintain more chat context. So Anthropic and OpenAI won't need as much hardware running inference to serve their users, but it doesn't do much to make bigger models work on smaller hardware.
Though I'm not an expert, maybe my understanding of the memory allocation is wrong.
The hyperscalers do not want us running models at the edge and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.
I don't think we are there yet. Models running in data centers will still be noticeably better as efficiency will allow them to build and run better models.
Not many people would like today models comparable to what was SOTA 2 years ago.
To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.
None of those two conditions seem to become true for the near future.
I like the mainframe comparison but isn't there a key difference? Mainframes died because hardware got cheap -- that's predictable. LLM efficiency improving enough to run locally needs algorithmic breakthroughs, which... aren't. My gut says we'll end up with a split. Stuff where latency matters (copilot, local agents) moves to edge once models actually fit on a laptop. But training and big context windows stay in the cloud because that's where the data lives. One thing I keep going back and forth on: is MoE "better math" or just "better engineering"? Feels like that distinction matters a lot for where this all goes.
> applying this compression algorithm at scale may significantly relax the memory bottleneck issue.
I don’t think they’re going to downsize though, I think the big players are just going to use the freed up memory for more workflows or larger models because the big players want to scale up. It’s a cat and mouse race for the best models.
Known in the business as 'pulling a jevons'
Does the KV cache really grow to use more memory than the model weights? The reduction in overall RAM relies on the KV cache being a substantial proportion of the memory usage but with very large models I can't see how that holds true.
Despite the shortage, RAM is still cheaper than mathematicians.
I don't know, I think if you weighed up the costs of AI related datacentre spend vs. the average mathematics academic's salary you could come to a different conclusion.
It's also less frustrating to organize world wide ram production and logistics than to deal with a single mathematician.
Constantly sitting around trying to solve problems that nobody has made headway on for hundreds of years. Or inventing theorems around 15th century mysticism that won't be applicable for hundreds of years.
Now if you'll excuse me I need to multiply some numbers by 3 and divide them by 2 ... I'm so close guys.
The comment feels a bit like Verdex may have dated a mathematician at some point and it went sour.
Doubt it. You have to pay these mathematicians once and then you can deploy to millions of sites.
But not everyone has to pay mathematicians, like RAM :-)
At the same time, processing is much cheaper than memory
[dead]
This is one of the basic avenues for advancement.
Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.
You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.
The same could be said about other IT domain... When you see single webpages that weight by tens of MB you wonder how we came to this.
Detachment from reality. Code elegance is more important then anything else. As simple as that.
Ive thought for a while that the real gains now will not come from throwing more hardware at the problem, but advances in mathematical techniques to make things for more efficient.
The drop in memory stocks seems counterintuitive to me.
The demand for memory isn't going to go down, we'll just be able to do more with the same amount of memory.
Sigh. Don't make me tap the sign [1]
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Doesn't seem relevant here. TurboQuant isn't a domain-specific technique like the BL is talking about, it's a general optimisation for transformers that helps leverage computation more effectively.
> If I were Google, I wouldn’t release research that exposes a competitive advantage.
Isn't that a classic tit for tat decision and head for a loss?
Excellence and prestige are valuable too. You get those expensive ML for a small discount, public/professional perception, etc. Considering the public communication from Google, that isn't complete sociopathic, they know this war isn't won in one night, they are the only sustainably funded company in the competition. Surely they are at risk with their business, but can either go rampant or focus. They decided to focus.
We will not see memory demand decrease because this will simply allow AI companies to run more instances. They still want an infinite amount of memory at the moment, no matter how AI improves.
If models become more efficient we will move more of the work to local devices instead of using SaaS models. We’re still in the mainframe era of LLM.
As I understand this advancement, this doesn't let you run bigger models, it lets you maintain more chat context. So Anthropic and OpenAI won't need as much hardware running inference to serve their users, but it doesn't do much to make bigger models work on smaller hardware.
Though I'm not an expert, maybe my understanding of the memory allocation is wrong.
The hyperscalers do not want us running models at the edge and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.
4 replies →
I don't see how we'll ever get to widespread local LLM.
The power efficiency alone is a strong enough pressure to use centralized model providers.
My 3090 running 24b or 32b models is fun, but I know I'm paying way more per token in electricity, on top of lower quality tokens.
It's fun to run them locally, but for anything actually useful it's cheaper to just pay API prices currently.
1 reply →
> If models become more efficient
Then we can make them even bigger.
4 replies →
I don't think we are there yet. Models running in data centers will still be noticeably better as efficiency will allow them to build and run better models.
Not many people would like today models comparable to what was SOTA 2 years ago.
To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.
None of those two conditions seem to become true for the near future.
I like the mainframe comparison but isn't there a key difference? Mainframes died because hardware got cheap -- that's predictable. LLM efficiency improving enough to run locally needs algorithmic breakthroughs, which... aren't. My gut says we'll end up with a split. Stuff where latency matters (copilot, local agents) moves to edge once models actually fit on a laptop. But training and big context windows stay in the cloud because that's where the data lives. One thing I keep going back and forth on: is MoE "better math" or just "better engineering"? Feels like that distinction matters a lot for where this all goes.
I disagree. I think a sharp drop in memory requirements of at least an order of magnitude will cause demand to adjust accordingly.
Department of Transportation always thinks adding more lanes will reduce traffic.
It doesn't, it induces demand. Why? Because there's always too many people with cars who will fill those lanes.
2 replies →
[dead]
Jevons paradox https://en.wikipedia.org/wiki/Jevons_paradox
Can we say something about the compression factor for pure knowledge of these models?