Comment by pvab3

1 month ago

inference requires a fraction of the power that training does. According to the Villalobos paper, the median date is 2028. At some point we won't be training bigger and bigger models every month. We will run out of additional material to train on, things will continue commodifying, and then the amount of training happening will significantly decrease unless new avenues open for new types of models. But our current LLMs are much more compute-intensive than any other type of generative or task-specific model

11 comments

pvab3

SequoiaHope 1 month ago

Run out of training data? They’re going to put these things in humanoids (they are weirdly cheap now) and record high resolution video and other sensor data of real world tasks and train huge multimodal Vision Language Action models etc.

The world is more than just text. We can never run out of pixels if we point cameras at the real world and move them around.

I work in robotics and I don’t think people talking about this stuff appreciate that text and internet pictures is just the beginning. Robotics is poised to generate and consume TONS of data from the real world, not just the internet.

DoctorOetker 1 month ago

While we may run out of human written text of value, we won't run out of symbolic sequences of tokens: we can trivially start with axioms and do random forward chaining (or random backward chaining from postulates), and then train models on 2-step, 4-step, 8-step, ... correct forward or backward chains.
Nobody talks about it, but ultimately the strongest driver for terrascale compute will be for mathematical breakthroughs in crypography (not bruteforcing keys, but bruteforcing mathematical reasoning).
vintermann 1 month ago
Yeah, another source of "unlimited data" is genetics. The human reference genome is about 6.5 GB, but these days, they're moving to pangenomes, wanting to map out not just the genome of one reference individual, but all the genetic variation in a clade. Depending on how ambitious they are about that "all", they can be humongous. And unlike say video data, this is arguably a language. We're completely swimming in unmapped, uninterpreted language data.
- boppo1 1 month ago
  
  Can you say more?

yourapostasy 1 month ago

Inference leans heavily on GPU RAM and RAM bandwidth for the decode phase where an increasingly greater amount of time is being spent as people find better ways to leverage inference. So NVIDIA users are currently arguably going to demand a different product mix when the market shifts away from the current training-friendly products. I suspect there will be more than enough demand for inference that whatever power we release from a relative slackening of training demand will be more than made up and then some by power demand to drive a large inference market.

It isn’t the panacea some make it out to be, but there is obvious utility here to sell. The real argument is shifting towards the pricing.

zozbot234 1 month ago

> We will run out of additional material to train on

This sounds a bit silly. More training will generally result in better modeling, even for a fixed amount of genuine original data. At current model sizes, it's essentially impossible to overfit to the training data so there's no reason why we should just "stop".

_0ffh 1 month ago
You'd be surprised how quickly improvement of autoregressive language models levels off with epoch count (though, admittedly, one epoch is a LOT). Diffusion language models otoh indeed keep profiting for much longer, fwiw.
- zozbot234 1 month ago
  
  Does this also apply to LLM training at scale? I would be a bit surprised if it does, fwiw.
  
  1 reply →
pvab3 1 month ago
I'm just talking about text generated by human beings. You can keep retraining with more parameters on the same corpus
https://proceedings.mlr.press/v235/villalobos24a.html
- x-complexity 1 month ago
  
  > I'm just talking about text generated by human beings.
  That in itself is a goalpost shift from
  > > We will run out of additional material to train on
  Where it is implied "additional material" === "all data, human + synthetic"
  ------
  There's still some headroom left in the synthetic data playground, as cited in the paper linked:
  https://proceedings.mlr.press/v235/villalobos24a.html ( https://openreview.net/pdf?id=ViZcgDQjyG )
  "On the other hand, training on synthetic data has shown much promise in domains where model outputs are relatively easy to verify, such as mathematics, programming, and games (Yang et al., 2023; Liu et al., 2023; Haluptzok et al., 2023)."
  With the caveat that translating this success outside of these domains is hit-or-miss:
  "What is less clear is whether the usefulness of synthetic data will generalize to domains where output verification is more challenging, such as natural language."
  The main bottleneck for this area of the woods will be (X := how many additional domains can be made easily verifiable). So long as (the rate of X) >> (training absorption rate), the road can be extended for a while longer.