Comment by SequoiaHope
17 days ago
Run out of training data? They’re going to put these things in humanoids (they are weirdly cheap now) and record high resolution video and other sensor data of real world tasks and train huge multimodal Vision Language Action models etc.
The world is more than just text. We can never run out of pixels if we point cameras at the real world and move them around.
I work in robotics and I don’t think people talking about this stuff appreciate that text and internet pictures is just the beginning. Robotics is poised to generate and consume TONS of data from the real world, not just the internet.
While we may run out of human written text of value, we won't run out of symbolic sequences of tokens: we can trivially start with axioms and do random forward chaining (or random backward chaining from postulates), and then train models on 2-step, 4-step, 8-step, ... correct forward or backward chains.
Nobody talks about it, but ultimately the strongest driver for terrascale compute will be for mathematical breakthroughs in crypography (not bruteforcing keys, but bruteforcing mathematical reasoning).
Yeah, another source of "unlimited data" is genetics. The human reference genome is about 6.5 GB, but these days, they're moving to pangenomes, wanting to map out not just the genome of one reference individual, but all the genetic variation in a clade. Depending on how ambitious they are about that "all", they can be humongous. And unlike say video data, this is arguably a language. We're completely swimming in unmapped, uninterpreted language data.
Can you say more?