Comment by SequoiaHope

18 days ago

Run out of training data? They’re going to put these things in humanoids (they are weirdly cheap now) and record high resolution video and other sensor data of real world tasks and train huge multimodal Vision Language Action models etc.

The world is more than just text. We can never run out of pixels if we point cameras at the real world and move them around.

I work in robotics and I don’t think people talking about this stuff appreciate that text and internet pictures is just the beginning. Robotics is poised to generate and consume TONS of data from the real world, not just the internet.

3 comments

SequoiaHope

DoctorOetker 18 days ago

While we may run out of human written text of value, we won't run out of symbolic sequences of tokens: we can trivially start with axioms and do random forward chaining (or random backward chaining from postulates), and then train models on 2-step, 4-step, 8-step, ... correct forward or backward chains.

Nobody talks about it, but ultimately the strongest driver for terrascale compute will be for mathematical breakthroughs in crypography (not bruteforcing keys, but bruteforcing mathematical reasoning).

vintermann 18 days ago

Yeah, another source of "unlimited data" is genetics. The human reference genome is about 6.5 GB, but these days, they're moving to pangenomes, wanting to map out not just the genome of one reference individual, but all the genetic variation in a clade. Depending on how ambitious they are about that "all", they can be humongous. And unlike say video data, this is arguably a language. We're completely swimming in unmapped, uninterpreted language data.

boppo1 18 days ago

Can you say more?