← Back to context

Comment by visarga

2 years ago

I have a strong intuition that chat logs are actually the most useful kind of data. They contain many LLM outputs followed by implicit or explicit feedback, from humans, from the real world, and from code execution. Scaling this feedback to 180M users and 1 trillion interactive tokens per month like OpenAI is a big deal.

Except LLMs are a distraction from AGI

  • If brain without language would suffice, a single human could rediscover all we know on their own. But it's not like that, brains are feeble individually, only in societies we have cultural evolution. If humanity lost language and culture and start from scratch, it would take us another 300K years to rediscover what we lost.

    But if you train a random-init LLM on the same data, it responds (almost) like a human on a diversity of tasks. Does that imply humans are just language models on two feet? Maybe we are also language modelling our way through life. New situation comes up, we generate ideas based on language, select based on personal experience, and then act and observe the outcomes to update our preferences in the future.

  • That doesn't necessarily imply that chat logs are not valuable for creating AGI.

    You can think of LLMs as devices to trigger humans to process input with their meat brains and produce machine-readable output. The fact that the input was LLM-generated isn't necessarily a problem; clearly it is effective for the purpose of prodding humans to respond. You're training on the human outputs, not the LLM inputs. (Well, more likely on the edge from LLM input to human output, but close enough.)

Yeah, similar to how Google's clickstream data makes their lead in search self-reinforcing. But chat data isn't the only kind of data. Multimodal will be next. And after that, robotics.