Comment by tovej
19 days ago
There's plenty of faults in this idea.
First, the subjectivity of language.
1) People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has. This context is extremely important to any form of communication and is entirely missing when you train a pure language model. The subjective experience required to parse the text is missing.
2) When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.
A world model should be as objective as possible. Using language, the most subjective form of information is a bad fit.
The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute.
> People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has
Which companies try to address with image, video and 3d world capabilities, to add that missing context. "Video generation as world simulators" is what OpenAI once called it
> When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.
Obviously you need not only a model of the world, but also of the messenger, so you can understand how subjective information relates to the speaker and the world. Similar to what humans do
> The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute
The argument is that training neural networks with gradient descent is a universal optimizer. It will always try to find weights for the neural network that cause it to produce the "best" results on your training data, in the constraints of your architecture, training time, random chance, etc. If you give it training data that is best solved by learning basic math, with a neural architecture that is capable of learning basic math, gradient descent will teach your model basic math. Give it enough training data that is best solved with a solution that involves building a world model, and a neural network that is capable of encoding this, then gradient descent will eventually create a world model.
Of course in reality this is not simple. Gradient descent loves to "cheat" and find unexpected shortcuts that apply to your training data but don't generalize. Just because it should be principally possible doesn't mean it's easy, but it's at least a path that can be monetized along the way, and for the moment seems to have captivated investors
You did not address the second issue at all. You are inverting the implication in your argument. Whether gradient descent helps solve the language model problem or not does not help you show that this means it's a useful world model.
Let me illustrate the point using a different argument with the same structure: 1) The best professional chefs are excellent at cutting onions 2) Therefore, if we train a model to cuy onions using gradient descent, that model will be a very good profrssional chef
2) clearly does not follow from 1)
I think the commenter is saying that they will combine a world model with the word model. The resulting combination may be sufficient for very solid results.
Note humans generate their own non-complete world model. For example there are sounds and colors we don’t hear or see. Odors we don’t smell. Etc…. We have an incomplete model of the world, but we still have a model that proves useful for us.
6 replies →