Comment by cs702

1 year ago

I agree with you, but your comment strikes me as unfair nitpicking, because the OP is referring to information that has been encoded in words.

4 comments

cs702

nickpsecurity 1 year ago

We learn the ideas from each mode of input. Then, one mode can elaborate on data learned from another mode. They build on each other.

From there, remember the text is usually a reflection of things in the real world. Understanding those things in non-textual ways both gives meaning to and deeper understanding of the text. Much of the text itself was even stored in other modes, like markup or PDF’s, whose structure tells us things about it.

That we learn multimodal from birth is therefore an important point to make.

It might also be a prerequisite for AGI. It could be one of the fundamental laws of information theory or something. Text might not be enough like how digital devices need analog to interface with the real world.

naasking 1 year ago

I understand that's the context, but I'm not sure that it's unfair nitpicking. It's common to talk about training data and how poor LLMs are compared to humans despite the apparently larger dataset than any human could absorb in a lifetime. The argument is just wrong because it doesn't properly quantify the dataset size, and when you do, you actually conclude the opposite: it's astounding how good LLMs are despite their profound disadvantage.

cs702 1 year ago

> I understand that's the context, but I'm not sure that it's unfair nitpicking.
The OP is about much more than that, and taken as a whole, suggests the author is well aware that human beings absorb a lot more data from multiple domains. It struck me as unfair to criticize one sentence out of context while ignoring the rest of the OP.
> It's common to talk about training data and how poor LLMs are compared to humans despite the apparently larger dataset than any human could absorb in a lifetime.
Thank you. Like I said, I agree. My sense is the author would agree too.
It's possible that to overcome some of the limits we're starting to see, AI models may need to absorb a giant, endless, torrential stream of non-textual, multi-domain data, like people.
At the moment, we don't know.

a_wild_dandan 1 year ago

Other modalities affect word semantics. You cannot ignore them when discussing sample efficiency.