Comment by yalogin

3 months ago

I don’t quite follow. The way I see it, I hat the llm “reads” depends on the input modality. If the input is a human it will be in text form, has to be. If the input is through a camera then yes, even text will be camera frames and pixels, and that is how I expect the llms to process. So I would a vision llm would already be doing this.

> if the input is a human it will be in text form, has to be.

Why can't it be a sequence of audio waveforms from human speech?