Comment by yalogin
3 months ago
I don’t quite follow. The way I see it, I hat the llm “reads” depends on the input modality. If the input is a human it will be in text form, has to be. If the input is through a camera then yes, even text will be camera frames and pixels, and that is how I expect the llms to process. So I would a vision llm would already be doing this.
> if the input is a human it will be in text form, has to be.
Why can't it be a sequence of audio waveforms from human speech?