Comment by vagab0nd

3 months ago

Human perception is essentially 2D+depth. Shouldn't we be feeding the transformers 2D data? Like a convolution front-end? Instead of tokenizing images, shouldn't we be rendering texts?

0 comments