Comment by vagab0nd
3 months ago
Human perception is essentially 2D+depth. Shouldn't we be feeding the transformers 2D data? Like a convolution front-end? Instead of tokenizing images, shouldn't we be rendering texts?
3 months ago
Human perception is essentially 2D+depth. Shouldn't we be feeding the transformers 2D data? Like a convolution front-end? Instead of tokenizing images, shouldn't we be rendering texts?
No comments yet
Contribute on Hacker News ↗