← Back to context

Comment by altruios

1 hour ago

Can you elaborate more on what a token looks like as a pixel patch/sound/general signal as it currently is (in this model)?

My understanding of pixel representation is: slice a grid in an image, each square slice gets projected into a number array of x long (not sure how long x is, or if it's variable), which then gets projected down to a token representing that space (3-4 long as alpha-numeric) and AGAIN gets passed into "position detector" which outputs a token representing that pixel/position. which gets passed into the lmm (at a significantly reduced/translated signal into token space).

First, before continuing: do I have that mostly correct?