← Back to context

Comment by zozbot234

3 hours ago

BTW, I forgot to mention that you can make this work in a way, but only if your model architecture generalizes the context and attention mechanism such that it's no longer a pure sequence. So you could have a large amount of distinct "early" token sequences, with each being self-contained and not depending on any other tokens, e.g. your source code files might be such. Then later parts of the context would of course depend on all of those files as usual. This makes prefill for the earlier context both reusable and cheaply recomputable throughout, at the cost of losing some dependencies that would've been previously accounted for: your model becomes faster and more efficient, but perhaps not quite as smart.