← Back to context

Comment by Imanari

1 month ago

Fascinating! I wonder if new training techniques could emerge from this. If we say layer-1=translater, layer2-5=reasoner, layer6 retranslater, could we train small 6 layer models but evaluate their performance in a 1>n*(2-5)>6 setup to directly train towards optimal middle-layers that can be looped? You'd only have to train 6 layers but get the duplication-benefit of the middle layers for free.

Yes, training directly for a diverse mix of "looped" inference procedures makes a lot of sense as a way of allowing for increased inference-time compute. It would likely be complementary to the usual thinking approach, which essentially runs the "loop" LLM-wide - and, critically, yields interpretable output which lets us see what the LLM is thinking about.