Comment by koenschipper
1 month ago
This is an incredibly elegant hack. The finding that it only works with "circuit-sized" blocks of ~7 layers is fascinating. It really makes you wonder how much of a model's depth is just routing versus actual discrete processing units.
I spend a lot of time wrestling with smaller LLMs for strict data extraction and JSON formatting. Have you noticed if duplicating these specific middle layers boosts a particular type of capability?
For example, does the model become more obedient to system prompts/strict formatting, or is the performance bump purely in general reasoning and knowledge retrieval?
Amazing work doing this on a basement 4090 rig!
No comments yet
Contribute on Hacker News ↗