← Back to context

Comment by BrownSol

1 month ago

By far one of the most interesting blogs I’ve read in a long while. I’m curious if you could combine this with Karpathy’s auto research to find the best combination of layer duplication. The callout to model merging in 2024 was funny… around that time I became friendly with RomboDawg on HF who had the best merged coding models around and created a couple of Frankenstein models myself.

I say this naively as I’m not that familiar with how transformers work under the hood, but I wonder if you could combine the two approaches in a coherent way. Frankenmerges were often down naively just smooshing things together, but knowing how the layers work under the hood I wonder if there’s a more intelligent way to combine merging and layer duplication to create even better performers.