← Back to context

Comment by mysteria

1 month ago

The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.

This wasn't something I really dug into in great detail but I remember my surprise back then at how all those merged models and those "expanded" models like Goliath still generated coherent output. IMO those were more community models made by small creators for entertainment rather than work, and only really of interest to the local LLM groups on Reddit, 4chan, and Discord. People might briefly discuss it on the board and say "that's cool" but papers aren't being written and it's less likely for academics or corpo researchers to notice it.

That being said I wonder if it's possible to combine the layers of completely different models like say a Llama and a Qwen and still get it to work.

Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.

Would using grammar parsing help here by forcing the LLM to only output the expected tokens (i.e. numbers)? Or maybe on the scoring side you could look at the actual probabilities per token to see how far the correct digit is.

I think the main challenge with combining layers of different would models be their differing embedding sizes and potentially different vocabularies.

Even between two models of identical architecture, they may have landed on quite different internal representations if the training data recipe was substantially different.

But it would be fun to experiment with.

  • Even with the same embedding sizes and vocabularies, there’s nothing that forces the meaning of dimension 1 of model 1 to mean the same thing as dimension 1 of model 2 — there are lots of ways to permute the dimensions of a model without changing its output, so whatever dimension 1 means the first time you train a model is just as likely to end up as dimension 2 the second time you train is as it is to be consistent with the first model.

    Nobody here or on Reddit has mentioned this, maybe bc it’s too obvious, but it’s clear to me that the residual connections are an absolutely necessary component to making this merging possible — that’s the only reason dimension 1 of a later layer is encouraged to mean something similar to dimension 1 of an earlier layer.

    • On a related note - would it be easier, instead of doing a benchmark sweep across the whole NxN set of start-end pairs for which layers to modify, to instead measure cross-correlation between outputs of all layers? Shouldn't that produce similar results?

It’s a good spot for hobbyists to fill in the gaps. Maybe it’s not interesting enough for academics to study, and for corporate ML they would probably just fine tune something that exists rather than spending time on surgery. Even Chinese labs that are more resource constrained don’t care as much about 4090-scale models.

It's still non-trivial, as multi-digit numbers can be constructed a huge combination of valid tokens.

The code in the blog helps derive useful metrics from partial answers.