← Back to context

Comment by gnulinux

8 months ago

I'm curious if this would also improve small local models. E.g. if I "alloy" Qwen3-8B and OpenThinker-7B is it going to be "better" than each models? I'll try testing this in my M1 Pro.

Would it really matter? Normally you use those small local models because you don't have the memory to spare for a larger model, so the real question would be: Is an alloy of Qwen3-8B and OpenThinker-7B better than a Qwen3-15B?

Beyond a certain smallness threshold it might also work to constantly swap in the models in and out of memory, but doubt that's a great experience to build on top of.

  • If it proved correct, it'd be an important insight. If you can run three low-inference-cost models and get comparable performance to a single paid frontier model in agentic workflows, it suggests this is a general insight about the way model performance scales.

    If your product is "good enough" with the current generation of models, you could cut OpenAI/Anthropic/Google out of the loop entirely by using open source & low-cost models.

    • I don't think an alloy can be as good as a larger model in general, though perhaps in special cases it can be.

      Say that you want to translate a string from English to language X. Models A and B, having fewer parameters to spare, have less knowledge of language X. Model C, a larger model, has better knowledge of language X. No matter how A and B collude, they will not exceed the performance of model C.

  • Yes it would matter. If you just have budget to run a 8B model and it's sufficient for the easy problem you have, a better 8B model with the same spec requirements is necessarily better regardless of how it compares to some other model. I have tons of problems I throw a specific sized model at.

    • > a better 8B model with the same spec requirements

      It's not the completely same spec requirements though. When using an alloy, you would need to have double the disk space (not a huge deal on desktop, but for mobile), significantly higher latency (as you need to swap the models in/out between every turn), and you can only apply it to multi-turn conversations/sufficiently decomposable problems.

  • Haha every question involves multiple writes of 10gb to the disk. I think the cost of new SSDs would be less than getting more memory in the even short term.

    • Were you replying to the right comment? (Though I also don't see another comment where what your are saying makes sense)