Comment by bachittle

1 day ago

I’m fascinated by this new paradigm. We’ve more or less perfected Mixture-of-Experts inside a single model, where routing happens between subnetworks. What GPT-5 auto (and this paper) are doing is a step further: “LLM routing” across multiple distinct models. It’s still rough right now, but it feels inevitable that this will get much better over time.

> It’s still rough right now, but it feels inevitable that this will get much better over time.

Yeah, the signals they get will improve things over time. You can do a lot of heavy lifting with embedding models nowadays, get "satisfaction" signals from chats, and adjust your router based on those. It will be weird at first, some people will complain, but at the end of the day, you don't need imo-gold levels of thinking to write a fitness plan that most likely the user won't even follow :)

Signal gathering is likely the driver of most of the subsidised model offerings we see today.

I wish this could be exploited even further, where a big model could be built with a network of a lot of small, specialized, models

And then maybe you could just customize and optimize your own mode for local use. Almost like mixing and matching different modules. It would be nice to have a model that only knows and does what you need it to

  • A Team-as-a-Service? Would be interesting to be able to create a Python script acting like a team of sales, project management, and engineering working together with telemetry and KPIs dashboard on top. If not to deliver anything useful then as a project management frameworks learning tool.

Does this have a compute benefit or could one use different specialized LLM architectures / models for the subnetworks?

I'd actually bet against this. The "bitter lesson" suggests doing things end-to-end in-model will (eventually, with sufficient data) outcompete building things outside of models.

My understanding is that GPT5 already does this by varying the quantity of CoT done (in addition to the kind of super-model-level routing described in the post), and I strongly suspect it's only going to get more sophisticated

  • The bitter lesson type of strategy would be to implement heterogeneous experts inside an MoE architecture so that the model automatically chooses the number of active parameters by routing to experts with more parameters.

    This approach is much more efficient than the paper of this HN submission, because request based routing requires you to recalculate the KV cache from scratch as you switch from model to model.