Comment by akavi
20 hours ago
I'd actually bet against this. The "bitter lesson" suggests doing things end-to-end in-model will (eventually, with sufficient data) outcompete building things outside of models.
My understanding is that GPT5 already does this by varying the quantity of CoT done (in addition to the kind of super-model-level routing described in the post), and I strongly suspect it's only going to get more sophisticated
The bitter lesson type of strategy would be to implement heterogeneous experts inside an MoE architecture so that the model automatically chooses the number of active parameters by routing to experts with more parameters.
This approach is much more efficient than the paper of this HN submission, because request based routing requires you to recalculate the KV cache from scratch as you switch from model to model.