← Back to context

Comment by fennecfoxy

1 month ago

I found this super interesting! Excellent writing! And I loved the cowboy quote, that was the best part; poor thing.

Now it's making me wonder - instead of smashing things together more violently for MoE type stuff, perhaps it's more effective to create better toolsets to allow us to analyse smaller models.

Then small models can be trained (faster & cheaper) to be excellent at very specific tasks or domains, the toolset used to identify the organ and organ selection layers, a larger Frankenstein's monster model can be stitched together from these organs with perhaps a little extra training/fine-tuning to improve its organ selection abilities.

That makes me imagine some sort of future of layer standardisation, in which for a standard and optimal architecture sets of layers can be dynamically downloaded, added, swapped out etc to maintain fastest inference speed whilst allowing for flexible skills. Almost like the concept of subagents but within the architecture of the model itself. Hmmm.

I'm only versed in transformer architecture at a high level, does anybody know of any architectures where the layers branch & then coalesce like that? Or is it majority linear layer by layer?