Comment by bbor
3 months ago
Yes, good sense for progress! This has been a central design component of most serious AI work since the ~90s, most notably popularized by Marvin Minsky’s The Society of Mind. Highly, highly recommend for anyone with an interest in the mind and AI — it’s a series of one-page essays on different aspects of the thesis, which is a fascinating, Martin-Luther-esque format.
Of course this has been pushed to the side a bit in the rush towards shiny new pure-LLM approaches, but I think that’s more a function of a rapidly growing user base than of lost knowledge; the experts still keep this in mind, either in these terms or in terms of “Ensembles”. A great example is GPT-4, which AFAIU got its huge performance increase mostly through employing a “mixture of experts”, which is clearly a synonym for a society of agents or an ensemble of models.
I don't think "mixture of experts" can be assimilated to a society of agents. It is just routing a prompt to the most performant model: the models do not communicate with each other, so how could they form a society ?
Hmm that's a good point, but IMO the distinction isn't sharp enough to make a big deal over. The core idea of SoM as I see it is that human cognition is often quite decentralized, and that any illusion of a unified self is constructed piecemeal from the outputs of smaller, less-aware subsystems. Generally it's expected that the subsystems communicate with each other, yes, but I think "disproportionately rely on one or two members for complex questions but act like you're unified overall" still fits the bill.
The opinion I formed during the first few months of GPT4 release was that the society of the mind hypothesis was being disproved by the "maximalist" approach some were undertaking in order to build a true AGI. Turned out composing many LLMs into a cognitive architecture where each one had a specific purpose (memory, planning, etc ...) wasn't scaling.
On the same note, I suggest the following: training a transformer by "slicing" it in group of layers and force it to emit/receive tokens at each of those group's boundaries. What I expect: using text rather than neural activations should lead to decreased performance.
This is something you can observe in our societies: intelligence doesn't compose, you just don't double a group's overall intelligence by doubling the number of members. At best you'll observe decreasing return, at worst intelligence will decrease.