Comment by i000

1 day ago

Would it make sense to embed such single-purpose network with fixed weights within a LLM before pre-training?

Good question.

It might work, I considered running a test like this. But it does demand certain things.

The subnetwork has to be either crafted as "gradient resistant" or remain frozen. Not all discovered or handcrafted circuits would survive gradient pressure as is. Especially the kind of gradients that fly in early pre-training.

It has to be able to interface with native representations that would form in a real LLM during pre-training, which is not trivial. This should happen early enough in pre-training. Gradients must start routing through our subnetwork. We can trust "rich get richer" dynamics to take over from there, but for that, we need the full network to discover the subnetwork and start using it.

And finally, it has to start being used for what we want it to be used for. It's possible that an "addition primitive" structure would be subsumed for something else, if you put it into the training run early enough, when LLM's native circuitry is nonexistent.

Overall, for an early test, I'd spray 200 frozen copies of the same subnetwork into an LLM across different layers and watch the dynamics as it goes through pre-training. Roll extra synthetic addition problems into the pre-training data to help discovery along. Less of a principled solution and more of an engineering solution.

  • +1 I’ve always had the feeling that training from randomly initialized weights without seeding some substructure is unnecessarily slowing LLM training.

    Similarly I’m always surprised that we don’t start by training a small set of layers, stack them and then continue.

    • Better-than-random initialization is underexplored, but there are some works in that direction.

      One of the main issues is: we don't know how to generate useful computational structure for LLMs - or how to transfer existing structure neatly across architectural variations.

      What you describe sounds more like a "progressive growing" approach, which isn't the same, but draws from some similar ideas.

I had that in mind too. What if you handcraft a subnetwork with (some subset of) Turing machine capability? Do those kinds of circuits emerge naturally during training? Can reasoning use them for complex computation?