Comment by elchananHaas

3 days ago

One more concern I noticed: This generative approach needs not only for each layer to select each output with uniform probability, but also for each layer to select each output with uniform probability regardless of the input.

This is the bad case I am concerned about.

Layer 1 -> (A, B) Layer 2 -> (C, D)

Lets say Layer 1 outputs A and B each with probability 1/2 (perfect split). Now, Layer 2 outputs C when it gets A as an input and D when it gets B as an input. Layer 2 is then outputting each output with probability 1/2, but it is not outputting each output with probability 1/2 when conditioned on the output of layer 1.

If this happens, the claim of exponential increase in diversity each layer breaks down.

It could be that the first-order approximation provided by Split-and-Prune is good enough. My guess though is that the gradient and the split-and-prune are helping each other to keep the outputs reasonably balanced on the datasets you are working on. The split and prune lets the optimization process "tunnel" though regions of the loss landscape that would make it hard to balance the classes.