Comment by nobodywillobsrv
2 days ago
Softmax’s exponential comes from counting occupation states. Maximize the ways to arrange things with logits as energies, and you get exp(logits) over a partition function, pure Boltzmann style. It’s optimal because it’s how probability naturally piles up.
I personally don’t think much of the maximum entropy principle. If you look at the axioms that inform it, they don’t really seem obviously correct. Further, the usual qualitative argument is only right in a certain lens: namely they say choosing anything else would require you to make more assumptions about your distribution than is required. Yet it’s easy to find examples where the max entropy solution suppresses some states more than is necessary etc., which to me contradicts that qualitative argument.
right and it should be totally obvious that we would choose an energy function from statistical mechanics to train our hotdog-or-not classifier
No need to introduce the concept of energy. It's a "natural" probability measure on any space where the outcomes have some weight. In particular, it's the measure that maximizes entropy while fixing the average weight. Of course it's contentious if this is really "natural," and what that even means. Some hardcore proponents like Jaynes argue along the lines of epistemic humility but for applications it really just boils down to it being a simple and effective choice.
In statistical mechanics, fixing the average weight has significance, since the average weight i.e. average energy determines the total energy of a large collection of identical systems, and hence is macroscopically observable.
But in machine learning, it has no significance at all. In particular, to fix the average weight, you need to vary the temperature depending on the individual weights, but machine learning practicioners typically fix the temperature instead, so that the average weight varies wildly.
So softmax weights (logits) are just one particular way to parameterize a categorical distribution, and there's nothing precluding another parameterization from working just as well or better.
1 reply →
The connection isn't immediately obvious, but it's simply because solving for the maximum entry distribution that achieves a given expectation value produces the Botlzmann distribution. In stat mech, our "classifier" over (micro-)states is energy; in A.I. the classifier is labels.
For details, the keyword is Lagrange multiplier [0]. The specific application here is maximizing f as the entropy with the constraint g the expectation value.
If you're like me at all, the above will be a nice short rabbit hole to go down!
[0]:https://tutorial.math.lamar.edu/classes/calciii/lagrangemult...
The way that energy comes in is that you have a fixed (conserved) amount of it and you have to portion it out among your states. There's nothing inherently energy-related about, it just happens that we often want to look energy distributions and lots of physical systems distribute energy this way (because it's the energy distribution with maximal entropy given the constraints).
(After I wrote this I saw the sibling comment from xelxebar which is a better way of saying the same thing.)