> we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points
does this mean 'an over-parameterized transformer problem is a convex svm problem'?
The irony is that your "simplification" uses even more "jargon."
But yes, thats how I would read that, and I also see no issue at all with the language in the paper. These terms are used for precision, and have meaning to those in the field. Papers are written for other experts, not laymen.
OK but why they write "benign optimization landscape devoid of stationary points" instead of "convex" other than "just for show"? In my understanding it's not better for either audience experts or laymen. For experts it would be more clear to just say convex and they would know the implications, and if someone doesn't know what convex means they probably also aren't going to be on board with 'stationary points'. Also I'm not trying to pick on the authors I'm just trying to answer the question of which specific parts could be seen as 'just for show'.
> we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points
does this mean 'an over-parameterized transformer problem is a convex svm problem'?
The irony is that your "simplification" uses even more "jargon."
But yes, thats how I would read that, and I also see no issue at all with the language in the paper. These terms are used for precision, and have meaning to those in the field. Papers are written for other experts, not laymen.
OK but why they write "benign optimization landscape devoid of stationary points" instead of "convex" other than "just for show"? In my understanding it's not better for either audience experts or laymen. For experts it would be more clear to just say convex and they would know the implications, and if someone doesn't know what convex means they probably also aren't going to be on board with 'stationary points'. Also I'm not trying to pick on the authors I'm just trying to answer the question of which specific parts could be seen as 'just for show'.
1 reply →
I read it the same way as you did, or at least it's an approximation.
In general that's not really surprising. I remember discussions from some years ago about larger networks leading to smother loss surfaces.