Comment by porridgeraisin

1 day ago

Is it that wide though? For example, how do you explain why you cannot autograd through sampling (and thus you use either a reparameterization trick, or gumbel). Sure, instead of relying on differentiability, you can intuitively explain it "the output changes only when you literally reach the next threshold, so all the way in between you don't really get a good direction", but how far are you going to take this?

I agree with your general point, that we don't need insane levels of math, but I would say a college level of calculus, linalg and probability is baseline.

A basic benchmark off the top of my head:

Being able to pick up, without stumbling on the fundamentals

- what LoRA is doing

- how a RBF-kernel SVM works

- why KL and reverse-KL are different

- why using mean squared error is equivalent to MLE on a gaussian

Not saying the four above pieces are all necessary, but that you should be able to learn them on demand without needing to revisit what a basis vector is.

"Working out derivatives of arbitrary functions" is school level.

3 comments

porridgeraisin

fc417fc802 1 day ago

Rate of change -> it is flat -> that is not a useful signal. I don't see the issue?

We aren't talking about doing cutting edge research, just educating people on the basics of how ML does what it does. I agree that the things you list should follow at some point in the sequence for any rigorous education. But it's a question of at what point those things should come up and what the corresponding depth of education is.

For the initial introduction I think everything you listed is entirely out of scope. You don't need any of that to get a basic MLP working using a for loop and naive gradient descent.

groundzeros2015 21 hours ago

> For the initial introduction I think everything you listed is entirely out of scope.
Who are we giving an intro to who doesn’t have 2 years of stem education?
porridgeraisin 14 hours ago

> You don't need any of that to get a basic MLP working using a for loop and naive gradient descent.
Well sure. Your initial statement was about "most applied ML".
> Rate of change -> it is flat -> that is not a useful signal. I don't see the issue?
It's not going to be zero if you sample in your practicum setting. You're gonna get RuntimeError: element 0 doesn't require grad and doesn't have a grad_fn. So yeah.