← Back to context

Comment by jsight

2 years ago

> The unfortunate thing about the field is that subtle methodological errors often cause subtle failures rather than catastrophic failures as we're used to in many other branches of engineering or science.

I've been doing a lot of studying in the ML field lately, and I'm seeing this a lot. It is just another thing that feels like the polar opposite of everything else that I've done as a software engineer.

Miss a semicolon? Instant error.

Miscalculate some grads on one out of three layers? Every now and then it might even work! But the results will be weird.

How about this one: tune your hyper-parameters based on the results on your test data.

This is prolific, even the norm, but it is a form of information leakage. You're passing information about the test dataset to the model. The solution to this is to use 3 partitions: train, validation, test. Validation is for HP tuning (you can do cross-validation btw) and test is a single shot.

  • Yep, I've been guilty of that one lately. That and solving problems by simply overfitting a neural net to the data in the problem domain.

    I mean, it works, but the result is less interesting than what I should have done. :)

    • Definitely. Problem is that doing this helps you get published, not hurts. I think this is why there's often confusion when industry tries to use academic models, as they don't generalize well due to this overfitting. But also, evaluation is fucking hard, and there's just no way around that. Trying to make it easy (i.e. benchmarkism) just adds up creating more noise instead of the intended decrease.

  • They banged cross validation into our heads in school and then no one in NlP uses it and I just can’t even understand why not.

    • Not only that, but I've argued with people substantially, where people claim that it isn't information leakage. The other thing I've significantly argued about is random sampling. You wonder why "random samples" in generative model papers are so good, especially compared to samples you get? Because a significant number of people believe that as long as you don't hand select individual images it is a "random sample." Like they generate a batch of samples, don't like it, so generate a new batch until they do. That's definitely not a random sample. You're just re-rolling the dice until you get a good outcome. But if you don't do this, and do an actual random sample, reviewers will criticize you on this even if your curated ones are good and all your benchmarks beat others. Ask me how I know...

      2 replies →