← Back to context

Comment by AbrahamParangi

2 years ago

Probably disappointing to the authors but an excellent rebuttal.

This is the sort of mistake that's awfully easy to make in ML. The unfortunate thing about the field is that subtle methodological errors often cause subtle failures rather than catastrophic failures as we're used to in many other branches of engineering or science. You can easily make a system slightly worse or slightly better by contaminating your training set with bad data or accidentally leaking some information about your target and the ML system will take it in stride (but with slightly contaminated results).

This result makes sense to me because as much as I would like it to be true, applying existing compression algorithms to ML feels like too much of a "free lunch". If there was any special magic happening in compression algorithms we'd use compression algorithms as encoders instead of using transformers as compressors.

> This is the sort of mistake that's awfully easy to make in ML.

It is important to remember this! Mistakes are common because they are easy to make. Science is a noisy process, but there is signal there and what we see here is exactly what peer review is about. I tend to argue that open publications are a better form of peer review than conferences/journals because of exactly this. Peer review is about your peers reviewing your work, less about whatever random and noisy standard a conference/journal puts forward. Remember that this was the way things happened for most of our history and that our modern notion of peer review is very recent (mid 70's). Older journals were more about accomplishing the mission that arxiv accomplishes today: disseminating works.

https://mitcommlab.mit.edu/broad/commkit/peer-review-a-histo...

[side note] another reason I'd advocate for the abolishment of conferences/journals is that through this we can actively advocate for reproduction papers, failure papers, and many other important aspects since we would not be held to the "novelty" criteria (almost everything is incremental). "Publishing" is about communicating your work to your peers and having them validate or invalidate your results.

[edit] I think conferences are good in the fact that they bring people together and that encourages collaboration. That's great. But I should clarify that I'm specifically talking about using these platforms as a means to judge the validity of works. If a conference system wants to just invite works and the community, then I'm totally cool with that. I do also like journals in theory given that there's a conversation happening between authors and reviewers, but I believe this also could just easily be accomplished through arxiv + github or OpenReview (preferred).

Those are used. Search for minimum description principle and entropy based classifier. The performance is poor, but it is definitely there and really easy to deploy. I have seen gzip being used for plagiarism detection as similar text tends to compress better. Use the compression ratio as weights on spring model then for visualisation. Also works with network communication metadata ...

It's true in many experiments. The desire to get the result you want can often overwhelm the need to validate what you are getting.

Especially true when the results confirm any pre-existing thinking you may have.

  • One particular example that I remember from an introductory particle physics class is the History Plots section[1] of the biennial review of experimental data.

    Knowing these quantities is important, but their particular values largely aren’t; nobody’s funding or career really depended on them being equal to one thing or another. Yet look at all the jumps, where the measurements after the initial very rough ones got stuck in the completely wrong place until the jump to the right value—when it happened—was of a completely implausible magnitude, like four, six, or ten sigma.

    [1] https://pdg.lbl.gov/2023/reviews/rpp2022-rev-history-plots.p...

    • What's also good to see here is that the post '90 numbers usually don't even fall within the error bars of the pre '90 numbers. While uncertainty is great, it isn't the end all. I think a lot of people forget how difficult evaluation actually is. Usually we just look at one or two metrics and judge based on that, but such an evaluation is incredibly naive. Metrics and measures are only guides, they do not provide certainty nor targets.

> The unfortunate thing about the field is that subtle methodological errors often cause subtle failures rather than catastrophic failures as we're used to in many other branches of engineering or science.

I've been doing a lot of studying in the ML field lately, and I'm seeing this a lot. It is just another thing that feels like the polar opposite of everything else that I've done as a software engineer.

Miss a semicolon? Instant error.

Miscalculate some grads on one out of three layers? Every now and then it might even work! But the results will be weird.

  • How about this one: tune your hyper-parameters based on the results on your test data.

    This is prolific, even the norm, but it is a form of information leakage. You're passing information about the test dataset to the model. The solution to this is to use 3 partitions: train, validation, test. Validation is for HP tuning (you can do cross-validation btw) and test is a single shot.

    • Yep, I've been guilty of that one lately. That and solving problems by simply overfitting a neural net to the data in the problem domain.

      I mean, it works, but the result is less interesting than what I should have done. :)

      3 replies →

Academic research code is largely dogshit written as quickly as possible by amateurs, barely tested whatsoever, and the primary intended output of all such code is accumulating paper citations.

A world with half as many scientific papers and twice as much care would produce far more value but the whole enterprise is hopelessly gamified.

Having worked in other sciences (neuroscience for me), I’m not sure what catastrophic obvious errors you’re used to seeing. The vast majority IME are like this, except with even longer feedback loops (on the order of several months).

  • Well, biology is probably a lot closer to ML in this way. My experience in chemistry or material science is that 99% of the time if you do anything wrong it totally doesn't work at all.

    This is fairly typical in software as well.

now shift fields such that the subtle methodological errors don't come to light in 20 years.

which field are you on now? economics!? haahha