← Back to context

Comment by mattsan

2 years ago

Yep, confirmation bias. Luckily helped with peer review!

Hasn’t this paper made it through peer review?

  • Yeah it was published at ACL ( https://aclanthology.org/2023.findings-acl.426/ ) which is one of the most prestigious conferences in NLP. So kinda disappointing.

    But paper reviewers are usually not supposed to look at the actual source code of the papers, and definitely don't try to reproduce the results. They just read the paper itself, which of course doesn't talk about the error.

    Not sure what the best solution is, other than having the most "hyped" papers double verified by researchers on Twitter.

    • It's customary to use OSF (https://osf.io/) on papers this "groundbreaking," as it encourages scientists to validate and replicate the work.

      It's also weird that at this stage there are not validation checks in place, exactly like those the author performed. There was so much talk of needing this post-"replication crisis."

      2 replies →

    • Yeah, it’s not (entirely) the students’ faults that this slipped through peer review. I don’t envy the whiplash they’re going to experience over the next few weeks.

      If I was the graduate chair of their department I might schedule a meeting with their supervisor to sort out how this happened.

      3 replies →

    • > paper reviewers are usually not supposed to look at the actual source code of the papers

      Wait what? I haven't reviewed for ACL but most conferences don't say "don't look at the source code." They will say that reviewers are not required to look at it (as well as the appendix). But generally it just isn't uploaded. I do always look at the main method when it is there but talking to my peers and advisor, this is very uncommon[0]. My experience is that most reviewers do not spend more than an hour on a work and make an opinion within 15 minutes.

      > Not sure what the best solution is, other than having the most "hyped" papers double verified by researchers on Twitter.

      I'd say (as a start):

      1) Get rid of the conference system. A zero-shot (maybe 1-shot if "rebuttal" is allowed) zero-sum system is just disastrous, especially at scale. There's high incentives to actually reject works you review for. A conference system has a binary outcome and the purpose is to reject 80% of papers based on a rather noisy metric of "top tier." A journal system is a back and forth where reviewers are trying to improve the paper. The purpose of the reviewers here is to determine if the idea is indeed good, and then if the paper meets the requirements or not and must explicitly state what needs to be changed for acceptance.

      1.5) An actual rebuttal system could help alleviate some of these issues. Using OpenReview for a conversation between authors and reviewers is critical. A singular 1 page response (the norm) is not adequate to respond to 4 different people who often have low similarities in responses. Reviewers are allowed (though breaks guidelines) to respond in one sentence.

      2) ACs need to do a better job at validating reviewers. The number of inane and absolutely unacceptable level of reviews I have gotten is astounding (>25%). I've also seen reviewers often break guidelines and have nothing happen. Examples are comments such as those claiming lack of novelty with no explanation or asking authors to compare to concurrent works (I've had this happen for a work that was put out _after_ submission deadlines. Not mine, but here's an example[1] of this being done publicly). If the reviewer is pushed to update their comment then the authors have no ability to respond to their update without the conversation aspect. If there is high variance in response -- not just scores, but what the critiques are about -- then the ACs need to look closer as something is going wrong. We're in a crisis for reviewers but we also have an undisclosed crisis in quality of reviewers. Benchmarkism is on the rise but benchmarks are extremely limiting for evaluation. There's a certain irony given our frequent discussion of Goodhart's Law or Reward Hacking. I'll even make the claim that the quality crisis influences the quantity crisis as I have seen many peers stop reviewing because it isn't worth their time and they aren't getting a fair shot in return. On a personal note, there is a journal I will no longer review for because in-actionable and unreasonable responses, but I also won't submit to them either.

      3) Either get rid of double-blind, or actually do it. Everything is published on arxiv these days, which in general is great for the community as it allows things to move fast. But with this it is incredibly easy to de-anonymize authors. Though for big labs, they de-anonymize themselves actively[2]. In a very noisy process even a very slight edge becomes a significant edge[3]. These biases can even come unconsciously given that we're all reading arxiv papers constantly and it isn't unlikely that we come across some of the works we end up reviewing (yet to knowingly happen to me fwiw). But certain labs do have keywords that they use that can be identified.

      I think one of the major problems comes down to this: in a small community we have a certain level of accountability, as we all end up knowing one another through minimal connections. But in a large community there is little to no accountability and what depends on good faith can no longer be trusted. This encourages bad actors, especially when the system is highly competitive (see 1)), and creates bad science/evaluation creep. (e.g. now standard to tune HPs on test data results -- this is information leakage. If you don't, you likely can't compete).

      ======

      [0] Here's a prominent researcher explicitly saying they don't read the appendix, calling it trash, and a poll showing most people don't look at it https://twitter.com/david_picard/status/1660293648796340226

      [1] Here's a prominent researcher criticizing a paper for "not citing his work". I linked the top response which is telling him the submission date was 2 months prior to his arxiv release. This is someone who published >250 papers vs someone with <50. For added reference, paper 2 (prominent researcher) was _published_ June 26th in TMLR, but they did cite the other work (gotta give credit for that) https://twitter.com/RinonGal/status/1667943354670170118

      [2] We have 2 scenarios here: either reviewers do not know Chinchila == DeepMind, where I'd argue that they are unfit for reviewing given the prominence of that model or 2) they do know, and thus know this is a DeepMind work, and we have an ethics problem. Neither sound great. https://openreview.net/forum?id=OpzV3lp3IMC&noteId=HXmrWV3ln...

      [3] The conclusion in this analysis of consistency experiment is that even a small amount of inconsistency leads to a lot of noise given a highly selective standard. Which means that paper acceptance itself is highly stochastic: (2014 experiment) https://inverseprobability.com/talks/notes/the-neurips-exper...

      [3.1] A shorter version: https://blog.mrtz.org/2014/12/15/the-nips-experiment.html

      [3.2] A follow-up on the 2014 experiment tdlr: reviewers are good at identifying bad papers, but not good at identifying good papers (i.e. bias to reject): https://arxiv.org/abs/2109.09774

      [3.3] A follow-up 2021 experiment (consistent with 2014 experiment): https://blog.neurips.cc/2021/12/08/the-neurips-2021-consiste...

      [3.4] Video form https://www.youtube.com/watch?v=19Q-vMd9bYg

      5 replies →

  • I suspect GP commenter meant "replication study" rather than "peer review".

    ;-)

    (Peer review doesn't check if your data is correct. They check your data collection methods make sense given the hypothesis you're testing, and that your conclusions are supported by the data you collected.)