← Back to context

Comment by gcr

1 day ago

NeurIPS leadership doesn’t think hallucinated references are necessarily disqualifying; see the full article from Fortune for a statement from them: https://archive.ph/yizHN

> When reached for comment, the NeurIPS board shared the following statement: “The usage of LLMs in papers at AI conferences is rapidly evolving, and NeurIPS is actively monitoring developments. In previous years, we piloted policies regarding the use of LLMs, and in 2025, reviewers were instructed to flag hallucinations. Regarding the findings of this specific work, we emphasize that significantly more effort is required to determine the implications. Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference). As always, NeurIPS is committed to evolving the review and authorship process to best ensure scientific rigor and to identify ways that LLMs can be used to enhance author and reviewer capabilities.”

> the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference)

Maybe I'm overreacting, but this feels like an insanely biased response. They found the one potentially innocuous reason and latched onto that as a way to hand-wave the entire problem away.

Science already had a reproducibility problem, and it now has a hallucination problem. Considering the massive influence the private sector has on the both the work and the institutions themselves, the future of open science is looking bleak.

  • I found at least one example[0] of authors claiming the reason for the hallucination was exactly this. That said, I do think for this kind of use, authors should go to the effort of verifying the correctness of the output. I also tend to agree with others who have commented that while a hallucinated citation or two may not be particularly egregious, it does raise concerns about what other errors may have been missed.

    [0] https://openreview.net/forum?id=IiEtQPGVyV&noteId=W66rrM5XPk

  • The wording is not hand-wavy. They said "not necessarily invalidated", which could mean that innocuous reason and nothing extra.

    • I really think it is. The primary function of these publications is to validate science. When we find invalid citations, it shows they're not doing their job. When they get called on that, they cite the volume of work their publication puts out and call out the only potential not-disqualifying outcome.

      Seems like CYA, seems like hand wave. Seems like excuses.

    • Even if some of those innocuous mistakes happen, we'll all be better off if we accept people making those mistakes as acceptable casualties in an unforgiving campaign against academic fraudsters.

      It's like arguing against strict liability for drunk driving because maybe somebody accidentally let their grape juice sit to long and they didn't know it was fermented... I can conceive of such a thing, but that doesn't mean we should go easy on drunk driving.

  • I don’t read the NeurIPS statement as malicious per se, but I do think it’s incomplete

    They’re right that a citation error doesn’t automatically invalidate the technical content of a paper, and that there are relatively benign ways these mistakes get introduced. But focusing on intent or severity sidesteps the fact that citations, claims, and provenance are still treated as narrative artifacts rather than things we systematically verify

    Once that’s the case, the question isn’t whether any single paper is “invalid” but whether the workflow itself is robust under current incentives and tooling.

    A student group at Duke has been trying to think about with Liberata, i.e. what publishing looks like if verification, attribution, and reproducibility are first class rather than best effort

    They have a short explainer here that lays out the idea if useful context helps: https://liberata.info/

  • Isn't disqualifying X months of potentially great research due to a misformed, but existing reference harsh? I don't think they'd be okay with references that are actually made up.

    • When your entire job is confirming that science is valid, I expect a little more humility when it turns out you've missed a critical aspect.

      How did these 100 sources even get through the validation process?

      > Isn't disqualifying X months of potentially great research due to a misformed, but existing reference harsh?

      It will serve as a reminder not to cut any corners.

      3 replies →

    • Science relies on trust.. a lot. So things which show dishonesty are penalised greatly. If we were to remove trust then peer reviewing a paper might take months of work or even years.

      2 replies →

This will continue to happen as long as it is effectively unpunished. Even retracting the paper would do little good, as odds are it would not have been written if the author could not have used an LLM, so they are no worse off for having tried. Scientific publications are mostly a numbers game at this point. It is just one more example of a situation where behaving badly is much cheaper than policing bad behavior, and until incentives are changed to account for that, it will only get worse.

Why not run every submitted paper through GPTZero (before sending to reviewers) and summarily reject any paper with a hallucination?

  • That's how GPTZero wants to situate themselves.

    Who would pay them? Conference organizers are already unpaid and undestaffed, and most conferences aren't profitable.

    I think rejections shouldn't be automatic. Sometimes there are just typos. Sometimes authors don't understand BibTeX. This needs to be done in a way that reduces the workload for reviewers.

    One way of doing this would be for GPTZero to annotate each paper during the review step. If reviewers could review a version of each paper with yellow-highlighted "likely-hallucinated" references in the bibliography, then they'd bring it up in their review and they'd know to be on their guard for other probably LLM-isms. If there's only a couple likely typos in the references, then reviewers could understand that, and if they care about it, they'd bring it up in their reviews and the author would have the usual opportunity to rebut.

    I don't know if GPTZero is willing to provide this service "for free" to the academic community, but if they are, it's probably worth bringing up at the next PAMI-TC meeting for CVPR.

    • Most publication venues already pay for a plagiarism detection service, it seems it would be trivial to add it on as a cost. Especially given APCs for journals are several thousand dollars, what's a few dollars more per paper.

> Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated.

This statement isn’t wrong, as the rest of the paper could still be correct.

However, when I see a blatant falsification somewhere in a paper I’m immediately suspicious of everything else. Authors who take lazy shortcuts when convenient usually don’t just do it once, they do it wherever they think they can get away with it. It’s a slippery slope from letting an LLM handle citations to letting the LLM write things for you to letting the LLM interpret the data. The latter opens the door to hallucinated results and statistics, as anyone who has experimented with LLMs for data analysis will discover eventually.

  • Yep, it's a slippery slope. No one in their right mind would have tried to use GPT 2.0 for writing a part of their paper. But hallucination-error-rate kept decreasing. How do you think, is there acceptable hallucination-error-rate greater than 0?

>NeurIPS leadership doesn’t think hallucinated references are necessarily disqualifying

That seems ridiculous.

I think a _single_ instance of an LLM hallucination should be enough to retract the whole paper and ban further submissions.

  •    For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex
    

    This is equivalent to a typo. I’d like to know which “hallucinations” are completely made up, and which have a corresponding paper but contain some error in how it’s cited. The latter I don’t think matters.

    • If you click on the article you can see a full list of the hallucinations they found. They did put in the effort to look for plausible partial matches, but most of them are some variation of "No author or title match. Doesn't exist in publication."

      Here's a random one I picked as an example.

      Paper: https://openreview.net/pdf?id=IiEtQPGVyV

      Reference: Asma Issa, George Mohler, and John Johnson. Paraphrase identification using deep contextual- ized representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 517–526, 2018.

      Asma Issa and John Johnson don't appear to exist. George Mohler does, but it doesn't look like he works in this area (https://www.georgemohler.com/). No paper with that title exists. There are some with sort of similar titles (https://arxiv.org/html/2212.06933v2 for example), but none that really make sense as a citation in this context. EMNLP 2018 exists (https://aclanthology.org/D18-1.pdf), but that page range is not a single paper. There are papers in there that contain the phrases "paraphrase identification" and "deep contextualized representations", so you can see how an LLM might have come up with this title.

    • It's not the equivalent of a typo. A typo would be immediately apparent to the reader. This is a semantic error that is much less likely to be caught by the reader.

  • Going through a retraction and blacklisting process is also a lot of work -- collecting evidence, giving authors a chance to respond and mediate discussion, etc.

    Labor is the bottleneck. There aren't enough academics who volunteer to help organize conferences.

    (If a reader of this comment is qualified to review papers and wants to step up to the plate and help do some work in this area, please email the program chairs of your favorite conference and let them know. They'll eagerly put you to work.)

    • That's exactly why the inclusion of a hallucinated reference is actually a blessing. Instead going back and forth with the fraudster, just tell them to find the paper. If they can't, case closed. Massive amount of time and money saved.

      1 reply →

  • I dunno about banning them, humans without LLMs make mistakes all the time, but I would definitely place them under much harder scrutiny in the future.

    • Hallucinations aren't mistakes, they're fabrications. The two are probably referred to by the same word in some languages.

      Institutions can choose an arbitrary approach to mistakes; maybe they don't mind a lot of them because they want to take risks and be on the bleeding edge. But any flexible attitude towards fabrications is simply corruption. The connected in-crowd will get mercy and the outgroup will get the hammer. Anybody criticizing the differential treatment will be accused of supporting the outgroup fraudsters.

      2 replies →

Kinda gives the whole game away, doesn’t it? “It doesn’t actually matter if the citations are hallucinated.”

In fairness, NeurIPS is just saying out loud what everyone already knows. Most citations in published science are useless junk: it’s either mutual back-scratching to juice h-index, or it’s the embedded and pointless practice of overcitation, like “Human beings need clean water to survive (Franz, 2002)”.

Really, hallucinated citations are just forcing a reckoning which has been overdue for a while now.

  • > Most citations in published science are useless junk:

    Can't say that matches my experience at all. Once I've found a useful paper on a topic thereafter I primarily navigate the literature by traveling up and down the citation graph. It's extremely effective in practice and it's continued to get easier to do as the digitization of metadata has improved over the years.

  • There should be a way to drop any kind of circular citation ring from the indexes.

    • It's tough because some great citations are hard to find/procure still. I sometimes refer to papers that aren't on the Internet (eg. old wonderful books / journals).

      4 replies →