GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers

1 day ago (gptzero.me)

I spot-checked one of the flagged papers (from Google, co-authored by a colleague of mine)

The paper was https://openreview.net/forum?id=0ZnXGzLcOg and the problem flagged was "Two authors are omitted and one (Kyle Richardson) is added. This paper was published at ICLR 2024." I.e., for one cited paper, the author list was off and the venue was wrong. And this citation was mentioned in the background section of the paper, and not fundamental to the validity of the paper. So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).

I think there are some egregious papers in their dataset, and this error does make me pause to wonder how much of the rest of the paper used AI assistance. That said, the "single error" papers in the dataset seem similar to the one I checked: relatively harmless and minor errors (which would be immediately caught by a DOI checker), and so I have to assume some of these were included in the dataset mainly to amplify the author's product pitch. It succeeded.

  • >this error does make me pause to wonder how much of the rest of the paper used AI assistance

    And this is what's operative here. The error spotted, the entire class of error spotted, is easily checked/verified by a non-domain expert. These are the errors we can confirm readily, with obvious and unmistakable signature of hallucination.

    If these are the only errors, we are not troubled. However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations. They are a signature that some LLM was used to generate parts of the paper and the responsible authors used this LLM without care.

    Checking the rest of the paper requires domain expertise, perhaps requires an attempt at reproducing the authors' results. That the rest of the paper is now in doubt, and that this problem is so widespread, threatens the validity of the fundamental activity these papers represent: research.

    • > If these are the only errors, we are not troubled. However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations. They are a signature that some LLM was used to generate parts of the paper and the responsible authors used this LLM without care.

      I am troubled by people using an LLM at all to write academic research papers.

      It's a shoddy, irresponsible way to work. And also plagiarism, when you claim authorship of it.

      I'd see a failure of the 'author' to catch hallucinations, to be more like a failure to hide evidence of misconduct.

      If academic venues are saying that using an LLM to write your papers is OK ("so long as you look it over for hallucinations"?), then those academic venues deserve every bit of operational pain and damaged reputation that will result.

      65 replies →

    • This seems like finding spelling errors and using them to cast the entire paper into doubt.

      I am unconvinced that the particular error mentioned above is a hallucination, and even less convinced that it is a sign of some kind of rampant use of AI.

      I hope to find better examples later in the comment section.

      5 replies →

    • The problem is, 10 years ago when I was still publishing even I would let an incorrect citation go through b/c of an old bibtex file or some such.

      1 reply →

    • > However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations

      Given how stupidly tedious and error-prone citations are, I have no trouble believing that the citation error could be the only major problem with the paper, and that it's not a sign of low quality by itself. It would be another matter entirely if we were talking about something actually important to the ideas presented in the paper, but it isn't.

    • agree, I dont find this evidence of AI. It often happened that authors change, there are multiple venues, or I'm using an old version of the paper. We also need to see the denominator. If this google paper had this one bad citation out of 20 versus out of 60.

      Also everyone I know has been relying on google scholar for 10+ years. Is that AI-ish? There are definitely errors on there. If you would extrapolate from citation issues to the content in the age of LLMs, were you doing so then as well?

      It's the age-old debate about spelling/grammar issues in technical work. In my experience it rarely gets to the point that these errors eg from non-native speakers affect my interpretation. Others claim to infer shoddy content.

    • Google scholar and the vagaries of copy/paste errors has mangled bibitex ever since it became a thing, a single citation with these sorts of errors may not even be AI, just “normal” mistakes.

  • The missing analysis is, of course, a comparison with pre-LLM conferences, like 2022 or 2023 that would show a “false positive” rate for the tool.

  • The thing is, when you copy paste a bibliography entry from the publisher or from Google Scholar, the authors won't be wrong. In this case, it is. If I were to write a paper with AI, I would at least manage the bibliography by hand, conscious of hallucinations. The fact that the hallucination is in the bibliography is a pretty strong indicator that the paper was written entirely with AI.

    • Google Scholar provides imperfect citations - very often wrong article type (eg article versus conference paper), but up to and including missing authors, in my experience.

      1 reply →

    • I'm not sure I agree... while I don't ever see myself writing papers with AI, I hate wrangling a bibtex bibliography.

      I wouldn't trust today's GPT-5-with-web-search to do turn a bullet point list of papers into proper citations without checking myself, but maybe I will trust GPT-X-plus-agent to do this.

      2 replies →

  • As with anything, it is about trusting your tools. Who is culpable for such errors? In the days of human authors, the person writing the text is responsible for not making these errors. When AI does the writing, the person whose name is on the paper should still be responsible—but do they know that? Do they realize the responsibility they are shouldering when they use these AI tools? I think many times they do not; we implicitly trust the outputs of these tools, and the dangers of that are not made clear.

  • Agreed.

    What I find more interesting is how easy these errors are to introduce and how unlikely they are to be caught. As you point out, a DOI checker would immediately flag this. But citation verification isn’t a first-class part of the submission or review workflow today.

    We’re still treating citations as narrative text rather than verifiable objects. That implicit trust model worked when volumes were lower, but it doesn’t seem to scale anymore

    There’s a project I’m working on at Duke University, where we are building a system that tries to address exactly this gap by making references and review labor explicit and machine verifiable at the infrastructure level. There’s a short explainer here that lays out what we mean, if useful context helps: https://liberata.info/

    • Citation checks are a workflow problem, not a model problem. Treat every reference as a dependency that must resolve and be reproducible. If the checker cannot fetch and validate it, it does not ship.

  • I see your point, but I don’t see where the author makes any claims about the specifics of the hallucinations, or their impact on the papers’ broader validity. Indeed, I would have found the removal of supposed “innocuous” examples to be far more deceptive than simply calling a spade a spade, and allowing the data to speak for itself.

    • The author calls the mistakes "confirmed hallucinations" without proof (just more or less evidence). The data never "speak for itself." The author curates the data and crafts a story about it. This story presented here is very suggestive (even using the term "hallucination" is suggestive). But calling it "100 suspected hallucinations", or "25 very likely hallucinations" does less for the author's end goal: selling their service.

      1 reply →

  • Bibtex are often also incorrectly generated. E.g., google scholar sometimes puts the names of the editors instead of the authors into the bibtex entry.

    • > Bibtex are often also incorrectly generated

      ...and including the erroneous entry is squarely the author's fault.

      Papers should be carefully crafted, not churned out.

      I guess that makes me sweetly naive

      6 replies →

  • The rate here (about 1% of papers) just doesn't seem that bad, especially if many of the errors are minor and don't affect the validity of the results. In other fields, over half of high-impact studies don't replicate.

  • Yeah even the entire "Jane Doe / Jame Smith" my first thought is that it could have been a latex default value

    There was dumb stuff like this before the GPT era, it's far from convincing

    • > Between 2020 and 2025, submissions to NeurIPS increased more than 220% from 9,467 to 21,575. In response, organizers have had to recruit ever greater numbers of reviewers, resulting in issues of oversight, expertise alignment, negligence, and even fraud.

      I don’t think the point being made is “errors didn’t happen pre-GPT”, rather the tasks of detecting errors have become increasingly difficult because of the associated effects of GPT.

      7 replies →

    • There are people who just want to punish academics for the sake of punishing academics. Look at all the people downthread salivating over blacklisting or even criminally charging people who make errors like this with felony fraud. Its the perfect brew of anti AI and anti academia sentiment.

      Also, in my field (economics), by far the biggest source of finding old papers invalid (or less valid, most papers state multiple results) is good old fashioned coding bugs. I'd like to see the software engineers on this site say with a straight face that writing bugs should lead to jail time.

      7 replies →

  • > relatively harmless and minor errors

    They are not harmless. These hallucinated references are ingested by Google Scholar, Scopus, etc., and with enough time they will poison those wells. It is also plain academic malpractice, no matter how "minor" the reference is.

  • > So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).

    Well the title says ”hallucinations”, not ”fabrications”. What you describe sounds exactly like what AI builders call hallucinations.

    • Read the article. The author uses the word "fabricate" repeatedly to describe the situation where the wrong authors are in the citation.

  • This is par for the course for GPTZero, which also falsely claims they can detect AI generated text, a fundamentally impossible task to do accurately.

  • the earlier list of ICLR papers had way more egregious examples. Those were taken from the list of submissions not accepted papers however.

  • The example you provided doesn't sit right me.

    If the mistake is one error of author and location in a citation, I find it fairly disingenuous to call that an hallucination. At least, it doesn't meet the threshold for me.

    I have seen this kind of mistakes done long before LLM were even a thing. We used to call them that: mistakes.

  • Sorry, but blaming it on "AI autocomplete" is the dumbest excuse ever. Author lists come from BibTeX entries and while they often contains errors since they can come from many sources, they do not contain completely made up authors. I don't share your view that hallucinated citations are less damaging in background section. Background, related works, and introduction is the sections where citations most often show up. These sections are meant to be read and generating them with AI is plain cheating.

    • I'm not blaming anything on anything, because I did not (nor did the authors) confirm the cause of any of these errors.

      > I don't share your view that hallucinated citations are less damaging in background section.

      Who exactly is damaged in this particular instance?

      1 reply →

Yuck, this is going to really harm scientific research.

There is already a problem with papers falsifying data/samples/etc, LLMs being able to put out plausible papers is just going to make it worse.

On the bright side, maybe this will get the scientific community and science journalists to finally take reproducibility more seriously. I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".

  • In my mental model, the fundamental problem of reproducibility is that scientists have very hard time to find a penny to fund such research. No one wants to grant “hey I need $1m and 2 years to validate the paper from last year which looks suspicious”.

    Until we can change how we fund science on the fundamental level; how we assign grants — it will be indeed very hard problem to deal with.

    • In theory, asking grad students and early career folks to run replications would be a great training tool.

      But the problem isn’t just funding, it’s time. Successfully running a replication doesn’t get you a publication to help your career.

      29 replies →

    • Funding is definitely a problem, but frankly reproduction is common. If you build off someone else's work (as is the norm) you need to reproduce first.

      But without repetition being impactful to your career and the pressure to quickly and constantly push new work, a failure to reproduce is generally considered a reason to move on and tackle a different domain. It takes longer to trace the failure and the bar is higher to counter an existing work. It's much more likely you've made a subtle mistake. It's much more likely the other work had a subtle success. It's much more likely the other work simply wasn't written such that a work could be sufficiently reproduced.

      I speak from experience too. I still remember in grad school I was failing to reproduce a work that was the main competitor to the work I had done (I needed to create comparisons). I emailed the author and got no response. Luckily my advisor knew the author's advisor and we got a meeting set up and I got the code. It didn't do what was claimed in the paper and the code structure wasn't what was described either. The result? My work didn't get published and we moved on. The other work was from a top 10 school and the choice was to burn a bridge and put a black mark on my reputation (from someone with far more merit and prestige) or move on.

      That type of thing won't change in a reproduction system but needs an open system and open reproduction system as well. Mistakes are common and we shouldn't punish them. The only way to solve these issues is openness

      2 replies →

    • Partially. There's also the issue that some sciences, like biology, are a lot messier & less predicatble than people like to believe.

    • yes, this should be built-in to grants and publishing

      of course the problem is that academia likes to assert its autonomy (and grant orgs are staffed by academia largely)

    • I often think we should movefrom peer review as "certification" to peer review as "triage", with replication determining how much trust and downstream weight a result earns over time.

      1 reply →

  • > I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".

    Most people (that I talk to, at least) in science agree that there's a reproducibility crisis. The challenge is there really isn't a good way to incentivize that work.

    Fundamentally (unless you're independent wealthy and funding your own work), you have to measure productivity somehow, whether you're at a university, government lab, or the private sector. That turns out to be very hard to do.

    If you measure raw number of papers (more common in developing countries and low-tier universities), you incentivize a flood of junk. Some of it is good, but there is such a tidal wave of shit that most people write off your work as a heuristic based on the other people in your cohort.

    So, instead it's more common to try to incorporate how "good" a paper is, to reward people with a high quantity of "good" papers. That's quantifying something subjective though, so you might try to use something like citation count as a proxy: if a work is impactful, usually it gets cited a lot. Eventually you may arrive at something like the H-index, which is defined as "The highest number H you can pick, where H is the number of papers you have written with H citations." Now, the trouble with this method is people won't want to "waste" their time on incremental work.

    And that's the struggle here; even if we funded and rewarded people for reproducing results, they will always be bumping up the citation count of the original discoverer. But it's worse than that, because literally nobody is going to cite your work. In 10 years, they just see the original paper, a few citing works reproducing it, and to save time they'll just cite the original paper only.

    There's clearly a problem with how we incentivize scientific work. And clearly we want to be in a world where people test reproducibility. However, it's very very hard to get there when one's prestige and livelihood is directly tied to discovery rather than reproducibility.

    • I'd personally like to see top conferences grow a "reproducibility" track. Each submission would be a short tech report that chooses some other paper to re-implement. Cap 'em at three pages, have a lightweight review process. Maybe there could be artifacts (git repositories, etc) that accompany each submission.

      This would especially help newer grad students learn how to begin to do this sort of research.

      Maybe doing enough reproductions could unlock incentives. Like if you do 5 reproductions than the AC would assign your next paper double the reviewers. Or, more invasively, maybe you can't submit to the conference until you complete some reproduction.

      10 replies →

    • > The challenge is there really isn't a good way to incentivize that work.

      What if we got Undergrads (with hope of graduate studies) to do it? Could be a great way to train them on the skills required for research without the pressure of it also being novel?

      4 replies →

    • > Eventually you may arrive at something like the H-index, which is defined as "The highest number H you can pick, where H is the number of papers you have written with H citations."

      It's the Google search algorithm all over again. And it's the certificate trust hierarchy all over again. We keep working on the same problems.

      Like the two cases I mentioned, this is a matter of making adjustments until you have the desired result. Never perfect, always improving (well, we hope). This means we need liquidity with the rules and heuristics. How do we best get that?

      4 replies →

    • > I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".

      But nobody want to pay for it

    • usually you reproduce previous research as a byproduct of doing something novel "on top" of the previous result. I dont really see the problem with the current setup.

      sometimes you can just do something new and assume the previous result, but thats more the exception. youre almost always going to at least in part reproducr the previous one. and if issues come up, its often evident.

      thats why citations work as a good proxy. X number of people have done work based around this finding and nobody has seen a clear problem

      theres a problem of people fabricating and fudging data and not making their raw data available ("on request" or with not enough meta data to be useful) which wastes everyones time and almost never leads to negative consequences for the authors

      2 replies →

    • That feels arbitrary as a measure of quality. Why isn't new research simply devalued and replication valued higher?

      "Dr Alice failed to reproduce 20 would-be headline-grabbing papers, preventing them from sucking all the air out of the room in cancer research" is something laudable, but we're not lauding it.

    • > you have to measure productivity somehow,

      No, you do not have to. You give people with the skills and interest in doing research the money. You need to ensure its spent correctly, that is all. People will be motivated by wanting to build a reputation and the intrinsic reward of the work

    • > If you measure raw number of papers (more common in developing countries and low-tier universities), you incentivize a flood of junk.

      This is exactly what rewarding replication papers (that reproduce and confirm an existing paper) will lead to.

      1 reply →

    • > The challenge is there really isn't a good way to incentivize that work.

      Ban publication of any research that hasn't been reproduced.

      4 replies →

  • Have they solved the issue where papers that cite research already invalidated are still being cited?

    • AFAIK, no, but I could see there being cause to push citations to also cite the validations. It'd be good if standard practice turned into something like

      Paper A, by bob, bill, brad. Validated by Paper B by carol, clare, charlotte.

      or

      Paper A, by bob, bill, brad. Unvalidated.

      2 replies →

    • Nope.

      I am still reviewing papers that propose solutions based on a technique X, conveniently ignoring research from two years ago that shows that X cannot be used on its own. Both the paper I reviewed and the research showing X cannot be used are in the same venue!

      3 replies →

  • Reproducibility is overrated and if you could wave a wand to make all papers reproducible tomorrow, it wouldn't fix the problem. It might even make it worse.

    https://blog.plan99.net/replication-studies-cant-fix-science...

    • ? More samples reduces the variance of a statistic. Obviously it cannot identify systematic bias in a model, or establish causality, or make a "bad" question "good". Its not overrated though -- it would strengthen or weaken the case for many papers.

      7 replies →

  • For ML/AI/Comp sci articles, providing reproducible code is a great option. Basically, PoC or GTFO.

    • The most annoying ones are those which discuss loosely the methodology but then fail to publish the weights or any real algorithms.

      It's like buying a piece of furniture from IKEA, except you just get an Allen key, a hint at what parts to buy, and blurry instructions.

      1 reply →

  • Yeah, spot on. If all we do is add more plausible sounding text on top of already fragile review and incentive structures, that really could make things worse rather than better

    Your second point is the important one. AI may be the thing that finally forces the community to take reproducibility, attribution, and verification seriously. That’s very much the motivation behind projects like Liberata, which try to shift publishing away from novelty first narratives and toward explicit credit for replication, verification, and followthrough. If that cultural shift happens, this moment might end up being a painful but necessary correction.

  • If there is one thing which scientific reports must require is not using AI to produce the documentation. They can be of the data but not of the source or anything else. AI is a tool, not a replacement for actual work.

  • > LLMs being able to put out plausible papers is just going to make it worse

    If correct form (LaTeX two-column formatting, quoting the right papers and authors of the year etc.) has been allowing otherwise reject-worthy papers to slip through peer review, academia arguably has bigger problems than LLMs.

    • Correct form and relevant citations have been, for generations up to a couple of years ago, mighty strong signals that a work is good and done by a serious and reliable author. This is no longer the case and we are worse off for it.

  • I think, at least I hope, that a part of the LLM value will be to create their retirement for specific needs. Instead of asking it to solve any problem, restrict the space to a tool that can help you then reach your goal faster without the statistical nature of LLMs.

  • On the bright side, an LLM can really help set up a reproduction environment.

    Perhaps repro should become the basis of peer review?

    • No, it can't. No LLM can purchase the equipment and chemicals and machinery you need to reproduce experiments, nor should you want it.

  • I heard that most papers in a given field are already not adding any value. (Maybe it depends on the field though.)

    There seems to be a rule in every field that "99% of everything is crap." I guess AI adds a few more nines to the end of that.

    The gems are lost in a sea of slop.

    So I see useless output (e.g. crap on the app store) as having negative value, because it takes up time and space and energy that could have been spent on something good.

    My point with all this is that it's not a new problem. It's always been about curation. But curation doesn't scale. It already didn't. I don't know what the answer to that looks like.

  • Reading the article, this is about CITATIONS which are trivially verifiable.

    This is just article publishers not doing the most basic verification failing to notice that the citations in the article don't exist.

    What this should trigger is a black mark for all of the authors and their institutions, both of which should receive significant reputational repercussions for publishing fake information. If they fake the easiest to verify information (does the cited work exist) what else are they faking?

  • I'd need to see the same scrutiny applied to pre-AI papers. If a field has a poor replication rate, meaning there's a good chance that a given published paper is just so much junk science, is that better or worse than letting AI hallucinate the data in the first place?

  •   > to finally take reproducibility more seriously
    

    I've long argued for this, as reproduction is the cornerstone of science. There's a lot of potential ways to do this but one that I like is linking to the original work. Suppose you're looking at the OpenReview page and they have a link for "reproduction efforts" and with at minimum an annotation for confirmation or failure.

    This is incredibly helpful to the community as a whole. Reproduction failures can be incredibly helpful even when the original work has no fraud. In those cases a reprising failure reveals important information about the necessary conditions that the original work relies on.

    But honestly, we'll never get this until we drop the entire notion of "novel" or "impact" and "publish or perish". Novel is in the eye of the reviewer and the lower the reviewer's expertise the less novel a work seems (nothing is novel as a high enough level). Impact can almost never be determined a priori, and when it can you already have people chasing those directions because why the fuck would they not? But publish or perish is the biggest sin. It's one of those ideas that looks nice on paper, like you are meaningfully determining who is working hard and who is hardly working. But the truth is that you can't tell without being in the weeds. The real result is that this stifles creativity, novelty, and impact as it forces researchers to chase lower hanging fruit. Things you're certain will work and can get published. It creates a negative feedback loop as we compete: "X publishes 5 papers a year, why can't you?" I've heard these words even when X has far fewer citations (each of my work had "more impact").

    Frankly, I believe fraud would dramatically reduce were researchers not risking job security. The fraud is incentivized by the cutthroat system where you're constantly trying to defend your job, your work, and your grants. They'll always be some fraud but (with a few exceptions) researchers aren't rockstar millionaires. It takes a lot of work to get to point where fraud even works, so there's a natural filter.

    I have the same advice as Mervin Kelly, former director of Bell Labs:

      How do you manage genius?
      You don't

This feels less like scientific integrity and more like predatory marketing. I find this public "shame list" approach by GPTZero deeply unethical and technically suspect for several reasons:

1. Doxxing disguised as specific criticism: Publishing the names of authors and papers without prior private notification or independent verification is not how academic corrections work. It looks like a marketing stunt to generate buzz at the expense of researchers' reputations.

2. False Positives & Methodology: How does their tool distinguish between an actual AI "hallucination" and a simple human error (e.g., a typo in a year, a broken link, or a messy BibTeX entry)? Labeling human carelessness as "AI fabrication" is libelous.

3. The "Protection Racket" Vibe: The underlying message seems to be: "Buy our tool, or next time you might be on this list." It’s creating a problem (fear of public shaming) to sell the solution.

We should be extremely skeptical of a vendor using a prestigious conference as a billboard for their product by essentially publicly shaming participants without due process.

  • I think its great.

    They explicitly distinguish between a "flawed citation" (missing author, typo in title) and a hallucination (completely fabricated journal, fake DOI, nonexistent authors). You can literally click through and verify each one yourself. If you think they're wrong about a specific example, point it out. It doesn't matter if these are honest mistakes or not - they should be highlighted and you should be happy to have a tool that can find them before you publish.

    It's ridiculous to call it doxxing. The papers are already published at NeurIPS with author names attached. GPTZero isn't revealing anything that wasn't already public. They are pointing out what they think are hallucinations which everyone can judge for themselves.

    It might even be terrible at detecting things. Which actually, I do not think is the case after reading the article. But even so, if they are unreliable I think the problem takes care of itself.

  • Don't expect ethics from GPTZero. If you upload a large document, they'll give a fake 100% AI rating behind a blur until you pay up to get the actual analysis. This clearly serves to prey on paranoid authors who are worried about being perceived as using AI.

NeurIPS leadership doesn’t think hallucinated references are necessarily disqualifying; see the full article from Fortune for a statement from them: https://archive.ph/yizHN

> When reached for comment, the NeurIPS board shared the following statement: “The usage of LLMs in papers at AI conferences is rapidly evolving, and NeurIPS is actively monitoring developments. In previous years, we piloted policies regarding the use of LLMs, and in 2025, reviewers were instructed to flag hallucinations. Regarding the findings of this specific work, we emphasize that significantly more effort is required to determine the implications. Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference). As always, NeurIPS is committed to evolving the review and authorship process to best ensure scientific rigor and to identify ways that LLMs can be used to enhance author and reviewer capabilities.”

  • > the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference)

    Maybe I'm overreacting, but this feels like an insanely biased response. They found the one potentially innocuous reason and latched onto that as a way to hand-wave the entire problem away.

    Science already had a reproducibility problem, and it now has a hallucination problem. Considering the massive influence the private sector has on the both the work and the institutions themselves, the future of open science is looking bleak.

    • I found at least one example[0] of authors claiming the reason for the hallucination was exactly this. That said, I do think for this kind of use, authors should go to the effort of verifying the correctness of the output. I also tend to agree with others who have commented that while a hallucinated citation or two may not be particularly egregious, it does raise concerns about what other errors may have been missed.

      [0] https://openreview.net/forum?id=IiEtQPGVyV&noteId=W66rrM5XPk

    • I don’t read the NeurIPS statement as malicious per se, but I do think it’s incomplete

      They’re right that a citation error doesn’t automatically invalidate the technical content of a paper, and that there are relatively benign ways these mistakes get introduced. But focusing on intent or severity sidesteps the fact that citations, claims, and provenance are still treated as narrative artifacts rather than things we systematically verify

      Once that’s the case, the question isn’t whether any single paper is “invalid” but whether the workflow itself is robust under current incentives and tooling.

      A student group at Duke has been trying to think about with Liberata, i.e. what publishing looks like if verification, attribution, and reproducibility are first class rather than best effort

      They have a short explainer here that lays out the idea if useful context helps: https://liberata.info/

    • Isn't disqualifying X months of potentially great research due to a misformed, but existing reference harsh? I don't think they'd be okay with references that are actually made up.

      8 replies →

  • This will continue to happen as long as it is effectively unpunished. Even retracting the paper would do little good, as odds are it would not have been written if the author could not have used an LLM, so they are no worse off for having tried. Scientific publications are mostly a numbers game at this point. It is just one more example of a situation where behaving badly is much cheaper than policing bad behavior, and until incentives are changed to account for that, it will only get worse.

  • Why not run every submitted paper through GPTZero (before sending to reviewers) and summarily reject any paper with a hallucination?

    • That's how GPTZero wants to situate themselves.

      Who would pay them? Conference organizers are already unpaid and undestaffed, and most conferences aren't profitable.

      I think rejections shouldn't be automatic. Sometimes there are just typos. Sometimes authors don't understand BibTeX. This needs to be done in a way that reduces the workload for reviewers.

      One way of doing this would be for GPTZero to annotate each paper during the review step. If reviewers could review a version of each paper with yellow-highlighted "likely-hallucinated" references in the bibliography, then they'd bring it up in their review and they'd know to be on their guard for other probably LLM-isms. If there's only a couple likely typos in the references, then reviewers could understand that, and if they care about it, they'd bring it up in their reviews and the author would have the usual opportunity to rebut.

      I don't know if GPTZero is willing to provide this service "for free" to the academic community, but if they are, it's probably worth bringing up at the next PAMI-TC meeting for CVPR.

      1 reply →

  • > Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated.

    This statement isn’t wrong, as the rest of the paper could still be correct.

    However, when I see a blatant falsification somewhere in a paper I’m immediately suspicious of everything else. Authors who take lazy shortcuts when convenient usually don’t just do it once, they do it wherever they think they can get away with it. It’s a slippery slope from letting an LLM handle citations to letting the LLM write things for you to letting the LLM interpret the data. The latter opens the door to hallucinated results and statistics, as anyone who has experimented with LLMs for data analysis will discover eventually.

    • Yep, it's a slippery slope. No one in their right mind would have tried to use GPT 2.0 for writing a part of their paper. But hallucination-error-rate kept decreasing. How do you think, is there acceptable hallucination-error-rate greater than 0?

  • >NeurIPS leadership doesn’t think hallucinated references are necessarily disqualifying

    That seems ridiculous.

  • I think a _single_ instance of an LLM hallucination should be enough to retract the whole paper and ban further submissions.

    •    For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex
      

      This is equivalent to a typo. I’d like to know which “hallucinations” are completely made up, and which have a corresponding paper but contain some error in how it’s cited. The latter I don’t think matters.

      2 replies →

    • Going through a retraction and blacklisting process is also a lot of work -- collecting evidence, giving authors a chance to respond and mediate discussion, etc.

      Labor is the bottleneck. There aren't enough academics who volunteer to help organize conferences.

      (If a reader of this comment is qualified to review papers and wants to step up to the plate and help do some work in this area, please email the program chairs of your favorite conference and let them know. They'll eagerly put you to work.)

      2 replies →

    • I dunno about banning them, humans without LLMs make mistakes all the time, but I would definitely place them under much harder scrutiny in the future.

      3 replies →

  • Kinda gives the whole game away, doesn’t it? “It doesn’t actually matter if the citations are hallucinated.”

    In fairness, NeurIPS is just saying out loud what everyone already knows. Most citations in published science are useless junk: it’s either mutual back-scratching to juice h-index, or it’s the embedded and pointless practice of overcitation, like “Human beings need clean water to survive (Franz, 2002)”.

    Really, hallucinated citations are just forcing a reckoning which has been overdue for a while now.

    • > Most citations in published science are useless junk:

      Can't say that matches my experience at all. Once I've found a useful paper on a topic thereafter I primarily navigate the literature by traveling up and down the citation graph. It's extremely effective in practice and it's continued to get easier to do as the digitization of metadata has improved over the years.

I was getting completely AI-generated reviews for a WACV publication back in 2024. The area chairs are so overworked that authors don't have much recourse, which sucks but is also really hard to handle unless more volunteers step up to the bat to help organize the conference.

(If you're qualified to review papers, please email the program chair of your favorite conference and let them know -- they really need the help!)

As for my review, the review form has a textbox for a summary, a textbox for strengths, a textbox for weaknesses, and a textbox for overall thoughts. The review I received included one complete set of summary/strengths/weaknesses/closing thoughts in the summary text box, another distinct set of summary/strengths/weaknesses/closing thoughts in the strengths, another complete and distinct review in the weaknesses, and a fourth complete review in the closing thoughts. Each of these four reviews were slightly different and contradicted each other.

The reviewer put my paper down as a weak reject, but also said "the pros greatly outweigh the cons."

They listed "innovative use of synthetic data" as a strength, and "reliance on synthetic data" as a weakness.

The ironic part about these hallucinations is that a research paper includes a literature review because the goal of the research is to be in dialogue with prior work, to show a gap in the existing literature, and to further the knowledge that this prior work has built.

By using an LLM to fabricate citations, authors are moving away from this noble pursuit of knowledge built on the "shoulders of giants" and show that behind the curtain output volume is what really matters in modern US research communities.

Wow! They're literally submitting references to papers by Firstname Lastname, John Doe and Jane Smith and nobody is noticing or punishing them.

Especially for your first NeurIPS paper as a PhD student, getting one published is extremely lucrative.

Most big tech PhD intern job postings have NeurIPS/ICML/ICLR/etc. first author paper as a de facto requirement to be considered. It's like getting your SAG card.

If you get one of these internships, it effectively doubles or triples your salary that year right away. You will make more in that summer than your PhD stipend. Plus you can now apply in future summers and the jobs will be easier to get. And it sets your career on a good path.

A conservative estimate of the discounted cash value of a student's first NeurIPS paper would certainly be five figures. It's potentially much higher depending on how you think about it, considering potential path dependent impacts on future career opportunities.

We should not be surprised to see cheating. Nonetheless, it's really bad for science that these attempts get through. I also expect some people did make legitimate mistakes letting AI touch their .bib.

  • This is 100% true, if anything you’re massively undercounting the value of publications.

    Most industry AI jobs that aren’t research based know that NeurIPS publications are a huge deal. Many of the managers don’t even know what a workshop is (so you can pass off NeurIPS workshop work as just “NeurIPS”)

    A single first author main conference work effectively allows a non Ph.D holder to be treated like they have a Ph.d (be qualified for professional researcher jobs). This means that a decent engineer with 1 NeurIPS publication is easily worth 300K+ YOY assuming US citizen. Even if all they have is a BS ;)

    And if you are lucky to get a spotlight or an oral, that’s probably worth closer to 7 figures…

Could you run a similar analysis for pre-2020 papers? It'd be interesting to know how prevalent making up sources was before LLMs.

  • Also, it'd be interesting how many pre-2020 papers their "AI detector" marks as AI-generated. I distrust LLMs somewhat, but I distrust AI detectors even more.

  • at the end of the article they made a clear distinction between flawed and hallucinated cititations. I feels its hard to argue that through a mistake a hallucinated citation emerge:

    > Real Citation Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521:436-444, 2015.

    Flawed Citation

    Y. LeCun, Y. Bengio, and Geoff Hinton. Deep leaning. nature, 521(7553):436-444, 2015.

    Hallucinated Citation

    Samuel LeCun Jackson. Deep learning. Science & Nature: 23-45, 2021.

  • Yeah, it’s kind of meaningless to attribute this to AI without measuring the base rate.

    It’s for sure plausible that it’s increasing, but I’m certain this kind of thing happened with humans too.

There's a lot of good arguments in this thread about incentives: extremely convincing about why current incentives lead to exactly this behaviour, and also why creating better incentives is a very hard problem.

If we grant that good carrots are hard to grow, what's the argument against leaning into the stick? Change university policies and processes so that getting caught fabricating data or submitting a paper with LLM hallucinations is a career ending event. Tip the expected value of unethical behaviours in favour of avoiding them. Maybe we can't change the odds of getting caught but we certainly can change the impact.

This would not be easy, but maybe it's more tractable than changing positive incentives.

  • the harsher the punishment, the more due process required.

    i don't think there are any AI detection tools that are sufficiently reliable that I would feel comfortable expelling a student or ending someone's career based on their output.

    for example, we can all see what's going on with these papers (and it appears to be even worse among ICLR submissions). but it is possible to make an honest mistake with your BibTeX. Or to use AI for grammar editing, which is widely accepted, and have it accidentally modify a data point or citation. There are many innocent mistakes which also count as plausible excuses.

    in some cases further investigation maybe can reveal a smoking gun like fabricated data, which is academic misconduct whether done by hand or because an AI generated the LaTeX tables. punishments should be harsher for this than they are.

    • Fabricated citations seem to be a popular and non ambiguous way for AI to sabotage science.

I'm surprised by these results. I would have expected non-Anglo-American universities to rank at the top of the list. One of the most valuable features of LLMs from the beginning has been their ability to improve written language. This is particularly beneficial for non-English-speaking researchers in preventing language-related biases. However, the list shows that LLM usage is more intensive in the English-speaking world. Why?

So the headline says

>GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers

And I'm left wondering if they mean 100 papers or 100 hallucinations

The subheading says

>GPTZero's analysis 4841 papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations

Which accidentally a word, but seems to clarify that they do legitimately mean 100 papers.

A later heading says

>Table of 100 Hallucinated Citations in Published Across 53 NeurIPS Papers

Which suggests either the opposite, or that they chose a subset of their findings to point out a coincidentally similar number of incidents.

How many papers did they find hallucinations in? I'm still not certain. Is it 100, 53 or some other number altogether? Does their quality of scrutiny match the quality of their communication. If they did in-fact find 100 Hallucinations in 53 papers, would the inconsistency against their claim of "papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations" meet their own bar for a hallucination?

  • They counted multiple hallucinations in a single paper toward the 100, and explicitly call out one paper with 13 incorrect citations that are claimed (reasonably, IMO) to be hallucinated.

    • So you are saying their claim of

      >GPTZero's analysis 4841 papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations

      Is not true. [Edit - that sounds a bit harsh making it seem like you are accusing them, it's more that this is a logical conclusion of your(imo reasonable) interpretation.

Getting papers published is now more about embellishing your CV versus a sincere desire to present new research. I see this everywhere at every level. Getting a paper published anywhere is a checkbox in completing your resume. As an industry we need to stop taking this into consideration when reviewing candidates or deciding pay. In some sense it has become an anti-signal.

  • It'd be nice if there were a dedicated journal for papers published just because you have to publish for your CV or to get your degree. That way people can keep publishing for the sake of publishing, but you could see at a glance what the deal was.

  • I think its fairer to say that perverse incentives have added more noise to the publishing signal. Publishing 0 times is not better than 100 times, even if 90% of those are Nth author formality/politeness citations.

  • I'd like to see a financial approach to deciding pay by giving researchers a small and perhaps nonlinear or time bounded share of any profits that arise from their research.

    Then peoples CV's could say "My inventions have led to $1M in licensing revenue" rather than "I presented a useless idea at a decent conference because I managed to make it sound exciting enough to get accepted".

The innumeracy is load-bearing for the entire media ecosystem. If readers could do basic proportional reasoning, half of health journalism and most tech panic coverage would collapse overnight.

GPTZero of course knows this. "100 hallucinations across 53 papers at prestigious conference" hits different than "0.07% of citations had issues, compared to unknown baseline, in papers whose actual findings remain valid."

  • I’m not sure that’s fair in this context.

    In the past, a single paper with questionable or falsified results at a top tier conference was big news.

    Something that casts doubt on the validity of 53 papers at a top AI conference is at least notable.

    > whose actual findings remain valid

    Remain valid according to who? The same group that missed hundreds of hallucinated citations?

    • Which of these papers had falsified results and not bad citations?

      What is the base rate of bad citations pre-AI?

      And finally yes. Peer review does not mean clicking every link in the footnotes to make sure the original paper didn't mislink, though I'm sure after this bruhaha this too will be automated.

      1 reply →

  • > "0.07% of citations had issues

    Nope, you are getting this part wrong. On purpose or by accident? Because it's pretty clear if you read the article they are not counting all citations that simply had issues. See "Defining Hallucinated Citations".

I wrote before about my embarrassing time with ChatGPT during a period (https://news.ycombinator.com/item?id=44767601) - I decided to go back through those old 4o chats with 5.2 pro extended thinking, the reply was pretty funny because it first slightly ridiculed me, heh - but what it showed was: basically I would say "what 5 research papers from any area of science talk to these ideas" and it would find 1 and invent 4 if it didn't know 4 others, and not tell me, and then I'd keep working with it and it would invent what it thought might be in the papers long the way, making up new papers in it's own work to cite to make it's own work valid, lol. Anyway, I'm a moron, sure, and no real harm came of it for me, just still slightly shook I let that happen to me.

  • Just to clarify, you didn't actually look up the publications it was citing? For example, you just stayed in ChatGPT web and used the resources it provided there? Not ridiculing you of course, but am just curious. The last paper I wrote a couple months back I had GPT search out the publications for me, but I would always open a new tab and retrieve the actual publication.

    • I didn't because I wasn't really doing anything serious to my mind, I think? basically felt like watching an episode of pbs spacetime, I think the difference is it's more like playing a video game while thinking you're watching an episode of spacetime, if that makes sense? I don't use chatgpt for me real work that much, and I'm not a scientist, so it was for me just mucking around, it pushed me slightly over a line into "I was just playing but now this seems real", it didn't occur to me to go back through and check all the papers, I guess because quite a lot of chatting had happened since then and, I dunno, I just didn't think to? Not sure that makes much sense. This was also over a year ago, during the time they had the gpt4o sycophancy mode that made the news, and it wasn't backed by webserch, so I took for granted what was in it's training data. No good excuse I'm afraid. tldr: poor critical thinking skills on my part there! :)

I'd really like to have studied in these times, where it's so much easier with all the new tools. I could have been a triple doctor.

At work I've automated tools to write automated technical certificates for wind parks.

I've wrote code automatically to solve problems I couldn't solve by my own. Complicated Linear Algebra stuff, which was always too hard.

I should have written papers automatically, at least my wife writes her reports with ChatGPT already.

Others are writing film scripts by tools.

Good times.

At least in one case the authors claimed to use ChatGPT to "generate the citations after giving it author-year in-text citations, titles, or their paraphrases." They pasted the hallucinations in without checking. They've since responded with corrections to real papers that in most cases are very similar to the hallucination, lending credibility to their claim.[1]

Not great, but to be clear this is different from fabricating the whole paper or the authors inventing the citations. (In this case at least.)

[1] https://openreview.net/forum?id=IiEtQPGVyV

I don't understand: why aren't there automated tools to verify citations' existence? The data for a citation has a structured styling (APA, MLA, Chicago) and paper metadata is available via e.g. a web search, even if the paper contents are not

I guess GPTZero has such a tool. I'm confused why it isn't used more widely by paper authors and reviewers

  • Citations are too open ended and prone to variation, and legitimate minor mistskes that wouldn't bother a human verifier but would break automated tools to easily verify in their current form. DOI was supposed to solve some of the literal mechanical variation of the existence of a source, but journal paywalls and limited adoption mean that is not a universal solution. Plus DOI still doesn't easily verify the factual accuracy of a citation, like "does the source say what the citation says it does," which is the most important part.

    In my experience you will see considerable variation in citation formats, even in journals that strictly define it and require using BibTex. And lots of journals leave their citation format rules very vague. Its a problem that runs deep.

  • Looks like GPTZero Source Finder was only released a year ago - if anything, I'm surprised slop-writers aren't using it preemptively, since they're "ahead of the curve" relative to reviewers on this sort of thing...

With regard to confabulating (hallucinating) sources, or anything else, it is worth noting this is a first class training requirement imposed on models. Not models simply picking up the habit from humans.

When training a student, normally we expect a lack of knowledge early, and reward self-awareness, self-evaluation and self-disclosure of that.

But the very first epoch of a model training run, when the model has all the ignorance of a dropped plate of spaghetti, we optimize the network to respond to information, as anything from a typical human to an expert, without any base of understanding.

So the training practice for models is inherently extreme enforced “fake it until you make it”, to a degree far beyond any human context or culture.

(Regardless, humans need to verify, not to mention read, the sources they site. But it will be nice when models can be trusted to accurately access what they know/don’t-know too.)

Which is worse:

a) p-hacking and suppressing null results

b) hallucinations

c) falsifying data

Would be cool to see an analysis of this

  • I'm doing some research, and this is something I'm unsure of. I see that "suppressing null results" is a bad thing, and I sort of agree, but for me personally, a lot of the null results are just the result of my own incompetence and don't contain any novel insights.

  • All 3 of these should be categorized as fraud, and punished criminally.

    • Only when we can arrest people who say dumb stuff on the internet too. Much like how trump and bubba (bill Clinton) should share a jail cell, those who pontificate about what they don’t know about (I.e non academics critiquing academia) can sit in the same jail cell as the supposed criminal academics.

      You gotta horse trade if you want to win. Take one for the team or get out of the way.

The takeaway for me isn't that LLMs produce bad references—humans do that too—but that cutting corners shows in the final product. If your background section contains made‑up citations, it makes readers wonder how careful you were with the parts they can't check as easily. If you're going to use AI tools for draft writing, you still need to vet every fact.

The spot-check revealing a 'wrong venue and missing authors' rather than a fabricated citation is telling. Citation errors existed long before LLMs - bibtex from Google Scholar is notoriously unreliable, and copy-paste errors are universal. Using these as definitive AI markers seems like working backward from a conclusion. The underlying question is interesting (how much AI use in academic writing?), but the methodology here doesn't seem to actually answer it.

This is awful but hardly surprising. Someone mentioned reproducible code with the papers - but there is a high likelihood of the code being partially or fully AI generated as well. I.e. AI generated hypothesis -> AI produces code to implement and execute the hypothesis -> AI generates paper based on the hypothesis and the code.

Also: there were 15 000 submissions that were rejected at NeurIPS; it would be very interesting to see what % of those rejected were partially or fully AI generated/hallucinated. Are the ratios comperable?

  • Whether the code is AI generated or not is not important, what matters is that it really works.

    Sharing code enables others to validate the method on a different dataset.

    Even before LLMs came around there were lots of methods that looked good on paper but turned out not to work outside of accepted benchmarks

And this is the tip of the iceberg, because these are the easy to check/validate things.

I'm sure plenty of more nuanced facts are also entirely without basis.

It has been several years since the reviewing process for top AI conferences have been broken as hell, due to having too many submissions and only a few reviewers (up to the point that Masters students are reviewing the papers). It was only a matter of time before these conferences will be filled with AI-written papers.

You will find out that Top CS conference is never scientific, if you really go to their GitHub and run their code.

It is very concerning that these hallucinations passed through peer review. It's not like peer review is a fool-proof method or anything, but the fact that reviewers did not check all references and noticed clearly bogus ones is alarming and could be a sign that the article authors weren't the only ones using LLMs in the process...

  • Is it common for peer reviewers to check references? Somehow I thought they mostly focused on whether the experiment looked reasonable and the conclusions followed.

    • In journal publications it is, but without DOIs it's difficult.

      In conference publications, it's less common.

      Conference publications (like NEURips) is treated as announcement of results, not verified.

      1 reply →

Th incorrect citations problem will disappear when AI web search and fetch becomes 100x cheaper than it is today. Right now, the APIs are too expensive to do proper multihundred results of papers (the search space for any paper is much larger than the final list of citations).

However, we’ll be left with AI written papers and no real way to determine if they’re based on reality or just a “stochastic mirror” (an approximate reflection of reality).

AI might just extinguish the entire paradigm of publish or perish. The sheer volume of papers makes it nearly impossible to properly decide which papers have merit, which are non-replicate and suspect, and which are just a desperate rush to publish. The entire practice needs to end.

  • Its not publish or perish so much as get grant money or perish.

    Publishing is just the way to get grants.

    A PI explained it to me once, something like this

    Idea(s) -> Grant -> Experiments -> Data -> Paper(s) -> Publication(s) -> Idea(s) -> Grant(s)

    Thats the current cycle ... remove any step and its a dead end

  • But how could we possibly evaluate faculty and researcher quality without counting widgets on an assembly line? /s

    It’s a problem. The previous regime prior to publishing-mania was essentially a clubby game of reputation amongst peers based on cocktail party socialization.

    The publication metrics came out of the harder sciences, I believe, and then spread to the softest of humanities. It was always easy to game a bit if you wanted to try, but now it’s trivial to defeat.

This is mostly an ad for their product. But I bet you can get pretty good results with a Claude Code agent using a couple simple skills.

Should be extremely easy for AI to successfully detect hallucinated references as they are semi-structured data with an easily verifiable ground truth.

It would be ironic if the very detection of hallucinations contained hallucinations of its own.

Why focus on hallucinations/LLMs and not on the authors? There are rules for submitting papers.

If I drop a loaded gun and it fires, killing someone, we don't go after the gun's manufacturer in most cases.

  • This isn't directly to your point, but: A civil suit for such an incident would generally name both the weapon owner (for negligence, etc.) and the manufacturer (for dangerous design).

  • Actually, if you’re the US navy, you DO go after the manufacturer!

    Go look up the P320 pistol and the tons of accidental discharges that’s it’s caused.

    https://stateline.org/2025/03/10/more-law-enforcement-agenci...

    • Thanks. But's not actually the point I'm trying to make.

      What I'm saying is that the authors have a responsibility, whether they wrote the papers themselves, asked an AI to write and didn't read it thoroughly, or asked their grandparents while on LSD to write it... it all comes back to whoever put their names on the paper and submitted it.

      I think AI is a red herring here.

Gave GPTZero a random ChatGPT text about finances. It was 84% confident, that it was entirely human writing

We've been talking about a "crisis of reproducibility" for years and the incentive to crank out high volumes of low-quality research. We now have a tool that brings down the cost of producing plausibly-looking research down to zero. So of course we're going to see that tool abused on a galactic scale.

But here's the thing: let's say you're an university or a research institution that wants to curtail it. You catch someone producing LLM slop, and you confirm it by analyzing their work and conducting internal interviews. You fire them. The fired researcher goes public saying that they were doing nothing of the sort and that this is a witch hunt. Their blog post makes it to the front page of HN, garnering tons of sympathy and prompting many angry calls to their ex-employer. It gets picked up by some mainstream outlets, too. It happened a bunch of times.

In contrast, there are basically no consequences to institutions that let it slide. No one is angrily calling the employers of the authors of these 100 NeurIPS papers, right? If anything, there's the plausible deniability of "oh, I only asked ChatGPT to reformat the citations, the rest of the paper is 100% legit, my bad".

The prevalence of hallucinations in the system is another signs for change in the system. The citations should be treated less like narrative context and more like verifiable objects

Better detectors, like the article implies, won’t solve the problem, since AI will likely keep improving

It’s about the fact that our publishing workflows implicitly assume good faith manual verification, even as submission volume and AI assisted writing explode. That assumption just doesn’t hold anymore

A student initiative at Duke University has been working on what it might look like to address this at the publishing layer itself, by making references, review labor, and accountability explicit rather than implicit

There’s a short explainer video for their system: https://liberata.info/

It’s hard to argue that the current status quo will scale, so we need novel solutions like this.

I'm surprised it's only 100, honestly. Also feels a little sensationalized... Before AI I wonder how many "hallucinations" were in human-written papers. Is there any data on this?

  • These are 100, already reviewed papers and accepted papers, by researchers in their areas of expertise. Usually PhD's and Professors.... They judge.

    These are not all the submissions that they received. The review process can be... brutal for some people (depending on the quality of their submission)

As long as these sorts of papers serve more important purposes for the careers of the authors than anything related to science or discovery of knowledge, then of course this happens and continues.

The best possible outcome is that these two purposes are disconflated, with follow-on consequences for the conferences and journals.

A lot of research in AI/ML seems to me to be "fake it and never make it". Literally it's all about optics, posturing, connections, publicity. Lots of bullshit and little substance. This was true before AI slop, too. But the fact that AI slop can make it pass the review really showcases how much a paper's acceptance hinges on things, other than the substance and results of the paper.

I even know PIs who got fame and funding based on some research direction that supposedly is going to be revolutionary. Except all they had were preliminary results that from one angle, if you squint, you can envision some good result. But then the result never comes. That's why I say, "fake it, and never make it".

This suggests that nobody was screening this papers in the first place—so is it actually significant that people are using LLMs in a setting without meaningful oversight?

These clearly aren't being peer-reviewed, so there's no natural check on LLM usage (which is different than what we see in work published in journals).

  • As one who reviews 20+ papers per year, we don't have time to verify each reference.

    We verify: is the stuff correct, and is it worthy of publication (in the given venue) given that it is correct.

    There is still some trust in the authors to not submit made-up-stuff, albeit it is diminishing.

  • Academic venues don't have enough reviewers. This problem isn't new, and as publication volumes increase, it's getting sharply worse.

    Consider the unit economics. Suppose NeurIPS gets 20,000 papers in one year. Suppose each author should expect three good reviews, so area chairs assign five reviewers per paper. In total, 100,000 reviews need to be written. It's a lot of work, even before factoring emergency reviewers in.

    NeurIPS is one venue alongside CVPR, [IE]CCV, COLM, ICML, EMNLP, and so on. Not all of these conferences are as large as NeurIPS, but the field is smaller than you'd expect. I'd guess there are 300k-1m people in the world who are qualified to review AI papers.

    • Seems like using tooling like this to identify papers with fake citations and auto-rejecting them before they ever get in front of a reviewer would kill two birds with one stone.

      2 replies →

  • When I was reviewing such papers, I didn't bother checking that 30+ citations were correctly indexed. I focused on the article itself, and maybe 1 or 2 citations that are important. That's it. For most citations, they are next to an argument that I know is correct, so why would I bother checking. What else do you expect? My job was to figure out if the article ideas are novel and interesting, not if they got all their citations right.

This feels like a big nothingburger to me. Try an analysis on conference submissions (perhaps even published papers) from 1995 for comparison, and one from 2005, one from 2015. I recall the typos/errors/ommissions because I reviewed for them and I used them. Even then: so what? If I could find the reference relatively easily and with enough confidence I was fine. Rarely I couldnt find it and contacted the author. The job of the reviewer (or even author) isnt to be a nitpicky editor—that’s the editor’s job. Editing does not happen until the final printed publication is near, and only for accepted papers, nowadays sometimes it never happens. Now that is a problem perhaps, but it has nothing to do with the authors’ use of LLMs.

> we discovered 100s of hallucinated citations missed by the 3+ reviewers who evaluated each paper.

This says just as much about the humans involved.

  • Well for one, it's definitely not the responsibility of the reviewers to check that all the citations exist. That would be insane.

We have the h score and such, can we have something similar that goes down when you pull stunts like these? Preferably link it to people’s orcid ids.

How you know it's really real is that they clearly tell the FPR, and compare against a pre-llm baseline.

But I saw it in Apple News, so MISSION ACCOMPLISHED!

Clearly there is some demand for those papers, and research, to exist. Good opportunity to fill the gaps.

The downstream effects of this are extremely concerning. We have already seen the damage caused by human written research that was later retracted like the “research” on vaccines causing autism.

As we get more and more papers that may be citing information that was originally hallucinated in the first place we have a major reliability issue here. What is worse is people that did not use AI in the first place will be caught in the crosshairs since they will be referencing incorrect information.

There needs to be a serious amount of education done on what these tools can and cannot do and importantly where they fail. Too many people see these tools as magic since that is what the big companies are pushing them as.

Other than that we need to put in actual repercussions for publishing work created by an LLM without validating it (or just say you can’t in the first place but I guess that ship has sailed) or it will just keep happening. We can’t just ignore it and hope it won’t be a problem.

And yes, humans can make mistakes too. The difference is accountability and the ability to actually be unsure about something so you question yourself to validate.

What's wild is so many of these are from prestigious universities. MIT, Princeton, Oxford and Cambridge are all on there. It must be a terrible time to be an academic who's getting outcompeted by this slop because somebody from an institution with a better name submitted it.

  • I'm going to be charitable and say that the papers from prestigious universities were honest mistakes rather than paper mill university fabrications.

    One thing that has bothered me for a very long time is that computer science (and I assume other scientific fields) has long since decided that English is the lingua franca, and if you don't speak it you can't be part of it. Can you imagine if being told that you could only do your research if you were able to write technical papers in a language you didn't speak, maybe even using glyphs you didn't know? It's crazy when you think about it even a little bit, but we ask it of so many. Let's not include the fact that 90% of the English-speaking population couldn't crank out a paper to the required vocabulary level anyway.

    A very legitimate, not trying to cheat, use for LLMs is translation. While it would be an extremely broad and dangerous brush to paint with, I wonder if there is a correlation between English-as-a-Second (or even third)-Language authors and the hallucinations. That would indicate that they were trying to use LLMs to help craft the paper to the expected writing level. The only problem being that it sometimes mangles citations, and if you've done good work and got 25+ citations, it's easy for those errors to slip through.

    • I can't speak for the American universities, but remember there is no entrance exam for UK PhDs, you just require a 2:1 or 1st class bachelor's degree/masters (going straight without a masters is becoming more common) usually, which is trivial to obtain. The hard part is usually getting funding, but if you provide your own funding you can go to any university you want. They are only really hard universities to get into for a bachelors, not for masters or PhD where you are more of a money/labour source than anything else.

      1 reply →

Implicitly this makes sense but the amount cited in this article is still hard for me to grasp. Wow.

The most striking part of the report isn't just the 100 hallucinations—it’s the "submission tsunami" (220% increase since 2020) that made this possible. We’re seeing a literal manifestation of a system being exhausted by simulation.

When a reviewer is outgunned by the volume of generative slop, the structure of peer review collapses because it was designed for human-to-human accountability, not for verifying high-speed statistical mimicry. In these papers, the hallucinations are a dead giveaway of a total decoupling of intelligence from any underlying "self" or presence. The machine calculates a plausible-looking citation, and an exhausted reviewer fails to notice the "Soul" of the research is missing.

It feels like we’re entering a loop where the simulation is validated by the system, which then becomes the training data for the next generation of simulation. At that point, the human element of research isn't just obscured—it's rendered computationally irrelevant.

This is going to be a huge problem for conferences. While journals have a longer time to get things right, as a conference reviewer (for IEEE conferences) I was often asked to review 20+ papers in a short time to determine who gets a full paper, who gets to present just a poster, etc. There was normally a second round, but often these would just look at submissions near the cutoff margin in the rankings. Obvious slop can be quickly rejected, but it will be easier to sneak things in.

  • AI conferences are already fucked. Students who are doing their Master's degrees are reviewing those top-tier papers, since there are just too many submissions for existing reviewers.

My website of choice whenever I have to deal with references is dblp [1]. In my opinion more reliable than Google scholar in creating correct BibTeX. Also when searching for a paper you clearly see where it has been published or if it is only on arxiv.

[1] https://dblp.org/

The authors talk about "a model's ability to align with human decisions" as a matter of the past. The omission in the paper is RLHF (Reinforcement Learning from Human Feedback). All these companies are "teaching machines to predict the preferences of people who click 'Accept All Cookies' without reading," by using low-paid human evaluators — “AI teachers.”

If we go back to Google, before its transformation into an AI powerhouse — as it gutted its own SERPs, shoving traditional blue links below AI-generated overlords that synthesize answers from the web’s underbelly, often leaving publishers starving for clicks in a zero-click apocalypse — what was happening?

The same kind of human “evaluators” were ranking pages. Pushing garbage forward. The same thing is happening with AI. As much as the human "evaluators" trained search engines to elevate clickbait, the very same humans now train large language models to mimic the judgment of those very same evaluators. A feedback loop of mediocrity — supervised by the... well, not the best among us. The machines still, as Stephen Wolfram wrote, for any given sequence, use the same probability method (e.g., “The cat sat on the...”), in which the model doesn’t just pick one word. It calculates a probability score for every single word in its vast vocabulary (e.g., “mat” = 40% chance, “floor” = 15%, “car” = 0.01%), and voilà! — you have a “creative” text: one of a gazillion mindlessly produced, soulless, garbage “vile bile” sludge emissions that pollute our collective brains and render us a bunch of idiots, ready to swallow any corporate poison sent our way.

In my opinion, even worse: the corporates are pushing toward “safety” (likely from lawsuits), and the AI systems are trained to sell, soothe, and please — not to think, or enhance our collective experience.

The problem isn’t scale.

The problem is consequences (lack of).

Doing this should get you barred from research. It won’t.

  • Scale IS a problem, just not the only one.

    Consequences are the inevitable solution. Accountability starting with authors, followed by organizations/institutions.

    Warning for first offense, ban after

What if they would only accept handwritten papers? Basically the current system is beyond repair, so may as well go back to receiving 20 decent papers instead of 20k hallucinated ones.

All papers proved to have used a LLM beyond writing improvement should be automatically retracted

I'm an author on a paper on breast cancer, and one of our co-authors generated the majority of their work with AI. It just makes me angry.

I don't know about you, but where I'm from, we call citations from sources which don't exist "fabrications" or "fraud" - not "hallucination", which sounds like some medical condition which evokes pity.

This is nice and all, but what repercussion does GPTZero get when their bullshit AI detection hallucinates a student using AI? And when that student receives academic discipline because of it?

Many such cases of this. More than 100!

They claim to have custom detection for GPT-5, Gemini, and Claude. They're making that up!

  • Indeed. My son has been accused by bullshit AI detection as having used AI, and it has devastated his work quality. After being "disciplined" for using AI (when he didn't), he now intentionally tries to "dumb down" his writing so that it doesn't sound so much like AI. The result is he writes much worse. What a shitty, shitty outcome. I've even found myself leaving typos and things in (even on sites like HN) because if you write too well, inevitably some comment replier will call you out as being an LLM even when you aren't. I'm as annoyed by the LLM posts as everybody else, but the answer surely is not to dumb us down into Idiocracy.

    • Stop using em dashes, the fancy quotes that can’t be easily typed. Stop using overused words like certainly and delve. Stop using LLM template slop like “it’s not X, it’s Y”. Stop always doing lists of 3s. We know you didn’t use to use so many emojis or bolded text. Also, AI really fking hates the exclamation mark so that’s a great proof of humanity!

      Most people getting flagged are getting flagged because they actually used AI and couldn’t even be bothered to manually deslop it.

      People who are too lazy to put even a tiny bit of human intentionality into their work deserve it.

I searched Google for one of the hallucinations: [N. Flammarion. Chen "sam generalizes"]

AI Overview: Based on the research, [Chen and N. Flammarion (2022)](https://gptzero.me/news/neurips/) investigate why Sharpness-Aware Minimization (SAM) generalizes better than SGD, focusing on optimization perspectives

The link is a link to the OP web page calling the "research" a hallucination.

Why does "Robust Label Proportions Learning" have a "Scan" link, while all the others have a "Sources" link? Was this web page generated by AI?

Given that many of these detections are being made from references, I don't understand why we're not using automatic citation checkers.

Just ask authors to submit their bib file so we don't need to do OCR on the PDF. Flag the unknown citations and ask reviewers to verify their existence. Then contact authors and ban if they can't produce the cited work.

This is low hanging fruit here!

Detecting slop where the authors vet citations is much harder. The big problem with all the review rules is they have no teeth. If it were up to me we'd review in the open, or at least like ICLR. Publish the list of known bad actors and let is look at the network. The current system is too protective of egregious errors like plagiarism. Authors can get detected in one conference, pull, and submit to another, rolling the dice. We can't allow that to happen and we should discourage people from associating with these conartists.

AI is certainly a problem in the world of science review, but it's far from the only one and I'm not even convinced it's the biggest. The biggest is just that reviewers are lazy and/or not qualified to review the works they're assigned. It takes at least an hour to properly review a paper in your niche, much more when it's outside. We're over worked as is, with 5+ works to review, not to mention all the time we got to spend reworking our own works that were rejected due to the slot machine. We could do much better if we dropped this notion of conference/journal prestige and focused on the quality of the works and reviews.

Addressing those issues also addresses the AI issues because, frankly, *it doesn't matter if the whole work was done by AI, what matters is if the work is real.*

Can we just hallucinate the whole conference by now? Like "Hey AI, generate me the whole conference agenda, schedule, papers, tracks, workshops, and keynote" and not pay the $1k?

"100 Hallucinated Citations in Published Across 53 NeurIPS Papers"

No one cares about citations. They are hallucinated because they are required to be present for political reasons, even though they have no relevance.

If these are so easy to identify, why not just incorporate some kind of screening into the early stages of peer review?

No surprises. Machine learning has, at least since 2012, been the go-to field for scammers and grifters. Machine learning, and technology in general, is basically a few real ideas, a small number of honest hard workers, and then millions of fad chasers and scammers.

It would be great if those scientists who use AI without disclosing it get fucked for life.

  • > It would be great if those scientists who use AI without disclosing it get fucked for life.

    There need to be dis-incentives for sloppy work. There is a tension between quality and quantity in almost every product. Unfortunately academia has become a numbers-game with paper-mills.

  • Harsh sentiment. Pretty soon every knowledge worker will use AI every day. Should people disclose spellcheckers powered by AI? Disclosing is not useful. Being careful in how you use it and checking work is what matters.

    • What they are doing is plain cheating the system to get their 3 conference papers so they can get their $150k+ job at FAANG. It's plain cheating with no value.

      5 replies →

    • > Should people disclose spellcheckers powered by AI?

      Thank you for that perfect example of a strawman argument! No, spellcheckers that use AI is not the main concern behind disclosing the use of AI in generating scientific papers, government reports, or any large block of nonfiction text that you paid for that is supposed to make to sense.

    • People are accountable for the results they produce using AI. So a scientist is responsible for made up sources in their paper, which is plain fraud.

      3 replies →

    • In general we're pretty good at drawing a line between purely editorial stuff like using a spellchecker, or even the services a professional editor (no need to acknowledge), and independent intellectual contribution (must be acknowledged). There's no slippery slope.

    • >Pretty soon every knowledge worker will use AI every day.

      Maybe? There's certainly a push to force the perception of inevitability.

    • False equivalence. This isn't about "using AI" it's about having an AI pretend to do your job.

      What people are pissed about is the fact their tax dollars fund fake research. It's just fraud, pure and simple. And fraud should be punished brutally, especially in these cases, because the long tail of negative effects produces enormous damage.

      1 reply →

    • "Pretty soon every knowledge worker will use AI every day" is a wild statement considering the reporting that most companies deploying AI solutions are seeing little to no benefit, but also, there's a pretty obvious gap between spell checkers and tools that generate large parts of the document for you

  • Instead of publishing their papers in the prestigious zines - which is what they're after - we will publish them in "AI Slop Weekly" with name and picture. Up the submission risk a bit.