← Back to context

Comment by j2kun

1 day ago

I spot-checked one of the flagged papers (from Google, co-authored by a colleague of mine)

The paper was https://openreview.net/forum?id=0ZnXGzLcOg and the problem flagged was "Two authors are omitted and one (Kyle Richardson) is added. This paper was published at ICLR 2024." I.e., for one cited paper, the author list was off and the venue was wrong. And this citation was mentioned in the background section of the paper, and not fundamental to the validity of the paper. So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).

I think there are some egregious papers in their dataset, and this error does make me pause to wonder how much of the rest of the paper used AI assistance. That said, the "single error" papers in the dataset seem similar to the one I checked: relatively harmless and minor errors (which would be immediately caught by a DOI checker), and so I have to assume some of these were included in the dataset mainly to amplify the author's product pitch. It succeeded.

>this error does make me pause to wonder how much of the rest of the paper used AI assistance

And this is what's operative here. The error spotted, the entire class of error spotted, is easily checked/verified by a non-domain expert. These are the errors we can confirm readily, with obvious and unmistakable signature of hallucination.

If these are the only errors, we are not troubled. However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations. They are a signature that some LLM was used to generate parts of the paper and the responsible authors used this LLM without care.

Checking the rest of the paper requires domain expertise, perhaps requires an attempt at reproducing the authors' results. That the rest of the paper is now in doubt, and that this problem is so widespread, threatens the validity of the fundamental activity these papers represent: research.

  • > If these are the only errors, we are not troubled. However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations. They are a signature that some LLM was used to generate parts of the paper and the responsible authors used this LLM without care.

    I am troubled by people using an LLM at all to write academic research papers.

    It's a shoddy, irresponsible way to work. And also plagiarism, when you claim authorship of it.

    I'd see a failure of the 'author' to catch hallucinations, to be more like a failure to hide evidence of misconduct.

    If academic venues are saying that using an LLM to write your papers is OK ("so long as you look it over for hallucinations"?), then those academic venues deserve every bit of operational pain and damaged reputation that will result.

    • >I am troubled by people using an LLM at all to write academic research papers.

      I'm an outsider to the academic system. I have cool projects that I feel push some niche application to SOTA in my tiny little domain, which is publishable based on many of the papers I've read.

      If I can build a system that does a thing, I can benchmark and prove it's better than previous papers, my main blocker is getting all my work and information into the "Arxiv PDF" format and tone. Seems like a good use of LLMs to me.

    • I would argue that an LLM is a perfectly sensible tool for structure-preserving machine translation from another language to English. (Where by "another language", you could also also substitute "very poor/non-fluent English." Though IMHO that's a bit silly, even though it's possible; there's little sense in writing in a language you only half know, when you'd get a less-lossy result from just writing in your native tongue, and then having it translate from that.)

      Google Translate et al were never good enough at this task to actually allow people to use the results for anything professional. Previous tools were limited to getting a rough gloss of what words in another language mean.

      But LLMs can be used in this way, and are being used in this way; and this is increasingly allowing non-English-fluent academics to publish papers in English-language journals (thus engaging with the English-language academic community), where previously those academics they may have felt "stuck" publishing in what few journals exist for their discipline in their own language.

      Would you call the use of LLMs for translation "shoddy" or "irresponsible"? To me, it'd be no more and no less "shoddy" or "irresponsible" than it would be to hire a freelance human translator to translate the paper for you. (In fact, the human translator might be a worse idea, as LLMs are more likely to understand how to translate the specific academic jargon of your discipline than a randomly-selected human translator would be.)

      11 replies →

    • > And also plagiarism, when you claim authorship of it.

      I don't actually mind putting Claude as a co-author on my github commits.

      But for papers there are usually so many tools involved. It would be crowded to include each of Claude, Gemini, Codex, Mathematica, Grammarly, Translate etc. as co-authors, even though I used all of them for some parts.

      Maybe just having a "tools used" section could work?

      1 reply →

    • There are legitimate, non-cheating ways to use LLMs for writing. I often use the wrong verb forms ("They synthesizes the ..."), write "though" when it should be "although", and forget to comma-separate clauses. LLMs are perfect for that. Generating text from scratch, however, is wrong.

      5 replies →

    • > It's a shoddy, irresponsible way to work. And also plagiarism, when you claim authorship of it.

      It reminds me of kids these days and their fancy calculators! Those new fangled doohickeys just aren't reliable, and the kids never realize that they won't always have a calculator on them! Everyone should just do it the good old fashioned way with slide rules!

      Or these darn kids and their unreliable sources like Wikipedia! Everyone knows that you need a nice solid reliable source that's made out of dead trees and fact checked but up to 3 paid professionals!

      24 replies →

    • >also plagiarism

      To me, this is a reminder of how much of a specific minority this forum is.

      Nobody I know in real life, personally or at work, has expressed this belief.

      I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.

      Clearly, the authors in NeurIPS don't agree that using an LLM to help write is "plagiarism", and I would trust their opinions far more than some random redditor.

      28 replies →

  • This seems like finding spelling errors and using them to cast the entire paper into doubt.

    I am unconvinced that the particular error mentioned above is a hallucination, and even less convinced that it is a sign of some kind of rampant use of AI.

    I hope to find better examples later in the comment section.

    • I actually believe it was an AI hallucination, but I agree with you that it seems the problem is far more concentrated to a few select papers (e.g., one paper made up more than 10% of the detected errors).

    • Why don't you look at the actual article? There are several more egregious examples, e.g., the authors being cited as "John Smith and Jane Doe"

      3 replies →

  • The problem is, 10 years ago when I was still publishing even I would let an incorrect citation go through b/c of an old bibtex file or some such.

    • Yeah, errors of omission are so common that "Errors and Omissions" is a category of professional liability insurance.

  • > However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations

    Given how stupidly tedious and error-prone citations are, I have no trouble believing that the citation error could be the only major problem with the paper, and that it's not a sign of low quality by itself. It would be another matter entirely if we were talking about something actually important to the ideas presented in the paper, but it isn't.

  • agree, I dont find this evidence of AI. It often happened that authors change, there are multiple venues, or I'm using an old version of the paper. We also need to see the denominator. If this google paper had this one bad citation out of 20 versus out of 60.

    Also everyone I know has been relying on google scholar for 10+ years. Is that AI-ish? There are definitely errors on there. If you would extrapolate from citation issues to the content in the age of LLMs, were you doing so then as well?

    It's the age-old debate about spelling/grammar issues in technical work. In my experience it rarely gets to the point that these errors eg from non-native speakers affect my interpretation. Others claim to infer shoddy content.

  • Google scholar and the vagaries of copy/paste errors has mangled bibitex ever since it became a thing, a single citation with these sorts of errors may not even be AI, just “normal” mistakes.

The missing analysis is, of course, a comparison with pre-LLM conferences, like 2022 or 2023 that would show a “false positive” rate for the tool.

The thing is, when you copy paste a bibliography entry from the publisher or from Google Scholar, the authors won't be wrong. In this case, it is. If I were to write a paper with AI, I would at least manage the bibliography by hand, conscious of hallucinations. The fact that the hallucination is in the bibliography is a pretty strong indicator that the paper was written entirely with AI.

  • Google Scholar provides imperfect citations - very often wrong article type (eg article versus conference paper), but up to and including missing authors, in my experience.

    • I've had the same experience. Also papers will often have multiple entries in Google Scholar, with small differences between them (enough that Scholar didn't merge them into one entry).

  • I'm not sure I agree... while I don't ever see myself writing papers with AI, I hate wrangling a bibtex bibliography.

    I wouldn't trust today's GPT-5-with-web-search to do turn a bullet point list of papers into proper citations without checking myself, but maybe I will trust GPT-X-plus-agent to do this.

    • Reference managers have existed for decades now and they work deterministically. I paid for one when writing my doctoral thesis because it would have been horrific to do by hand. Any of the major tools like Zotero or Mendeley (I used Papers) will export a bibtex file for you, and they will accept a RIS or similar format that most journals export.

    • This seems solvable today if you treat it as an architecture problem rather than relying on the model's weights. I'm using LangGraph to force function calls to Crossref or OpenAlex for a similar workflow. As long as you keep the flow rigid and only use the LLM for orchestration and formatting, the hallucinations pretty much disappear.

Agreed.

What I find more interesting is how easy these errors are to introduce and how unlikely they are to be caught. As you point out, a DOI checker would immediately flag this. But citation verification isn’t a first-class part of the submission or review workflow today.

We’re still treating citations as narrative text rather than verifiable objects. That implicit trust model worked when volumes were lower, but it doesn’t seem to scale anymore

There’s a project I’m working on at Duke University, where we are building a system that tries to address exactly this gap by making references and review labor explicit and machine verifiable at the infrastructure level. There’s a short explainer here that lays out what we mean, if useful context helps: https://liberata.info/

  • Citation checks are a workflow problem, not a model problem. Treat every reference as a dependency that must resolve and be reproducible. If the checker cannot fetch and validate it, it does not ship.

I see your point, but I don’t see where the author makes any claims about the specifics of the hallucinations, or their impact on the papers’ broader validity. Indeed, I would have found the removal of supposed “innocuous” examples to be far more deceptive than simply calling a spade a spade, and allowing the data to speak for itself.

  • The author calls the mistakes "confirmed hallucinations" without proof (just more or less evidence). The data never "speak for itself." The author curates the data and crafts a story about it. This story presented here is very suggestive (even using the term "hallucination" is suggestive). But calling it "100 suspected hallucinations", or "25 very likely hallucinations" does less for the author's end goal: selling their service.

    • Obviously a post on a startup's blog will be more editorialized than an academic paper. Still, this seems like an important discussion to have.

Bibtex are often also incorrectly generated. E.g., google scholar sometimes puts the names of the editors instead of the authors into the bibtex entry.

  • > Bibtex are often also incorrectly generated

    ...and including the erroneous entry is squarely the author's fault.

    Papers should be carefully crafted, not churned out.

    I guess that makes me sweetly naive

    • That's not happening for a similar reason people do not bug-check every single line of every single third-party library in their code. It's a chore that costs valuable time that you can instead spend on getting the actual stuff done. What's really important is that the scientific contribution is 100% correct and solid. For the references, the "good enough" paradigm applies. They mustn't be complete bogus, like the referenced work not existing at all which would indicate that the authors didnt even look at the reference. But minor issues like typos or rare issues with wrong authors can happen.

      1 reply →

    • I don't think the original comment was saying this isn't a problem but that flagging it as a hallucination from an LLM is a much more serious allegation. In this case, it also seems like it was done to market a paid product which makes the collateral damage less tolerable in my opinion.

      > Papers should be carefully crafted, not churned out.

      I think you can say the same thing for code and yet, even with code review, bugs slip by. People aren't perfect and problems happen. Trying to prevent 100% of problems is usually a bad cost/benefit trade-off.

    • What's the benefit to society of making sure that academics waste even more of their valuable hours verifying that Google Scholar did not include extraneous authors in some citation which is barely even relevant to their work? With search engines being as good as they are, it's not like we can't easily find that paper anyway.

      The entire idea of super-detailed citations is itself quite outdated in my view. Sure, citing the work you rely on is important, but that could be done just as well via hyperlinks. It's not like anybody (exclusively) relies on printed versions any more.

    • You want the content of the paper to be carefully crafted. Bibtex entries are the sort of thing you want people to copy and paste from a trusted source, as they can be difficult to do consistently correctly.

    • Pointing out these errors isn't wrong. But making the leap to "therefore: AI hallucinations!" without substantiating those accusations is.

The rate here (about 1% of papers) just doesn't seem that bad, especially if many of the errors are minor and don't affect the validity of the results. In other fields, over half of high-impact studies don't replicate.

As with anything, it is about trusting your tools. Who is culpable for such errors? In the days of human authors, the person writing the text is responsible for not making these errors. When AI does the writing, the person whose name is on the paper should still be responsible—but do they know that? Do they realize the responsibility they are shouldering when they use these AI tools? I think many times they do not; we implicitly trust the outputs of these tools, and the dangers of that are not made clear.

Yeah even the entire "Jane Doe / Jame Smith" my first thought is that it could have been a latex default value

There was dumb stuff like this before the GPT era, it's far from convincing

  • > Between 2020 and 2025, submissions to NeurIPS increased more than 220% from 9,467 to 21,575. In response, organizers have had to recruit ever greater numbers of reviewers, resulting in issues of oversight, expertise alignment, negligence, and even fraud.

    I don’t think the point being made is “errors didn’t happen pre-GPT”, rather the tasks of detecting errors have become increasingly difficult because of the associated effects of GPT.

    • > rather the tasks of detecting errors have become increasingly difficult because of the associated effects of GPT.

      Did the increase to submissions to NeurIPS from 2020 to 2025 happen because ChatGPT came out in November of 2022? Or was AI getting hotter and hotter during this period, thereby naturally increasing submissions to ... an AI conference?

      6 replies →

  • There are people who just want to punish academics for the sake of punishing academics. Look at all the people downthread salivating over blacklisting or even criminally charging people who make errors like this with felony fraud. Its the perfect brew of anti AI and anti academia sentiment.

    Also, in my field (economics), by far the biggest source of finding old papers invalid (or less valid, most papers state multiple results) is good old fashioned coding bugs. I'd like to see the software engineers on this site say with a straight face that writing bugs should lead to jail time.

    • And research codebases (in AI and otherwise) are usually of extremely bad quality. It's usually a bunch of extremely poorly-written scripts, with no indication which order to run them in, how inputs and outputs should flow between them, and which specific files the scripts were run on to calculate the statistics presented in the paper.

    • > I'd like to see the software engineers on this site say with a straight face that writing bugs should lead to jail time.

      My hand is up.

      I do not believe in gaol, but I do agree with the sentiment.

      5 replies →

> So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).

Well the title says ”hallucinations”, not ”fabrications”. What you describe sounds exactly like what AI builders call hallucinations.

  • Read the article. The author uses the word "fabricate" repeatedly to describe the situation where the wrong authors are in the citation.

This is par for the course for GPTZero, which also falsely claims they can detect AI generated text, a fundamentally impossible task to do accurately.

  • I'm not going to bat for GPTZero, but I think it's clearly possible to identify some AI-written prose. Scroll through LinkedIn or Twitter replies and there are clear giveaways in tone, phrasing and repeated structures (it's not just X it's Y).

    Not to say that you could ever feasibly detect all AI-generated text, but if it's possible for people to develop a sense for the tropes of LLM content then there's no reason you couldn't detect it algorithmically.

the earlier list of ICLR papers had way more egregious examples. Those were taken from the list of submissions not accepted papers however.

> relatively harmless and minor errors

They are not harmless. These hallucinated references are ingested by Google Scholar, Scopus, etc., and with enough time they will poison those wells. It is also plain academic malpractice, no matter how "minor" the reference is.

The example you provided doesn't sit right me.

If the mistake is one error of author and location in a citation, I find it fairly disingenuous to call that an hallucination. At least, it doesn't meet the threshold for me.

I have seen this kind of mistakes done long before LLM were even a thing. We used to call them that: mistakes.

Sorry, but blaming it on "AI autocomplete" is the dumbest excuse ever. Author lists come from BibTeX entries and while they often contains errors since they can come from many sources, they do not contain completely made up authors. I don't share your view that hallucinated citations are less damaging in background section. Background, related works, and introduction is the sections where citations most often show up. These sections are meant to be read and generating them with AI is plain cheating.

  • I'm not blaming anything on anything, because I did not (nor did the authors) confirm the cause of any of these errors.

    > I don't share your view that hallucinated citations are less damaging in background section.

    Who exactly is damaged in this particular instance?

    • Trust is damaged. I cannot verify that the evidence is correct only that the conclusions follow from the evidence. I have to rely on the authors to truthfully present their evidence. If they for whatever reason add hallucinated citations to their background that trust is 100% gone.

      1 reply →