← Back to context

Comment by cogman10

1 day ago

Yuck, this is going to really harm scientific research.

There is already a problem with papers falsifying data/samples/etc, LLMs being able to put out plausible papers is just going to make it worse.

On the bright side, maybe this will get the scientific community and science journalists to finally take reproducibility more seriously. I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".

In my mental model, the fundamental problem of reproducibility is that scientists have very hard time to find a penny to fund such research. No one wants to grant “hey I need $1m and 2 years to validate the paper from last year which looks suspicious”.

Until we can change how we fund science on the fundamental level; how we assign grants — it will be indeed very hard problem to deal with.

  • In theory, asking grad students and early career folks to run replications would be a great training tool.

    But the problem isn’t just funding, it’s time. Successfully running a replication doesn’t get you a publication to help your career.

    • Grad students don’t get to publish a thesis on reproduction. Everyone from the undergraduate research assistant to the tenured professor with research chairs are hyper focused on “publishing” as much “positive result” on “novel” work as possible

      9 replies →

    • That.. still requires funding. Even if your lab happens to have all the equipment required to replicate you're paying the grad student for their time spent on replicating this paper and you'll need to buy some supplies; chemicals, animal subjects, pay for shared equipment time, etc.

    • You may well know this, but I get the sense that it isn’t necessarily common knowledge, so I want to spell it out anyway:

      In a lot of cases, the salary for a grad student or tech is small potatoes next to the cost of the consumables they use in their work.

      For example,I work for a lab that does a lot of sequencing, and if we’re busy one tech can use 10k worth of reagents in a week.

      1 reply →

    • There is actually a ton of replication going on at any given moment, usually because we work off of each other's work, whether those others are internal or external. But, reporting anything basically destroys your career in the same way saying something about Weinstein before everyone's doing it does. So, most of us just default to having a mental list of people and circles we avoid as sketchy and deal with it the way women deal with creepy dudes in music scenes, and sometimes pay the troll toll. IMO, this is actually one of the reasons for recent increases in silo-ing, not just stuff being way more complicated recently; if you switch fields, you have to learn this stuff and pay your troll tolls all over again. Anyway, I have discovered or witnessed serious replication problems four times --

      (1) An experiment I was setting up using the same method both on a protein previously analyzed by the lab as a control and some new ones yielded consistently "wonky" results (read: need different method, as additional interactions are implied that make standard method inappropriate) in both. I wasn't even in graduate school yet and was assumed to simply be doing shoddy work, after all, the previous work was done by a graduate student who is now faculty at Harvard, so clearly someone better trained and more capable. Well, I finally went through all of his poorly marked lab notebooks and got all of his raw data... his data had the same "wonkiness," as mine, he just presumably wanted to stick to that method and "fixed" it with extreme cherry-picking and selective reporting. Did the PI whose lab I was in publish a retraction or correction? No, it would be too embarrassing to everyone involved, so the bad numbers and data live on.

      (2) A model or, let's say "computational method," was calibrated on a relatively small, incomplete, and partially hypothetical data-set maybe 15 years ago, but, well, that was what people had. There are many other models that do a similar task, by the way, no reason to use this one... except this one was produced by the lab I was in at the time. I was told to use the results of this one into something I was working on and instead, when reevaluating it on the much larger data-set we have now, found it worked no better than chance. Any correction or mention of this outside the lab? No, and even in the lab, the PI reacted extremely poorly and I was forced to run numerous additional experiments which all showed the same thing, that there was basically no context this model was useful. I found a different method worked better and subsequently, had my former advisor "forget" (for the second time) to write and submit his portion of a fellowship he previously told me to apply to. This model is still tweaked in still useless ways and trotted out in front of the national body that funds a "core" grant that the PI basically uses as a slush fund, as sign of the "core's" "computational abilities." One of the many reasons I ended up switching labs. PI is a NAS member, by the way, and also auto-rejects certain PIs from papers and grants because "he just doesn't like their research" (i.e. they pissed him off in some arbitrary way), also flew out a member of the Swedish RAS and helped them get an American appointment seemingly in exchange for winning a sub-Nobel prize for research... they basically had nothing to do with, also used to basically use various members as free labor on super random stuff to faculty who approved his grants, so you know the type.

      (3) Well, here's a fun one with real stakes. Amyloid-β oligomers, field already rife with fraud. A lab that supposedly has real ones kept "purifying" them for the lab involved in 2, only for the vial to come basically destroyed. This happened multiple times, leading them to blame the lab, then shipping. Okay, whatever. They send raw material, tell people to follow a protocol carefully to make new ones. Various different people try, including people who are very, very careful with such methods and can make everything else. Nobody can make them. The answer is "well, you guys must suck at making them." Can anyone else get the protocol right? Well, not really... But, admittedly, someone did once get a different but similar protocol to work only under the influence of a strong magnetic field, so maybe there's something weird going on in their building that they actually don't know about and maybe they're being truthful. But, alternatively, they're coincidentally the only lab in the world that can make super special sauce, and everybody else is just a shitty scientist. Does anyone really dig around? No, why would a PI doing what the PI does in 2 want to make an unnecessary enemy of someone just as powerful and potentially shitty? Predators don't like fighting.

      (4) Another one that someone just couldn't replicate at all, poured four years into it, origin was a big lab. Same vibe as third case, "you guys must just suck at doing this," then "well, I can't get in contact with the graduate student who wrote the paper, they're now in consulting, and I can't find their data either." No retraction or public comment, too big of a name to complain about except maybe on PubPeer. Wasted an entire R21.

    • Enough people will falsify the replication and pocket the money, taking you back to where you were in the first place and poorer for it. The loss of trust is an existential problem for the USA.

    • Grad students have this weird habit of eating food and renting places to live, though, so that's also money

  • Funding is definitely a problem, but frankly reproduction is common. If you build off someone else's work (as is the norm) you need to reproduce first.

    But without repetition being impactful to your career and the pressure to quickly and constantly push new work, a failure to reproduce is generally considered a reason to move on and tackle a different domain. It takes longer to trace the failure and the bar is higher to counter an existing work. It's much more likely you've made a subtle mistake. It's much more likely the other work had a subtle success. It's much more likely the other work simply wasn't written such that a work could be sufficiently reproduced.

    I speak from experience too. I still remember in grad school I was failing to reproduce a work that was the main competitor to the work I had done (I needed to create comparisons). I emailed the author and got no response. Luckily my advisor knew the author's advisor and we got a meeting set up and I got the code. It didn't do what was claimed in the paper and the code structure wasn't what was described either. The result? My work didn't get published and we moved on. The other work was from a top 10 school and the choice was to burn a bridge and put a black mark on my reputation (from someone with far more merit and prestige) or move on.

    That type of thing won't change in a reproduction system but needs an open system and open reproduction system as well. Mistakes are common and we shouldn't punish them. The only way to solve these issues is openness

    • > If you build off someone else's work (as is the norm) you need to reproduce first.

      Not if the result you're building off of is a model, you can just assume it

      1 reply →

  • Partially. There's also the issue that some sciences, like biology, are a lot messier & less predicatble than people like to believe.

  • yes, this should be built-in to grants and publishing

    of course the problem is that academia likes to assert its autonomy (and grant orgs are staffed by academia largely)

  • I often think we should movefrom peer review as "certification" to peer review as "triage", with replication determining how much trust and downstream weight a result earns over time.

    • grants should come with money and requirement for independent reproduction

      academia is too fragmented and extremely inefficient

> I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".

Most people (that I talk to, at least) in science agree that there's a reproducibility crisis. The challenge is there really isn't a good way to incentivize that work.

Fundamentally (unless you're independent wealthy and funding your own work), you have to measure productivity somehow, whether you're at a university, government lab, or the private sector. That turns out to be very hard to do.

If you measure raw number of papers (more common in developing countries and low-tier universities), you incentivize a flood of junk. Some of it is good, but there is such a tidal wave of shit that most people write off your work as a heuristic based on the other people in your cohort.

So, instead it's more common to try to incorporate how "good" a paper is, to reward people with a high quantity of "good" papers. That's quantifying something subjective though, so you might try to use something like citation count as a proxy: if a work is impactful, usually it gets cited a lot. Eventually you may arrive at something like the H-index, which is defined as "The highest number H you can pick, where H is the number of papers you have written with H citations." Now, the trouble with this method is people won't want to "waste" their time on incremental work.

And that's the struggle here; even if we funded and rewarded people for reproducing results, they will always be bumping up the citation count of the original discoverer. But it's worse than that, because literally nobody is going to cite your work. In 10 years, they just see the original paper, a few citing works reproducing it, and to save time they'll just cite the original paper only.

There's clearly a problem with how we incentivize scientific work. And clearly we want to be in a world where people test reproducibility. However, it's very very hard to get there when one's prestige and livelihood is directly tied to discovery rather than reproducibility.

  • I'd personally like to see top conferences grow a "reproducibility" track. Each submission would be a short tech report that chooses some other paper to re-implement. Cap 'em at three pages, have a lightweight review process. Maybe there could be artifacts (git repositories, etc) that accompany each submission.

    This would especially help newer grad students learn how to begin to do this sort of research.

    Maybe doing enough reproductions could unlock incentives. Like if you do 5 reproductions than the AC would assign your next paper double the reviewers. Or, more invasively, maybe you can't submit to the conference until you complete some reproduction.

    • The problem is that reproducing something is really, really hard! Even if something doesn't reproduce in one experiment, it might be due to slight changes in some variables we don't even think about. There are some ways to circumvent it (e.g. team that's being reproduced cooperating with reproducing team and agreeing on what variables are important for the experiemnt and which are not), but it's really hard. The solutions you propose will unfortunately incentivize bad reproductions and we might reject theories that are actually true because of that. I think that one of the best way to fight the crisis is to actually improve quality of science - articles where authors reject to share their data should be automatically rejected. We should also move towards requiring preregistration with strict protocols for almost all studies.

      4 replies →

    • Is it time for some sort of alternate degree to a PhD beyond a Master's? Showing, essentially, "this person can learn, implement, validate, and analyze the state of the art in this field"?

      4 replies →

  • > The challenge is there really isn't a good way to incentivize that work.

    What if we got Undergrads (with hope of graduate studies) to do it? Could be a great way to train them on the skills required for research without the pressure of it also being novel?

    • Those undergrads still need to be advised and they use lab resources.

      If you're a tenure-track academic, your livelihood is much safer from having them try new ideas (that you will be the corresponding author on, increasing your prestige and ability to procure funding) instead of incrementing.

      And if you already have tenure, maybe you have the undergrad do just that. But the tenure process heavily filters for ambitious researchers, so it's unlikely this would be a priority.

      If instead you did it as coursework, you could get them to maybe reproduce the work, but if you only have the students for a semester, that's not enough time to write up the paper and make it through peer review (which can take months between iterations)

    • Most interesting results are not so simple to recreate that would could reliably expect undergrads to do perform the replication even if we ignore the cost of the equipment and consumables that replication would need and the time/supervision required to walk them through the process.

    • Unfortunately, that might just lead to a bunch of type II errors instead, if an effect requires very precise experimental conditions that undergrads lack the expertise for.

      1 reply →

  • > Eventually you may arrive at something like the H-index, which is defined as "The highest number H you can pick, where H is the number of papers you have written with H citations."

    It's the Google search algorithm all over again. And it's the certificate trust hierarchy all over again. We keep working on the same problems.

    Like the two cases I mentioned, this is a matter of making adjustments until you have the desired result. Never perfect, always improving (well, we hope). This means we need liquidity with the rules and heuristics. How do we best get that?

  • > I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".

    But nobody want to pay for it

  • usually you reproduce previous research as a byproduct of doing something novel "on top" of the previous result. I dont really see the problem with the current setup.

    sometimes you can just do something new and assume the previous result, but thats more the exception. youre almost always going to at least in part reproducr the previous one. and if issues come up, its often evident.

    thats why citations work as a good proxy. X number of people have done work based around this finding and nobody has seen a clear problem

    theres a problem of people fabricating and fudging data and not making their raw data available ("on request" or with not enough meta data to be useful) which wastes everyones time and almost never leads to negative consequences for the authors

    • It's often quite common to see a citation say "BTW, we weren't able to reproduce X's numbers, but we got fairly close number Y, so Table 1 includes that one next to an asterisk."

      The difficult part is surfacing that information to readers of the original paper. The semantic scholar people are beginning to do some work in this area.

      1 reply →

  • That feels arbitrary as a measure of quality. Why isn't new research simply devalued and replication valued higher?

    "Dr Alice failed to reproduce 20 would-be headline-grabbing papers, preventing them from sucking all the air out of the room in cancer research" is something laudable, but we're not lauding it.

  • > you have to measure productivity somehow,

    No, you do not have to. You give people with the skills and interest in doing research the money. You need to ensure its spent correctly, that is all. People will be motivated by wanting to build a reputation and the intrinsic reward of the work

  • > If you measure raw number of papers (more common in developing countries and low-tier universities), you incentivize a flood of junk.

    This is exactly what rewarding replication papers (that reproduce and confirm an existing paper) will lead to.

    • And yet if we can't reproduce an existing paper, it's very possible that existing paper is junk itself.

      Catch-22 is a fun game to get caught in.

  • > The challenge is there really isn't a good way to incentivize that work.

    Ban publication of any research that hasn't been reproduced.

    • If we did that, CERN could not publish, because nobody else has the capabilities they do. Do we really want to punish CERN (which has a good track record of scientific integrity) because their work can't be reproduced? I think the model in many of these cases is that the lab publishing has to allow some number of postdocs or competitor labs to come to their lab and work on reproducing it in-house with the same reagents (biological experiments are remarkably fragile).

    • > Ban publication of any research that hasn't been reproduced.

      Unless it is published, nobody will know about it and thus nobody will try to reproduce it.

      1 reply →

    • lol, how would the first paper carrying some new discovery get published?

Have they solved the issue where papers that cite research already invalidated are still being cited?

  • AFAIK, no, but I could see there being cause to push citations to also cite the validations. It'd be good if standard practice turned into something like

    Paper A, by bob, bill, brad. Validated by Paper B by carol, clare, charlotte.

    or

    Paper A, by bob, bill, brad. Unvalidated.

  • Nope.

    I am still reviewing papers that propose solutions based on a technique X, conveniently ignoring research from two years ago that shows that X cannot be used on its own. Both the paper I reviewed and the research showing X cannot be used are in the same venue!

Reproducibility is overrated and if you could wave a wand to make all papers reproducible tomorrow, it wouldn't fix the problem. It might even make it worse.

https://blog.plan99.net/replication-studies-cant-fix-science...

  • ? More samples reduces the variance of a statistic. Obviously it cannot identify systematic bias in a model, or establish causality, or make a "bad" question "good". Its not overrated though -- it would strengthen or weaken the case for many papers.

    • If you have a strong grip on exactly what it means, sure, but look at any HN thread on the topic of fraud in science. People think replication = validity because it's been described as the replication crisis for the last 15 years. And that's the best case!

      Funding replication studies in the current environment would just lead to lots of invalid papers being promoted as "fully replicated" and people would be fooled even harder than they already are. There's got to be a fix for the underlying quality issues before replication becomes the next best thing to do.

      6 replies →

For ML/AI/Comp sci articles, providing reproducible code is a great option. Basically, PoC or GTFO.

  • The most annoying ones are those which discuss loosely the methodology but then fail to publish the weights or any real algorithms.

    It's like buying a piece of furniture from IKEA, except you just get an Allen key, a hint at what parts to buy, and blurry instructions.

    • This is so egregious. The value of such papers is basically nothing but they're extremely common.

Yeah, spot on. If all we do is add more plausible sounding text on top of already fragile review and incentive structures, that really could make things worse rather than better

Your second point is the important one. AI may be the thing that finally forces the community to take reproducibility, attribution, and verification seriously. That’s very much the motivation behind projects like Liberata, which try to shift publishing away from novelty first narratives and toward explicit credit for replication, verification, and followthrough. If that cultural shift happens, this moment might end up being a painful but necessary correction.

If there is one thing which scientific reports must require is not using AI to produce the documentation. They can be of the data but not of the source or anything else. AI is a tool, not a replacement for actual work.

> LLMs being able to put out plausible papers is just going to make it worse

If correct form (LaTeX two-column formatting, quoting the right papers and authors of the year etc.) has been allowing otherwise reject-worthy papers to slip through peer review, academia arguably has bigger problems than LLMs.

  • Correct form and relevant citations have been, for generations up to a couple of years ago, mighty strong signals that a work is good and done by a serious and reliable author. This is no longer the case and we are worse off for it.

I think, at least I hope, that a part of the LLM value will be to create their retirement for specific needs. Instead of asking it to solve any problem, restrict the space to a tool that can help you then reach your goal faster without the statistical nature of LLMs.

On the bright side, an LLM can really help set up a reproduction environment.

Perhaps repro should become the basis of peer review?

I heard that most papers in a given field are already not adding any value. (Maybe it depends on the field though.)

There seems to be a rule in every field that "99% of everything is crap." I guess AI adds a few more nines to the end of that.

The gems are lost in a sea of slop.

So I see useless output (e.g. crap on the app store) as having negative value, because it takes up time and space and energy that could have been spent on something good.

My point with all this is that it's not a new problem. It's always been about curation. But curation doesn't scale. It already didn't. I don't know what the answer to that looks like.

Maybe it will also change the whole publication as evaluation of science.

Reading the article, this is about CITATIONS which are trivially verifiable.

This is just article publishers not doing the most basic verification failing to notice that the citations in the article don't exist.

What this should trigger is a black mark for all of the authors and their institutions, both of which should receive significant reputational repercussions for publishing fake information. If they fake the easiest to verify information (does the cited work exist) what else are they faking?

I'd need to see the same scrutiny applied to pre-AI papers. If a field has a poor replication rate, meaning there's a good chance that a given published paper is just so much junk science, is that better or worse than letting AI hallucinate the data in the first place?

  > to finally take reproducibility more seriously

I've long argued for this, as reproduction is the cornerstone of science. There's a lot of potential ways to do this but one that I like is linking to the original work. Suppose you're looking at the OpenReview page and they have a link for "reproduction efforts" and with at minimum an annotation for confirmation or failure.

This is incredibly helpful to the community as a whole. Reproduction failures can be incredibly helpful even when the original work has no fraud. In those cases a reprising failure reveals important information about the necessary conditions that the original work relies on.

But honestly, we'll never get this until we drop the entire notion of "novel" or "impact" and "publish or perish". Novel is in the eye of the reviewer and the lower the reviewer's expertise the less novel a work seems (nothing is novel as a high enough level). Impact can almost never be determined a priori, and when it can you already have people chasing those directions because why the fuck would they not? But publish or perish is the biggest sin. It's one of those ideas that looks nice on paper, like you are meaningfully determining who is working hard and who is hardly working. But the truth is that you can't tell without being in the weeds. The real result is that this stifles creativity, novelty, and impact as it forces researchers to chase lower hanging fruit. Things you're certain will work and can get published. It creates a negative feedback loop as we compete: "X publishes 5 papers a year, why can't you?" I've heard these words even when X has far fewer citations (each of my work had "more impact").

Frankly, I believe fraud would dramatically reduce were researchers not risking job security. The fraud is incentivized by the cutthroat system where you're constantly trying to defend your job, your work, and your grants. They'll always be some fraud but (with a few exceptions) researchers aren't rockstar millionaires. It takes a lot of work to get to point where fraud even works, so there's a natural filter.

I have the same advice as Mervin Kelly, former director of Bell Labs:

  How do you manage genius?
  You don't