← Back to context

Comment by godelski

2 days ago

I find this study quite suspect. I'd have to dive deeper but there's definitely significant alarm bells that should be going off for anyone reading.

Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol

There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?

I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over

Independent of whether it has any meaning (because the entire paper might be a bit iffy), I find it curious that Instructors 3 and 8 have the lowest harmfulness rates, quite a bit lower than even the LLMs, but not the highest preference rates. Harmfulness anticorrelates with preference, but not perfectly. Some amount of charisma appears to be a factor even in selections by professionals?

  • This is exactly why I'd be cautious about interpreting the preference metric too strongly

  • Yeah it's difficult to interpret.

    One possible interpretation, the statements were very bland. These would be very low harm but also not very informative

Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.” In another two years it’s going to be curtains.

  • The issue is, it almost always outperforms knowledge workers.

    IF the right questions are asked, and IF steered into and corrected at a few crucial points. IF not it goes off in the wrong direction really quick and that's a problem that's still mostly unsolved in the last 2 years.

    And that can be catastrophic in high risk environments, like legal, medical or high risk software products where being wrong in the wrong place can mean bankruptcy or even cost a life.

    I help run a few marketing websites where I let the CEO's run crazy with Claude cowork, they are making PR's like a madman, but they are not allowed to touch any of the API's & platforms where there is real user data & sensitive information.

    • Ya, while the tools are really solid and have seen huge leaps these past two years, in no way will an LLM be able to do any of it unguided in two years. Just a humble opinion that I would love to see be wrong.

      13 replies →

    • > And that can be catastrophic in high risk environments, like legal, medical or high risk software products where being wrong in the wrong place can mean bankruptcy or even cost a life.

      Which also happens with humans – does it do so at a lower rate? On its own, it kind of sounds like similar anti-self-driving-car arguments.

      1 reply →

    • I kinda disagree. High risk environments just means that they will have to have a human-in-the-loop for a longer time which drastically reduce the skill required for such human (which is still requires high skill just not stupidly high).

      9 replies →

    • Yeah but even what you describe makes it an extremely useful tool and productivity boost. Sure, we're not going to deploy a lawyer agent with full autonomy and no more oversight than a real lawyer. But isn't it wild that's now the frontier?

      It's not like self driving cars where better than a human 80% of the time isn't good enough and they aren't really usable until its 95%, 99% etc.

  • > the study where it beats our highest caliber of knowledge workers may have some methodological deficits

    The point is that if the study can't validate the claims being made then we can't actually extrapolate from that claim. What you're predicting may or may come true, but the study (which is the topic at hand) isn't useful for supporting the assertion.

  • > Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.”

    With that kind of logic ... anything is possible.

  • I'd say if it does have methodological deficits, it should be ignored. Measuring a length with a wet spaghetti can only result in nonsense.

  • Autopilots have been able to land planes for years (decades?), and yet they still don't land passengers planes at any increased rate.

  • Assuming it keeps improving at the same rate, which I think we are already seeing not play out. If you compare the first six months when GPT truly hit the mainstream to the previous six months, the improvements are not nearly as evident. That isn’t to say they aren’t noticeable, I could definitely tell it’s improving, but not nearly at the pace it once was.

    There’s also the fact that they can’t possibly keep improving frontier models at the same rate (I.e. training investment) when investment starts slowing down. The amount of cash being burned is completely unsustainable and you’re already seeing some pullback.

    • On the other hand we keep seeing only marginal generational imorovements in CPU space, yet performance gains over last 10 years in CPUs are very material.

      Every new model might not be a leap like it used to be, but give it enough time and improvements add up.

      8 replies →

    • The issue is that before GPT models basically were useless for any conversation. We are literally in science fiction realm. From a text conversation perspective the gap between where we are at and what’s left to get to is relatively small.

      In my opinion, the main thing we need to do is have training happen continuously. And probably more real world data (from sensors).

      3 replies →

    • I agree. But notice that you assume that there is a metric with which you can messure improvement. Which is fine if you are measuring against your personal taste.

      But it might be that the optimization target itself has a ceiling. If you're training toward human approval ratings from a broad population, you converge toward what median preference selects for. The plateau is baked into what you're measuring against.

    • It doesn't even need to 'improve' at the same rate to have extraordinary impact in society. Even if the frontier models stayed roughly the same in cost and capability for just 1-2 years, the harnesses and processes built around them would mature. We have not yet metabolized these models. Frankly, a lot of this feels like late 80s early 90s complaints about how office computerization wasn't happening yet--it was, just not at the rate promised by the companies selling computers to businesses. We don't look back at those people in the 80s saying that paper was here to stay as visionaries just because they noticed that propaganda temporarily outran the business environment.

      I just wish people would take a step back and think about the timescales here. Language Models are Unsupervised Multitask Learners was in 2019. Here we are seven years later and LOOK AROUND. The landscape is unrecognizable. It's worth thinking about who, in those seven years, had an accurate estimate of the future and whose estimate fundamentally failed. And just as it is valuable to note where propaganda about progress speeds past where we are, we should remember that it is costless to announce that at some unspecified future time all of this will settle down and things will go back to the way they were.

      2 replies →

  • >the study where it beats our highest caliber of knowledge workers may have some methodological deficits.

    That isn’t even remotely what this study is looking at.

  • Your “some methodological deficits” is doing a lot of work.

    • What if the methodological deficits are actually causing the paper to underestimate the quality of the AI responses? Why assume any deficits would bias the AI's competence upwards instead of downwards?

      1 reply →

  • I mean, my shoe could beat the highest caliber of knowledge workers with enough methodological deficits.

  • "the study that claims it beats our highest caliber of knowledge workers has methodological deficits" ftfy

    so extrapolating from that, in another two years it will continue to bamboozle

More than that, the entire structure of the study is pointless. They set up as a question/response and then had humans rate the response. That's literally what LLM's are trained to do, which ultimately is convincing a human to click the "I like this one better" button on it's response.

  • LLMs are trained to convince a typical human to click the "I like this one better" on their response.

    Convincing a human law professor to click the "I would prefer to deliver this response to a student" button, and to not click the "this response is pedagogically harmful" button is a different task!

    I could imagine an LLM convincing a typical human to click the "I like this one better" button with flattery, or with nice-sounding platitudes, or with hand-wavey explanations that sound plausible. And in fact that's exactly what LLMs do when they go wrong - they bluff and output superficially plausible nonsense!

    But these weren't typical humans, these were law professors specifically tasked with deciding which response was a better option to give to students as a canonical answer to a contract law question. So I think this is a genuinely impressive result.

  • This is kind of like saying you can't compare Computer Vision models to Human performance because those models were literally trained to identify objects in images...

    • I'm not saying you can't compare them, I'm saying it's pointless. LLM's are extremely large scale multivariate regression machines, evaluating it's output within it's own training domain is as pointless as seeing if a ball rolls downhill.

I think your 3k figure comes from here - It is explained:

> As judges, the professors then completed 2,918 blinded, forced-choice comparisons (median per judge: 200), each time indicating which of the two anonymized responses, from the instructor or the LLM, they would rather give to a student

  • So did were the answers fact checked? If not that seems like a pretty obvious flaw!

    • The study deliberately analyzes questions that don't have clear black or white answers, what matters is the reasoning.

more and more i see papers. interview 8 ppl, draw conclusions based on their expert opinions. AI and Cybersecurity are full of this.

Even saw some where they just slapped interviews + protocol into chatgpt as 'methodology' to extract the results -_-. Peer reviewed and published.

  • People don't always have the resources to conduct massive "proper" studies. We live in the real world, and have to settle for what studies people can conduct.

    Not saying we should take such studies as the "gospel truth" ... but if you ignore them and only consider "proper" studies, you'll be waiting a very long time to learn anything new.

    • You are saying the companies that are planning to build structures the size of Manhattan, while claiming multiple trillions TAM, and eventual apotheosis, along with the consumers of these models can't scrape together enough coins to fund a study with a decent statistical power?

      We have to settle for 'crumbs'?

      Why would you say this like it is true?

    • These studies are often conducted by the AI companies them selves (in this case, an institute that receives funding from AI companies), if they were interested in the truth (which they obviously are not) and not propaganda (which they obviously are) they would fund the necessary research. AI companies have plenty of money and can well afford to do this properly.

      Other then AI companies, a more realistic option are state funded universities (particularly in Europe and east Asia) which have consumer protection agencies who’s purpose is to protect their residents from corporate greed, and as such should fund, commission, or even conduct such studies. They also have enough money to do this properly.

      If there is enough money for propaganda, there should also be enough money for the truth.

The paper says the professors have a median of 200 comparisons each. It also says they only used 2 models because using more models would require more comparisons and they selected Google models because Google was branded/advertised as being education focused. When you see other models show up elsewhere, that's because they extended the main idea to other models but using LLMs to judge instead of human professors.

  • Sure, but the biggest problem is they have no statistical significance. Variance is too high. How do you distinguish the signal from the noise? Confidence intervals aren't enough.

    But is it a surprise law professors aren't great statisticians?

    • I disagree. 16 isn't necessarily the relevant N here but the number of responses is.

      If you have 100 responses from 1 professor, and the AI wins 75% of the time that is very likely a true signal that the AI is better than this prof. It would be incorrect to generalize this to all profs though.

      Further, if you sample 16 profs and the AI beats 10 of them you can be fairly certain that the real percentage of profs it beats isn't 10%. Further, when estimating the probability that the AI beats a random prof, it's the relative estimation error that scales with 1/sqrt N. If you have a coin and it lands heads up 16 times, that tells you something quite robust about the coin.

      Reasonably estimating confidence intervals at small N and high p is not trivial. But it can be done.

      A good heuristic is "add 2 successes and 2 failures" which is due to Agresti & Couli.

      See down the page here for source papers:

      https://en.wikipedia.org/wiki/Binomial_proportion_confidence...

  • I think it is more likely that they selected Gemini because the lead author is a fellow at an institute which receives a lot of their funding from Google.

The study was conducted by Stanford’s HAI institute, which receives heavy funding from Google (how much I couldn’t find because they don‘t publish their donations in a place I could find it; but I suspect it is alot). And the authors did not declare a non-conflict of interest at the end of the paper.

  • Wait, where are you seeing the link to HAI? TFA mentions something called "liftlab" which seems to be something under Stanford Law School and separate from HAI. The study has more than a dozen authors from as many different universities but HAI is not mentioned.

    • You are right, this study was technically conducted by The Stanford Law AI Initiative which is co-chaired by Julian Nyarko who is also a senior fellow at HAI, and is also the lead author of this study.

      This is enough of an association to claim a conflict of interest between the study authors and Google. But I wanted to go further and see if The Stanford Law AI Initiative had been given a research grant from HAI. So I spent way to long on both of their websites to find a list of research grants either awarded by HAI or received by Stanford Law AI Initiative. But no such luck. Despite HAI having a page dedicated to Centers and Labs, and to Research partners, and despite claiming 500+ research funded, they only list like 6 organizations each, and then link to each other in their “See More” button below.

      I have a feeling I will have to browse through some tax filing papers to find the truth here. But I am not a journalist, so I am not gonna. I am simply gonna leave it at the obvious associations involved here. And maybe issue a correction: “conducted by a senior fellow at HAI

  • Do papers need a "non-conflict of interest" disclosure nowadays to not be considered just ads?

    • When they are studying a consumer product it is pretty customary to declare a non-conflict of interest. So yes. Declare it at the end of your paper please.

      Unless you have a conflict of interest, in which case declare e.g. “the lead author of this paper is an Associate Director and Senior Fellow at HAI which receives funding from Google the company which makes Gemini”.

  • The HAI is also funded with money from OpenAI, Antropic, and other big tech corporations. I don't know what you are trying to prove.

> There's also really clear bias given that the main results only feature Google models.

The main results also don’t seem to know what a “model” is, as the two “models” it refers to are “stock Gemini 2.5 Pro” and “a retrieval-augmented version of NotebookLM”.

One of which is a model, and the other of which is an interface backed by different models depending on exactly when the analysis was performed.

I find it entirely likely that the preference for the AI generated answers is entirely due to the confidence of its assertions. Given the numbers of evaluations each prof had to do, there’s no way they researched the answers thoroughly. But if there’s one thing we all know LLMs can do well, it’s to generate text that sounds extremely confident. And that signal is appealing in choosing which of two statements you’d give to students.

But does it really matter? It seems fairly obvious that AI is going to outperform professors. While the studies run, there are three more model releases that change the calculus entirely. I wonder how much we are learning with these studies about what is going on.

  • > I wonder how much we are learning with these studies about what is going on.

    So your alternative is to not have any studies and everyone can just stump up anecdata as "evidence" for the capabilities of these models?

    • Doing things that are well meaning, but ineffective is not great policy. The simplest alternative to doing things that don't work is always not doing them. Better ideas are of course welcome, but not required.

      1 reply →

Agreed. The study might show something useful, but the headline is doing a lot of work.

Reversly viewed ones should ask with what intend the study should be like this. And for obvious reasons it sounds like monetary-nature.

I never get the same answer from any two lawyers. I hate law as a result. With developers you might get disagreements based on experience, but there's usually a strong consensus on specific things, with lawyers and courts its all over the flipping place. I wouldn't be surprised if LLMs can "pass" on paper (ie college exams) but in practice, they might 'struggle' in different courts.

...On the other hand, if an LLM has access to every transcript of every case a Judge has overseen, they might have an unfair advantage in any case... Hmmm...

This all assuming the AI lawyer doesn't hallucinate and start referencing cases that don't exist.

  • I now foresee a future where law firms have models trained on all the transcriptions of individual judges, lawyers and prosecutors, and run agents against them to decide on the optimal strategy for a case.

    • Agree, though I've also heard from a lawyer to be very careful trusting an LLM for legal advise, and I believe them because the law is insanely nuanced (they disagree with me on this) just talk to a room of lawyers about what should be "simple" clean cut legal issues, and they might ALL disagree based on nuanced reasons and personal experiences with cases.

> That's very high variance

Do you doubt that educational value of a law professor can vary from 0 to somewhat reasonable? You are not studying screws here.

This is the bit I'm suspicious of:

> They calibrated AI responses to match the length and structure of human answers

which I would guess removes AI's hallucinations and errors somewhat.

> confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over

You can confidently say that you are unsure?