AI outperforms law professors in Stanford Law study

2 days ago (law.stanford.edu)

https://law.stanford.edu/wp-content/uploads/2026/06/salinas_...

I find this study quite suspect. I'd have to dive deeper but there's definitely significant alarm bells that should be going off for anyone reading.

Figure 2 (page 6) screams problems. There's only 16 professors (3k comparisons each?!?!) and the professors are all over the place. That's very high variance, suggesting the study has no meaningful statistical power. Poor instructor 16 can't catch a break lol

There's also really clear bias given that the main results only feature Google models. Other models show up elsewhere, why not there?

I'm no lawyer, but I'm a pretty competent statistician and can confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over

  • Independent of whether it has any meaning (because the entire paper might be a bit iffy), I find it curious that Instructors 3 and 8 have the lowest harmfulness rates, quite a bit lower than even the LLMs, but not the highest preference rates. Harmfulness anticorrelates with preference, but not perfectly. Some amount of charisma appears to be a factor even in selections by professionals?

    • This is exactly why I'd be cautious about interpreting the preference metric too strongly

    • Yeah it's difficult to interpret.

      One possible interpretation, the statements were very bland. These would be very low harm but also not very informative

  • Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.” In another two years it’s going to be curtains.

    • The issue is, it almost always outperforms knowledge workers.

      IF the right questions are asked, and IF steered into and corrected at a few crucial points. IF not it goes off in the wrong direction really quick and that's a problem that's still mostly unsolved in the last 2 years.

      And that can be catastrophic in high risk environments, like legal, medical or high risk software products where being wrong in the wrong place can mean bankruptcy or even cost a life.

      I help run a few marketing websites where I let the CEO's run crazy with Claude cowork, they are making PR's like a madman, but they are not allowed to touch any of the API's & platforms where there is real user data & sensitive information.

      27 replies →

    • > the study where it beats our highest caliber of knowledge workers may have some methodological deficits

      The point is that if the study can't validate the claims being made then we can't actually extrapolate from that claim. What you're predicting may or may come true, but the study (which is the topic at hand) isn't useful for supporting the assertion.

      1 reply →

    • > Sure, but in two years AI has gone from “impressive tool, but not a replacement for knowledge workers” to “the study where it beats our highest caliber of knowledge workers may have some methodological deficits.”

      With that kind of logic ... anything is possible.

    • I'd say if it does have methodological deficits, it should be ignored. Measuring a length with a wet spaghetti can only result in nonsense.

    • Assuming it keeps improving at the same rate, which I think we are already seeing not play out. If you compare the first six months when GPT truly hit the mainstream to the previous six months, the improvements are not nearly as evident. That isn’t to say they aren’t noticeable, I could definitely tell it’s improving, but not nearly at the pace it once was.

      There’s also the fact that they can’t possibly keep improving frontier models at the same rate (I.e. training investment) when investment starts slowing down. The amount of cash being burned is completely unsustainable and you’re already seeing some pullback.

      20 replies →

    • >the study where it beats our highest caliber of knowledge workers may have some methodological deficits.

      That isn’t even remotely what this study is looking at.

    • "the study that claims it beats our highest caliber of knowledge workers has methodological deficits" ftfy

      so extrapolating from that, in another two years it will continue to bamboozle

  • More than that, the entire structure of the study is pointless. They set up as a question/response and then had humans rate the response. That's literally what LLM's are trained to do, which ultimately is convincing a human to click the "I like this one better" button on it's response.

    • LLMs are trained to convince a typical human to click the "I like this one better" on their response.

      Convincing a human law professor to click the "I would prefer to deliver this response to a student" button, and to not click the "this response is pedagogically harmful" button is a different task!

      I could imagine an LLM convincing a typical human to click the "I like this one better" button with flattery, or with nice-sounding platitudes, or with hand-wavey explanations that sound plausible. And in fact that's exactly what LLMs do when they go wrong - they bluff and output superficially plausible nonsense!

      But these weren't typical humans, these were law professors specifically tasked with deciding which response was a better option to give to students as a canonical answer to a contract law question. So I think this is a genuinely impressive result.

    • This is kind of like saying you can't compare Computer Vision models to Human performance because those models were literally trained to identify objects in images...

      1 reply →

  • I think your 3k figure comes from here - It is explained:

    > As judges, the professors then completed 2,918 blinded, forced-choice comparisons (median per judge: 200), each time indicating which of the two anonymized responses, from the instructor or the LLM, they would rather give to a student

  • more and more i see papers. interview 8 ppl, draw conclusions based on their expert opinions. AI and Cybersecurity are full of this.

    Even saw some where they just slapped interviews + protocol into chatgpt as 'methodology' to extract the results -_-. Peer reviewed and published.

    • People don't always have the resources to conduct massive "proper" studies. We live in the real world, and have to settle for what studies people can conduct.

      Not saying we should take such studies as the "gospel truth" ... but if you ignore them and only consider "proper" studies, you'll be waiting a very long time to learn anything new.

      2 replies →

  • The paper says the professors have a median of 200 comparisons each. It also says they only used 2 models because using more models would require more comparisons and they selected Google models because Google was branded/advertised as being education focused. When you see other models show up elsewhere, that's because they extended the main idea to other models but using LLMs to judge instead of human professors.

    • Sure, but the biggest problem is they have no statistical significance. Variance is too high. How do you distinguish the signal from the noise? Confidence intervals aren't enough.

      But is it a surprise law professors aren't great statisticians?

      1 reply →

    • I think it is more likely that they selected Gemini because the lead author is a fellow at an institute which receives a lot of their funding from Google.

  • The study was conducted by Stanford’s HAI institute, which receives heavy funding from Google (how much I couldn’t find because they don‘t publish their donations in a place I could find it; but I suspect it is alot). And the authors did not declare a non-conflict of interest at the end of the paper.

    • Wait, where are you seeing the link to HAI? TFA mentions something called "liftlab" which seems to be something under Stanford Law School and separate from HAI. The study has more than a dozen authors from as many different universities but HAI is not mentioned.

      2 replies →

    • The HAI is also funded with money from OpenAI, Antropic, and other big tech corporations. I don't know what you are trying to prove.

  • > There's also really clear bias given that the main results only feature Google models.

    The main results also don’t seem to know what a “model” is, as the two “models” it refers to are “stock Gemini 2.5 Pro” and “a retrieval-augmented version of NotebookLM”.

    One of which is a model, and the other of which is an interface backed by different models depending on exactly when the analysis was performed.

  • I find it entirely likely that the preference for the AI generated answers is entirely due to the confidence of its assertions. Given the numbers of evaluations each prof had to do, there’s no way they researched the answers thoroughly. But if there’s one thing we all know LLMs can do well, it’s to generate text that sounds extremely confident. And that signal is appealing in choosing which of two statements you’d give to students.

  • But does it really matter? It seems fairly obvious that AI is going to outperform professors. While the studies run, there are three more model releases that change the calculus entirely. I wonder how much we are learning with these studies about what is going on.

    • > I wonder how much we are learning with these studies about what is going on.

      So your alternative is to not have any studies and everyone can just stump up anecdata as "evidence" for the capabilities of these models?

      2 replies →

  • Agreed. The study might show something useful, but the headline is doing a lot of work.

  • Reversly viewed ones should ask with what intend the study should be like this. And for obvious reasons it sounds like monetary-nature.

  • I never get the same answer from any two lawyers. I hate law as a result. With developers you might get disagreements based on experience, but there's usually a strong consensus on specific things, with lawyers and courts its all over the flipping place. I wouldn't be surprised if LLMs can "pass" on paper (ie college exams) but in practice, they might 'struggle' in different courts.

    ...On the other hand, if an LLM has access to every transcript of every case a Judge has overseen, they might have an unfair advantage in any case... Hmmm...

    This all assuming the AI lawyer doesn't hallucinate and start referencing cases that don't exist.

    • I now foresee a future where law firms have models trained on all the transcriptions of individual judges, lawyers and prosecutors, and run agents against them to decide on the optimal strategy for a case.

      1 reply →

  • > That's very high variance

    Do you doubt that educational value of a law professor can vary from 0 to somewhat reasonable? You are not studying screws here.

  • This is the bit I'm suspicious of:

    > They calibrated AI responses to match the length and structure of human answers

    which I would guess removes AI's hallucinations and errors somewhat.

  • > confidently say this paper has a smell to it. I can't call it bullshit, but there are red flags all over

    You can confidently say that you are unsure?

As a software engineer I have some intuition for what the risks are of letting agents do some tasks vs others.

I don't have a similar intuition calibrated for what could go wrong when asking AI to draft a legal document. Some things seem harmless, i.e. drafting a will, but I don't really know- our legal system is notoriously rife with footguns.

  • I've used general purpose LLM AI (e.g. run-of-the-mill Claude, GPT etc) heavily to draft legal documents. The biggest trap is the hallucinated citation. It will easily insert an absolutely authentic sounding quotation from another case that perfectly proves the point you are trying to make, then it'll make up an authentic name for it, e.g. United States v. Shenzhou Electronics Inc or whatever. You can get really comfortable after checking its output a few times and getting no false citations, and then BAM, it'll put three in the next motion it writes.

    Any lawyer who isn't using LLMs for research is behind the curve, though. They are unbelievable at finding niche cases you would never have found on your own. Previously it was a lot of exact search term matching, which is inherently useless for a lot of legal research. I need something that can search on vaguer terms, which AI can do incredibly well. Just check the results. I'm sure the LLMs from Lexis Nexis/Westlaw are probably better than the general purpose ones.

    LLMs make fantastic paralegals. If you're doing any legal work, you should be using it, even if it's just to shoot ideas at. Have it play devil's advocate. My friend always has it play the other party's lawyer to see what all the counter-arguments are going to be.

    Just like you would with software development. If you care about what you are creating, CHECK THE OUTPUT.

    • > The biggest trap is the hallucinated citation. It will easily insert an absolutely authentic sounding quotation from another case that perfectly proves the point you are trying to make, then it'll make up an authentic name for it, e.g. United States v. Shenzhou Electronics Inc or whatever.

      Naive question from an outsider: aren't there searchable databases of cases (with complete text) so that citations could be checked automatically, either by the same or an independent agent?

      5 replies →

    • >The biggest trap is the hallucinated citation

      The "biggest problem" being the one thing that is trivial to verify against concrete databases is a bit convenient don't you think?

      I think it's more likely that it makes mistakes evenly but the one thing that you are able to check with certainty is the only place you discover the errors.

      1 reply →

    • Just because the citation exists, what the LLM says it stands for and what it actually stands for are not the same.

      For testing, I've asked (admittedly last-gen) LLMs to generate legal opinions regarding issues in commercial English civil litigation, and I received back cases where the citation is real, but the area of law (family law) is not relevant as family courts apply a very different set of procedural rules.

      (If you squint a bit, they sometimes might be relevant... and could be useful for a particularly creative litigator to make a novel argument on behalf of a very risk tolerant client. But you would very much want to go read those cases and think quite hard about them.)

      3 replies →

    • I think the paralegal analogy is right, but with one important difference: a human paralegal usually knows when they are unsure, or at least can be trained to flag uncertainty

      1 reply →

    • A legal professional can be personally liable for not finding the most recent case-law.

      The knowledge cut off gap means the models sometimes don't know about the most recent case-law, in a given situation.

      I've seent his happen multiple times now. Accountants and legal professionals advising clients based on outdated information assembled through chat-gtp, claude and copilot.

      Professionals drafting letters and missing recent case-law which handles their exact case. It's unreliable.So it can save you some work; but it can't save you all of the work. And in some cases its mistakes really force you to redo all the work, and more, to be thorough and have confidence in the result.

      3 replies →

    • Seems companies like Thomson Reuters or other legal services have incentive to build LLM with RAG over legal cases texts and robust hallucinations detection on reference

    • Chatgpt regularly hallucinates entire cases whole cloth or fabricates an entirely different fact pattern for a given case. Perplexity does much better at citing its sources and providing accurate quotes, at least in my experience.

  • I think this is probably true for most skilled professions. AI is best used in the hands of folks already knowledgeable in the skills/professions they are using it for.

    I liken it to me googling things as a sysadmin vs. Jane from accounting doing it. The non-tech end user is far more likely to make the problem worse, or install something sketchy from the ad riddled results than I am, or one of my help desk employees are.

    I wouldn't trust myself to draft an important legal document using AI without the advice of a lawyer, much like I wouldn't really want to rely on my lawyer to use AI to write code for me.

    •   > I think this is probably true for most skilled professions.
      

      I agree, BUT I also find that it's easy for experts to atrophy quickly. When the AI is right 80/90% of the time it lulls you into over confidence.

      I find those that are best and make the greatest use are the ones who remain skeptical but also use the tool. The same people who were already nuanced and picky before AI. The same people who already doubted and questioned their own work, and used that suspicion to help prevent them from having over confidence in their own work. If you weren't willing to just "lgtm" with your own code, it's difficult to do that with AI.

      (To be clear, I'm not saying perfectionists. Some might call them that because the picky people have higher standards, but a good expert has to also understand that perfection doesn't exist. That's often a driving force in the suspicion! This also tends to cause them to continually improve)

      2 replies →

    • > sysadmin

      Another domain where LLMs are very effective at confidently leading people down a messy path. I have a roommate using LLMs to guide him through setting up some ollama stuff in my WSL (I happen to have the half-decent GPU here) and after multiple rounds of the bot trying to get him to do things that were redundant if not in the wrong direction entirely (and vaguely insulting as a matter of course), I had to write "ground truths" along these lines, and probably more as I find them:

        We are using systemd. ~/.bashrc or similar dotfiles should not be used to start services/processes automatically. Do not "sudo" anything in ~/.bashrc.
      

      [Yes, it did that]

        A systemd service should be created for any processes/services that need to run automatically and persistently. The current output of `systemctl list-unit-files | grep enabled` is available at [ . . . ]  
      
        sshd is already enabled + running and listening on 0.0.0.0:22 and [::]:22. ~/.ssh perms are already 700 and ~/.ssh/authorized_keys perms are already 600. Public key authentication is already enabled in sshd and ~/.ssh/authorized_keys already contains pubkeys ENDING as follows: . . . 
      
        tailscaled is already enabled + running; the tailscale address for [host] is [addr]
      
        It is not necessary to fix connectivity to any 192.168.0.0/16 ; tailscale interface should be used for any traffic to [host] or other hosts involved in the project; hosts/nodes lacking tailscale interface should be assigned one  
      

      [roommate + bot spent 45 minutes on trying to configure their way through NAT when not having to do that is almost the entire point of tailscale. It was just (essentially) like, "You're absolutely right. We have tailscale set up, so we don't need to be able to ssh to that other interface at all. Not troubleshooting that would have saved 45 whole minutes. Oh well, now what?"]

      Maybe it's just me, but I'm not inclined to trust the judgment of something that can't keep this kind of thing straight, which I know is to some degree a matter of having all the needed info in the context window. But maybe it would be able to do that if it didn't waste tokens telling me to cd into the same directory that I'm already in every 2 minutes, or chmod .ssh/ again, or (when it really needs to burn some tokens) blow away the .venv and pull a bunch of modules again just to "start clean".

      1 reply →

    • im not so sure

      i think devs overestimate their own role and underestimate others

      i am seeing lawyers and doctors roll out their own software with AI

      but we dont have their training and experience

      3 replies →

    • It's like that in engineering, for sure. My background is in aerospace and there are lots of things that a reasonably technically-inclined random can probably do passably. It takes an engineer to know which tasks those are, though.

      I would imagine it's similar in law, in that it takes a lawyer or judge to know where the foot guns lie.

      1 reply →

  • IME so far (as both a lawyer and a software engineer), LLM error rates when drafting code and legal documents are reasonably comparable, but it's more problematic in the legal context because legal documents do not benefit from many of the structural safeguards available for code. For legal documents, there are no automated tests, no static typing, no test environments, no logging/observability instrumentation, no sandboxing.

    The time lag between drafting and "deployment" also makes for much less effective, much more expensive debugging loops. You can deploy your code to prod in seconds, see an error pop up in the logs, and immediately start debugging. But it will take at a minimum days and frequently as long as several years before an error in a contract or a court filing will be detected, and often the error is beyond correction at that point. Thus, the errors are both more difficult to detect and to resolve.

    And the consequences of error are often much greater, both because they are not correctable and because a legal error may risk someone's life, liberty, or substantial property. Although that's not categorically the case, obviously bugs in certain safety critical systems can be as bad or even worse than legal mistakes. But in general, most software is lower stakes than most legal writing.

    On the flip side, LLMs do seem to do a better job with basic style and structure for legal documents compared to code. Things like following IRAC format, citing assertions of law (although hallucination remains an issue), and writing comprehensible sentences. These would be the equivalents in code to best practices like good comments, cohesion, consistent use of design patterns, test coverage, clear variable names, DRY, etc. Although the better performance on those more qualitative metrics may just be because even the longest legal documents are typically simpler in structure and have fewer lines of text than a large, complex codebase. Or maybe it's because LLMs are trained on natural language text more than on code. Or because natural language is more forgiving than code, in that minor variation in diction or grammar is unlikely to have any significant effect on how the document is interpreted, whereas even single character errors in code can have enormous effects.

    • There is also one thing I would like to add, and you can correct me if you disagree: coding benefits much more from thorough planning. Now, I exclusively work by first writing a plan that has well-defined steps and goals, which can of course change over time.

      It seems to me like it would be more difficult to achieve with legal documents and, in my experience at least, writing a concrete plan has been the decisive factor that make my AI coding robust (plus all that you mentionned).

      1 reply →

    • This is a very good comment. But notice how even in software engineering there is still disagreement about these structural safeguards.

      So yes, we can say the LLM created bad code when it does not compile or fails prewritten tests.

      But experts might disagree what good comments, good cohesion, appropriate use of design patterns, appropriate test coverage or clear variable names are.

      So what are we suppossed to train the LLMs towards? Somebody still has to decide what "good" is.

    • Well this is largely the fault of law itself. especially english style law. A legal, parseable code, in which not every single tiny municipality (some less than 1 square mile) has their own set of rules and laws, not all published or available - but which citizens are expected to abide by of course - how could we expect AI to do well and not some typical TV southern lawyer who knows the judge?

      2 replies →

  • > Some things seem harmless, i.e. drafting a will

    Absolutely not harmless if you're the executor of an estate forced to deal with a screwed up AI will. I just handler my dad's estate this spring. It's a frustrating and confusing process even with the simplest of estates.

    • I recently had to file to become an estate admin with no will at all. And it was literally cheaper for me to fly 3000 miles to do it in person than it was to pay a lawyer. Because lawyers are frankly greedy scumbags half the time. They don't offer an appropriate cost for the service..instead the conversation immediately goes to "how much" money is in the accounts and suddenly they want a percentage of your father's estate for filing two pieces of paper.

      And in my experience if you do actually pay a lawyer for something they will act like you're not worth their time and will literally role their eyes at you when you're trying to explain the minor details of a case because they are too lazy to listen and zone in like I would when doing my job.

  • I wouldn't consider drafting a will to be harmless. If its done poorly the next of kin could have to deal with a huge headache and potentially months or years of probate proceedings.

    • I had a very well crafted will from my parents, one of whom was a very good lawyer hiring other good lawyers. It was still a pain in the ass for many of the reasons they were trying to make it easy for us.

      One thing I learned, just bite the bullet and re-write the whole fucking will instead of making riders.

      Piecing the will together from riders was terrible. Al the clauses fell away everyone got older. The final will could have been 8 pretty clear pages.

      The other part that is hard is just knowing all of the things that happen with assets and a passing. Luckily we had another lawyer and financial folks to advise us. It was still a lot and not that easy to find details. This was pre-ai that would have helped walk through his shit.

  • I would think that LLMs would be better at avoiding foot-guns. That’s a situation where you have a list of well known rules and potential pit falls, and the work of the lawyer is to apply those to a fact pattern. That’s something that has been hard to automate programmatically, because the fact patterns are similar but different. LLMs, however, seem to excel at applying general principles to differing fact patterns.

    • I would categorize this in the "expertise that people internalize but never figure out how to verbalize" department, and that is a department we have no way to teach an LLM because if nobody is writing out those unspoken, subconscious rules then the LLM has nothing to read about them in its training data.

      3 replies →

    • I don't know the source off hand, but I've seen llms hallucinating case citations in order to "prove" their premises.

      can't get more foot gun than "well according to [fiction] it is a well established practice (that the defendent is guilty)"

    • But can an LLM come up with questions like what the definition of is is? Seems to me there's a lot of "depends on how you read it" type of stuff that lawyers excel at finding novel interpretations. So what coders thinking of as rules are much less straight forward to understand when it comes to laws

      2 replies →

  • I think that's actually a perfect analogy to AI writing code. Drafting a will seems like not a big deal, until that will is accepted as "good enough" and is then in court and under fire.

  • As someone who's been sued frivolously...

    Believe it or not...

    A lot can go wrong if you have real life human lawyers draft a legal document.

  • > drafting a will

    Such a document may not make a difference to the person that eventually will have died, but it can make or break the life of generations to come in countries that are so heavily optimized for dynasty building like the US.

  • This is why I can’t see how college grads are going to survive the AI apocalypse. domain experts driving LLMs are super powerful because they can spot where they make mistakes. Juniors don’t have that insight and the LLMs then cost them productivity.

    • > domain experts driving LLMs are super powerful because they can spot where they make mistakes

      I don’t know if that’ll be true for long. I just had my colleague who’s a very competent engineer IMO hand me a frontier model vibed PR to review (after reviewing it himself, he claims) which contained random variable assignments, conditionals that do nothing, etc. He’d never do such a thing before. People become too comfortable and get confirmation bias as well.

  • I think that's the right intuition. Legal AI feels especially dangerous because the output can look competent while hiding jurisdiction-specific footguns

  • > drafting a will

    Tell me you've never been the executor of an estate in the United States without telling me.

    • I think going through this process has made me uniquely qualified to write one.

  • there’s really no limit to how many times and ways you can review something with AI, except dollars.

  • There will still need to be a lawyer in the loop to review and stamp and take accountability.

    However, the good news is that a whole bunch of laywer positions in drafting docs and research will be able to be eliminated due to AI.

  • I imagine it's really hard to spot a comma in the wrong place, or a missing sentence in a 10 page contract unless you wrote it yourself, or you assembled it from some battle tested templates.

I understand why the conversation on this article looks like it does, but the study is specifically focused on the potential for LLMs to operate as tutors for law students. I enjoy the extrapolation out to whether LLMs will replace lawyers, but did not find that to be discussed in the study itself.

In the framing of using LLMs as legal tutors, with the implication of lowering the cost of legal training, this seems like a socially-positive outcome. Furthermore, it feels kind of intuitive to me that any contemporary system operating with an LLM and access to legal reference material will be prepared to answer _student-originated questions_ comprehensively and with breadcrumbs or direct references to educational/source materials, as seems to have been found in the study.

The authors explicitly and intentionally emphasize that many legal questions require contextualization, as opposed to some discrete calculated answer. The result of the study implies that the LLM-based systems were capable of using what many of us here understand to be the "stochastic best-fit algorithmic generation" of a contemporary language model to adequately contextualize a student's question, providing insight into the trade-offs or complications implicit in the question, while then, critically, _meeting the professional standards of legal educators in explaining that complexity to a student_.

Realistically, I would hope this provides some confidence to readers of HN that they can actually ask a legal question to an LLM and expect the response will explain the complexity of the law in relation to the question. This is great news, and is likely the minimal pre-work any of us should do before actually consulting a lawyer, if time permits.

On the other hand, I do _not_ think that this study provides any indication that an LLM is prepared to actually provide direct legal counsel. Possibly in the same way that a legal textbook does not replace legal counsel, or perhaps more accurately, the same way that stumbling upon a legal case study for approximately the same situation you're in doesn't guarantee you'll have the same result.

  • > On the other hand, I do _not_ think that this study provides any indication that an LLM is prepared to actually provide direct legal counsel

    I think it indicates that LLMs are smart enough to be used in the context of law education.

In general it is not surprising. Even if this particular study is bad.

There are certain areas of law work that are about analyzing large amounts of texts, drawing conclusions and writing other texts based on that and nothing more. That is literally the bread of LLMs.

Those types of lawyers should be the first in line for unemployment, not programmers, not even close.

  • "That is literally the bread of LLMs." correct. However, programming has a large number of advantages RE LLM use compare to law:

    You can execute the logic, and set up loops from the output. You can set up more useful RL. It's easier to generate synthetic training data. It naturally supports tool use and agent parallelism. It's easier to integrate with APIs (with what few APIs the court systems provide). Programming explicitly encodes abstractions at the function, module levels etc that are easier to KG/reason/build upon than text chunks.

  • 'Bread *and butter'. The English expression requires the second part—but otherwise fits perfectly in your well-stated point, with which I wholeheartedly agree.

    Source: AAL.

    • Thank you! As a non native speaker I was not sure if “and butter” is a mandatory part but didn’t want (nor had time) to llm the comment for the sake of authenticity :) TIL

      1 reply →

  • Just because it is theoretically the bread and butter of LLMs does not mean LLMs are capable of doing the job. It still needs to be proven, setting prior beliefs aside. Law is a life-critical system and deserves our highest level of scrutiny.

  • I see the same problem with AI in both programming and law though.

    AI is like a scab on a wound: it's a temporary filler, it rushes in to fill a void, but it's not going to be the final solution.

    Models showed us that there was huuuge unmet demand for literacy, both in software and in law. But now we have a choice to either address the systemic causes of the unmet demand, or just try to paper over them with layers and layers of AI scab.

    • > But now we have a choice to either address the systemic causes of the unmet demand, or just try to paper over them with layers and layers of AI scab.

      Yeah, but in my experience it won't come down to "which is the better solution" but "which is cheaper/easier"

      So I look forward to lots of layers of papered over AI scabs in the future. It won't be cheaper in the long run, but it will pump someone's quarterly numbers enough that they get a promotion before the problem they introduce come back to them

  • These are academics. Not to disparage them or their work at all but it is very different to the transaction or litigation work that is done in BigLaw. It is a lot more focused on analysing and summarising existing texts, which are themselves more easily available for LLMs to train on (statutes, case law, legal journals, textbooks). As such it is probably the easiest legal work to LLM-ify but also the least valuable, because I assume law professors aren't getting paid nearly as much as BigLaw lawyers. So this approach won't scale. Not to say AI won't crack BigLaw but it will be a different challenge.

  • LLMs answered student questions of the top of their heads, without any refresher look into the case law. And systems that were primied with the case law like NotebookLM underperformed when compared to baseline LLMs that you'd as anything about anything.

    It's not about what LLMs can or are suited to do. This study shows strengths of what's already in them, innately.

  • The more I see the evolution, the more it looks to me that any knowledge workers is going to be impacted.

  • > analyzing large amounts of texts, drawing conclusions and writing other texts based on that and nothing more

    The same could be said about programming. Or if you want to be even more reductive, looking at a screen and pressing buttons to make the correct lights light up https://xkcd.com/722/

    • Philosophically or metaphorically speaking - yes.

      But in my comment it is literally what some subset of lawyers do.

      Literally is much more tangible and risky in terms of real impact on employment etc.

    • Oh wow, did Randall Munroe inadvertently predict the employee workload in the show Severance? :)

I'm surprised Stanford Law would go along with this over-reaching press release title. How about "For common first-year contracts-law questions, law professors preferred AI-generated answers to professor-generated answers"

  • The revised title is spot on. It's odd to me how academics are trying to sound like top research labs' CEOs trying to pump valuations by overreaching claims.

    • It is rarely the academics writing the press release. It is even rarer that the author of the press release chooses the title.

I wonder if this could be explained in a similar way to Hollywood movies. If the movies are designed to please the largest group of people, there is a greater chance people will choose to see it than another movie. The human law professors come with their own personalities, beliefs, and opinions that come through in their writing. An LLM has been trained to please the largest swathe of the population. That doesn't mean the answer is better; just like Captain America isn't necessarily better than American Beauty.

My best guess is that Gemini was trained on the textbooks that the questions are meant to test against, thus they are probably better at explicit recall of those questions or related questions.

This is a pretty limited introductory course based on what it says in the methods of the paper itself.

  • That and the research is done by Stanford’s HAI institute with an obvious bias and the paper is curiously missing a conflict of interest statement.

    EDIT: just found out that Google is a major donor to HAI. So this research is at least partially funded by Google. Which is probably the reason the authors fail to declare no conflict of interest.

Figure I.1 is telling. It shows answer length is the strongest predictor of win rate. I suspect this is due to the flawed methodology of the study. Professors were instructed to be succinct ("Please be concise. We expect that each answer takes no more than 3 minutes to write down.") and likely erred on the short side. Also, professors may not have put great effort into their written answers, especially when already trying to be concise. This isn't the headline the authors think it is.

By its very nature, the field of law is ideally suited for AI language models. Fundamentally, everything is based on interconnected texts. I believe that even larger waves of layoffs could loom here than in the IT sector. However, it is likely that a more powerful lobby will be at work here—one that will grossly inflate the perceived value of their work and shield it from outside intrusion.

  • As a lawyer, I think your intuition is right re llms. Law is the wordplay that llms thrive at.

    However the waves are starting and they ARE going to be huge. Corporate clients are insisting on AI. They don’t want to pay an associate hours to draft anything to be reviewed by a partner. They want top partner to use AI and just proofread.

What the LLM cannot do is explain why it said what it said, when cross-examined. It simply hallucinates the best account of why someone would have said such a thing as it said, same as it can give a probable account of why someone else said something different. The question 'But why did you say this not that ...?' does not lead it to make explicit its grounds for what it said, but just to make a new more complicated statement.

  • This is true in the naive case.

    There are however LLM context building techniques that anchor completions in data structures that persist the structure of claims that support the conclusion contained in a completion. Lots of different patterns exist —organizing logic in language is a rich domain— but the one I’ve liked the most is something called a Claim Dependency Graph that models the relationships between atomic claims as graph edges.

    There’s a whole suite of operations you can perform on these structures, and “reconstruct how you came to this conclusion” is absolutely one of them.

  • A human has a motive that exists that frames the thought being expressed. An LLM is going to be creating a “de novo” thought in response to a line of questioning.

  • Same is probably true of humans. In a conversation, we often respond from instinct, then work backwards to a rationalization only when asked. For more considered thoughts, if we’re lucky, we can remember our “reasoning traces” but that’s as deep as our introspection goes. Unless we’re neuroscientists, we don’t even know how many neurons we have, let alone have any understanding of how they generate our thoughts. Motivated reasoning impairs our introspection further, and then dishonesty and communication errors prevent us from relaying the limited remaining information to each other.

    Model interpretability work has advanced a lot. Arguably we already can explain AI decision-making better than human brains.

    • No, it happens in the immediate context, where e.g. we say 'No I meant Meredith Jones, not Meredith Smith'- and the possibility of this elaboration is actually part of ordinary communication. I did mean Meredith Jones, not Meredith Smith - thus the use of the past tense The LLM will just give the best answer for what one might have meant, completely reopening calculation.

      The point is familiar but there are good illustrations in the Atlantic article by a book editor. At first it seems abstract AI hate, but then she gets to the details. AI text cannot be edited. https://www.theatlantic.com/technology/2026/05/how-to-tell-a... or https://archive.ph/YJsGK

    • Nonsense, some of my friends are lawyers and they're able to give you consistent interpretations on why they think about a certain aspect of a law a certain way. The whole thing is that they work with this the entire time, so they have a really consistent 'head model' of how things work and why and how considerations should be weighted/ordered/whatever. LLMs just do not have this, there's no consistent underlying reasoning (the 'reasoning' traces in LLMs are really inconsistent)

  • LLMs hallucinate, because humans hallucinate.

    Asking the LLM in a way where it annotates its sources, it can greatly increase the pattern matching to closely simulate logic, just like in humans.

    I understand the question of why did you say this, not that, I have seen other ways of asking that which do not seem to trigger the LLMs over-response in the other direction.

    • No, the hallucination of its reasons follows immediately from the technique of probabilistic inference. You can see this in real time, just ask 'why did you use this word, not that word?' It is in the position of a desperate liar. All its responses are essentially 'rationalizations'

I do question at what point AI could be useful as a teaching aid.

The quality of LLMs depends heavily on, among other things, how you word your questions.

Knowing the correct questions to ask is not something most students know how to do given that it tends to require a fair bit of pre-existing domain knowledge.

Having been a law student and practicing lawyer, it's clear to me that law professors aren't really representative of much if any part of private practice. Most of the things they think and reason about are quite theoretical and academic, and it doesn't surprise me that the models would regurgitate a more average response which most human graders would prefer.

That's the entire point, though!

The legal academy is supposed to have outlying opinions on things and present novel philosophical answers to questions. (And questions to answers!) So in addition to the statistical arguments against this paper made elsewhere, to me it doesn't real much new information.

In many (most?) countries you can defend yourself, waive your court appointed attorney. You are of course highly discouraged to do so. But sometimes people do it, mostly for smaller claims where they don't want to rack up legal bills for things which might cost more than what is at stake.

But, it makes me wonder, will clients be able to use these AI-attorney systems in the future, in the court. Where they basically either just parrot what the model is instructing them to do, or - I dunno - give the model permission to speak for them (while waiving liabilities).

I have no doubt that some complex AI system can perform better than a bottom-tier, overworked lawyer.

  • Pro se litigants are hyper vulnerable to LLM hallucinations.

    One wrong advice clump and, like a step onto the wrong path while hiking, all subsequent steps go in the wrong direction. And sycophancy tuning means marginal one-sides takes get presented as sure-fire things.

    I’m of the opinion that the big wins aren’t in using the LLMs to do the work (legal, in this case), but rather to refine and improve the dialog and presentation from all parties. A court-centric LLM that could give likely procedural needs to a litigant, and a law-firm-centric LLM could help a pro se litigant create a meaningful and refined set of questions for lawyer consideration, condensed and targeted, saving all parties time and confusion while meeting the clients linguistic needs ‘where they are’.

    All the lawyers know things LLMs never will, the law is interpreted, and the written part isn’t engineering grade facts but suggestions interpreted in context. Arguably this is a racket and a thin veneer of plausible deniability for authoritarian rule. But as the law stands even with federal statues and citations from the courts website, practicing lawyers will frequently end up explaining that in this county/country/court/jurisdiction The Way of Things is different.

    • I think it could work for some things. Years before LLMs became capable of doing anything substantial, people were selling "legal services" via websites where people could dispute trivial stuff like parking tickets, and what have you in the small courts.

      Those services were usually just based on NLP + simple decision trees, and people actually won their cases.

      Of course, doing huge corporate contract disputes, IP disputes, M&A, and whatever will probably be out of question for a good while. Same with more serious criminal cases where the stakes are very high.

      But I think there's potential for automating away less serious cases, especially where there's good structure.

      And of course, it all depends on what kind of legal system one is situated in. Immediately I'd think that Civil Law would be easier for AI lawyers, as its inherent structure is a better fit for machine reasoning. So I'd expect to see more AI products start in Civil Law countries.

    • > Arguably this is a racket and a thin veneer of plausible deniability for authoritarian rule.

      The fact that Lexis and WestLaw have such an iron grip on the entirety of the US legal system is exactly why general LLMs are completely unequipped to be useful in this domain.

One way to make legal services more affordable and accessible would be to put the burden of ensuring the AI legal services are accurate on a private-public partnership with the government.

If a person using the service is given inaccurate legal advice and acts on that advice, the person can't be charged with a crime, can't be given any civil penalties, etc., as long as the law in question is non-obvious.

Obviously if by some exploit, some fundamentally obvious crime (murder, theft, obvious fraud, etc.) is said to be legal, that wouldn't apply, but of course the service should try to prevent those kinds of exploits anyway.

Could limit this to something like business regulations to begin with, or even specifically for small businesses, or contracts within some time limit and dollar amount that would otherwise be coverable by small claims court, etc.

In the hands of a domain expert, AI is useful. In the hands of the naive, it is a foot gun.

I killed my Arch installation and was stuck at the GRUB prompt.Unwilling to brush up my rusty knowledge of GRUB syntax, I asked Gemini for help. The commands Gemini suggested would have wiped my hd...

Once Gemini was told that I was using BTRFS, the suggestion from Gemini looked a bit more sane, but still looked incorrect to me.

It was only after I informed Gemini that I was using a NMVE with BTRFS that it finally produced a sane command.

I'm going to need some legal help for my startup. But I can't pay much. So I figured I will ask AI all relevant questions, as well as forms filled etc. Perhaps even create a patent-application for me.

THEN I find a human lawyer and give AI's answers to them and say "Can you find any errors in this? Can you improve it?" .

That way I think my legal bills should be smaller because the AI has already done most of the work. What do you think? Which LLM is best for legal work?

  • I think that within a few years, most lawyers will expect that clients will have run contracts through an LLM prior to sending them to outside counsel. Emails will be along the lines of:

    Please see attached contract we received from [counterparty]. ChatGPT says blah, blah and blah should be revised. What do you think? Is there anything else that we should change?

    • Right. That will reduce workload for the lawyers. But will their fees then go down? I'm kinda worried that if I don't give them the LLM produced legal docs for review they will just use the LLM themselves and then charge me for the work the LLM did :-)

      It's bit like with doctors, you'll want a second opinion, if you can afford it.

      1 reply →

    • > most lawyers will expect that clients will have run contracts through an LLM prior

      and if clients won't do LLM as their last step, lawyers will do LLM as their first step ... probably both

  • i use codex to do initial research and draft texts (in typst). i use files-output skill so that all research contexts are rendered into files md files.

    i do second phase on codex, by asking to download all pdfs and extract all text of laws it references. can repeat fully local research step.

    after i ask gemini to find issues and criticize.

    UPDATE: there many legal skills on github to try, not used so any yet

  • You are probably going to pay about the same.

    While you might find a lawyer who bills fewer hours, think about what would happen if you got a second legal opinion from another human lawyer. The second lawyer would still need to do essentially the same work. In fact, their ethics would likely require them to independently review the facts, documents, and legal issues before giving you advice.

    So even if AI gives you a draft, the lawyer is not simply checking grammar or spotting obvious mistakes. They still have to verify the analysis, look for missing issues, and decide whether the work is legally sound. Could even be more expensive

I beat lawyers twice before generative AI even existed. Recently I asked Gemini a few questions about personal conflicts in everyday life. It's often too conservative, with views too shallow for the problem. So I still handle human conflicts myself. I only outsource the templated stuff like routine chat replies or marketing copy though it saves me huge amount of time. People who quote AI in serious conflicts are too weak to handle them on their own.

16 is such a small number for what they phrase as an important finding. It really couldn't be much harder to coordinate with 100+ professors.

> rated AI responses significantly higher than answers written by other professors, with AI winning 75% of head-to-head matchups.

That's the problem, you never know when the 25% deliver a true stink bomb, and that's not considering prompting - while a fair prompt/question maybe considered objective, it's very easy to stray.

It is important for society to understand it is not merely programmers and customer support who are at risk of losing their jobs. Clearly A.I can do much more than just program.

There is quite a simple solution for many of the problems described in the comments: Make drafting legal papers a defined interface.

If you think about it and extract sematics of any law you get something that looks familiar, sort of like code. Of course there's some complexities where certain phrases can mean different things, but legal papers in a way are written like they're programming languages already especially when it comes to law.

First we would have to define a language that can handle ambigious operations and we alread y have this with programatic proofs where n should land in x. So in the end I'd assume it would look something like this in a two party dispute:

This is very simplified and pseudo like language, writing out a full contract would be as long as a real contract.

     DEFINE DEFENDANT "A Corp"
     DEFINE PLAINTIFF "B Corp"
     DEFINE CONTRACT  CONTRACT(PLAINTIFF, DEFENDANT, 3054-41-95)

     // attaching extracted requirements, definitions and obligations of contract

     FACT   PLAINTIFF delivered(goods) ON 7054-34-99
     FACT   DEFENDANT paid(0) OF CONTRACT.amount

     CLAIM  breach WHEN obligation(DEFENDANT, "pay") IS NOT satisfied

     PROVE breach:                                                                                                                                                                  
         REQUIRE  PLAINTIFF performed                                                                                                                                               
         REQUIRE  DEFENDANT.paid < CONTRACT.amount                                                                                                                                  
         ASSERT   delay WITHIN reasonable(time)

     IF PROVE(breach):
         AWARD PLAINTIFF (CONTRACT.amount - DEFENDANT.paid) + interest()
     ELSE:
         DISMISS

Then you would run a proof based LLM to generate it into target language and since we already had an example of this from one of the AI labs we know it works. Automatic citations and supporting proof would be automatically populated from reviewed legal -> DSL extracted papers as supporting evidence.

I am sure that many AI labs are working on something similar already and we will see something like that in the near future as proof based llms evolve.

Does the "outperforming" conclusion incorporate the appropriateness of decisions? Or just if things are technically correct. Without human eyes on cases, things could easily get very off track. AI can do a lot of data wrangling, but there is no conscience.

Yes, LLMs are great at search. That's not news.

  • Isn't "getting greater" the more accurate representation, though?

    In 'critical' industries, the error rate is massively important, and if the quality of search is reaching an acceptable error rate, that's quite big news.

When I see news pieces like this I wonder about the failures. Maybe the failure percentage is low but what happens if a bot gives bad counseling? Who is responsible then?

Attorneys will be using LLMs for convenience but they will not disappear, because there needs to be an ultimately human responsible of the decisions.

As others pointed. It kind implies it surpasses professors, but reading more carefully it seems more like the mythos situation. There was a single professor or test that it surpasses.

Reading it makes me extremely suspicious on how cherry picked this was

I'm not a law lecturer. I spend most of my time wrangling contracts and advising about data law. But I did a stint of part-time work teaching a masters in law.

My experience then (this was back before "Attention Is All You Need", I hadn't met the output of generative models) was that students tended to produce work that did not have a proper thread of reasoning in it. There was a tendency to repeat things they had read but rehashed in various ways.

Reviewing some of their texts it was clear that much of the writing - by law tutors - was of the same kind. Much was incorrect. The fact that someone at some time had said a particular case was a proposition for something, meant that got repeated from book to book. Many authors simply didn't read their sources or check their references. Students repeated what they had been told incuriously.

Note: this was a graduate level course. Not wet about the ears undergraduates.

The worst material was little potted notes produced for law students. Utterly awful material in most cases.

Anyway, when LLM's became a thing, a lot of what did not feel right about their output and many of their error patterns, reminded me of the experience of teaching masters' students.

One of the saving graces of English court room practice (when I did that sort of thing) was that judges would say to you "where does it say that?" in a case you cited. You had better have them all at your fingertips and know exactly where you had cited. That avoided a lot of hallucination.

Just a random remark which might be of interest.

Yeah this could be interesting. A lot of the spotlight has been on “law firm stuff” like demand letters and writing contracts…

But imagine if a dev team didn’t have to go engineer -> product manager -> legal team to get a question answered on local data retention requirements. You could ship that much faster.

  • Would you take responsibility for missing details about local data retention requirements?

    • Yes.

      If the only purpose of asking a lawyer is transferring risk (aka cover your ass) while getting the same advice as an LLM, that’s slowing down delivery for purely bureaucratic reasons.

      I’ve seen that mentality at big companies where everyone is scared to stick their neck out and be accountable for a decision. And nothing gets done. Drives me crazy.

      But the people who move up are the people who take ownership and get shit done (and are right a lot).

      (BTW, I have been at companies that were sued by regulators. They never really punish the individual(s) who were in the room when the decision is made. So your worry is kind of misplaced.)

Tangential, is there a "test suite/CI" for AI writing legal documents? Long back in terms of AI progress, a lawyer filed something with hallucinated sources. Do new tools prevent this?

I'd read this less as "AI replaces law professors" and more as "AI may be a surprisingly strong first-pass tutor, especially when the student knows enough to question it"

The interesting shift isn't whether AI beats law professors on tests — it's what happens to the value chain after that threshold is crossed.

When AI clears the knowledge bar in a domain, the remaining moat becomes trust, accountability, and local regulatory context. That's actually good news for niche SaaS builders targeting specific jurisdictions: the generic AI layer commoditizes, but the "AI + local compliance + human accountability" bundle still has real pricing power.

Curious whether anyone has seen this play out already in contract review or compliance tooling outside the US.

I'm not a lawyer, I program.

My understanding is that Civil Law (most of the world excluding UK, US, AU) is like a program: you feed it a situation, it outputs a decision, every once in a while you edit it.

Common Law (UK, US) isn't really a program, but you could stretch and say it's a state machine that has been running since the country started. Every interaction sets a new precedent and changes the state. But the programming analogy falls apart because no one in the right mind would design such a program.

LLMs might actually be the best example of such a program though: Common Law is basically one long chat with an LLM, hundreds of years long.

Before LLMs came along, a Common Law system seemed to have a finite time limit before it's co-opted by wealthy people with the resources to read the whole history. Now I think maybe can push it a bit further.

But it's still a terrible program.

> In a blind evaluation of nearly 3,000 anonymized comparisons, professors rated AI responses significantly higher than answers written by other professors, with AI winning 75% of head-to-head matchups.

75% win rate seems pretty good!

Paper link: https://law.stanford.edu/wp-content/uploads/2026/06/salinas_...

  • I wonder to what degree the AI was just better at communicating. My experience with attorneys is that they are often some of the worst writers.

    • The writing is always fluid and grammatically flawless. This carries much more weight with us than we believe. I know the illusion well from decades of grading college papers. Many of the highest quality students use English as a second language, and I know this, but an American well trained in writing, grammar, spelling always gives an impression of superiority. (Being well trained in writing, grammar, spelling etc is of course high merit, which is how the illusion forms - it is basically an illusion of global 'intelligence')

  • I do wish they'd used some more objective criteria. Simply being preferable one of the things LLMs have trained for since the beginning, hence its sycophantic nature.

    • Maybe sycophantic nature is a good fit for the legal system. A successful lawyer once told me that the most important thing is to know your judge. Objectivity isn't a big thing in court. They'll cite random newspaper articles as evidence and throw out expert opinions - if they like. There might be a way to appeal - but that road often is not functional.

Oh, a "Human-Cented" study by AI lover:

Julian Nyarko

    Professor of Law
    Co-Chair Stanford Law AI Initiative
    Senior Fellow, Stanford Institute for Human-Cented AI (HAI)

LOL!

* Gemini 2.5 Pro (no outside resources), and * NotebookLM (not versioned -- with added legal resources).

NotebookLM was considered slightly better than 2.5 Pro by the evaluators.

I think there will be a market for firms that aggressively market themselves as non-AI, and then as more people turn towards that human connection we'll go full circle

  • Nobody wants to pay their lawyers more than they have to. There will be a huge market for firms that can use AI to avoid charging clients for $1,000/hour junior associates.

  • If you want human connection the legal system is not where you are going to find it, period.

    I don't think there will be any such market for "non ai" law. If I'm involved with the legal system I just want out as quick as possible as cheap as possible.

    • Bad legal advice will keep you dealing with the legal system for much longer and at much greater cost. Something being cheap and quick upfront doesn't mean it will be cheap and quick by the end of the process.

      5 replies →

Curious how they do a “blind” preference test. To any evaluator I’m sure it’s quite clear which answer is AI vs human.

This is exactly what LLM designed to do. Double up a lot of data and find connections and patterns in it.

So no wonder on this point.

One thing I want to mention: Law != Justice.

So while LLMs are awesome at the law study they will suck at justice. Just because one has to solve very emotional problems with it at times. And LLMs are not that good at finding the correct emotion.

  • Also because their reasoning is just a statistical model of whatever they've been fed. No experience of pain, humility, human connection, etc in this.

Question is: if a legal question is answered incorrectly by an LLM, who is going to be held responsible?

The title of the study "Law Professors Prefer AI Over Peer Answers" is VERY different from the title on HackerNews. This is completely clickbait at this point.

And this was done with Gemini 2.5

By the time any research study is done on AI is published the models are already 0.5-1 generation ahead. Even this bullish outcome for AI models and their ability to perform useful work does not reflect how good they are now.

Incredible that the common people will be able to wrestle the right to rule of law away from the bloated legal caste, who have built themselves quite the moat.

The inaccessibility of justice is a huge driver of inequality. Any tools which bridge this gap will help make a more just society.

  • The profession is walking into a court room 90 minutes late because you know the judge's work pattern then going "hey Mike, how are the kids" after 22 years in the same jurisdiction. Then they old boys haggle based on how much the lawyer is charging. You are basically paying for access to the social club. Better outcomes when part of the in-group of course.

    • Would like to plot attitudes to AI against parental incomes or inheritance. If your value derives from having contacts and access to gatekept materials, rather than pure technical expertise, you've got a lot to lose as the walls come crumbling down.

      There was another thread about the impact of AI on maths, and one of the arguments was about peer review... Made me wonder whether the writer was more concerned about the established order and gates being upset, or whether there's actually a valid technical criticism.

Personally I think this is very good. One of the hardest things out there is maintaining a society in the face of changing times and it's because law is dense and slow.

I think, in the right hands, this could be huge.

After quick look of study details and statistics, it does not look very definitive in one way or another.

I mean, LLM's do OK with tutoring, but it depends more of how unique the questions are, not how difficult they are.

Honestly it's not surprising that AI provided answers that were flagged less often as "pedagogically harmful" if we take in account that somehow LLMs create an "average" of all knowledge they ingested.

While they provided the questions that professors and LLMs were asked to respond to, they don't include any of the answers from either the humans or the LLMs, so there's no way to independently verify that the LLMs actually returned "better" answers.

Given the number of responses the professors were asked to rate (200 each), they probably graded them the same way that bar exam responses are graded: quickly and superficially. Not surprising that LLMs achieved higher scores in this scenario, since they excel at producing superficially nice answers that don't hold up under scrutiny.

Also...unless statistics has changed in the past 2 decades, the math in the charts doesn't math. That's probably why they're leaving out the actual numerical data. I also wouldn't be surprised if we learn in the coming days that the charts were AI generated.

America has the jury system- which means you have to be a good actor.

Making people believe that the 14 year old girl is a slut that was raping your poor client- THAT is lawyering.

What is the point of this conclusion? That law professors like the tone and verbosity of AI slop? Okay?

  • I had a similar thought. What if the result, statistical and significance critique aside, mostly means that when it comes to first-year tutoring of law students, the vibe, tone and overall presentation of arguments weighs a lot, maybe even more than the factual arguments themselves?

    In such a framing I don't find it surprising at all that teachers prefer the more polished answers generated by AI, because if LLMs are good at one thing, it is being confident in whatever they generate and present it convincingly.

Law and accounting both seem to be the perfect fields to replace with AI.

Just massive data where you either do calculations or interpretation.

You will replace 100 lawyers with AI and have a single lawyer to review what the AI outputs and stamp their name on it for accountability.

This contradicts my anecdata.

Recently, I tasked Opus 4.6 to study a new Czech building permit law in conjunction with some waste disposal regulations and the result was disappointing. The model could not stop drawing conclusions from obsolete regulations in its training dataset, even when given the fulltext of the new law. The usual "you are totally right" also applied and its conclusions were most of the time obviously wrong even to a human with cursory knowledge of the subject.

I ended with studying the relevant regulations myself over the weekend.

I skimmed portions of the study but didn't manage to figure out whether this actually measures a preference for confident mediocrity.

Uh, oh ... AI is in for it now. It has rankled the ire of lawyers. ;-D

Library outperforms student... more news at 9

  • This was an open book test. The real problem with this study is that winning the most head-to-head preference tests is not the right metric. It doesn't much matter if two answers are right, and one is written a little better than the other. It matters quite a lot if one answer is right and another is wrong.

    The authors point out that this other metric was computed in prior work and incorrectly dismiss it as being not as good as winning percentage in head to head competitions. The cited prior work shows that the models fare poorly on that metric. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5166938

  • Except the library outperformed the professors, which is quite a bit more impressive.

He is basically an AI professor for law. This study just confirms his existence:

https://juliannyarko.com/

Stanford and its donors of course want to replace anyone but its administrators, so they cheer on such anti-intellectual nonsense.

  • This is the state of HN. Created new account. Accused without evidence. Emotional clickbait.

    • I vibe coded hn10k earlier this year. You could choose to see pages with comments only started by 1k+, 10k+ or 100k+ karma contributors. I'm too lazy to keep it up, but I found 1k and 10k both to be better experiences than "vanilla".

[flagged]

  • Just so you know, I have nothing to do with Stanford, but I am flagging this as conspiratorial nonsense. So when you comment is flagged, I just want you to know that it doesn't confirm your belief, it's just that this comment harms discussion and so must be removed.

[flagged]

  • A law professor studying AI has an affiliation with the center at their university that studies applications of AI? Scandalous!

  • You're suspicious that the person doing academic research on how AI applies to law has a job related to research on law and AI?

    • You are not? It is at least worth investigating how much this professor benefits from AI companies. In fact this is HN. Let me come back to you in about 10 minutes.

      EDIT: 10 min later. I give up. I tried to find who is funding HAI, and came empty handed, usually you can see that in their yearly reports, but no such luck for me. I know Google and Bill Gates are big donors, so take that as you will.

Marc Andreessen argued that we've already reached AGI. He says that the top AI models give better answers than 99% of people he has access to, and he has access to some of the best people in their field.

I'm getting more convinced. I mean, sure it makes dumb mistakes sometimes but its a particular set of self serving mistakes, commenting out tests in order to pass. We obv don't want this behavior but I wouldn't say it's dumb.

It'll be like the Turing test, which we just blew past years ago and no one cared. After all the hand-wringing about sentience and rights of the AI if it passes the Turing test, and now we just have AI bots running 24/7 writing slop.

How does everyone else feel?

  • > Marc Andreessen argued that we've already reached AGI. He says that the top AI models give better answers than 99% of people he has access to, and he has access to some of the best people in their field.

    He stands to make billions if enough people believe him — unless you also do, consider that you’re the mark. For example, if that was true, it would have to mean that AI companies either aren’t letting customers use the good models or are instructing them to frequently make errors which reveal a fundamental lack of reasoning ability.

    Consider also that his wealth means he hasn’t had to defend an idea stringently since the 90s. I wouldn’t be surprised if he does think LLMs give deep answers because it often looks that way until you critically review the response and ask questions like what’s missing which require you to have a decent understanding of the problem domain.

    • And you stand to lose your job and your identity as a programmer.

      He makes billions but he already is a billionaire. Gaining billions more doesn't mean shit. The guy really has nothing to lose and the utility of what he gains contribute little to his life style.

      I will tell you this. HN has been comically wrong about everything related to AI. They said driverless cars have no chance of becoming useable. Now Tesla FSD is almost there and I sleep in waymo cars. HN said AI will never code, now everyone uses it to code.

      It's fucking stupid. This is one of the smartest forums on the internet but HN becomes next to stupid when predicting AI. Why? Because humans can't face the truth. When the victim of attack is yourself, it doesn't matter how smart you are... you have to scaffold a rationalization to spare yourself as the victim. You have to lie to yourself and tell yourself that you matter.

      The truth of it is, while LLMs are not the end game, AI in general is on a trajectory to take over. It shows us how meaningless our skills are... not only as programmers but as artists. That beautiful song you felt had greater meaning? It's all reproducible via an algorithm because it never really had a greater meaning. It was just a pattern.

      2 replies →

  • Marc Andreessen has a strong financial incentive to feel this way and to convince others to feel this way.

    I also think it’s easy to think that AI gives good answers if you don’t know the field well. In fields where I know the material, the answers are pretty variable and can be quite bad.

    • HNers have strong incentive to feel the opposite. Humanity in general has strong incentive to feel the opposite.

      AI is not only replacing programmers, but art and the meaning of being human itself. It's showing us how trivial all of human creation is as it's just patterns from an algorithm.

  • >Marc Andreessen argued that we've already reached AGI. He says that the top AI models give better answers than 99% of people he has access to, and he has access to some of the best people in their field.

    He has access to employees and yes-men. What he actually needs to hear, nobody will tell him, AI even less so. Every shit idea he has, would be "what a bright idea"-ed by both everyone around him and AI.

    And of course there's the little matter that he makes money and increases his power by selling AI. What seller doesn't promote their stuff as the greatest ever?

  • Knowing the question is half of the answer. LLMs are great at scoping your context and answering precisely what you asked; it's also why they go off the rails when they misunderstand a part of your question. Incidentally, they're great at "knowing" and reaching for knowledge.

    Humans have the advantage of perspective. We always lack some knowledge and answer broadly. This is bad if you have a particular goal in mind, but better if you're just generally learning, because you see more and learn to discriminate the correct from the wrong. And most importantly, being wrong is part of human ingenuity - because sometimes we turn something "obviously" wrong into something right.

  • Getting the right answers is just half of it, you need to know the right questions to ask. I haven't yet seen AI crack that one.

  • > Marc Andreessen argued that we've already reached AGI. He says that the top AI models give better answers than 99% of people he has access to, and he has access to some of the best people in their field.

    Investor with vested interest in AI companies makes claim of reaching "AGI".

    He is one of the last people to listen to about AGI. Unless the term "AGI" means something entirely different to him vs to independent researchers vs to CEOs, since the term has become entirely meaningless.

  • [flagged]

    • I’m not an AI stan by any means and certainly no fan of Andreessen, but using the term “clanker” immediately biases your statement and can discredit what is a well-referenced or well-meaning comment.