Comment by stult
2 days ago
IME so far (as both a lawyer and a software engineer), LLM error rates when drafting code and legal documents are reasonably comparable, but it's more problematic in the legal context because legal documents do not benefit from many of the structural safeguards available for code. For legal documents, there are no automated tests, no static typing, no test environments, no logging/observability instrumentation, no sandboxing.
The time lag between drafting and "deployment" also makes for much less effective, much more expensive debugging loops. You can deploy your code to prod in seconds, see an error pop up in the logs, and immediately start debugging. But it will take at a minimum days and frequently as long as several years before an error in a contract or a court filing will be detected, and often the error is beyond correction at that point. Thus, the errors are both more difficult to detect and to resolve.
And the consequences of error are often much greater, both because they are not correctable and because a legal error may risk someone's life, liberty, or substantial property. Although that's not categorically the case, obviously bugs in certain safety critical systems can be as bad or even worse than legal mistakes. But in general, most software is lower stakes than most legal writing.
On the flip side, LLMs do seem to do a better job with basic style and structure for legal documents compared to code. Things like following IRAC format, citing assertions of law (although hallucination remains an issue), and writing comprehensible sentences. These would be the equivalents in code to best practices like good comments, cohesion, consistent use of design patterns, test coverage, clear variable names, DRY, etc. Although the better performance on those more qualitative metrics may just be because even the longest legal documents are typically simpler in structure and have fewer lines of text than a large, complex codebase. Or maybe it's because LLMs are trained on natural language text more than on code. Or because natural language is more forgiving than code, in that minor variation in diction or grammar is unlikely to have any significant effect on how the document is interpreted, whereas even single character errors in code can have enormous effects.
There is also one thing I would like to add, and you can correct me if you disagree: coding benefits much more from thorough planning. Now, I exclusively work by first writing a plan that has well-defined steps and goals, which can of course change over time.
It seems to me like it would be more difficult to achieve with legal documents and, in my experience at least, writing a concrete plan has been the decisive factor that make my AI coding robust (plus all that you mentionned).
I'm not sure about that, I actually think planning may be just as important in both domains. Outlining before drafting is an almost universal best practice in legal writing that is drilled into law students to the point that outlining as exam prep is something students spend several weeks on each semester. So personally I always have a fairly detailed implementation plan in the form of an outline before I ask an LLM to draft a more detailed legal document.
I've also adopted an AI coding workflow that involves a lot of planning, although I actually write very little of the plan myself anymore. I have a chain of slash commands like this: create-issue -> plan-issue -> build-plan -> pr-into-dev. I write a relatively brief description of what I want accomplished to create the issue, and then the agent fleshes out my description with more detailed requirements and acceptance criteria. I review the issue description, and the LLM often identifies open questions I failed to consider, so I revise as necessary and then the agent posts the description to the GH issue. I have planning separated because I often create issues quickly when something occurs to me and then circle back at a later date to implement, and want the agent to create the concrete implementation plan with an up-to-date snapshot of the code in context. Then I review that again, adjusting as necessary, and then the agent posts the result as a comment on the original issue.
Like you, I've found this detailed planning makes for a very robust coding agent (again, also in combination with the aforementioned best practices, especially requiring 100% test coverage because forcing it to exercise every line of code avoids hallucinated dummy tests that assert on nothing). Interestingly in comparison to legal writing, I also rely on the agent to decompose complex tasks into separate issues or subissues as appropriate, which is something that is never necessary for legal analysis because pretty much every every legal analysis can be one-shotted.
For legal writing, my workflow is nowhere near as structured as that. For context, I have only ever used LLMs for drafting what are effectively emails to clients or memoranda of law for clients that are a step up in complexity and formality from an email. So not something that will be filed with a court necessarily but very much in the same format and style as a formal motion that would be submitted to a court on behalf of a client. And never a contract, will, or judicial opinion, nor a communication with a counterparty like a demand letter or C&D. So YMMV for other types of legal writing.
That said, I typically start drafting a memo by conversing casually with an agent to explore the general boundaries of an issue I am evaluating, by identifying relevant sources of law, potentially related issues, and the analytical process I need to follow (i.e., what issues to evaluate and what order to evaluate them in, more or the less the analytical "algorithm"). Once I have a good sense of that algorithm, I put together a high level outline and then ask the agent to draft a detailed memo around that outline. Or at least that's what I used to do before the last few months, since when the models have matured to the point where I increasingly just ask the agent to write the outline based on the conversation we had, then review that, then ask it to write the memo based on the outline.
As I have been writing this, it occurs to me that actually I am following almost the exact same process for writing code and for writing legal memos, and should probably distill the legal writing process into a similarly well-structured set of chained skills/slash commands. In both domains, I describe an issue at a high level, get the LLM to fill in some of the broad outline level details, review that, then get the LLM to implement the complete final product. (Also perhaps worth noting while I do occasionally conduct general high level research by talking to a frontier lab LLM, I have always used locally hosted OS/OW models for drafting memos where I need to provide concrete, specific factual information about clients to the LLM, to avoid attorney-client privilege issues, so the quality has lagged behind the frontier models, which is part of why I haven't developed this workflow into as structured of an approach as I have for coding).
In both coding and legal contexts, I think that this planning or outlining step is critical not (or not just) because it forces the agent to create a higher quality product, but because it forces me to review what I am asking the agent to do at a sufficiently detailed level that I can catch errors before they crop up in the implementation. A lot of the time, the errors that occur if I skip this step aren't because the LLM has made any clear mistake, but because I failed to specify some aspect of the task and the LLM is forced to guess at what I really intended, which is where agents often struggle.
So I guess I would tentatively suggest that legal writing does in fact benefit from thorough planning, though it is hard for me to quantify whether those benefits are greater or less than the comparable benefits for code.
This is a very good comment. But notice how even in software engineering there is still disagreement about these structural safeguards.
So yes, we can say the LLM created bad code when it does not compile or fails prewritten tests.
But experts might disagree what good comments, good cohesion, appropriate use of design patterns, appropriate test coverage or clear variable names are.
So what are we suppossed to train the LLMs towards? Somebody still has to decide what "good" is.
Hidden gem of a comment, thanks for writing
Well this is largely the fault of law itself. especially english style law. A legal, parseable code, in which not every single tiny municipality (some less than 1 square mile) has their own set of rules and laws, not all published or available - but which citizens are expected to abide by of course - how could we expect AI to do well and not some typical TV southern lawyer who knows the judge?
I could not agree more. A simple example: it boggles my mind how every state organizes their statutes in entirely dissimilar ways. I'm not sure there's a need for every state to have slightly different wording for a murder statute in the first place, but even assuming there is, why do they all have to be scattered around in different code sections instead of every state just following some consistent convention like always putting the murder statute at Title V, Section 1.4 (or whatever the case may be, that's just a random invented example).
For murder that's not such a huge deal because the statutes are typically easy to track down and don't really differ all that much substantively, but once you get really into the weeds on something like commercial contracts it can be a huge pain to do cross-jurisdictional research.
And that's just a tiny, super obvious example of how impenetrable statutory law is, which isn't even the really pernicious problem. Case law is infinitely worse. It makes me absolutely furious how difficult legal research still is. The Westlaw/LexisNexis duopoly is a moral crime and wildly destructive to the quality of government in this country. Every single written court opinion should be publicly available for free on the internet in an easily searched format. It would cost practically nothing to achieve. We're talking about less text than Wikipedia hosts. Yet still many states make it almost impossible to access case law. Even though these cases are law. Binding law that we are supposed to follow, yet we cannot even easily access. It's insane, and largely perpetuated by the complacency of lawyers who can charge others for what should be free, the lobbying of the duopoly, and the incompetence of politicians.
If all of the laws were consistently available and stored in reasonable, consistent citation formats (I would settle for hyperlinking as a replacement for the rat's nest of wildly varying jurisdiction-specific citation systems), it would even be possible to introduce a form of unit testing for legal drafting that would allow us to automatically verify if the LLM hallucinated a citation.
It also doesn't help that we (for what were at the time very good reasons) moved away from the system of legal writs that used to provide fairly standardized, almost "cut and paste" templates for legal filings. So now every legal document (filings, memos, contracts, court opinions, statutes) is drafted like a bespoke, artisanal creation with few strict structural or stylistic conventions. That makes automated interpretation much harder than it needs to be.
[dead]