Comment by t34t34r43

5 hours ago

Posting this under a burner so I don't dox myself: I work in FinTech on a regulated product. We have access to Mythos. Mythos identified part of our codebase that it confidently asserted was not complaint with a particular regulation and we were at grave risk by allowing it to operate the way it was.

Except this was not the case, it had of course hallucinated what the regulation actually required (I know this because the code in question had already been reviewed by human counsel). This is (supposedly) the most bleeding-edge model available.

We use a lot of genAI to help us write code, but there is no way in the mid-term we could ever rely on these tools to actually build compliant financial products. We'd have to be totally mad. Yes, lots of Fintech companies are using these agents to accelerate, but anyone who's using them to actually ship product without a human actually digging into it is opening themselves up to a world of risk.

136 comments

t34t34r43

PeterStuer 3 hours ago

I have worked on highly regulated areas in finance (risk). Compliance is a highly creative art, often requiring lots of out-of-the-box thinking and non-obvious solutions. The people I found worst at this were IT. They tend to over-interpret regulation, and super-restrict beyond what is needed for actual de-facto compliance.

My guess is the model makes the same mistakes as the programmers: taking 'rules' literally, unaware of sectoral joint understanding, validated interpretations and habits. (btw. this is often on the non-tech side also a difference between regulatory and legal. The former are much more result oriented while the latter are primarily risk averse.

davedx 37 minutes ago

Ha. I've worked in a fairly strongly regulated sector (energy, in the Netherlands), where I collaborated closely with our head of compliance, and she heavily over-interpreted the regulations while I often tried to find more pragmatic solutions.
I think adherence to regulation and compliance is nothing to do with whether you're a SWE, a risk officer, or C-level, and everything to do with your own principles, ethics, professional attitude, and pragmatism.
thewebguyd 2 hours ago
> IT. They tend to over-interpret regulation, and super-restrict beyond what is needed for actual de-facto compliance.
IME this is less the fault of IT and more so bad auditors that won't consider, or just don't understand, what compensating controls are. If it doesn't meet their little checklist exactly, they fail the audit.
- antonvs 1 hour ago
  
  > IT. They tend to over-interpret regulation, and super-restrict beyond what is needed for actual de-facto compliance.
  This is such a nonsensical claim. If a company is asking someone from IT to read the regulations and implement them, then obviously you’re going to get something that conforms to the written specification they were provided.
  But a company that does that is basically delegating both compliance and legal functions to IT. No sane company does that.
- hparadiz 2 hours ago
  
  It's cause IT never has to live with the consequences of their decisions. Who cares if the other department keeps bleeding talent because you twisted the knobs so hard no one wants to work in your system?
  
  2 replies →
jayd16 3 hours ago
Who gets in trouble if it turns out you are actually held to the literal rule?
- PeterStuer 2 hours ago
  
  Contrary to what you indicate rules are not declared in a vacuum, for people to read and then algorithmically 'implement'. There are many ways to interpret regulation, and there will be both accompanying clarifications, as well as compliance departments negotiating with regulators on what is an acceptable and sufficient compliance action. Then there furthermore is a risk that will be calculated vs the cost and opportunity costs etc.
  As an enterprise architect, these are all part of the meetings you have with compliance when you are working on major projects. I have had the privilege of working with some excellent compliance officers, and they are the opposite of the nay-saying caricature that is often painted of them. I found these people to be extremely creative and helpful, working together towards solutions rather than stalling or nixing viable progress.
  
  8 replies →
- scott_w 3 hours ago
  
  That's why you work with your Legal/Compliance Team to make sure you stay in line. They can explain when a rule applies and when it doesn't. This needs the engineering side to be able to explain what's happening, and translate it into the business process as closely as possible, and the legal side to be able to apply the law to the case.
- tsunamifury 3 hours ago
  
  If you think rules are literal than you aren’t aware how the world works.
  There’s a reason it’s called “judgement”
  
  4 replies →

trumpdong 4 hours ago

It was my impression that a whole lot of products are only pretending to be compliant, and that it's much more profitable to operate like that.

InsideOutSanta 3 hours ago

I've worked in fintech for 30 years. I've never seen a product that was intentionally "only pretending to be compliant" with laws.
I've seen accidental non-compliance. I've seen what I would call negligent compliance, where a company attempted to be compliant but didn't meet full, correct compliance (one example I've seen is that a company assigned resources to compliance and forgot to increase resources as workload increased, causing them to be increasingly behind on compliance work), but I've never seen a company that just decided to pretend to be compliant knowing that they were not.
rpicard 4 hours ago
In my experience this is not representative of most fintechs. Of course there are both cases of real intentional noncompliance, and accidental, but by and large it seems like everyone’s trying to innovate within the law.
- scott_w 2 hours ago
  
  This makes sense because these companies want to become large companies and contract with large companies. Large companies, by and large, try to follow the law (while trying to bend it to the limit) because they're aware they have a big target on their back and no CEO wants to be on the front page of the papers for tanking a company in such a stupid fashion.
saghm 4 hours ago

Even if that's the case, I feel like accurately knowing which regulations you're in compliance with and not is would be kind of important from a risk management perspective. From a "maximize profits" perspective (which I'm not saying is good but what you're saying you thought they operated with), you'd want to know the potential gain from ignoring a given regulation and the likelihood of getting caught (along with the cost of the punishment if that's happens). This is the kind of math that I'd expect a finance company to be pretty familiar with, and giving that up for a fuzzy "idk if we're in compliance or not" check seems like a pretty huge liability (unless there's confidence in not being liable for blindly trusting the LLM, which I hope is not the future we're headed for but I guess I can never be totally confident in us not somehow ending up with rules that defy common sense).
sandworm101 4 hours ago
Companies that are growing tend towards faking compliance. Many financial rules like pci only kick in at certain scales. So a company growing very quickly will often be behind the curve but will do everything to seem like they are compliant. Then they would hire people like me to come in and make them actually compliant. More often than not, making an effort at improvement was enough to keep the ball rolling.
- mattmanser 4 hours ago
  
  I think it's the same throughout startup software to be honest. It's just easier to point out when there's clear rules.
  Security, GDPR, backups, build pipelines, disaster recovery, most of it will be faked, half-heartedly done once or ignored entirely.
  Then there's the more abstract things like scalability, idempotency when integrating with external APIs, error recovery, accessibility, UX, etc.
  Almost always that sort of stuff will have been entirely ignored, or there will be a fig leaf over a real mess of misunderstood standards or manual intervention steps.
  Startup developers usually have to be generalists as they often wear many hats, so things that need deeper domain knowledge get done to a bare minimum.
IAmGraydon 3 hours ago
Where did you get this impression from?
- parineum 2 hours ago
  
  A worldview built on reading comments from news aggregators.

ilaksh 2 hours ago

3 years max. Maybe 5 if you are lucky.The models will continue to improve. The exponential gains in compute efficiency that have been ongoing for 70+ years will continue and that will result in even smarter models. There are dramatic hardware changes in the pipeline.

But really that particular issue could have been solved by literally just telling it in a markdown file or instructions something like "verify all facts or compliance requirements with web search and include citations in responses".

ofjcihen 2 hours ago
This is akin to “don’t make mistakes”
“Verify all facts and compliance requirements” leaves enormous holes even if you assume the LLM has a concept of facts and requirements (it does not).
What facts? What requirements? For what industry? For what subset of that industry? For what country or countries that you will be doing business in? Are these current “facts” and “requirements” or is the LLM referencing a dusty article from 1992 for which the subject matter has been radically overhauled?
In my job I regularly see small but incredibly important mistakes like this lead to major issues. Some of those are human driven but increasingly the defense of the person responsible has turned into “Claude said it was fine though!”
- kolinko 16 minutes ago
  
  Well, you wouldn't just give human a task "verify all facts and compliance requirements" and expect it to end well either, no?
- ilaksh 2 hours ago
  
  It can make mistakes and will sometimes, but what he specifically mentioned was a case where it did not pull up a reference that it needed. So using a web search tool effectively would make a big difference.
  
  2 replies →
vor_ 24 minutes ago
> 3 years max. Maybe 5 if you are lucky.The models will continue to improve. The exponential gains in compute efficiency that have been ongoing for 70+ years will continue and that will result in even smarter models. There are dramatic hardware changes in the pipeline.
I remember hearing that 10 years ago about self-driving.
- DaSHacka 2 minutes ago
  
  "Just 2 more weeks guys, and AI will be able to do everything!"
jppope 2 hours ago

Stuff like that is risk tolerance... its not strictly codified and its more akin to probability. Different companies at different stages, in different industries will all interpret their risk differently... how will a smarter model improve that?
suttontom 2 hours ago

Ah yes, the magical equivalent of "you are a senior software engineer who writes bug-free code".
IME people would benefit greatly from the process, albeit tedious and time-consuming, of testing out the same prompt sequence/session with the exact same model multiple times. It becomes clear extremely quickly how capable but unreliable and inconsistent a model can be even when given the same context. If you have ever completed a long, complicated task with an agent and then lost the session and tried doing the same thing again from scratch you may have had the experience of seeing the subtle changes that come up in the model's thinking which lead it to accept or reject certain paths and ignore or incorporate prompt instructions like the one you've provided.
eikenberry 2 hours ago
The classic 3-5 year window for a new technology that is uncertain and requires just a few more breakthroughs to get there...
- Upvoter33 7 minutes ago
  
  written with confidence too. I'm amazed at the levels of confidence people have in predicting the (unclear) future.
- weakfish 1 hour ago
  
  Like full self driving!

bobkb 3 hours ago

IMHO even if we are using auditing tools I believe we must use deterministic tools for critical analysis like this. Such rule and pattern based systems may not scale beyond certain point but they can be accurate.

ericmcer 3 hours ago

The dynamic of agent codes human reviews does seem like the only sane one for the foreseeable future. Even Anthropic themselves still fall back to this.

The problem is that sucks, even if all software engineers keep their jobs and salaries, the floor is still pulled out from under us. Imagine if a surgeons job was to supervise robot surgeons from a remote computer, or a woodworker just signs off on work before the machines do all the cutting and assembly. Sure they still have important jobs in their field but the soul & humanity of their skill is gone.

hax0ron3 3 hours ago
I never found there to be much soul and humanity in the job to begin with. Coding personal projects has soul, but for me at least the demands of high-velocity sprint-based software development to match business needs removed most of the soul and humanity long before AI got good at coding. And I mean, I totally understand why it has to be like that. In most businesses, you do better by shipping decent software fast than by shipping great software slowly. I don't have a problem with that in principle. But it does mean that for me, the software development side of things has never had much soul and humanity to begin with. It was just being a glorified assembly line worker, with the sprints being the assembly line. Of course, others may have had very different experiences, but that has been mine.
For me, AIs have actually made the job more soulful, not less. For one thing, it lets me use the part of my mind that is good at human language, not just the part of my mind that is good at software. This makes the job feel a bit less one-dimensional in terms of what parts of me are engaged while doing it. For another, I find it liberating to no longer have to think much about boilerplate code or to spend time roaming around the Internet looking up documentation of various language syntax and API details, the vast majority of which are arbitrary rather than being based on any kind of mathematical beauty. For me it makes the job more soulful that I can think of the job on a higher level instead of having to spend effort on arbitrary and tedious details.
Of course there is still the question of "will the job even exist in a few years, at least for more than a relatively small number of people?". But that's a separate question. For now at least, I am finding that for me AIs have brought a lot more soul and humanity to the job than it ever had before.
- abalashov 1 hour ago
  
  That's an interesting perspective. It's hard for me to relate to it because I haven't worked in a job where I just have to ship code 'for work' in so long. Being a more or less one-man software company, all my work projects, but especially our products, feel like personal projects.
  However, if I were just having to do things for the man, I might have a rather different take on all this.
  
  1 reply →
davedx 32 minutes ago

I don't know if I agree that's the only sane workflow; the problem is, I am way less invested doing code reviews of agents than I am reviewing code by human colleagues.
I would love to be able to say I pay the same amount of attention and am just as diligent and communicate as clearly with an agent, but it wouldn't be honest: I scan agent PRs for obvious mistakes or misinterpretation of what they've implemented.
With human colleagues I usually know them and their style, their way of working, so have a better idea what to look for. You also have a genuine return on providing feedback that helps coworkers learn and improve, whereas with agents, all the feedback you write is gone when the thing gets merged (unless your org has some kind of shared memory for its agents).
I don't have the answer for what the future looks like, but I suspect agent-type-1 reviews agent-type-2 is actually where we'll end up.
odeono 3 hours ago
"Soul and humanity" is doing a lot of work here.
Does the woodworker who shape using a handsaw use less "soul" than the one who uses a machine?
Does the musician who use a DAW and VSTs instead of analogue tape recorders create music with less "soul"?
Does the painter who buys acryllic paint instead of synthesizing their own dye from plants use less "soul"?
As technological innovation progresses, the barrier to creation falls. The process of creating something is not to be conflated with the final piece of art itself.
- hatsix 2 hours ago
  
  Does the carpenter who used to build custom fit cabinets with hand and power tools put in the same creativity when he just carries around a scanner, scans the area, the customers use software to select the layout, approve the work, then the CNC cuts out the wood, then all that's left is to put the screws in the holes and go home.
  This isn't like the step from hand saws to power saws, and it's disingenuous to pretend like it is. This is what the startup machine has been doing to every industry... finding... "inefficiencies" and "optimizing" them.
- jadbox 3 hours ago
  
  Not _my_ opinion, but I just wanted to share that many people (in the Midwest) do believe that anything synthetic that it not readily made from simple materials has "less soul". It's a sorta test of "if I dropped you off in the jungle, can you still produce works of soul? Or are you just another cog in the machine.".
- runarberg 2 hours ago
  
  Your analogies are flawed. DAWs and skill saws generate nothing. They take skill to operate, and a novice cannot use these tools at all unless they know the craft.
  Compare to this to prompting an LLM: “Generate a third person where game with a view from above where you can steal cars, shoot at people, run from the police, etc.” Anybody with access to the tool can do this, and the results are just another uninspiring GTA clone that you would imagine.
  The latter is more like a carpenter ordering their “work” from alibaba then it is like using a skill saw.
- ImprobableTruth 3 hours ago
  
  Except it's not just a tool.
  It's when a woodworker, musician or painter completely outsources their work and just marks what's wrong, sending those parts back. Yes, the final art piece might be the same, but the artist definitely uses less of their "soul".
lubujackson 2 hours ago
I think there is a big difference between a surgeon, who is performing a specific task with a clear outcome, to a woodworker, who might produce a unique piece of art or a functional chair. I think the surgeon-type tasks will be replaced eventually. More interesting are the woodworker types, which has some similarities to SWEs.
When industrialization hit, we definitely lost a ton of craftsmanship and craftsman, but a standard Ikea chair is less likely to wobble than the average chair at a much better price (for a random example). Yes, we traded artistry for convenience, but what we really did was bifurcate our needs between "some place stable to sit" from "a beautiful chair for my home". Most people wanted the former more than the latter, and the same applies to software.
If we split the roles into buckets, many woodworkers disappeared, some became artisans, some became designers for industrially-produced products, and some catered to Luddites for a long transitional period. Despite Anthropic's claims, SWEs won't disappear in a year but over a generation or two, no matter how good LLMs become.
Obviously software is much more complicated and integrated into other elements of business, which in a way makes it more vulnerable to AI taking over and in another way will be at the mercy of larger shifts to how businesses organize human roles and responsibilities. What we call "taste" comes down to "intent" - what the hell does a company do? What should it be doing and how should it operate? These will be the only questions that matter and the one thing LLMs can't replace since they will always choose the most default path. So I think human's roles will be to inject intent/taste at different levels of abstraction throughout an organization.
- Melatonic 17 minutes ago
  
  Im not sure your assumption holds at all. If anything the ikea chair could argued to be a very efficient use of resources producing a minimally useful chair. But why is it less likely to wobble ?
  In addition the incentives are misaligned - the "artisan" made chair (in the past) wasn't likely made for aesthetic reasons - it was made to last long term and function. And if it wobbled or had any problems the original woodworker was probably around to fix it.
adrianN 3 hours ago

After a couple of years of this their expertise will be gone too and then nobody is qualified to supervise the clankers.

deanc 5 hours ago

I've worked on projects in the airline and health industry which are highly regulated too. The regulations can be incredibly difficult to process and implement, and make sure you adhere to everything correctly. I've been involved in multiple scenarios where people have made false assertions about compliance or lack of. I'd still place a bet that the SOA models make _far_ less mistakes than humans.

genxy 5 hours ago
They might make fewer mistakes, but they aren't evenly distributed. They don't use logic when making mistakes, it is gaps in the training data and now large of a span they have to bridge in the latent space. Just as they aren't smart like humans, they aren't stupid like humans. Don't mistake rate for quality.
- Terr_ 2 hours ago
  
  Yeah, this starts to overlap with some autonomous vehicle stuff, where I like to say that the rate of errors is not the shape or distribution of errors.
  We have long historical experience and innate tools for detecting and mitigating errors made by humans. If we can't apply those to automation, then even fewer total mistakes may end up being a worse outcome.
csallen 4 hours ago
For some reason, tons of people seem to be in camps at both extremes. It's either "AI sucks don't trust it!" or "AI is so much better than humans!"
But the most reasonable take, which I'm happy to see reflected in so many comments in this thread, is… use both.
Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI. Then the unique shortcomings of each party can be covered by the other's strengths.
- hammock 4 hours ago
  
  AI review is never going to beat a fully resourced human review.
  It might beat an underresourced human review, on time, efficiency, cost metrics. But on the metric of accuracy, throwing unlimited humans at a problem will still beat throwing unlimited AI at it
  
  1 reply →
- bigstrat2003 4 hours ago
  
  > Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI.
  You can do that, sure. But doing so negates any improvements in speed the LLM brought. And at that point, you may as well just do it yourself to begin with.
  
  3 replies →
- BurningFrog 3 hours ago
  
  This makes sense, but a logical next step is to have one AI write code, and then have another AI, instead of humans, verify it.
  Or are current AIs too similar for that to be fruitful?
  
  1 reply →
criticalfault 4 hours ago

not according. to my experience.
regulation questions. even the simple ones, AI gets all the time wrong. it wasn't Mythos, but other models like opus.
I can adjust the view on this topic if/when we get access to mythos.
sillyfluke 4 hours ago
>I'd still place a bet that the SOA models make _far_ less mistakes than humans.
Genuine question: your top coder seems to be producing the most error-free code from your perspective, has the deepest knowledge of the architecture and codebase, and is faster on the trigger than the others.
But your top coder has proven and verifiable dementia, where they will confidently assume the existence of apis and code that do not exist, mix up the purpose of others and forget other things, and you can't predict when and how they will introduce errors into the system or the severity of such errors.
Are you really comfortable letting this person with dementia generate most of your codebase in the airline and health industry?
I also hope you have an iron-clad agreement that prevents the model provider from doing silent updates because all your evidence of correctness you collected thus far goes out the window in that case.
Another genuine question:
You have witnessed a human coder and the AI you're using make the same important mistake. Assuming you do not have the time and resources to retrain, fine tume, and test your frontier model:
Who would you trust not to make the same mistake multiple times in the future after you have warned them that their job depends on it, the AI or the human?
- deanc 4 hours ago
  
  Your top coder has guard rails in place to prevent him autonomously going free - right? This is how you should approach agentic development with LLMs. Like it or not, we are the final bastion, the gatekeepers. The hallucination thing I think is mostly overblown and from speaking to colleagues it seems to vary wildly depending on which model and harness you are using - always go for SOA. In the last 3 months I can count on one hand where it's done something wrong and that's primarily as I'm operating it with guard rails and giving it context.
  
  7 replies →
realusername 4 hours ago

> I'd still place a bet that the SOA models make _far_ less mistakes than humans.
Well too bad, the problem is that they also produce things much faster than humans so errors will compound quicker.
porridgeraisin 4 hours ago

This stupid argument again. The number of mistakes _does not matter_. Get. This. In. Your. Head. The predictability of the _type_ of error is what matters. For LLMs and machine learning in general the error distribution is not what you would expect and it is not possible to predict either.

tpoacher 3 hours ago

In some sense, you should still act on this, since if an external auditor relies on the same stack, it'll still cause you headaches.

whatevaa 3 hours ago

The models can change at any time and behave differently.

solenoid0937 3 hours ago

I use Opus 4.8 and GPT 5.5 and haven't suffered from hallucinations in months. But we also put a lot of effort into our harness.

Aeolun 3 hours ago
Opus 4.8 and gpt constantly hallucinate stuff as well. If you haven’t encountered or caught it that’s something different. Of course these days it’s mostly confidently asserting a wrong thing.
Loic 3 hours ago

Sometimes the harness can only be a human.
And this is fine. Developing new software with a really smart intern is the same, you, as an expert, need to bring your experience/expertise on the table to have everything right. Because experience needs time.

galactushonor 4 hours ago

> it had of course hallucinated what the regulation actually required

Did it do the correct job once you put the regulations doc(s) in the context?

loloquwowndueo 4 hours ago
What I usually do when in doubt is challenge the AI. “Please quote the section of regulation the product is non compliant with”. It usually admits it hallucinated the whole thing.
- mattmanser 4 hours ago
  
  It sometimes says that even if it hasn't though, so like everything with LLMs, you can't actually rely on that.

rvz 5 hours ago

100%. Unfortunately those not in the depths of mission critical systems or regulated products will continue to believe that producing tons of code quickly using LLMs without humans in these systems is acceptable.

Here's an example of what we will continue to see with folks fully immersed in gen AI psychosis:

"The creator of claude code said that he no longer writes code for about 6 months and now has Claude doing all his work now. He also said recently that he no longer prompts Claude and now has it running in loops and it is self-improving itself and performing better than a human!"

If the code produced by the LLM is perfect, the LLM takes the credit. But when a disaster happens, you cannot blame the LLM and it then falls on the human who did it.

I don't think SWEs heavily vibe-coding with LLMs realize the risk in not understanding what the code the LLM being produced is doing even after generating tests (lol). We will see more of this too. [0]

[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

oceanplexian 4 hours ago
Why is it such a dramatic statement for Boris to claim that he no longer writes code?
Are people on HN still typing out functions by hand one character at a time?
It would be like a developer in 2020 claiming that he only writes assembly because compilers can’t be trusted. No one is taking that person seriously. If you chose a career in tech you made a decision to work in one of the fastest moving fields in human history. Now it’s time to get over it, learn the new tools and adapt.
- msm_ 3 hours ago
  
  >Are people on HN still typing out functions by hand one character at a time?
  Well I use tab completion, of course. And I copy-paste snippets from LLM more often than from SO now. But otherwise not much has changed in my career in the last 5 years. Is this different for you?
  I'm not fundamentally opposed to code generation, and I use LLMs for some taks, but I don't see myself vibecoding whole pages of production code. I vibecoded a throwaway note-taking app for myself though.
- lelanthran 2 hours ago
  
  > Now it’s time to get over it, learn the new tools and adapt.
  If the AI is producing what you tell it to, why are you needed?
- bigstrat2003 4 hours ago
  
  > Now it’s time to get over it, learn the new tools and adapt.
  No, thank you. I have used the new tools, determined that they aren't helpful to me, and set them aside as I would with any other bad tool. I don't feel the need to let hype take the steering wheel.
- rvz 4 hours ago
  
  > Now it’s time to get over it, learn the new tools and adapt.
  Exactly. You are free to use openclaw or a coding agent to build a competing bank, hedge-fund, hospital or even a new airliner because the previous ones were built by humans. Surely an AI can do it better by itself.
  So why haven't you done it yet?
- matkoniecz 3 hours ago
  
  > Are people on HN still typing out functions by hand one character at a time?
  Yes, me. Yes, I tried LLMs for what I am doing and will try again in few months. No, there was no noticeable or clear improvement over doing it manually.
  Yes, I am using some LLMs for some purposes but Claude Code had slight improvement, if any, not worth introducing proprietary dependency.
- solenoid0937 3 hours ago
  
  It is because HN is contrarian and behind the times.
  I work at a big tech company and I don't know a single person that still hand writes code. Most people haven't hand written code for at least half a year now.
  I do wonder what sort of bug is making its rounds on HN that people here find this so shocking and unbelievable.
- rjrjrjrj 3 hours ago
  
  C'mon, the LLM/compiler false analogy? In 2026?
- troupo 3 hours ago
  
  > Why is it such a dramatic statement for Boris to claim that he no longer writes code?
  Because we can actually see the disjointed slop that Anthropic produces. And when issues happen, they can't fix them for weeks on end because no one understands what code does anymore, and all of their "hard problems causing issues" they blog about are literally "if we had actual engineers this wouldn't even be an issue to begin with". Like this bullshit they had in spring: https://www.anthropic.com/engineering/april-23-postmortem
  > It would be like a developer in 2020 claiming that he only writes assembly because compilers can’t be trusted.
  LLMs are not compilers. For a few very obvious reasons I'll leave as an exercise to figure out

mbbutler 4 hours ago

False-positive rate is so high with Mythos according to friends and other reporting I have seen.

The original Mythos release used ASan to filter false-positives so it was able to maintain a good FPR, but when Mythos moves into domains that don't have a readily available oracle to help filter hits, the result is a deluge of false bullshit.

Lionga 5 hours ago

Have you added "Make no mistakes" to the proompt? Mythos can't go wrong then, must be a skill issue.

cheschire 5 hours ago
its shocking people don't realize you're being ironic
- steveBK123 5 hours ago
  
  AI cannot fail, it can only be failed
  
  2 replies →
- SpicyLemonZest 5 hours ago
  
  I realize they’re being ironic, it’s just a poor contribution to an otherwise productive conversation.

franze 5 hours ago

what am i missing?

you take a spec and create tests, every little thing

you use another ai to verify these tests against the spec

you review the tests vs the spec (at one point human review)

you put the tests off limits to change / wall them.

you let the ai write the software that fulfills the tests.

there will be some gaps where you repeat the cycle above

if the tests fulfill the spec, the code will fulfill the spec

torben-friis 4 hours ago
>you take a spec and create tests, every little thing
A spec detailed enough and unambiguous enough to be translated into machine execution deterministically is called code.
Unlike a compiler, AI can build with a spec that is not detailed enough or unambiguous enough: It does so by filling in the gaps with educated guesses.
This is safe if and only if you take the time to later read the output, understand what its guesses were, and judge wether they were acceptable. No AI can do this for you because the truth lies in your original intentions, which it does not have access to.
The jury is out there on how reliable and time consuming this is vs writing the code yourself; it is not immediately obvious that is faster or requires a smaller cognitive load.
- hparadiz 4 hours ago
  
  Code is not a spec. It's an instruction set. It can be a spec if you try hard but that's not an inherent property of code. For example you can write code to be a compiler..that makes it a spec. But hello world is not a spec.
  As for whether or not LLMs can write unit tests. The answer is yes.
  
  5 replies →
steveBK123 5 hours ago

If each step requires micro-steps iterating with an LLM with human review to prevent hallucinations creeping in.. at some point you might just be better off letting the human do the work.
Particularly as tokenmaxxing has ended and people are being charged more economic prices. If the pricing 5-10x the way Uber,etc did on the path to profitability.. even more so.
officialchicken 5 hours ago
IME, regulatory compliance is something you are rarely able to test for in a nice little box or with well-known suite. So there's no easy "this complies" in many situations, no matter how many lawyers, compliance officers, and llm's you run it past.
- franze 5 hours ago
  
  so, whats the difference to human engineering?
  other than there are "internal micro feedback loops" during development?
hedora 3 hours ago

I walked down that path for a few months. The more you constrain LLM's, the more underhanded they behave in order to produce something that satisfies all the constraints.
Doing the above doesn't actually make the model smarter, so, if it couldn't get to correct code with fewer steps, then the light you see at the end of the tunnel is an oncoming train.
sigbottle 4 hours ago

This is such an abstract principle that the principle itself cannot be refuted. The plan sounds fine on paper. "Just iterate bro". But it entirely depends on what rational agents you put into the system. Obviously, if I sub in a 5 year old child everywhere, this loop breaks. Humans and AI, sometimes one is better than the other at certain things, we're still learning.
The only way to test this is to test it out, in real life. Sometimes people see results, sometimes people don't. Note that yes, I am including the entire iteration process - even after iterating, people still don't see results with AI.
I have had both positive and negative experiences with AI, over multi-week projects. But apparently on hackernews, anything positive about AI is proof that AI is superhuman and taking over, and all follies about AI are lies by stupid humans who secretly have psychological dispositions to fear AI. Sometimes the AI genuinely isn't good enough. Are we not allowed to say that now? We might not know why, but it's just the truth.
The other solution is to formally analyze the entire space of possible actions the agent can take a priori. Then yes, you can definitively say whether or not the principle breaks or not. Can you, though? Can you give a formal specification for the space of possible actions for AI and show that your loop never breaks, or breaks less than humans, or any other sensible criteria? If not, then you can't just give an abstract principle and start making inferences from that.
bobkb 3 hours ago

It’s impossible to write a spec that’s not ambiguous , complete and correct in natural languages. Thus prompts will always generate unreliable software.

SuperV1234 5 hours ago

Is that all that Mythos did?

Did it find any real potential issue, optimization/simplification opportunities, or sparked any thought-provoking discussion within your organization?

Or was it purely a net negative experience?

margalabargala 5 hours ago

Read their comment. It's a negative anecdote surrounded by them using genAI all the time.
You're the only one coming away thinking there was a net negative experience.
troupo 5 hours ago
In regulated industries none of those matter if the tool invents compliance issues or breaks compliance.
The only thought-ptovoking discussion should be "why the hell do we have this stochastic parrot anywhere near out codebase"
- bloaf 5 hours ago
  
  I think that what technical people fail to understand is that a lot of the time, "compliance" is not the same as a binary compiles/does not compile. For a lot of rules/regulations, compliance means "making enough effort that legal is willing to back you up".
  A system which will just randomly decide to give the legal team reasons to not back you up is:
  * A system whose output will get brought up in lawsuits and make legal's job harder.
  * A system that will make the dev team perpetually chase its tail while it oscillates between the several different valid interpretations of the rules.
- brookst 5 hours ago
  
  Odd take. So if it identified 17 real gaps and helped fix them, the fact it was wrong about one gap, and the appropriate humans caught it and no harm was done, the whole thing is useless?
  Not saying that is the situation, I don’t know. But if “one error is too many” is your point of view… do you think the humans in these orgs are 100% perfect 100% of the time?
  
  1 reply →

gaiagraphia 5 hours ago

Isn't that a net positive though? (not sure about the cost human and tech cost). I'm guessing that without using Mythos, those conversations would never have been had, and confidence in the compliance of the product would've been lower.

I love using AI tools as casinos. It's epic in helping to forge ideas and kickstart thought processes. You basically have the entirety of world knowledge at your fingertips to have a pint with.

vulcan01 5 hours ago
your parent:
> the code in question had already been reviewed by human counsel
- johnbarron 5 hours ago
  
  They cant read all comments they comment on...
cucumber3732842 5 hours ago
> I'm guessing that without using Mythos, those conversations would never have been had, and confidence in the compliance of the product would've been lower.
The conversations had already been had and the product made compliant. Mythos just pulled new rules out of its ass and of course the product wasn't compliant with those. So they do a fire drill and find that to be the case at great expense.
Yeah you can frame it as "more checking is always better" if you wanted but that's just the same old "other people's resources are valueless" slight of hand we see on everything. It probably was mostly wasteful work.
- hedora 3 hours ago
  
  There's a chapter in Simple Sabotage about how to undermine a white collar organization from the inside. One of the key tactics is to hold meetings that revisit decided upon points, and to invent unnecessary process / checking.
  So, in this case, the LLM's behavior was equivalent to the behavior of the resistance during WWII.
  I think that book should be required reading for all engineering students.