← Back to context

Comment by iandanforth

5 hours ago

Wut? I pilot LLMs all day but there's no way in hell I'd agree to be at the helm of a finance product. That first pillar is still there. Maybe the author isn't aware of the impact they have, but I know, with the evidence of reverted PRs, that when I step outside my area of deep knowledge I can no longer call BS on the agents. Our most capable agent, with access to the same kind of distributed systems the author talks about, is regularly wrong, frequently myopic, and just outright dumb constantly. It's the expertise of engineers on the team that push it back on track.

Posting this under a burner so I don't dox myself: I work in FinTech on a regulated product. We have access to Mythos. Mythos identified part of our codebase that it confidently asserted was not complaint with a particular regulation and we were at grave risk by allowing it to operate the way it was.

Except this was not the case, it had of course hallucinated what the regulation actually required (I know this because the code in question had already been reviewed by human counsel). This is (supposedly) the most bleeding-edge model available.

We use a lot of genAI to help us write code, but there is no way in the mid-term we could ever rely on these tools to actually build compliant financial products. We'd have to be totally mad. Yes, lots of Fintech companies are using these agents to accelerate, but anyone who's using them to actually ship product without a human actually digging into it is opening themselves up to a world of risk.

  • I have worked on highly regulated areas in finance (risk). Compliance is a highly creative art, often requiring lots of out-of-the-box thinking and non-obvious solutions. The people I found worst at this were IT. They tend to over-interpret regulation, and super-restrict beyond what is needed for actual de-facto compliance.

    My guess is the model makes the same mistakes as the programmers: taking 'rules' literally, unaware of sectoral joint understanding, validated interpretations and habits. (btw. this is often on the non-tech side also a difference between regulatory and legal. The former are much more result oriented while the latter are primarily risk averse.

    • > IT. They tend to over-interpret regulation, and super-restrict beyond what is needed for actual de-facto compliance.

      IME this is less the fault of IT and more so bad auditors that won't consider, or just don't understand, what compensating controls are. If it doesn't meet their little checklist exactly, they fail the audit.

      3 replies →

  • It was my impression that a whole lot of products are only pretending to be compliant, and that it's much more profitable to operate like that.

    • I've worked in fintech for 30 years. I've never seen a product that was intentionally "only pretending to be compliant" with laws.

      I've seen accidental non-compliance. I've seen what I would call negligent compliance, where a company attempted to be compliant but didn't meet full, correct compliance (one example I've seen is that a company assigned resources to compliance and forgot to increase resources as workload increased, causing them to be increasingly behind on compliance work), but I've never seen a company that just decided to pretend to be compliant knowing that they were not.

    • In my experience this is not representative of most fintechs. Of course there are both cases of real intentional noncompliance, and accidental, but by and large it seems like everyone’s trying to innovate within the law.

      1 reply →

    • Even if that's the case, I feel like accurately knowing which regulations you're in compliance with and not is would be kind of important from a risk management perspective. From a "maximize profits" perspective (which I'm not saying is good but what you're saying you thought they operated with), you'd want to know the potential gain from ignoring a given regulation and the likelihood of getting caught (along with the cost of the punishment if that's happens). This is the kind of math that I'd expect a finance company to be pretty familiar with, and giving that up for a fuzzy "idk if we're in compliance or not" check seems like a pretty huge liability (unless there's confidence in not being liable for blindly trusting the LLM, which I hope is not the future we're headed for but I guess I can never be totally confident in us not somehow ending up with rules that defy common sense).

    • Companies that are growing tend towards faking compliance. Many financial rules like pci only kick in at certain scales. So a company growing very quickly will often be behind the curve but will do everything to seem like they are compliant. Then they would hire people like me to come in and make them actually compliant. More often than not, making an effort at improvement was enough to keep the ball rolling.

      1 reply →

  • IMHO even if we are using auditing tools I believe we must use deterministic tools for critical analysis like this. Such rule and pattern based systems may not scale beyond certain point but they can be accurate.

  • The dynamic of agent codes human reviews does seem like the only sane one for the foreseeable future. Even Anthropic themselves still fall back to this.

    The problem is that sucks, even if all software engineers keep their jobs and salaries, the floor is still pulled out from under us. Imagine if a surgeons job was to supervise robot surgeons from a remote computer, or a woodworker just signs off on work before the machines do all the cutting and assembly. Sure they still have important jobs in their field but the soul & humanity of their skill is gone.

    • "Soul and humanity" is doing a lot of work here.

      Does the woodworker who shape using a handsaw use less "soul" than the one who uses a machine?

      Does the musician who use a DAW and VSTs instead of analogue tape recorders create music with less "soul"?

      Does the painter who buys acryllic paint instead of synthesizing their own dye from plants use less "soul"?

      As technological innovation progresses, the barrier to creation falls. The process of creating something is not to be conflated with the final piece of art itself.

      4 replies →

    • I never found there to be much soul and humanity in the job to begin with. Coding personal projects has soul, but for me at least the demands of high-velocity sprint-based software development to match business needs removed most of the soul and humanity long before AI got good at coding. And I mean, I totally understand why it has to be like that. In most businesses, you do better by shipping decent software fast than by shipping great software slowly. I don't have a problem with that in principle. But it does mean that for me, the software development side of things has never had much soul and humanity to begin with. It was just being a glorified assembly line worker, with the sprints being the assembly line. Of course, others may have had very different experiences, but that has been mine.

      For me, AIs have actually made the job more soulful, not less. For one thing, it lets me use the part of my mind that is good at human language, not just the part of my mind that is good at software. This makes the job feel a bit less one-dimensional in terms of what parts of me are engaged while doing it. For another, I find it liberating to no longer have to think much about boilerplate code or to spend time roaming around the Internet looking up documentation of various language syntax and API details, the vast majority of which are arbitrary rather than being based on any kind of mathematical beauty. For me it makes the job more soulful that I can think of the job on a higher level instead of having to spend effort on arbitrary and tedious details.

      Of course there is still the question of "will the job even exist in a few years, at least for more than a relatively small number of people?". But that's a separate question. For now at least, I am finding that for me AIs have brought a lot more soul and humanity to the job than it ever had before.

      2 replies →

    • I think there is a big difference between a surgeon, who is performing a specific task with a clear outcome, to a woodworker, who might produce a unique piece of art or a functional chair. I think the surgeon-type tasks will be replaced eventually. More interesting are the woodworker types, which has some similarities to SWEs.

      When industrialization hit, we definitely lost a ton of craftsmanship and craftsman, but a standard Ikea chair is less likely to wobble than the average chair at a much better price (for a random example). Yes, we traded artistry for convenience, but what we really did was bifurcate our needs between "some place stable to sit" from "a beautiful chair for my home". Most people wanted the former more than the latter, and the same applies to software.

      If we split the roles into buckets, many woodworkers disappeared, some became artisans, some became designers for industrially-produced products, and some catered to Luddites for a long transitional period. Despite Anthropic's claims, SWEs won't disappear in a year but over a generation or two, no matter how good LLMs become.

      Obviously software is much more complicated and integrated into other elements of business, which in a way makes it more vulnerable to AI taking over and in another way will be at the mercy of larger shifts to how businesses organize human roles and responsibilities. What we call "taste" comes down to "intent" - what the hell does a company do? What should it be doing and how should it operate? These will be the only questions that matter and the one thing LLMs can't replace since they will always choose the most default path. So I think human's roles will be to inject intent/taste at different levels of abstraction throughout an organization.

    • After a couple of years of this their expertise will be gone too and then nobody is qualified to supervise the clankers.

  • 3 years max. Maybe 5 if you are lucky.The models will continue to improve. The exponential gains in compute efficiency that have been ongoing for 70+ years will continue and that will result in even smarter models. There are dramatic hardware changes in the pipeline.

    But really that particular issue could have been solved by literally just telling it in a markdown file or instructions something like "verify all facts or compliance requirements with web search and include citations in responses".

    • This is akin to “don’t make mistakes”

      “Verify all facts and compliance requirements” leaves enormous holes even if you assume the LLM has a concept of facts and requirements (it does not).

      What facts? What requirements? For what industry? For what subset of that industry? For what country or countries that you will be doing business in? Are these current “facts” and “requirements” or is the LLM referencing a dusty article from 1992 for which the subject matter has been radically overhauled?

      In my job I regularly see small but incredibly important mistakes like this lead to major issues. Some of those are human driven but increasingly the defense of the person responsible has turned into “Claude said it was fine though!”

      3 replies →

    • Stuff like that is risk tolerance... its not strictly codified and its more akin to probability. Different companies at different stages, in different industries will all interpret their risk differently... how will a smarter model improve that?

    • Ah yes, the magical equivalent of "you are a senior software engineer who writes bug-free code".

      IME people would benefit greatly from the process, albeit tedious and time-consuming, of testing out the same prompt sequence/session with the exact same model multiple times. It becomes clear extremely quickly how capable but unreliable and inconsistent a model can be even when given the same context. If you have ever completed a long, complicated task with an agent and then lost the session and tried doing the same thing again from scratch you may have had the experience of seeing the subtle changes that come up in the model's thinking which lead it to accept or reject certain paths and ignore or incorporate prompt instructions like the one you've provided.

  • I've worked on projects in the airline and health industry which are highly regulated too. The regulations can be incredibly difficult to process and implement, and make sure you adhere to everything correctly. I've been involved in multiple scenarios where people have made false assertions about compliance or lack of. I'd still place a bet that the SOA models make _far_ less mistakes than humans.

    • They might make fewer mistakes, but they aren't evenly distributed. They don't use logic when making mistakes, it is gaps in the training data and now large of a span they have to bridge in the latent space. Just as they aren't smart like humans, they aren't stupid like humans. Don't mistake rate for quality.

      1 reply →

    • For some reason, tons of people seem to be in camps at both extremes. It's either "AI sucks don't trust it!" or "AI is so much better than humans!"

      But the most reasonable take, which I'm happy to see reflected in so many comments in this thread, is… use both.

      Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI. Then the unique shortcomings of each party can be covered by the other's strengths.

      8 replies →

    • not according. to my experience.

      regulation questions. even the simple ones, AI gets all the time wrong. it wasn't Mythos, but other models like opus.

      I can adjust the view on this topic if/when we get access to mythos.

    • >I'd still place a bet that the SOA models make _far_ less mistakes than humans.

      Genuine question: your top coder seems to be producing the most error-free code from your perspective, has the deepest knowledge of the architecture and codebase, and is faster on the trigger than the others.

      But your top coder has proven and verifiable dementia, where they will confidently assume the existence of apis and code that do not exist, mix up the purpose of others and forget other things, and you can't predict when and how they will introduce errors into the system or the severity of such errors.

      Are you really comfortable letting this person with dementia generate most of your codebase in the airline and health industry?

      I also hope you have an iron-clad agreement that prevents the model provider from doing silent updates because all your evidence of correctness you collected thus far goes out the window in that case.

      Another genuine question:

      You have witnessed a human coder and the AI you're using make the same important mistake. Assuming you do not have the time and resources to retrain, fine tume, and test your frontier model:

      Who would you trust not to make the same mistake multiple times in the future after you have warned them that their job depends on it, the AI or the human?

      7 replies →

    • > I'd still place a bet that the SOA models make _far_ less mistakes than humans.

      Well too bad, the problem is that they also produce things much faster than humans so errors will compound quicker.

    • This stupid argument again. The number of mistakes _does not matter_. Get. This. In. Your. Head. The predictability of the _type_ of error is what matters. For LLMs and machine learning in general the error distribution is not what you would expect and it is not possible to predict either.

  • In some sense, you should still act on this, since if an external auditor relies on the same stack, it'll still cause you headaches.

  • I use Opus 4.8 and GPT 5.5 and haven't suffered from hallucinations in months. But we also put a lot of effort into our harness.

    • Opus 4.8 and gpt constantly hallucinate stuff as well. If you haven’t encountered or caught it that’s something different. Of course these days it’s mostly confidently asserting a wrong thing.

      1 reply →

    • Sometimes the harness can only be a human.

      And this is fine. Developing new software with a really smart intern is the same, you, as an expert, need to bring your experience/expertise on the table to have everything right. Because experience needs time.

  • > it had of course hallucinated what the regulation actually required

    Did it do the correct job once you put the regulations doc(s) in the context?

    • What I usually do when in doubt is challenge the AI. “Please quote the section of regulation the product is non compliant with”. It usually admits it hallucinated the whole thing.

      1 reply →

  • 100%. Unfortunately those not in the depths of mission critical systems or regulated products will continue to believe that producing tons of code quickly using LLMs without humans in these systems is acceptable.

    Here's an example of what we will continue to see with folks fully immersed in gen AI psychosis:

    "The creator of claude code said that he no longer writes code for about 6 months and now has Claude doing all his work now. He also said recently that he no longer prompts Claude and now has it running in loops and it is self-improving itself and performing better than a human!"

    If the code produced by the LLM is perfect, the LLM takes the credit. But when a disaster happens, you cannot blame the LLM and it then falls on the human who did it.

    I don't think SWEs heavily vibe-coding with LLMs realize the risk in not understanding what the code the LLM being produced is doing even after generating tests (lol). We will see more of this too. [0]

    [0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

    • Why is it such a dramatic statement for Boris to claim that he no longer writes code?

      Are people on HN still typing out functions by hand one character at a time?

      It would be like a developer in 2020 claiming that he only writes assembly because compilers can’t be trusted. No one is taking that person seriously. If you chose a career in tech you made a decision to work in one of the fastest moving fields in human history. Now it’s time to get over it, learn the new tools and adapt.

      8 replies →

  • False-positive rate is so high with Mythos according to friends and other reporting I have seen.

    The original Mythos release used ASan to filter false-positives so it was able to maintain a good FPR, but when Mythos moves into domains that don't have a readily available oracle to help filter hits, the result is a deluge of false bullshit.

  • what am i missing?

    you take a spec and create tests, every little thing

    you use another ai to verify these tests against the spec

    you review the tests vs the spec (at one point human review)

    you put the tests off limits to change / wall them.

    you let the ai write the software that fulfills the tests.

    there will be some gaps where you repeat the cycle above

    if the tests fulfill the spec, the code will fulfill the spec

    • >you take a spec and create tests, every little thing

      A spec detailed enough and unambiguous enough to be translated into machine execution deterministically is called code.

      Unlike a compiler, AI can build with a spec that is not detailed enough or unambiguous enough: It does so by filling in the gaps with educated guesses.

      This is safe if and only if you take the time to later read the output, understand what its guesses were, and judge wether they were acceptable. No AI can do this for you because the truth lies in your original intentions, which it does not have access to.

      The jury is out there on how reliable and time consuming this is vs writing the code yourself; it is not immediately obvious that is faster or requires a smaller cognitive load.

      5 replies →

    • If each step requires micro-steps iterating with an LLM with human review to prevent hallucinations creeping in.. at some point you might just be better off letting the human do the work.

      Particularly as tokenmaxxing has ended and people are being charged more economic prices. If the pricing 5-10x the way Uber,etc did on the path to profitability.. even more so.

    • IME, regulatory compliance is something you are rarely able to test for in a nice little box or with well-known suite. So there's no easy "this complies" in many situations, no matter how many lawyers, compliance officers, and llm's you run it past.

      1 reply →

    • I walked down that path for a few months. The more you constrain LLM's, the more underhanded they behave in order to produce something that satisfies all the constraints.

      Doing the above doesn't actually make the model smarter, so, if it couldn't get to correct code with fewer steps, then the light you see at the end of the tunnel is an oncoming train.

    • This is such an abstract principle that the principle itself cannot be refuted. The plan sounds fine on paper. "Just iterate bro". But it entirely depends on what rational agents you put into the system. Obviously, if I sub in a 5 year old child everywhere, this loop breaks. Humans and AI, sometimes one is better than the other at certain things, we're still learning.

      The only way to test this is to test it out, in real life. Sometimes people see results, sometimes people don't. Note that yes, I am including the entire iteration process - even after iterating, people still don't see results with AI.

      I have had both positive and negative experiences with AI, over multi-week projects. But apparently on hackernews, anything positive about AI is proof that AI is superhuman and taking over, and all follies about AI are lies by stupid humans who secretly have psychological dispositions to fear AI. Sometimes the AI genuinely isn't good enough. Are we not allowed to say that now? We might not know why, but it's just the truth.

      The other solution is to formally analyze the entire space of possible actions the agent can take a priori. Then yes, you can definitively say whether or not the principle breaks or not. Can you, though? Can you give a formal specification for the space of possible actions for AI and show that your loop never breaks, or breaks less than humans, or any other sensible criteria? If not, then you can't just give an abstract principle and start making inferences from that.

    • It’s impossible to write a spec that’s not ambiguous , complete and correct in natural languages. Thus prompts will always generate unreliable software.

  • Is that all that Mythos did?

    Did it find any real potential issue, optimization/simplification opportunities, or sparked any thought-provoking discussion within your organization?

    Or was it purely a net negative experience?

    • Read their comment. It's a negative anecdote surrounded by them using genAI all the time.

      You're the only one coming away thinking there was a net negative experience.

    • In regulated industries none of those matter if the tool invents compliance issues or breaks compliance.

      The only thought-ptovoking discussion should be "why the hell do we have this stochastic parrot anywhere near out codebase"

      3 replies →

  • Isn't that a net positive though? (not sure about the cost human and tech cost). I'm guessing that without using Mythos, those conversations would never have been had, and confidence in the compliance of the product would've been lower.

    I love using AI tools as casinos. It's epic in helping to forge ideas and kickstart thought processes. You basically have the entirety of world knowledge at your fingertips to have a pint with.

    • > I'm guessing that without using Mythos, those conversations would never have been had, and confidence in the compliance of the product would've been lower.

      The conversations had already been had and the product made compliant. Mythos just pulled new rules out of its ass and of course the product wasn't compliant with those. So they do a fire drill and find that to be the case at great expense.

      Yeah you can frame it as "more checking is always better" if you wanted but that's just the same old "other people's resources are valueless" slight of hand we see on everything. It probably was mostly wasteful work.

      1 reply →

> It's the expertise of engineers on the team that push it back on track.

But how are you so sure your colleagues are not more "expert" than you? Prior LLMs there was room for very good engineers and mediocre engineers to work together in 99% of the companies out there. With LLMs, only the "best" engineers will survive, because nobody needs mediocre engineers anymore.

This being HN, I imagine every engineer reading this thinks they are in top the 10-5% of their company/city/country, and therefore they think they are not "mediocre" engineers that can get affected by the introduction of LLMs. Statistically, they are probably wrong. So, it's all about ego. Chances are you are not a rockstar and LLMs will eventually take over your job.

As usual, the only winners here are corporations and executives. Most of us are the last monkeys in the chain, and so we'll get screwed.

  • The corporations and executives are already winning if you swallowed the concept of 'rockstar' engineer. Sure there are more and less experienced engineers, but even interns can and often do provide good input and spot mistakes made by seniors. The 'rockstar' engineer at most tech companies simply equates to the somewhat autistic guy with a brown nose who's working 15 hour days for a pat on the head from management (and making many mistakes in the process).

    • Even if we forget "rockstar", there are certainly different levels of engineers. More experience doesn't automatically mean better either. That is not to say experience doesn't matter. It matters quite a bit. Sure , good interns can sometimes have good feedback or spot mistakes. But not consistently enough.

      All of this to say that it's not just experience that makes one a better engineer.

      1 reply →

  • > because nobody needs mediocre engineers anymore.

    This is giving too much credit to LLM. I think LLMs are great and it is incredibly useful both in personal and professional settings. However, it exist on a separate plane than human workers in the tools category.

    Sooner or later, people will find out that LLMs only overlaps with existing human hierarchy (e.g. junior dev X%, senior dev Y%, etc), but almost never 100%. If it was 100% to a certain position, you are probably using the humans wrong to begin with there - since humans have one of the most priced thing that I don't see an single ounce out of LLMs: initiative

    • It's hard to show initiative without a pulse. Most agents don't have that (yet). But can't be too hard to build.

  • > With LLMs, only the "best" engineers will survive, because nobody needs mediocre engineers anymore.

    LLMs are going to show that there's a huge divide in "engineers" between people who love "coding" and people who like "engineering".

    The group of people kicking and screaming the most are the people who love code and don't want to see their coding go away.

    These are typically the build vs buy folks. "We can't use anything anyone else wrote, I can do it better..."

    What do you think Staff level engineers do? They don't sit around coding all day.

    Writing the code is just something you had to do in the past to get the job done.

    What you get paid to do is "engineer" and the two are separate. Coding is very small part of the average engineer's job.

    And yet the vast majority of engineers think that the world is going to end if they aren't spending most of their time "coding".

    • Very well said and if you look at some of the other threads on hacker news about why people don’t like AI it specifically because they like typing and coding

      The majority of my time is an engineering manager has been teaching “engineers” how to actually do engineering with any kind of rigor

      The number of engineers who have an absolutely no theoretical structural or system basis for what they’re doing is the vast vast majority

  • Exactly. Same with tractors. Once they arrived, nobody benefited except Big Tractor.

    Famously a net loss for humanity.

  • > With LLMs, only the "best" engineers will survive, because nobody needs mediocre engineers anymore.

    I don't think this is true.

    A good engineer doesn't have infinite throughput. In my opinion the best engineers should be constantly bottlenecked because they solve difficult problems. They don't have time for grunt work. Every company needs less than perfect engineers, AI assisted or not.

  • Well almost 70% of the developers in the industry can't write a fizz buzz.

    But, besides coding skills (which some possess), the engineering, social, and business ones are close to non existent.

> Wut? I pilot LLMs all day but there's no way in hell I'd agree to be at the helm of a finance product.

Dunno how much longer that is going to remain true for your specific employer - all the fintech companies I deal with personally have had some sort of AI account for their devs since last year.

Even places like jane street have employees posting blogs (one of which was on HN frontpage about 60m ago) saying they mostly direct agents.

How long do you think your specific employer is going to hold out?

  • Sorry if I was unclear. I don't work in finance. I do work with agents. I think expert engineers in finance who are guiding agents are adding a lot of value because of their knowledge of finance. Because I lack that knowledge of finance, even given access to agents, I would not accept a role guiding agents in a finance company because I wouldn't be able to guide the agents well and my/our output would be bad.

Norwegians have a saying: “Den som er ferdig utlært, er ikke utlært – men ferdig.” Meaning if you are finished with learning the one that is finished is you. Typical scandinavic hard cold truth…

I understand the frustration of spending years nurturing a skill and then seeing its value decline.But this isn’t really an LLM problem. The same thing happened to factory workers, typists, draftsmen, and many others before. The technology changes, but the underlying issue is the economic system we live in, where the market can suddenly decide that something you’ve spent years mastering is worth much less than before.

LLMs are not creating that dynamic. They’re just accelerating it.

Unfortunately every software related industry is embracing LLM/Codegen. Your banks, fintechs, insurance. Everyone. Your concerns are the same I'm having, yet it's regularly dismissed or hand-waved away as "don't worry about it the delivery velocity/ROI is worth it"

  • It's not so much about velocity or quality, both of which LLM do (or will) provide.

    The real question is about accountability and liability.

    When a major data leak is going to happen, who will they sue or fire ? That is the value engineers provide. They understand, confirm, and take ownership.

    • This is what I'm wondering too. We've signed a confidentiality agreement with all the big players (as I'm sure all other companies have done), which is supposed to ensure our data is both segregated and not used for training. I don't trust these companies not to do just that; their business is in taking what we have and training their models.

      1 reply →

    • This question has been easily answered by many companies.

      You, the IC, the developer prompting the code extruder, are ultimately responsible for its outputted code and its behaviour.

      You may feel pressured to push out thousands of lines of code a day. You may see those thousands of lines refactored several times over the lifespan of a merge request. You may be asked to do this continue this in the long term with all the mental fatigue that entails.

      When it's too much for you to sustainably deal with and you turn to using LLMs to review the code, that will still, presumably, fall on you at the end of the day.

      The output is your responsibility.

    • Ostensibly, due-diligence should not change. But people are lazy, just as they've always been around testing/QA/definition-of-done.

      I'm not even certain that laziness gets them further along than it used to; I think it's that people have not had their overconfidence painfully corrected yet. Behaviors will re-align pretty fast when people realize that no, they're not going to get away with just pressing a button and saying everything is "good". That is happening right now.

    • Don't worry, we can throw in all in 55 gallon drums and dump it over a cliff when the time comes.

    • Just having this discussion with someone about AI in healthcare and how issues are going to be handled.

      If a nurse does something incorrectly, they can lose their license. Ensuring that nurse will never be a nurse again. There is a very clear path of accountability and very clear ways to mitigate it.

      For instance, if a nurse is drunk and you recognize there is a pattern of people showing up drunk, you institute drug tests and breathalyzers and move on.

      While we probably won't have LLM's autonomously performing procedures, they are 100% parsing documentation, reading lab results, making suggestions, etc. And right now, the burden has been placed squarely on the clinicians themselves. It'll feed them them the data, ask if they approve/agree, and then essentially wash their hands of accountability. Let's say an LLM starts incorrectly reading lab results, how is that fixed/remedied? A prompt update? Additional safeguards? Adjusting the temperature? Changing a model?

      This is a far different type of engineering that still feels pretty new. Granted, I'm still an amateur in this space (I use Claude Code a decent bit), but it feels really opaque to me right.

    • > When a major data leak is going to happen, who will they sue or fire ? That is the value engineers provide. They understand, confirm, and take ownership.

      This goes for serious incidents, disasters, outages and security breaches.

      If there was an investigation and the answer was "a piece of software was vibe coded with AI" why would anyone trust the software vendor after that?

      3 replies →

  • Are banks that concerned about velocity? Because moving fast and breaking things in the banking sector can get extremely expensive. It's also not a who-gives-a-shit industry like operating a taxi service or hosting images, but a very tightly regulated sector.

    • I might have been a bit broad with the brush. I can't speak for banks, but I can speak for the the fintech/money-movement space (e.g. Remitly, Wise, Revolut).

      It's a race to get first-to-market for backend integrations/features. It's given rise to a culture of "move fast break things" where safety is only for some core features, but absolutely not for the constellation of other services we provide. Failure rates have increased almost a percentage point since Codegen/LLM adoption was mandated from up top.

      You would think regulators would be on top of this, but our industry runs on all actors "self reporting" their outages. Most don't unless they can't hide it (>1h)

    • 'Keeping up with regulations' may as well be a separate field from the core stuff. It has the same pressures as any other development effort. Managers will want the integration to the KYC service LLM'd as quickly as possible.

Reg PRs - for the ones with complex requirements what I am seeing is that time to initial PR is very short, and a ping-pong between the reviewer and developer begins, because in my cases (not all) the developer vibe-coded parts, and they didn't really understand the requirements deeply or their code, and it takes multiple iterations for them to fix it. You can argue this is a human problem but this is the net effect I'm seeing.

I am not sure but for complex cases it seems to me that the earlier sum of moderately long PR time + moderately long review time has been replaced by very short PR time + even longer review time. I am not sure if there's a net gain in these cases. Sometimes even if the code is functionally correct, it's verbose enough (e.g., too many intermediate functions) that I think they will impact future reviews.

> That first pillar is still there. Maybe the author isn't aware of the impact they have, but I know, with the evidence of reverted PRs, that when I step outside my area of deep knowledge I can no longer call BS on the agents. Our most capable agent, with access to the same kind of distributed systems the author talks about, is regularly wrong, frequently myopic, and just outright dumb constantly. It's the expertise of engineers on the team that push it back on track.

I'd posit there's another layer. You have domain knowledge, certainly. But more valuable still is the wisdom to find more.

Anthropic and OpenAI can stick financial regulations in the training data all they want, but the AI systems will never learn to anticipate the future, or reach out to clients, partners, or regulators in complicated situations.

  • > AI systems will never learn to anticipate the future

    Citation needed. I don’t see any reason these systems shouldn’t be able to speculate; indeed some would say that’s all they do, even about the past.

Yeah I'm constantly shocked at how simultaneously smart and dumb Opus can be. It can tell me a LOT about my codebase but it will miss very critical clarifications that I begin with. And when I call it out it obviously remembered it, it just ignored it.

I agree with this experience. LLMs are great and save me a lot of time, but they need frequent nudges to avoid going down a completely wrong path. I just don't feel like the management dream of "every engineer has 3 agents working for them full time" is quite a reality yet. I'm not saying it won't get there, or that I feel secure being a software engineer until I'm of retirement age, but I also think it's important to understand the limitation of the tools. You do need to know your codebase. You do need to iterate on small chunks of it at a time. You do need to carefully understand every line of code you're putting into production. LLMs are amazing at generating a lot of proposals, but you need to carefully consider each one.

Most surprising to me about the article was the desire for OP's company to use AI for design docs. I feel like AI-generated design docs are some of the worst -- basically treating English as a programming language. They aren't enjoyable to read, and they often miss the forest for the trees. A human written sketch explaining why we're here and what we're working towards is still meaningful and important. If you want code-level details of every decision and algorithm, we have code for that.

I have mixed feelings on whether these documents are useful LLM inputs. I did a project where I carefully paired with Claude Code on producing a specification that another model would actually implement. I'm not sure it saved me any time, and it was very un-fun. (I kind of blame Opus 4.7 xhigh for this. It ain't speedy.) I feel like I can nitpick code to get exactly what I want, but defining exactly what I want an auto-mode LLM to go and do, in English, is much more difficult. I don't think the PLAN.md I generated would have been useful for a human trying to understand the system (too verbose), and Claude Code still made its usual mistakes that I have reminded it a billion times not to make (t.Context() in tests, not context.Background()!), so I'm just not sure it was worth it. I would say I probably wouldn't do it again in the near future. A rough sketch to get humans on board and to get the high level details worked out, written by hand, and then pairing with the LLM on actually typing in the code seems the most productive to me. But I do try to go outside my comfort zone once in a while to test the edges of these tools. They are very impressive and are worth a lot of the hype. (I know I will never write a YAML file again. I hate it more than anything, and Claude is amazing at it. But I worry I wouldn't feel the same way if I hadn't already had 8 years of k8s experience.)

> I pilot LLMs all day

Love the metaphor. Planes are sophisticated machines capable of auto-piloting, but humans are still needed to ultimately pilot the beast.

  • There is a product called Microsoft Copilot...

    • a slightly different metaphor. Copilot suggests it is next to you, helping you pilot... something else. The computer? The system? But "piloting the LLM" changes the relationship. The LLM is the thing that is being piloted.

You pilot LLMs all day but that might not last.

A lot of companies are investing money on “ai factories” that are join to automate a lot of software development (that is, steer LLMs) on the basis of jira tickets (or linear/trello cards or whatever).

a year ago I would have agreed, but the gap is getting smaller all the time... these things can do 90% of the work, and how many people does a company really need for the remaining 10%? certainly not as many as they needed before

  • The things can do 90% of the work ... but only if used by the right people.

    I've seen first hand what less experienced developers produce using the same models, your 90% accuracy suddenly drops to 50%...

    • With opus 4.8 we're frankly aproaching the 100% of the work, but only if tasked by the right people. A decade ago I worked as an enterprise architect and left it because I preffered coding. Now I'm an enterprise architect again, and we're at the point where I've setup a Microsoft Fabric and integrated a ADLS Gen2 with a Lakehouse building Dimension and Fact tables for our Business Intelligence people with Cowork. A month ago I didn't know what Dimension and Fact tables were in a datawarehouse and now I've not only setup a flow for it I've made it more accurate than what they had before because I understood how BC365 worked and the previous consultants didn't.

      We had a PoC in place to get fabric, it had like 500 hours allocated for what I did in a week with cowork, and my product is actually on secure vnet network with Azure identity security with both a test and a production environment delivering actual data.

      Cowork even made the damn powerpoint slideshows for decision makers.

      The single saving grace right now is that it apparently isn't easy for everyone to do this yet. But I didn't use a whole lot of my knowledge on software engineering to make any of it happen, not even the pandas and arrow code that moves the data behind the scenes. I mainly used my knowledge of NIS2 compliance and general data architecture in a step-by-step process. To me anyone with common sense should be able of doing this, and I really don't think I'm special... but then I teach other people AI at our company and they can barely get it to create a running program. Which is fine for now, but I have to work another 20ish years before I retire, and by then a lot of young people will have grown up with AI, and like I said, I'm not special. I think the only thing that differentes me is that I mash the buttons until it works but also have decades of security and compliance hammered into me.

[flagged]

  • "I ended up working in software development roles in the domains of finance, bookkeeping and payment processing, where I had great autonomy and a close and candid relationship with Product Managers and stakeholders.

    I learnt a lot about the domain and how to effectively write programs for it: PCI compliance, double-entry ledgers, escrows, reconciliation, payment lifecycles, bank transfer idempotency, etc.

    It was, then, obvious that I should focus my career on becoming an expert on that domain to stand out as a professional and differentiate myself in a field that showed signs of an increasing need for domain specialists."

  • The backend is the bit that "does stuff" so it's the part that needs to be correct.

    He said "Last year, I got hired by a company in the finance workspace.".