Comment by Aurornis

2 days ago

> This is why I use AI for all my medical questions and doctors use AI to write software, and we both smirk at the quality the other person is getting from it.

There is an interesting third group emerging: People who acknowledge the quality problem, but think they can deal with it by applying more AI to the output.

This takes the form of people who spin up a lot of "agents" and give them personalities like security director or quality director (which are unnecessarily complex and maddeningly unpredictable ways to trigger an LLM session for doing a security review or a quality check pass).

It also includes the person who knows that their app is full of bugs, but thinks it's not a problem because they can have the AI fix the bugs as they show up. People in this class haven't encountered security breaches or data loss bugs yet. They think it's all about having Claude fix that div that isn't centered or handle that error code that shows up some times.

> People who acknowledge the quality problem, but think they can deal with it by applying more AI to the output.

Brute Force: if it doesn't work, you're just not using enough.

What if they're right though?

  • It does not have to be brushed away as "brute force" necessarily. We can, and do, build more reliable systems out of less reliable components. In fact, most industrial engineering accepts some defect rate and builds margins around it.

    Software is no different. Even without AI, you already have buggy compilers and buggy OSes and buggy libraries. You just tend to accept the risk because you have some idea of what the failure modes are and can work around it or manage the risk in some other way (buy literal insurance.)

    • > you already have buggy compilers and buggy OSes and buggy libraries.

      Which run, I must add, on effectively infallible hardware. Most of the software straight up assumes that the CPUs and the RAM will function perfectly and don't bother even trying to detect such failures (unless those failures manifest themselves in a catastrophic manner, the show will simply go on).

      So in effect, we also can, and do, build less reliable systems out of more reliable components, and that's how software is different.

      6 replies →

  • There are other places where some process has an error rate and you make up for that error rate by doing the work more than once and then comparing results. For example, I've heard in a video that satellites and other space craft often have 3 or 4 processors and compare the results to make sure there were no errors due to radiation. Similarly, we have RAID arrays that store data multiple times because disks can fail. So, even if AI has a failure rate of like 20%, maybe you can make up for that by running the same prompt multiple times with slight variations or with different models, comparing the results and choosing the best.

  • I've seen it turn right in business contexts. Sometimes you can even lower your standard of "good enough" and find quantity has a quality all its own.

    But it requires taste and engineering to do it right, and on the right things. It'll be an interesting few years.

    • I think it also requires someone who knows just enough to be able to navigate between those ideas that will set you back and those which will propel you forward. At the end of the day, you still need some human filter.

  • they are right. bad output is user error. there, am i suiting the role appropriately? i do like 65% believe that, fwiw.

> There is an interesting third group emerging: People who acknowledge the quality problem, but think they can deal with it by applying more AI to the output.

That's the entire big tech's business strategy right now.

I'm in a similar-ish boat here. I acknowledge that what I paid an LLM $100 to develop isn't as good as what if pay a human $100,000 to do, but it's "good enough" to solve the problem.

How did you get over 52,000 karma in under 3 years with no submissions at all?

Are you averaging like 2000+ comments a month?

  • They spin up agents, and then give them roles like commenter, and director of quality for the commenter. Although I'm unsure how the director helps since I've never seen one do actual work.

  • Commenting more than I should, to be honest.

    I have a few periods during my daily routine where I’m waiting somewhere away from the computer and need a break from email.

    A lot of my comments have double digit upvotes and some get into the mid hundreds. I try to actually read articles and provide thoughtful comments, which gets upvoted a lot more than the throwaway.

    > Are you averaging like 2000+ comments a month?

    52000 / 3 years would be under 1500 points per month or 48 points per day. That could be done with 1-2 helpful comments per day on popular threads.

    • I browse HN a bit more than I should and I see you and simonw around a lot, like you said always providing thoughtful commentary.

      When I write comments on here I tend to spend upwards of 15 minutes to draft and reformulate my comments. Sometimes double-checking what I'm about to say (sometimes not thoroughly enough as some of my recent comments show) and I was wondering if you have a similar experience in that regard or do you just manage to fire off a comment in a stream of thought fashion from start to end?

    • Serious, non-acusatory question. Your writing looks human. Do you use any writing assistants?

      Where else, other than HN, do you post?

  • 3 pages deep into their comment history only brings me to 5 days ago so probably yes.

>It also includes the person who knows that their app is full of bugs, but thinks it's not a problem because they can have the AI fix the bugs as they show up. People in this class haven't encountered security breaches or data loss bugs yet.

How come? Their human code didn't have any of either all those decades?

> There is an interesting third group emerging: People who acknowledge the quality problem, but think they can deal with it by applying more AI to the output.

Ah yes, the known unknowns.

The discussion reminds me of a talk Zizek gave in which he discusses the speech Rumsfeld gave regarding the evidence Iraq supplying weapons to terrorist[0].

Zezik argues the unknown knowns are far more interesting (and the reason why USA was losing in Iraq). While Rumsfeld focused on the unknown unknowns.

I've noticed that domain experts who implicitly know the the known unknowns of their field distrust LLMs because they can identify their shortcomings. Those subtle mistakes LLMs make. I argue this is why domain experts using LLMs get such a boost. They can identify and avoid pitfalls sometimes before they happen. But in other fields the same people are in awe of LLM capabilities precisely because the known unknowns are a mystery.

The Unknown Unknowns of LLMs are the IMO the most interesting. The so called emergent capabilities of the technology. The use of LLMs in others fields such as biology, eg in protein language models, is really cool.

Everyone focuses on replacement of people workers when I think opening new fields of work for humans should be the goal of LLMs by leveraging the tech to discover.

The other interesting caregory is unknown knows. But that's another topic for another time.

[0] https://en.wikipedia.org/wiki/There_are_unknown_unknowns

  • As an aside, the mass mockery in response to Rumsfeld's statement always bothered me because it's the single most intelligent statement he ever made about the Iraq war, and if he had started out with that mindset things probably would not have gone nearly as pear-shaped as they did.

    • This is one of those classic "sounds dumb / doesn't play well on TV but is actually smarter than most of the other people babbling about it" things. Nassim Taleb has written for example about how maddening it is to watch world-class economists who are also just sort of awkward and a little nerdy go on TV and "lose" to blowhards who don't actually know what the hell they're talking about but appear confident and look good on camera. Thankfully in Rumsfeld's case I think as time has gone on it's become a pretty respected statement about risk even if people still occasionally find the phrasing a bit amusing.

I always imagine the model rolling its silicon eyes when it’s assigned a personality (“you are an expert growth hacker”) at the start of the prompt. Was that ever actually shown to be effective? Is it still?

  • > Was that ever actually shown to be effective? Is it still?

    Yes! Personas demonstrated measurable improvement in a few different ways, with caveats of course. The common intuition is that personas influence token space in beneficial ways.

    I'll come back here later on desktop and link a few (still) relevant papers on this topic.

  • I remember there were some studies that this kind of thing was effective a year or so ago, so essentially a lifetime in Model years.

    However to me it seems completely reasonable that it would work, because my understanding of what happens is the model interprets what you said as:

    Look for a group of people who are considered to be expert growth hackers by the world at large and answer my questions as though they were answering them.

    So assuming that there are a set of questions that can best be answered by people that most other people identify as expert growth hackers then yes, I believe assigning a personality in this way should obviously work.

    • I imagined it as kind of a shorthand for "you should be spending my tokens on looking for / addressing issues like X, Y, and Z," where X, Y, and Z are the sorts of things that an expert in [insert domain here] would be likely to care most about.

      3 replies →

    • It's been interesting to see how aggressively some reasoning models like to "reason" by analogy. They love to say things like "it's like a CPU" or "it's like a highway", and then they start to make logical leaps based off that rather than just using it for user explanation. Gemini 2.5 and 3.1 Pro have been particularly bad for this type of behavior. Telling models to "speak as though you are a physiologist considering the case with an expert colleague" gets them to "reason" using a more correct linguistic substrate.

      The Opus models over the last year doesn't seem as vulnerable to this type of behavior and I've noticed the "identify as expert" prompt tricks aren't as meaningful there.

    • I propose we move away from the framing of "Model years" - they're standard human research years. Yes, likely more people are working on it, and also working harder, but ever since we acquired a certain amount of compute in the world, many people were able to independently find the same patterns and train models.

  • I feel it helps for the personality aspect, how it handles answers and general vocabulary, but it doesn’t in any way improve skill level, at least that’s my take from building an assistant.

  • It reminds me when people would stuff their image prompts with things like NO DEFORMED FINGERS.

  • I've always wondered if the go-to should have been prefilling its response with "I am an expert growth leader, and here are my thoughts:".

  • There was a time when stuff like "Unreal Engine, trending on ArtStation, 8K resolution" actually worked when prompting image gen models because such labels actually correlated with higher-quality images in the web-crawled training datasets available back then.

  • Back with some papers. (Apologies in advance; I typically don't edit/format comments much here, please bear with me.)

    Notable papers describing performance improvements with prescribed roles and personas:

    - ExpertPrompting: Instructing Large Language Models to be Distinguished Experts (2023) https://arxiv.org/abs/2305.14688 (if you're going to only read one paper here, maybe read this one but know there has been a lot of follow up with more modern models.)

    - Expert Personas Improve LLM Alignment but Damage Accuracy (2026) https://arxiv.org/abs/2603.18507

    - When Does Persona Prompting Actually Help? (2026) https://arxiv.org/abs/2605.29420

    - Unveiling Power on Combining Prompt Engineering Techniques: An Experimental Evaluation on Code Generation (2025) https://doi.org/10.5753/sbbd.2025.247251

    - A Pattern Language for Persona-based Interactions with LLMs (2025) https://www.dre.vanderbilt.edu/~schmidt/PDF/Persona-Pattern-...

    A TLDR of my *admittedly heavily biased* mental model (so take it with a grain of salt): personas do improve task alignment and precision to measurable effect but with observed negative impact to accuracy and knowledge grounding. Overall, this makes it quite suitable and preferred for code generation scenarios. (Don't over-index on 'accuracy' here as meaning "bad code", it's more about verbosity/jargon reducing clarity of higher order goals like business objectives and system architecture.)

    Outside of code generation, personas have the interesting effect of increasing implicit biases and stereotypes. It's not hard to imagine something like "you are a left|right wing politician ..." or "you are a senior-citizen|teenager ..." influencing token space construction considerably.

  • From what I've heard, personas give a greater chance that the LLM will answer confidently.. and also a greater chance it'll hallucinate something when the data is sparse. Supposedly "grounding" the personas on real documents/web searches is the best approach. Anecdotal though.

  • The reason it seems suspicious is that it's phrased in a way that's oriented towards humans. I haven't tested this, but I suspect you'd get similar results if you said something like "orient your response to that of a growth hacker." Either one is likely to have the desired effect on the stochastic result.

  • At least in the beginning of spicy autocomplete, this sort of role-play did work pretty dramatically at aligning a conversation to a task, though I don't think anyone ever tested it versus somewhat less cringe priming.

    After that, cargo cults do what they do best.

    • > though I don't think anyone ever tested it versus somewhat less cringe priming.

      I really wonder if phrasing it differently would make a difference. In good faith conversations, it just doesn't happen that someone tells someone else who that person is.

> People who acknowledge the quality problem, but think they can deal with it by applying more AI to the output.

This is just like throwing more money at a problem, hoping that it might solve it, but instead one throws tokens.

3 out of 5 voting works quite well for hardware sensors and for computing in space.

No reason why it won't improve the quality of the agents output too, eventually. Spin 5 from different providers, take the vote.