Comment by josefritzishere
6 hours ago
I appreciate the directness of calling LLMs "Bullshit machines." This terminology for LLMs is well established in academic circles and is much easier for laypeople to understand than terms like "non-deterministic." I personally don't like the excessive hype on the capabilities of AI. Setting realistic expectations will better drive better product adoption than carpet bombing users with marketing.
I have still mixed feelings about LLMs.
If I take the example of code, but that extends to many domains, it can sometimes produce near perfect architecture and implementation if I give it enough details about the technical details and fallpits. Turning a 8h coding job into a 1h review work.
On the other hand, it can be very wrong while acting certain it is right. Just yesterday Claude tried gaslighting me into accepting that the bug I was seeing was coming from a piece of code with already strong guardrails, and it was adamant that the part I was suspecting could in no way cause the issue. Turns out I was right, but I was starting to doubt myself
I think over time we will find better usage patterns for these machines. Even putting a model in a position to gaslight the user seems like a complete failure in the usage model. Not critiquing you at all on this, it's how these models are marketed and what all the tooling is built around. But they are incredibly useful and I think once we figure out how to use them better we can minimise these downsides and make ourselves much more productive without all the failures.
Of course that won't happen until the bubble pops - companies are racing to make themselves indispensable and to completely corner certain markets and to do so they need autonomous agents to replace people.
If it bullshits so much, you wouldn't have a problem giving me an example of it bullshitting on ChatGPT (paid version)? Lets take any example of a text prompt fitting a few pages - it may be a question in science or math or any domain. Can you get it to bullshit?
I like to let new models write a few lines of Latin poetry - they rarely get the meter right.
I don't have access to paid ChatGPT right now, but here's Opus 4.6 with extra thinking enabled: https://claude.ai/share/6e0e8ef5-06e4-4514-ba7e-299357c1fc55
The initial draft fucks up the meter in lines 3 and 8, the final version gets line 2 wrong ("venit meis") and is somewhat obnoxious with verses 2 and 8 basically repeating each other. The thinking trace is useless and gives us no clue why the model exchanged a bland, but metrically correct first distich for a more interesting, but metrically incorrect one.
In fact, the "careful" examination of its own output completely skips the erroneously modified half-verse in line 2 - now, tell me that's a coincidence and not a sign of bullshitting.
https://discuss.systems/@palvaro/116286268110078647
Arguing with Gemini Home Assistant about whether or not it can turn off the lights. When the user gets frustrated and tells the LLM to kill itself, the LLM turns off the lights.
I think you highlight one of the problems with users of LLMs: You can't tell anymore if it is BS or not.
I caught Claude the other day hallucinating code that was not only wrong, but dangerously wrong, leading to tasks being failed and never recover. But it certainly wasn't obvious.
To me it’s the other way around. It’s difficult to trust (paid) ChatGPT‘s output consistently.
When I need exact, especially up to date facts, I have to constantly double check everything.
I split my sessions into projects by topic, it regularly mixes things up in subtle and not so subtle ways. There is no sense of actually understanding continuity and especially not causality it seems.
It’s _very_ easy to lead it astray and to confidently echo false assumptions.
In any case, I‘ve become more precise at prompting and good at spotting when it fails. I think the trick is to not take its output too seriously.
> If it bullshits so much, you wouldn't have a problem giving me an example of it bullshitting on ChatGPT (paid version)?
There's an entire paragraph in the essay about apyhr's direct experience with ChatGPT failures and sustained bullshitting that we'd never expect from a moderately-skilled human who possesses at least two functioning braincells. That paragraph begins "I have recently argued for forty-five minutes with ChatGPT". Do notice that there are six sentences in the paragraph. I encourage you to read all of them (make sure to check out the footnote... it's pretty good).
The exact text of the ChatGPT session is irrelevant; even if you reported that you were unable to reproduce the issue, it would only reinforce one of the underlying points -namely- that these systems are unreliable. aphyr has a pretty extensive body of published work that indicates that he'd not likely fabricate a story of an LLM repeatedly failing to accomplish a task that any moderately-skilled human could accomplish when equipped with the proper tools. So, I believe that his report is true and accurate.
There's also this seven-week-old example [0] (linked in the essay) of ChatGPT very confidently recommending a asinine course of action because it was unable to understand what the hell it was being told.
Listening to the audio is not required, as there's a reasonably accurate on-screen transcript, but it is valuable to listen to just how very hard they've worked to make this tool sound both confident and capable, even in situations where it's soul-crushingly incorrect. Those of us who have worked in Blasted Corporate Hellscapes may recognize how this manner of speaking can be very, very compelling to a certain sort of person (who -as it turns out- is frequently found in a management position).
[0] <https://www.instagram.com/reel/DUylL79kvub/>
17 replies →