Comment by bensyverson
8 hours ago
I get the frustration, but it's reductive to just call LLMs "bullshit machines" as if the models are not improving. The current flagship models are not perfect, but if you use GPT-2 for a few minutes, it's incredible how much the industry has progressed in seven years.
It's true that people don't have a good intuitive sense of what the models are good or bad at (see: counting the Rs in "strawberry"), but this is more a human limitation than a fundamental problem with the technology.
Two things can be true at the same time: The technology has improved, and the technology in its current state still isn't fit for purpose.
I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.
The most intellectually honest way to evaluate these things is how they behave now on real tasks. Not with some unfalsifiable appeal to the future of "oh, they'll fix it."
The errors are also not distributed in the same way as you'd expect from a human. The tools can synthesize a whole feature in a moderately complicated web app including UI code, schema changes, etc, and it comes out perfectly. Then I ask for something simple like a shopping list of windshield wipers etc for the cars and that comes out wildly wrong (like wrong number of wipers for the cars, not just the wrong parts), stuff that a ten year old child would have no trouble with. I work in the field so I have a qualitative understanding of this behavior but I think it can be extremely confusing to many people.
One of the reasons I'm comfortable using them as coding agents is that I can and do review every line of code they generate, and those lines of code form a gate. No LLM-bullshit can get through that gate, except in the form of lines of code, that I can examine, and even if I do let some bullshit through accidentally, the bullshit is stateless and can be extracted later if necessary just like any other line of code. Or, to put it another way, the context window doesn't come with the code, forming this huge blob of context to be carried along... the code is just the code.
That exposes me to when the models are objectively wrong and helps keep me grounded with their utility in spaces I can check them less well. One of the most important things you can put in your prompt is a request for sources, followed by you actually checking them out.
And one of the things the coding agents teach me is that you need to keep the AIs on a tight leash. What is their equivalent in other domains of them "fixing" the test to pass instead of fixing the code to pass the test? In the programming space I can run "git diff *_test.go" to ensure they didn't hack the tests when I didn't expect it. It keeps me wondering what the equivalent of that is in my non-programming questions. I have unit testing suites to verify my LLM output against. What's the equivalent in other domains? Probably some other isolated domains here and there do have some equivalents. But in general there isn't one. Things like "completely forged graphs" are completely expected but it's hard to catch this when you lack the tools or the understanding to chase down "where did this graph actually come from?".
The success with programming can't be translated naively into domains that lack the tooling programmers built up over the years, and based on how many times the AIs bang into the guardrails the tools provide I would definitely suggest large amounts of skepticism in those domains that lack those guardrails.
> the technology in its current state still isn't fit for purpose.
This is a broad statement that assumes we agree on the purpose.
For my purpose, which is software development, the technology has reached a level that is entirely adequate.
Meanwhile, sports trivia represents a stress test of the model's memorized world knowledge. It could work really well if you give the model a tool to look up factual information in a structured database. But this is exactly what I meant above; using the technology in a suboptimal way is a human problem, not a model problem.
There's nothing in these models that say its purpose is software development. Their design and affordances scream out "use me for anything." The marketing certainly matches that, so do the UIs, so do the behaviors. So I take them at their word, and I see that failure modes are shockingly common even under regular use. I'm not out to break these things at all. I'm being as charitable and empirical as I can reasonably be.
If the purpose is indeed software development with review, then there's nothing stopping multi-billion dollar companies from putting friction into these sytems to direct users towards where the system is at its strongest.
1 reply →
Which things actually matter? I think we can all agree that an LLM isn't fit for purpose to control a nuclear power plant or fly a commercial airliner. But there's a huge spectrum of things below that. If an LLM trading error causes some hedge fund to fail then so what? It's only money.
Not to mention that it would then make some hedge fund with a better backtesting harness or more AI scrutiny more successful thus keeping the financial market work as designed.
> I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.
95% is not my experience and frankly dishonest.
I have ChatGPT open right now, can you give me examples where it doesn't work but some other source may have got it correct?
I have tested it against a lot of examples - it barely gets anything wrong with a text prompt that fits a few pages.
> The most intellectually honest way to evaluate these things is how they behave now on real tasks
A falsifiable way is to see how it is used in real life. There are loads of serious enterprise projects that are mostly done by LLMs. Almost all companies use AI. Either they are irresponsible or you are exaggerating.
Lets be actually intellectually honest here.
>95% is not my experience and frankly dishonest.
Quite frankly, this is exactly like how two people can use the same compression program on two different files and get vastly different compression ratios (because one has a lot of redundancy and the other one has not).
8 replies →
Six months bro, we're still so early
Whether LLMs can create correct content doesn't matter. We've already seen how they are being used and will be used.
Fake content and lies. To drive outrage. To influence elections. To distract from real crimes. To overload everyone so they're too tired to fight or to understand. To weaken the concept that anything's true so that you can say anything. Because who cares if the world dies as long as you made lots of money on the way.
> Because who cares if the world dies as long as you made lots of money on the way.
Guiding principle of the AI industry
It's really the whole tech industry as it exists right now and AI is a victim of bad timing. If this AI had been invented 40 years ago there'd have been a lower ceiling on the damage it could do.
Another way of saying that is that capitalism is the real problem, but I was never anti-capitalist in principle, it's just gotten out of hand in the last 5-10 years. (Not that it hadn't been building to that.)
2 replies →
Computer graphics have been improving for decades but the uncanny valley remains undefeated. I don't know why anyone expects a breakthrough in other areas. There's a wall we hit and we don't understand our own consciousness and effectiveness well enough to replicate it.
We have credible deepfakes on demand. (To be fair, there have been deceptive photos as long as photos have existed, but the cost of automating their creation going to basically zero has a social impact)
We can use AI to make video clips to trick boomers on Facebook into thinking Obama eats babies. They already want to believe it. AI isn't outputting real full-length books and movies.
In computer graphics we understand how it works, we just lack the computational power to do it real time, but we can with sufficient processing produce realistic looking images with physically accurate lighting. But when it comes to cognition its a lot of guesswork, we haven't yet mapped out the neuron connections in a brain, we haven't validated it works as popular science writing suggests. We don't understand intelligence, so all we can do is accidentally bumble into it and it seems unlikely that will just happen especially when its so hard to compute what we are already doing.
That's not why the author calls them bullshit machines.
> One way to understand an LLM is as an improv machine. It takes a stream of tokens, like a conversation, and says “yes, and then…” This yes-and behavior is why some people call LLMs bullshit machines. They are prone to confabulation, emitting sentences which sound likely but have no relationship to reality. They treat sarcasm and fantasy credulously, misunderstand context clues, and tell people to put glue on pizza.
Yes, there have been improvements on them, but none of those improvements mitigate the core flaw of the technology. The author even acknowledges all of the improvements in the last few months.
Bullshit is the perfect term here, even as AI's get so much better and capable Brandolini's Law aka the "bullshit asymmetry principle" always applies--the energy required to refute misinformation is an order of magnitude larger than that needed to produce it. Even to use AIs effectively today requires a very good BS detector--some day in the future it won't.
models are improving. the pricing already assumes they're ready for prod. that's where the fires start
Calling LLMs "bullshit machines" is a reference to a 2024 paper [1] which itself uses the concept of "bullshit" as defined in the essay/book "On Bullshit" by Harry G. Frankfurt [2]. The TL;DR is that LLMs are fundamentally bullshit machines because they are only made to generate sentences that sound plausible, but plausible does not always mean true.
[1]: https://link.springer.com/article/10.1007/s10676-024-09775-5
[2]: https://en.wikipedia.org/wiki/On_Bullshit
it's not a bullshit machine because its output is bad, it's a bullshit machine because its output is literally 'bullshit' as in, output that is statistically likely but with no factual or reasoning basis. as the models have improved, their bullshit is more statistically likely to sound coherent (maybe even more likely to be 'accurate'), but no more factual and with no more reasoning.
However, when fed source material into the context they will lie less, right? So at this point is it not just a battle of the nines until it's called "good enough"?
I also wonder if I leave my secretary with a ream of papers and ask him for a summary how many will he actually read and understand vs skim and then bullshit? It seems like the capacity for frailty exists in both "species".
It doesn't matter how good the models become. They can only deal in bullshit, in the academic use of the term.
They are bullshit machines because they do not have an internal mental model of truth like a human does. The flagship models bullshit less, but their fundamental architectures prevent having truth interfere with output.
https://philosophersmag.com/large-language-models-and-the-co...
"Bullshit" is a human concept. LLMs do not work like the human brain, so to call their output "bullshit" is ascribing malice and intent that is simply not there. LLMs do not "think." But that does not mean they're not incredibly powerful and helpful in the right context.
I sort of agree. In this context "bullshit" means "speech intended to persuade without regard for truth", and while it's true that LLM output is without regard for truth, it's not an entity capable of the agency to persuade, although functionally that is what it can appear like.
https://en.wikipedia.org/wiki/On_Bullshit
> it's reductive to just call LLMs "bullshit machines" as if the models are not improving
This is true, but I prefer to think of it as "It's delusional to pretend as if human beings are not bullshit machines too".
Lies are all we have. Our internal monologue is almost 100% fantasy. Even in serious pursuits, that's how it works. We make shit up and lie to ourselves, and then only later apply our hard-earned[1] skill prompts to figure out whether or not we're right about it.
How many times have the nerds here been thinking through a great new idea for a design and how clever it would be before stopping to realize "Oh wait, that won't work because of XXX, which I forgot". That's a hallucination right there!
[1] Decades of education!
I'm not entirely sure I can agree, although the premise is seductive in certain ways. We do lie to ourselves, but we also have meta-cognition - we can recognise our own processes of thought. Imperfect as it may be, we have feedback loops which we can choose to use, we have heuristics we can apply, we can consciously alter our behaviour in the presence of contextual inputs, and so on.
Being wrong is not the same as a hallucination. It's a natural step on a journey to being more right. This feels a bit like Andreesen proudly stating he avoids reflection - you can act like that, but the human brain doesn't have to. LLMs have no choice in the matter.
The problem, unfortunately, is the scale. It's always scale. Humans make all the kinds of mistakes that we ascribe to LLMs, but LLMs can make them much faster and at much larger scale.
Models have gotten ridiculously better, they really have, but the scale has increased too, and I don't think we're ready to deal with the onslaught.
Scale is very different, but I wonder if human trust isn't the real issue. We trust technology too much as a group. We expect perfection, but we also assume perfection. This might be because the machines output confident sounding answers and humans default to trusting confidence as an indirect measure for accuracy, but I think there is another level where people just blindly trust machines because they are so use to using them for algorithms that trend towards giving correct responses.
Even before LLMs where in the public's discourse, I would have business ask about using AI instead of building some algorithm manually, and when I asked if they had considered the failure rate, they would return either blank stares or say that would count as a bug. To them, AI meant an algorithm just as good as one built to handle all edge cases in business logic, but easier and faster to implement.
We can generally recognize the AIs being off when they deal in our area of expertise, but there is some AI variant of Gell-Mann Amnesia at play that leads us to go back to trusting AI when it gives outputs in areas we are novices in.
"Lies are all we have."
If so, how do we distinguish between code that works and code that doesn't work? Why should we even care?
> If so, how do we distinguish between code that works and code that doesn't work?
Hilariously, not by using our brains, that's for sure. You have to have an external machine. We all understand that "testing" and "code review" are different processes, and that's why.
3 replies →
So your logic is humans and LLMs are the same because humans are wrong sometimes?
Pretty much, yeah. Or rather, the fact that we're both reliably wrong in identifiably similar ways makes "we're more alike than different" an attractive prior to me.
1 reply →
Humans are different. Humans - at least thoughtful humans - know the difference between knowing something and not knowing something. Humans are capable of saying "I don't know" - not just as a stream of tokens, but really understanding what that means.
> Humans - at least thoughtful humans - know the difference between knowing something and not knowing something.
Your no-true-scotsman clause basically falsifies that statement for me. Fine, LLMs are, at worst I guess, "non-thoughtful humans". But obviously LLMs are right an awful lot (more so than a typical human, even), and even the thoughtful make mistakes.
So yeah, to my eyes "Humans are NOT different" fits your argument better than your hypothesis.
(Also, just to be clear: LLMs also say "I don't know", all the time. They're just prompted to phrase it as a criticism of the question instead.)
2 replies →