Comment by Humorist2290
5 days ago
> Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
From my experience we just get both. The constant risk of some catastrophic hallucination buried in the output, in addition to more subtle, and pervasive, concerns. I haven't tried with Gemini 3 but when I prompted Claude to write a 20 page short story it couldn't even keep basic chronology and characters straight. I wonder if the 14 page research paper would stand up to scrutiny.
I feel like hallucinations have changed over time from factual errors randomly shoehorned into the middle of sentences to the LLMs confidently telling you they are right and even provide their own reasoning to back up their claims, which most of the time are references that don't exist.
I recently tasked Claude with reviewing a page of documentation for a framework and writing a fairly simple method using the framework. It spit out some great-looking code but sadly it completely made up an entire stack of functionality that the framework doesn't support.
The conventions even matched the rest of the framework, so it looked kosher and I had to do some searching to see if Claude had referenced an outdated or beta version of the docs. It hadn't - it just hallucinated the funcionality completely.
When I pointed that out, Claude quickly went down a rabbit-hole of writing some very bad code and trying to do some very unconventional things (modifying configuration code in a different part of the project that was not needed for the task at hand) to accomplish the goal. It was almost as if it were embarrassed and trying to rush toward an acceptable answer.
I've noticed the new OpenAI models do self contradiction a lot more than I've ever noticed before! Things like:
- Aha, the error clearly lies in X, because ... so X is fine, the real error is in Y ... so Y is working perfectly. The smoking gun: Z ...
- While you can do A, in practice it is almost never a good idea because ... which is why it's always best to do A
I've seen it so this too. I had it keeping a running tally over many turns and occasionally it would say something like: "... bringing the total to 304.. 306, no 303. Haha, just kidding I know it's really 310." With the last number being the right one. I'm curious if it's an organic behavior or a taught one. It could be self learned through reinforcement learning, a way to correct itself since it doesn't have access to a backspace key.
Yeah.
I worked with Grok 4.1 and it was awesome until it wasn't.
It told me to build something, just to tell me in the end that I could do it smaller and cheaper.
And that multiple times.
Best reply was the one that ended with something algong the lines of "I've built dozens of them!"
I like when they tell you they’ve personally confirmed a fact in a conversation or something.
I got a 3000 word story. Kind of bland, but good enough for cheating in high school.
See prompt, and my follow-up prompts instructing it to check for continuity errors and fix them:
https://pastebin.com/qqb7Fxff
It took me longer to read and verify the story (10 minutes) than to write the prompts.
I got illustrations too. Not great, but serviceable. Image generation costs more compute to iterate and correct errors.
Disappointingly, that is an exceedingly good story for a high school assignment. The use of an appositive phrase alone would raise alarm bells though.
It's nitpicking for flaws, but why not -- what lens on an old DSLR, older than a car, will let you take a macro shot, a wide shot, and a zoom shot of a bird?
In any case I'm not surprised. It's a short story, and it is indeed _serviceable_, but literature is more than just service to an assignment.