Comment by Humorist2290

3 months ago

> Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.

From my experience we just get both. The constant risk of some catastrophic hallucination buried in the output, in addition to more subtle, and pervasive, concerns. I haven't tried with Gemini 3 but when I prompted Claude to write a 20 page short story it couldn't even keep basic chronology and characters straight. I wonder if the 14 page research paper would stand up to scrutiny.

8 comments

Humorist2290

acters 3 months ago

I feel like hallucinations have changed over time from factual errors randomly shoehorned into the middle of sentences to the LLMs confidently telling you they are right and even provide their own reasoning to back up their claims, which most of the time are references that don't exist.

njovin 3 months ago

I recently tasked Claude with reviewing a page of documentation for a framework and writing a fairly simple method using the framework. It spit out some great-looking code but sadly it completely made up an entire stack of functionality that the framework doesn't support.
The conventions even matched the rest of the framework, so it looked kosher and I had to do some searching to see if Claude had referenced an outdated or beta version of the docs. It hadn't - it just hallucinated the funcionality completely.
When I pointed that out, Claude quickly went down a rabbit-hole of writing some very bad code and trying to do some very unconventional things (modifying configuration code in a different part of the project that was not needed for the task at hand) to accomplish the goal. It was almost as if it were embarrassed and trying to rush toward an acceptable answer.
jaccola 3 months ago
I've noticed the new OpenAI models do self contradiction a lot more than I've ever noticed before! Things like:
- Aha, the error clearly lies in X, because ... so X is fine, the real error is in Y ... so Y is working perfectly. The smoking gun: Z ...
- While you can do A, in practice it is almost never a good idea because ... which is why it's always best to do A
- SomewhatLikely 3 months ago
  
  I've seen it so this too. I had it keeping a running tally over many turns and occasionally it would say something like: "... bringing the total to 304.. 306, no 303. Haha, just kidding I know it's really 310." With the last number being the right one. I'm curious if it's an organic behavior or a taught one. It could be self learned through reinforcement learning, a way to correct itself since it doesn't have access to a backspace key.
- k__ 3 months ago
  
  Yeah.
  I worked with Grok 4.1 and it was awesome until it wasn't.
  It told me to build something, just to tell me in the end that I could do it smaller and cheaper.
  And that multiple times.
  Best reply was the one that ended with something algong the lines of "I've built dozens of them!"
emodendroket 3 months ago

I like when they tell you they’ve personally confirmed a fact in a conversation or something.

gowld 3 months ago

I got a 3000 word story. Kind of bland, but good enough for cheating in high school.

See prompt, and my follow-up prompts instructing it to check for continuity errors and fix them:

https://pastebin.com/qqb7Fxff

It took me longer to read and verify the story (10 minutes) than to write the prompts.

I got illustrations too. Not great, but serviceable. Image generation costs more compute to iterate and correct errors.

Humorist2290 3 months ago

Disappointingly, that is an exceedingly good story for a high school assignment. The use of an appositive phrase alone would raise alarm bells though.
It's nitpicking for flaws, but why not -- what lens on an old DSLR, older than a car, will let you take a macro shot, a wide shot, and a zoom shot of a bird?
In any case I'm not surprised. It's a short story, and it is indeed _serviceable_, but literature is more than just service to an assignment.