Comment by simonw

2 months ago

Were you running it inside a coding agent like Codex?

If so then it should have realized its mistake when it tried to run those CLI commands and saw the error message. Then it can try something different instead.

If you were using a regular chat interface and expecting it to know everything without having an environment to try things out then yeah, you're going to be disappointed.

4 comments

simonw

heavyset_go 2 months ago

No, Codex doesn't have permission to install random software on my machine and then execute it to see if it's real or a hallucination.

CLI utility here means software with a CLI, not classic Unix-y CLI tools.

The WebDav hallucinations happened in the chat interface.

varenc 2 months ago
It's not an all or nothing permission. How I use claude code it has to ask me for permission for every CLI tool use. This seems like reasonable way to balance security with utility and would allow the agent to correct itself when it hallucinates CLI tools. Or just run it in an isolated container where it can't break anything and give it full perms.
- heavyset_go 2 months ago
  
  I don't want any LLM tool prompting me to install and run software it makes up on the fly.
  Typosquatting is a thing, for example, and I'm sure hallucination squatting will be, too.
  I also don't want to run anything in a "sandbox", either. Containers are not sandboxes despite things like the Gemini CLI pretending they are.

hashhar 2 months ago

Codex for me behaves very junior engineer-ish. Claude is smarter and tries to think long term.

A great example of their behaviours for a problem that isn't 100% specified in detail (because detail would need iterations) is available at https://gist.github.com/hashhar/b1215035c19a31bbe4b58f44dbb4....

I gave both Codex (GPT5-ExHi) and Claude (Opus 4.5 Thinking) the exact same prompts and the end results were very different.

The most interesting bit was asking both of them to try to justify why there were differences and then critiquing each other's code. Claude was so good at this - took the best parts of GPTs code, fixed a bug there and ended up with a pretty nice implementation.

The Claude generated code was much more well-organised too (less script-like, more program like).