Comment by wagerlabs

1 day ago

It wouldn't do much.

I find that ChatGPT 5.1 was much better at reviewing this code than writing it so I had it review Claude's output until the review was clean.

This is in addition to making sure existing and newly generated compiler tests pass and that the output in the PR / blog post is generated by actually running lldb through its paces.

I did have a "Oh, shit!" moment after I posted a nice set of examples and discovered that the AI made them up. At least it honestly told me so!

1 comment

wagerlabs

pepoluan 18 hours ago

LLM will guiltlessly produce hallucinated 'review', because LLMs does NOT 'understand' what it is writing.

LLMs will merely regurgitate a chain of words -- tokens -- that best match its Hidden Markov Model chains. It's all just a probabilistic game, with zero actual understanding.

LLMs are even known to hide or fake Unit Test results: Claiming success when it fails, or not skipping the results completely. Why? Because based on the patterns it has seen, the most likely word that follow "the results of tests" are the words "all successful". Why? Because it tries to reproduce other PRs it has seen, PRs where the PR author actually performed tests on their own systems first, iterating multiple times until the tests succeed, so the PRs that the public sees are almost invariably PRs with the declaration that "all tests pass".

I'm quite certain that LLMs never actually tried to compile the code, much less run Test Cases against them. Simply because there is no such ability provided in their back-ends.

All LLMs can do is "generate the most probabilistically plausible text". In essence, a Glorified AutoComplete.

I personally won't touch code generated wholly by an AutoComplete with a 10-foot pole.