Comment by worble

2 months ago

I'd be curious in how well it passes 100th Coin's NES accuracy tests https://github.com/100thCoin/AccuracyCoin

31 comments

worble

Indeed, that's what I kind of hinted at in https://news.ycombinator.com/item?id=46437688 briefly after, namely that OK, one can "generate" a "solution", that's much easier than before... but until we can verify somehow that it actually does what it say it does (and we know of hallucinations and have no reason to believe this changed) then testing itself, especially of well know "problems" is more and more important.

That being said, it doesn't answer the "why" in the first place, an even more important question. At least though it does help somehow to compare with existing alternatives.

garciasn 2 months ago
Isn’t this how all software development works? Folks commit code, it’s tested, and reviewed, and then deployed.
Why would this be any different?
- PaulDavisThe1st 2 months ago
  
  That's not how software development works.
  Folks think, they write code, they do their own localized evaluation and testing, then they commit and then the rest of the (down|up)stream process begins.
  LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step. Granted, most humans don't do this step as thoroughly and carefully as would be desirable (sometimes through laziness, sometimes because of a belief in (down|up)stream testing processes). But LLM's don't do it at all.
  
  23 replies →

roger_ 2 months ago

I’m sure you can point Claude at that page and have it make the necessary changes to pass.

deadbabe 2 months ago
Or it could loop infinitely, never quite being able to pass all the tests.
- hu3 2 months ago
  
  which is easily fixable by some human guidance
  
  2 replies →