Comment by simonw

6 days ago

I'm sure a lot of them are superstitions! I've written about that before: https://simonwillison.net/2023/Aug/27/wordcamp-llms/#superst...

One of the more "engineering" like skills in using this stuff is methodically figuring out what's a superstition and what actually works.

7 comments

simonw

sarchertech 6 days ago

Nice to see that you recognize that!

> One of the more "engineering" like skills in using this stuff is methodically figuring out what's a superstition and what actually works.

The problem is there are so many variables and the system is so chaotic that this is a nearly impossible task for things that don’t have an absolutely enormous effect size.

For most things you’re testing, you need to run the experiment many many times to get any kind is statistically significant result, which rules out manual review.

And since we have tried and failed to develop objective code quality metrics, you’re left with metrics like “does this pass the automated test or not!”, but that doesn’t tell you whether the code is any good, or whether it is overfitting the test suite. Then when a new model comes out, you have to scrap your results and run your experiments all over. This is engineering of the laws of physics were constantly changing, and I lived in that universe, I think I’d take my ball and go home.

There's always been a bit of magic to being a programmer, and if you look at the cover of SICP people like to imagine that they are wizards or alchemists. But "vibe engineering" moves that to a whole new level. You're a wizard mixing up gunpowder and sacrificing chickens to fire spirits before you light it. It's not engineering because unless the models fundamentally change you'll never be able to really sort the science from the superstition. Software engineering already had too much superstition for my taste, but we're at a whole new level now.

simonw 6 days ago
Here's an example from today of something I just figured out.
I had Claude Code do some work which I pushed as a branch to GitHub. Then I opened a PR so I could more easily review it and added a bunch of notes and comments there.
On a hunch, I pasted the URL to that PR into Claude Code and said "use the GitHub API to fetch the notes on this PR"...
... and it did exactly that. It guesses the API URL, fetched the JSON and read my notes back to me.
I told it to address each note in turn and commit the result. It did.
If a future model changes such that it can no longer correctly guess the URL to fetch JSON notes for a GitHub PR I'll notice when this trick fails. For the moment it's something I get to tuck in my ever expanding list of things that Claude (and likely other good models) can do.
- sarchertech 6 days ago
  
  How is that an example of something you are doing that might be a superstition?
  You asked it to do a single easily verifiable task and it did it. You don’t know whether that’s something it can do reliably until you test it sure.
  An example of a possible superstitious action would be always adding commands as notes in a PR because you believe Claude gives PR notes more weight.
  That’s something that sounds crazy, but it’s perfectly believable that some artifact of training could lead some model to actually behave this way. And you can imagine that someone picking up on this pattern could continue to favor writing commands as PR notes years after model changes have removed this behavior.
  
  4 replies →