Comment by Lerc

2 months ago

I guess you have a couple of options.

You could trust the expert analysis of people in that field. You can hit personal ideologies or outliers, but asking several people seems to find a degree of consensus.

You could try varying tasks that perform complex things that result in easy to test things.

When I started trying chatbots for coding, one of my test prompts was

    Create a JavaScript function edgeDetect(image) that takes an ImageData object and returns a new ImageData object with all direction Sobel edge detection.

That was about the level where some models would succeed and some will fail.

Recently I found

    Can you create a webgl glow blur shader that takes a 2d canvas as a texture and renders it onscreen with webgl boosting the brightness so that #ffffff is extremely bright white and glowing,

Produced a nice demo with slider for parameters, a few refinements (hierarchical scaling version) and I got it to produce the same interface as a module that I had written myself and it worked as a drop in replacement.

These things are fairly easy to check because if it is performant and visually correct then it's about good enough to go.

It's also worth noting that as they attempt more and more ambitious tasks, they are quite probably testing around the limit of capability. There is both marketing and science in this area. When they say they can do X, it might not mean it can do it every time, but it has done it at least once.

6 comments

Lerc

taurath 2 months ago

> You could trust the expert analysis of people in that field

That’s the problem - the experts all promise stuff that can’t be easily replicated. The promises the experts send doesn’t match the model. The same request might succeed and might fail, and might fail in such a way that subsequent prompts might recover or might not.

Lerc 2 months ago

The experts I am talking about trusting here are the ones doing the replication, not the ones making the claims.
timschmidt 2 months ago
That's how working with junior team members or open source project contributors goes too. Perhaps that's the big disconnect. Reviewing and integrating LLM contributions slotted right into my existing workflow on my open source projects. Not all of them work. They often need fixing, stylistic adjustments, or tweaking to fit a larger architectural goal. That is the norm for all contributions in my experience. So the LLM is just a very fast, very responsive contributor to me. I don't expect it to get things right the first time.
But it seems lots of folks do.
Nevertheless, style, tweaks, and adjustments are a lot less work than banging out a thousand lines of code by hand. And whether an LLM or a person on the other side of the world did it, I'd still have to review it. So I'm happy to take increasingly common and increasingly sophisticated wins.
- worble 2 months ago
  
  Junior's grow into mids, and eventually into seniors. OSS contributor's eventually learn the codebase, you talk to them, you all get invested in the shared success of the project and sometimes you even become friends.
  For me, personally, I just don't see the point of putting that same effort into a machine. It won't learn or grow from the corrections I make in that PR, so why bother? I might as well have written it myself and saved the merge review headache.
  Maybe one day it'll reach perfect parity of what I could've written myself, but today isn't that day.
  
  2 replies →