← Back to context

Comment by bob1029

5 hours ago

These tests are looking increasingly like a waste of time.

The "intelligence" is clearly there now. Trying to measure it seems pointless. I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce. That is clearly an insane ask, but that's approximately what is being pushed for with these models now.

Domain specificity (harness & environment) is where the magic happens next. I intentionally use a slightly less powerful model to help reveal weakness in how I've exposed the domain to the model. Having capability reserves available dramatically increases confidence around a project like this. If the customer starts to complain about some edges, I can crank them up to gpt5.5 for target scenarios. If I'm already on 5.5 there's nowhere else to go. I'm up against the wall.

"the intelligence is clearly there"

I wonder if I am using the same models as everyone else. To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.

  • It might be extra demand for rigor that's not equally applied to humans. One could argue that other coders in our teams, or even ourselves, often fail in "a miserable way", say about 20% of the time. But we block this out, or consider it "regular functioning", or just a one-off based on something we got wrong, "just a try" we redo, etc.

    But when an LLM does it on an area we know, we notice and suddenly it's too much.

    • Because a human fails in a known way. If a human does not have expertise in domain X or tech Y, they will fail there and the expectation is that they will fail.

      With an LLM you never know where it can fail. There is no domain expertise for an LLM. It can fail in a miserable way in the same domain it worked spectacularly for.

    • No. It is not intelligent at all to confidently assert false things you know nothing about, and humans don’t do this outside of compulsive liars. For example…

      A few days ago I asked ChatGPT where a Spurgeon quote came from. Response:

      “That quote is widely attributed to Charles Spurgeon, but pinning down an exact sermon or written source is surprisingly difficult—and that’s a red flag.

      Short answer There’s no well-attested primary source (sermon, lecture, or publication) where Spurgeon clearly says that exact wording.” Etc. etc. … Why it sounds like Spurgeon It fits his theology and rhetoric almost perfectly: • etc etc. … Closest authentic themes (but not the quote) Spurgeon repeatedly says things like: • etc etc. … So the quote is basically: a modern condensation of real Spurgeon ideas, not a verifiable citation etc. etc.”

      Utter bullshit. One web search produces the full sermon manuscript with the quote.

      One could argue that the previous context in the thread primed the LLM to fail here, but once again, a person is not confused by the change of topic.

      1 reply →

    • > But when an LLM does it on an area we know, we notice and suddenly it's too much.

      Well of course. The owners of the companies building this are constantly talking about it replacing us all. Why would it be surprising that it would then be held to a higher standard?

      1 reply →

  • I get about the same success rate with my problems (scientific computing usually), but they're often _much_ easier to check than to write, so an 80% success rate becomes game-changing.

  • It really depends on the field you are in and the tasks you set and how much of it was in the training set? A webdeveloper will find it succeeding in all taks - while some c++ exotic physics simulation developer will find it lacking.

    The "works for me" is telling more about the field of the LLM reviewer, then the LLM.

    • Funny you used this example :)

      I'm a month and a half deep into using it to make a traffic simulator with a bespoke physics engine that has complete drivetrain, suspension, and tire kernels. Think rally sim with an arcadey super off road presentation. It also has a full (also bespoke) webtransport stack that has held up beyond my wildest dreams. The simulation itself is capable of >500k cars. That was all complete about 2 weeks ago, the remainer of the work is integrating and optimizing the (you guessed it, also bespoke) pure synthesis sound engines for drivetrain/engine/tire/collision noise, and making pixi performant enough to actually display it all.

      My biggest regret is actually accepting its choice of pixi, if I would have just trusted what I knew and done my own renderer too it'd already be finished! In the meantime I'm having fun boiling down the nonlinear continuous-ish models into fitted surrogate polynomials and regime-specific closed forms. Currently using cloud credits I was given to test the library I need to accelerate this work on CDNA3/4 cards. It's so nice to make someone else's room hot for a change

      I've really enjoyed the ~3 month speedrun from "he has psychosis" to "the model did everything", yet somehow the number of people having this kind of success continues to match up with where I'd rank a given dev. There just aren't that many talented people out there and an even smaller subset of them are aiming high enough with LLMs, if at all. It's a truly awesome time to not have/need a job

      E: Most of my frustration is directed at OAI, they keep fucking up the cache and usage calculations. They got a grand out of me, I'm excited to see what Deepseek does for me with the same.

  • In my experience of hiring and managing people, I would have been very happy if they gave good answers or produced good results 80% of the time.

Why would it be a "waste of time"?

We are just getting into the nitty-gritty of LLM benchmarking - to be fair they still need to go a long way still IMO. But it's incredibly exciting that a local run LLM is capable of producing similar results as a SOTA model.

I agree. I feel like sonnet 4.6 is sufficient for almost everything. Beyond that level it feels like the orchestration is more important.

That being said the models still surprise me with a broad range of hallucinations, lack of epistemology or common sense or inability to follow instructions on a daily basis.

Today it was trying to get opus 4.8 to just follow a simple architectural pattern for controllers in a rails app. It was pulling teeth out of a shark.

> I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce.

What? You can and you should. That's exactly what product tests are enabling you to do. If you need a glue, you want to look at someone who tried to glue some things with few glues so you know what to roughly expect form which specific glue.