Comment by stefan_

10 months ago

Have you tried it? In my experience they just go off on a hallucination loop, or blow up the code base with terrible re-implementations.

Similarly Claude 3.5 was stuck on TensorRT 8, and not even pointing it at the documentation for the updated 10 APIs for RAG could ever get it to correctly use the new APIs (not that they were very complex; bind tensors, execute, retrieve results). The whole concept of the self-reinforcing Agent loop is more of a fantasy. I think someone else likened it to a lawnmower that will run rampage over your flower bed at the first hiccup.

Yes, they're part of my daily toolset. And yes, they can spin out. I just hit the "reject" button when they do, and revise my prompt. Or, sometimes, I just take over and fill in some of the structure of the problem I'm trying to solve myself.

I don't know about "self-reinforcing". I'm just saying: coding agents compile and lint the code they're running, and when they hallucinate interfaces, they notice. The same way any developer who has ever used ChatGPT knows that you can paste most errors into the web page and it will often (maybe even usually) come up with an apposite fix. I don't understand how anybody expects to convince LLM users this doesn't work; it obviously does work.

  • > I don't understand how anybody expects to convince LLM users this doesn't work; it obviously does work.

    This is really one of the hugest divides I've seen in the discourse about this: anti-LLM people saying very obviously untrue things, which is uh, kind of hilarious in a meta way.

    https://bsky.app/profile/caseynewton.bsky.social/post/3lo4td... is an instance of this from a few days ago.

    I am still trying to sort out why experiences are so divergent. I've had much more positive LLM experiences while coding than many other people seem to, even as someone who's deeply skeptical of what's being promised about them. I don't know how to reconcile the two.

    • > I am still trying to sort out why experiences are so divergent. I've had much more positive LLM experiences while coding than many other people seem to, even as someone who's deeply skeptical of what's being promised about them. I don't know how to reconcile the two.

      I am also trying to sort this out, but I'm probably someone you'd consider to be on the other, "anti-LLM" side.

      I wonder if part of this is simply level of patience, or, similarly, having a work environment that's chill enough to allow for experimentation?

      From my admittedly short attempts to use agentic coding so far, I usually give up pretty quickly because I experience, as others in the thread described, the agent just spinning its wheels or going off and mangling the codebase like a lawnmower.

      Now, I could totally see a scenario where if I spent time tweaking prompts, writing rule files, and experimenting with different models, I could improve that output significantly. But this is being sold to me as a productivity tool. I've got code to write, and I'm pretty sure I can write it fairly quickly myself, and I simply don't have time at my start up to muck around with babysitting an AI all day -- I have human junior engineers that need babysitting.

      I feel like I need to be a lot more inspired that the current models can actually improve my productivity in order to spend the time required to get there. Maybe that's a chicken-or-egg problem, but that's how it is.

      2 replies →

    • > This is really one of the hugest divides I've seen in the discourse about this: anti-LLM people saying very obviously untrue things, which is uh, kind of hilarious in a meta way.

      > https://bsky.app/profile/caseynewton.bsky.social/post/3lo4td... is an instance of this from a few days ago.

      Not sure why this is so surprising? ChatGPT search was only released in November last year, was a different mode, and it sucked. Search in o3 and o4-mini came out like three weeks ago. Otherwise you were using completely different products from Perplexity or Kagi, which aren't widespread yet.

      Casey Newton even half acknowledges that timing ("But it has had integrated web search since last year"...even while in the next comment criticising criticisms using the things "you half-remember from when ChatGPT launched in 2022").

      If you give the original poster the benefit of the doubt, you can sort of see what they're saying, too. An LLM, on its own, is not a search engine and can not scan the web for information. The information encoded in them might be ok, but is not complete, and does not encompass the full body of the published human thought it was trained on. Trusting an offline LLM with an informational search is sometimes a really bad idea ("who are all the presidents that did X").

      The fact that they're incorrect when they say that LLM's can't trigger search doesn't seem that "hilarious" to me, at least. The OP post maybe should have been less strident, but it also seems like a really bad idea to gatekeep anybody wanting to weigh in on something if their knowledge of product roadmaps is more than six months out of date (which I guarantee is all of us for at least some subject we are invested in).

      2 replies →

    • These are mostly value judgments and people are using words that mean different things to different people but I would point that LLM boosters have been saying the same thing for each product release: "now it works, you are just using the last-gen model/technique which doesn't really work (even though I said the same thing for that model/technique and every one before that). Moreover there still hasn't been significant, objectively observable impact: no explosion in products, no massive acceleration of feature releases, no major layoffs attributed to AI (to which the response every time is that it was just released and you will see the effects in a few months).

      Finally, if it really were really true that some people know the special sauce of how to use LLMs to make a massive difference in productivity but many people didn't know how to do that then you could make millions or tens of millions per year as a consultant training everyone at big companies. In other words if you really believed what you were saying you should pick up the money on the ground.

      2 replies →

    • > I am still trying to sort out why experiences are so divergent. I've had much more positive LLM experiences while coding than many other people seem to, even as someone who's deeply skeptical of what's being promised about them. I don't know how to reconcile the two.

      As with many topics, I feel like you can divide people in a couple of groups. You have people who try it, have their mind blown by it, so they over-hype it. Then the polar-opposite, people who are overly dismissive and cement themselves into a really defensive position. Both groups are relatively annoying, inaccurate, and too extremist. Then another group of people might try it out, find some value, integrate it somewhat and maybe got a little productivity-boost and moves on with their day. Then a bunch of other groupings in-between.

      Problem is that the people in the middle tend to not make a lot of noise about it, and the extremists (on both ends) tend to be very vocal about their preference, in their ways. So you end up perceiving something as very polarizing. There are many accurate and true drawbacks with LLMs as well, but it also ends up poisoning the entire concept/conversation/ecosystem for some people, and they tend to be noisy as well.

      Then the whole experience depends a lot on your setup, how you use it, what you expect, what you've learned and so many much more, and some folks are very quick to judge a whole ecosystem without giving parts of it an honest try. It took me a long time to try Aider, Cursor and others, and even now after I've tried them out, I feel like there are probably better ways to use this new category of tooling we have available.

      In the end I think reality is a bit less black/white for most folks, common sentiment I see and hear is that LLMs are probably not hellfire ending humanity nor is it digital-Jesus coming to save us all.

      1 reply →

    • > anti-LLM people saying very obviously untrue things, which is uh, kind of hilarious in a meta way.

      tptacek shifted the goal posts from "correct a hallucination" to "solve a copy pasted error" (very different things!) and just a comment later theres someone assassinating me as an "anti-LLM person" saying "very obviously untrue things", "kind of hilarious". And you call yourself "charitable". It's a joke.

      1 reply →

    • > I am still trying to sort out why experiences are so divergent

      I suspect part of it is that there still isn't much established social context for how to interact with an LLM, and best practices are still being actively discovered, at least compared to tools like search engines or word processors.

      Search engines somewhat have this problem, but there's some social context around search engine skill, colloquially "google-fu", if it's even explicitly mentioned.

      At some point, being able to get the results from a search engine stopped being entirely about the quality of the engine and instead became more about the skill of the user.

      I imagine, as the UX for AI systems stabilizes, and as knowledge of the "right way", to use them diffuses through culture, experiences will be less divergent.

      1 reply →

    • So is the real engineering work in the agents rather than in the LLM itself then? Or do they have to be paired together correctly? How do you go about choosing an LLM/agent pair efficiently?

      4 replies →

    • I think it is pretty simple: people tried it a few times a few months ago in a limited setting, formed an opinion based on those limited experiences and cannot imagine a world where they are wrong.

      That might sound snarky, but it probably works out for people in 99% of cases. AI and LLMs are advancing at a pace that is so different from any other technology that people aren't yet trained to re-evaluate their assumptions at the high rate necessary to form accurate new opinions. There are too many tools coming (and going, to be fair).

      HN (and certain parts of other social media) is a bubble of early adopters. We're on the front lines seeing the war in realtime and shaking our heads at what's being reported in the papers back home.

      2 replies →

    • > an instance of this from a few days ago.

      Bro I've been using LLMs for search since before it even had search capabilities...

      "LLMs not being for search" has been an argument from the naysayers for a while now, but very often when I use an LLM I am looking for the answer to something - if that isn't [information] search, then what is?

      Whether they hallucinate or outright bullshit sometimes is immaterial. For many information retrieval tasks they are infinitely better than Google and have been since GPT3.

      1 reply →

> I think someone else likened it to a lawnmower that will run rampage over your flower bed at the first hiccup

This reminds me of a scene from the recent animation movie "Wallace and Gromit: Vengeance Most Fowl" where Wallace actually uses a robot (Norbot) to do gardening tasks, and rampages over Gromit's flower bed.

https://youtu.be/_Ha3fyDIXnc

I mean, I have. I use them every day. You often see them literally saying "Oh there is a linter error, let me go fix it" and then a new code generation pass happens. In the worst case, it does exactly what you are saying, gets stuck in a loop. It eventually gets to the point where it says "let me try just once more" and then gives up.

And when that happens I review the code and if it is bad then I "git revert". And if it is 90% of the way there I fix it up and move on.

The question shouldn't be "are they infallible tools of perfection". It should be "do I get value equal to or greater than the time/money I spend". And if you use git appropriately you lose at most five minutes on a agent looping. And that happens a couple of times a week.

And be honest with yourself, is getting stuck in a loop fighting a compiler, type-checker or lint something you have ever experienced in your pre-LLM days?

Have you tried it? More than once?

I’m getting massive productivity gains with Cursor and Gemini 2.5 or Claude 3.7.

One-shotting whole features into my rust codebase.

  • I use it all the time, multiple times daily. But the discussion is not being very honest, particularly for all the things that are being bolted on (agent mode, MCP). Like just upstream people dunk on others for pointing out that maybe giving the model an API call to read webpages isn't quite turning LLM into search engines. Just like letting it run shell commands has not made it into a full blown agent engineer.

    I tried it again just now with Claude 3.7 in Cursors Agent/Compose (they change this stuff weekly). Write a simple C++ TensorRT app that loads an engine and runs inference 100 times for a benchmark, use this file to source a toolchain. It generated code with the old API & a CMake file and (warning light turns on) a build script. The compile fails because of the old API, but this time it managed to fix it to use the new API.

    But now the linking fails, because it overwrote the TRT/CUDA directories in the CMakeLists with some home cooked logic (there was nothing to do, the toolchain script sets up the environment fully and just find_package would work).

    And this is where we go off the rails; it messes with the build script and CMakeLists more, but still it can not link. It thinks hey it looks like we are cross-compiling and creates a second build script "cross-compile.sh" that tries to use the compiler directly, but of course that misses things that the find_package in CMake would setup and so fails with include errors.

    It pretends its a 1970 ./configure script and creates source files "test_nvinfer.cpp" and "test_cudart.cpp" that are supposed to test for the presence of those libraries, then tries to compile them directly; again its missing directories and obviously fails.

    Next we create a mashup build script "cross-compile-direct.sh". Not sure anymore what this one tried to achieve, didn't work.

    Finally, and this is my favorite agent action yet, it decides fuck it, if the library won't link, why don't we just mock out all the actual TensorRT/CUDA functionality and print fake benchmark numbers to demonstrate LLMs can average a number in C++. So it writes, builds ands runs a "benchmark_mock.cpp" that subs out all the useful functionality for random data from std::mt19937. This naturally works, so the agent declares success and happily updates the README.md with all the crap it added and stops.

    This is what running the lawnmower over the flower bed means; you have 5 more useless source files and a bunch more shell scripts and a bunch of crap in a README that were all generated to try and fail to fix a problem it could not figure out, and this loop can keep going and generate more nonsense ad infinitum.

    (Why could it not figure out the linking error? We come back to the shitty bolted on integrations; it doesn't actually query the environment, search for files or look at what link directories are being used, as one would investigating a linking error. It could of course, but the balance in these integrations is 99% LLM and 1% tool use, and even context from the tool use often doesn't help)

    • It's really weird for me to see people talking about using LLMs in coding situations in a frame where "agents" (we're not even at MCP yet!) are somehow an extra. People discussing the applicability of LLMs to programming, and drawing conclusions (even if only for themselves) about how well it works, should be experienced with a coding agent.

    • LLMs aren't good at deep ML problems yet. We know this, its what MLE-bench is for. That doesn't mean they aren't good at other coding problems

Someone gave me the tip to add "all source files should build without error", which you'd think would be implicit, but it seems not.

  • There's definitely a skill to using them well (I am not yet expert); my only frustration is with people who (like me) haven't refined the skill but have also concluded that there's no benefit to the tool. No, really, in this case, you're mostly just not holding it right.

    The tools will get better, but what I see happening with people who are good at using them (and from my own code, even in my degraded LLM usage), we have an existence proof of the value of the tools.