Comment by libraryofbabel

2 months ago

Strongly recommend this blog post too which is a much more detailed and persuasive version of the same point. The author actually goes and builds a coding agent from zero: https://ampcode.com/how-to-build-an-agent

It is indeed astonishing how well a loop with an LLM that can call tools works for all kinds of tasks now. Yes, sometimes they go off the rails, there is the problem of getting that last 10% of reliability, etc. etc., but if you're not at least a little bit amazed then I urge you go to and hack together something like this yourself, which will take you about 30 minutes. It's possible to have a sense of wonder about these things without giving up your healthy skepticism of whether AI is actually going to be effective for this or that use case.

This "unreasonable effectiveness" of putting the LLM in a loop also accounts for the enormous proliferation of coding agents out there now: Claude Code, Windsurf, Cursor, Cline, Copilot, Aider, Codex... and a ton of also-rans; as one HN poster put it the other day, it seems like everyone and their mother is writing one. The reason is that there is no secret sauce and 95% of the magic is in the LLM itself and how it's been fine-tuned to do tool calls. One of the lead developers of Claude Code candidly admits this in a recent interview.[0] Of course, a ton of work goes into making these tools work well, but ultimately they all have the same simple core.

[0] https://www.youtube.com/watch?v=zDmW5hJPsvQ

76 comments

libraryofbabel

deadbabe 2 months ago

Generally when LLM’s are effective like this, it means a more efficient non-LLM based solution to the problem exists using the tools you have provided. The LLM helps you find the series of steps and synthesis of inputs and outputs to make it happen.

It is expensive and slow to have an LLM use tools all the time for solving the problem. The next step is to convert frequent patterns of tool calls into a single pure function, performing whatever transformation of inputs and outputs are needed along the way (an LLM can help you build these functions), and then perhaps train a simple cheap classifier to always send incoming data to this new function, bypassing LLMs all together.

In time, this will mean you will use LLMs less and less, limiting their use to new problems that are unable to be classified. This is basically like a “cache” for LLM based problem solving, where the keys are shapes of problems.

The idea of LLMs running 24/7 solving the same problems in the same way over and over again should become a distant memory, though not one that an AI company with vested interest in selling as many API calls as possible will want people to envision. Ideally LLMs are only needed to be employed once or a few times per novel problem before being replaced with cheaper code.

toobulkeh 2 months ago

I’ve been tinkering with this, but haven’t found a pattern or library of someone solving this.
Have you?

vidarh 2 months ago

There's a Ruby port of the first article you linked as well. Feature-wise they're about the same, but if you (like me) enjoy Ruby more than Python it's worth reading both articles:

https://radanskoric.com/articles/coding-agent-in-ruby

forgingahead 2 months ago
Love to see the Ruby implementations! Thanks for sharing.
- ichiwells 2 months ago
  
  Thank you so much for sharing this!
  We are using ruby to build a powerful AI toolset in the construction space, and we love how simple all of the SaaS parts are and not reinventing the wheel, but the ruby LLM SDK ecosystem is a bit lagging, so we've written a lot of our own low-level tools.
  (btw we are also hiring rubyists https://news.ycombinator.com/item?id=43865448)
  
  1 reply →

datpuz 2 months ago

Can't think of anything an LLM is good enough at to let them do on their own in a loop for more than a few iterations before I need to reign it back in.

hbbio 2 months ago

That's why in practice you need more than this simple loop!
Pretty much WIP, but I am experimenting with simple sequence-based workflows that are designed to frequently reset the conversation [2]
This goes well with Microsoft paper "LLMs Get Lost In Multi-Turn Conversation " that was published Friday [1].
- [1]: https://arxiv.org/abs/2505.06120
- [2]: https://github.com/hbbio/nanoagent/blob/main/src/workflow.ts
Groxx 2 months ago
They're extremely good at burning through budgets, and get even better when unattended
- _kb 2 months ago
  
  Maximising paperclip production too.
- mycall 2 months ago
  
  Is that really true? I though there free models and $200 all you can eat models.
  
  9 replies →
CuriouslyC 2 months ago
The main problem with agents is that they aren't reflecting on their own performance and pausing their own execution to ask a human for help aggressively enough. Agents can run on for 20+ iterations in many cases successfully, but also will need hand holding after every iteration in some cases.
They're a lot like a human in that regard, but we haven't been building that reflection and self awareness into them so far, so it's like a junior that doesn't realize when they're over their depth and should get help.
- vendiddy 2 months ago
  
  I think they are capable of doing it, but it requires prompting.
  I constantly have to instruct them: - Go step by step, don't skip ahead until we're done with a step - Don't make assumptions, if you're unsure ask questions to clarify
  And they mostly do this.
  But this needs to be default behavior!
  I'm surprised that, unless prompted, LLMs never seem to ask follow-up questions as a smart coworker might.
- ariwilson 2 months ago
  
  Is there value in adding an overseer LLM that measures the progress between n steps and if it's too low stops and calls out to a human?
  
  28 replies →
vidarh 2 months ago
They've written most of the recent iterations of X11 bindings for Ruby, including a complete, working example of a systray for me.
They also added the first pass of multi-monitor support for my WM while I was using it (restarted it repeatedly while Claude Code worked, in the same X session the terminal it was working in was running).
You do need to reign them back in, sure, but they can often go multiple iterations before they're ready to make changes to your files once you've approved safe tool uses etc.
- TZubiri 2 months ago
  
  How do they read the screen?
  
  1 reply →
- datpuz 2 months ago
  
  Agents? Doubt.
  
  3 replies →
eru 2 months ago

The hope is that the ground truth from calling out to tools (like compilers or test runs) will eventually be enough keep them on track.
Just like humans and human organisations also tend to experience drift, unless anchored in reality.
mkagenius 2 months ago

I built android-use[1] using LLM. It is pretty good at self healing due to the "loop", it constantly checks if the current step is actually a progress or regress and then determines next step. And the thing is nothing is explicitly coded, just a nudge in the prompts.
1. clickclickclick - A framework to let local LLMs control your android phone (https://github.com/BandarLabs/clickclickclick)
loa_in_ 2 months ago

You don't have to. Most of the appeal is automatically applying fixes like "touch file; make" after spotting a trivial mistake. Just let it at it.
JeremyNT 2 months ago

Definitely true currently, which is why there's so much focus on using them to write real code that humans have to actually commit and put their names on.
Longer term, I don't think this holds due to the nature of capitalism.
If given a choice between paying for an LLM to do something that's mostly correct versus paying for a human developer, businesses are going to choose the former, even if it results in accelerated enshittification. It's all in service of reducing headcount and taking control of the means of production away from workers.

meander_water 2 months ago

There's also this one which uses pocketflow, a graph abstraction library to create something similar [0]. I've been using it myself and love the simplicity of it.

[0] https://github.com/The-Pocket/PocketFlow-Tutorial-Cursor/blo...

wepple 2 months ago

Ah, it’s Thorsten Ball!

I thoroughly enjoyed his “writing an interpreter”. I guess I’m going to build an agent now.

sesm 2 months ago

Should we change the link above to use `?utm_source=hn&utm_medium=browser` before opening it?

libraryofbabel 2 months ago

fixed :)

orange_puff 2 months ago

I have been trying to find such an article for so long, thank you! I think a common reaction to Agents is “well, it probably cannot solve a really complex problem very well”. But to me, that isn’t the point of an agent. LLMs function really well with a lot of context, and agent allows the LLM to discover more context and improve its ability to answer questions.

xnx 2 months ago

> The reason is that there is no secret sauce and 95% of the magic is in the LLM itself

Makes that "$3 billion" valuation for Windsurf very suspect

TonyEx 2 months ago

The value in the windsurf acquisition isn't the code they've written, it's the ability to see what people are coding and use that information to build better LLMs. -- Product development.
rrrx3 2 months ago

Indeed. But keep in mind they weren't just buying the tooling - they get the team, the brand, and the positional authority as well. OpenAI could have spun up a team to build an agent code IDE, and they would have been starting on the back foot with users, would have been compared to Cursor/Windsurf...
The price tag is hefty but I figure it'll work out for them on the backside because they won't have to fight so hard to capture TAM.

gchamonlive 2 months ago

How far can you go with the best models that fit in a consumer grade GPU (24GB vram)?

TZubiri 2 months ago

there is the problem of getting that last 10% of reliability.

In my experience, that next 9% will take 9 times the effort.

And that next 0.9% will take 9 times the effort.

And so on.

So 90% is very far off from 99.999% reliability. Which would still be less reliable than an ec2 instance.

com 2 months ago

As the project management saying goes: 99% done is much closer to 0% done than 100% done

aibrother 2 months ago

thanks for the rec. and yeah agreed with the observations as well

kcorbitt 2 months ago

For "that last 10% of reliability" RL is actually working pretty well right now too! https://openpipe.ai/blog/art-e-mail-agent