← Back to context

Comment by coffeemug

3 days ago

If you ignore the word "agent" and autocomplete it in your mind to "step", things will make more sense.

Here is an example-- I highlight physical books as I read them with a red pen. Sometimes my highlights are underlines, sometimes I bracket relevant text. I also write some comments in the margins.

I want to photograph relevant pages and get the highlights and my comments into plain text. If I send an image of a highlighted/commented page to ChatGPT and ask to get everything into plain text, it doesn't work. It's just not smart enough to do it in one prompt. So, you have to do it in steps. First you ask for the comments. Then for underlined highlights. Then for bracketed highlights. Then you merge the output. Empirically, this produces much better results. (This is a really simple example; but imagine you add summarization or something, then the steps feed into each other)

As these things get complicated, you start bumping into repeated problems (like understanding what's happening between each step, tweaking prompts, etc.) Having a library with some nice tooling can help with those. It's not especially magical and nothing you couldn't do yourself. But you also could write Datadog or Splunk yourself. It's just convenient not to.

The internet decided to call these types of programs agents, which confuses engineers like you (and me) who tend to think concretely. But if you get past that word, and maybe write an example app or something, I promise these things will make sense.

To add some color to this

Anthropic does a good job of breaking down some common architecture around using these components [1] (good outline of this if you prefer video [2]).

"Agent" is definitely an overloaded term - the best framing of this I've seen is aligns more closely with the Anthropic definition. Specifically, an "agent" is a GenAI system that dynamically identifies the tasks ("steps" from the parent comment) without having to be instructed that those are the steps. There are obvious parallels to the reasoning capabilities that we've seen released in the latest cut of the foundation models.

So for example, the "Agent" would first build a plan for how to address the query, dynamically farm out the steps in that plan to other LLM calls, and then evaluate execution for correctness/success.

[1] https://www.anthropic.com/research/building-effective-agents [2] https://www.youtube.com/watch?v=pGdZ2SnrKFU

  • This sums up as ranging from multiple LLM calls to build a smart features to letting the LLM decide what to do next. I think you can go very far with the former but the latter is more autonompus in unconstrained environments (like chatting with a human etc.)