Comment by potatolicious

6 months ago

So it's pretty early-days for these kinds of systems, so there's no "one true" architecture that people have settled on. There are two broad variations that I see:

1 - The LLM is in charge and at the top of the stack. The deterministic bits are exposed to the LLM as tools, but you instruct the LLM specifically to use them in a particular way. For example: "Generate this code, and then run the build and tests. Do not proceed with more code generation until build and tests successfully pass. Fix any errors reported at the build and test step before continuing." This mostly works fine, but of course subject to the LLM not following instructions reliably (worse as context gets longer).

2 - A deterministic system is at the top, and uses LLMs in an otherwise-scripted program. This potentially works better when the domain the LLM is meant to solve is narrow and well-understood. In this case the structure of the system is more like a traditional program, but one that calls out to LLMs as-needed to fulfill certain tasks.

> "I’m not understanding how a probabilistic machine output can reliably map onto a strict input schema."

So there are two tricks to this:

1 - You can actually force the machine output into strict schemas. Basically all of the large model providers now support outputting in defined schemas - heck, Apple just announced their on-device LLM which can do that as well. If you want the LLM to output in a specified schema with guarantees of correctness, this is trivial to do today! This is fundamental to tool-calling.

2 - But often you don't actually want to force the LLM into strict schemas. For the coding tool example above where the LLM runs build/tests, it's often much more productive to directly expose stdout/stderr to the LLM. If the program crashed on a test, it's often very productive to just dump the stack trace as plaintext at the LLM, rather than try to coerce the data into a stronger structure and then show it to the LLM.

How much structure vs. freeform is very much domain-specific, but the important realization is that more structure isn't always good.

To make the example concrete, an example would be something like:

[LLM generates a bunch of code, in a structured format that your IDE understands and can convert into a diff]

[LLM issues the `build_and_test` tool call at your IDE. Your IDE executes the build and tests.]

[Build and tests (deterministic) complete, IDE returns the output to the LLM. This can be unstructured or structured.]

[LLM does the next thing]

4 comments

potatolicious

biophysboy 6 months ago

So, to summarize, there is a feedback loop like this: LLM <--> deterministic agent? And there's a asymmetry in strictness, i.e. LLM --> agent funnels probabilistic output into 1+ structured fields, whereas agent --> LLM can be more freeform (stderr plaintext). Is that right?

A few questions:

1) how does the LLM know where to put output tokens given more than one structured field options?

2) Is this loop effective for projects from scratch? How good is it at proper design (understanding tradeoffs in algorithms, etc)?

potatolicious 6 months ago
> "there is a feedback loop like this: LLM <--> deterministic agent?"
More or less, though the agent doesn't have to be deterministic. There's a sliding scale of how much determinism you want in the "overseer" part of the system. This is a huge area of active development with not a lot of settled stances.
There's a lot of work being put into making the overseer/agent a LLM also. The neat thing is that it doesn't have to be the same LLM, it can be something fine-tuned to specifically oversee this task. For example, "After code generation and build/test has finished, send the output to CodeReviewerBot. Incorporate its feedback into the next round of code generation." - where CodeReviewerBot is a different probabilistic model trained for the task.
You could even put a human in as part of the agent: "do this stuff, then upload it for review, and continue only after the review has been approved" is a totally reasonable system where (part of) the agent is literal people.
> "And there's a asymmetry in strictness, i.e. LLM --> agent funnels probabilistic output into 1+ structured fields, whereas agent --> LLM can be more freeform (stderr plaintext). Is that right?"
Yes, though some flexibility exists here. If LLM --> deterministic agent, then you'd want to squeeze the output into structured fields. But if the agent is itself probabilistic/a LLM, then you can also just dump unstructured data at it.
It's kind of the wild west right now in this whole area. There's not a lot of common wisdom besides "it works better if I do it this way".
> "1) how does the LLM know where to put output tokens given more than one structured field options?"
Prompt engineering and a bit of praying. The trick is that there are methods for ensuring the LLM doesn't hallucinate things that break the schema (fields that don't exist for example), but output quality within the schema is highly variable!
For example, you can force the LLM to output a schema that references a previous commit ID... but it might hallucinate a non-existent ID. You can make it output a list of desired code reviewers, and it'll respect the format... but hallucinate non-existent reviewers.
Smart prompt engineering can reduce the chances of this kind of undesired behavior, but given that it's a giant ball of probabilities, performance is never truly guaranteed. Remember also that this is a language model - so it's sensitive to the schema itself. Obtuse naming within the schema itself will negatively impact reliability.
This is actually part of the role of the agent. "This code reviewer doesn't exist. Try again. The valid reviewers are: ..." is a big part of why these systems work at all.
> "2) Is this loop effective for projects from scratch? How good is it at proper design (understanding tradeoffs in algorithms, etc)?"
This is where the quality of the initial prompt and the structure of the agent comes into play. I don't have a great answer for here besides that making these agents better at decomposing higher-level tasks (including understanding tradeoffs) is a lot of what's at the bleeding edge.
- biophysboy 6 months ago
  
  Wait, so you just tell the LLM the schema, and hope it replicates it verbatim with content filled into it? I was under the impression that you say "hey, please tell me what to put in this box" repeatedly until your data model is done. That sort of surprises me!
  This interface interests me the most because it sits between the reliability-flexibility tradeoff that people are constantly debating w/ the new AI tech. Are there "mediator" agents with some reliability AND some flexibility? I could see a loosey goosey LLM passing things off to Mr. Stickler agent leading to failure all the time. Is the mediator just humans?
  
  1 reply →