Comment by luodaint
21 days ago
When it comes to the business logic of production use, this particular failure type is less obvious compared to benchmarking tasks. Benchmarking involves having the answer already known — it helps detect mismatches easily. Business logic pipeline does not. If LLM gives out a valid output that happens to be semantically incorrect, the pipeline goes through. There is no mistake to catch.
Created a dedupe pipeline where an LLM decides whether two feature requests are similar enough to merge. Occasional mistakes in terms of false positives — valid JSON structure, but incorrectly assessed similarity. In this case, it didn’t help to implement the retry technique. The solution was implementing a deterministic gate validating the output of the model based on its semantic similarity score calculated separately.
The reason why recovery works only with the help of additional tools when the error rate is at zero percent becomes clear: the LLM does not recognize the fact that it made a mistake. The guardrail becomes necessary for that — the retry is just one way of implementing the guardrail concept.
Definitely, there's several failure modes and Forge doesn't address all of them. This is just one tool in the toolbox to getting things stable enough for production use at reduced costs.
Forge sits one level lower - in my mind - than a gate which would sit more at the workflow level. Perfectly complementary.