Comment by riemannzeta

11 hours ago

It is really curious to see how the performance degraded despite the tool calls. What was different about the first month? Was all of the context there without tool calls in the first month? In the later months that seem like tool calls weren't happening. That should have been happening to inform the context?

2 comments

riemannzeta

lukezeller 8 hours ago

(Another member of the team behind the benchmark here) The first month performed well because (1) the models effectively leveraged historical precedent - they could identify similar transactions from past data and apply established patterns, and (2) the starting balances were clean, so they were more easily able to understand / track down discrepancies.

> Was all of the context there without tool calls in the first month?

We provided schemas for the GL and source data in the system prompt, but none of the actual data. The model had to use its tools (SQL and python script) to understand / analyze historical data.

> In the later months that seem like tool calls weren’t happening. That should have been happening to inform the context?

We actually didn’t find that they stopped calling tools entirely. Instead, they weren’t able to make sense of the information fetched with tools (for example, a bank account starting balance that was >$100000 different from the starting balance on the supporting bank statement). They’d tend to either do nothing or just do a first pass without deduplicating / cleaning up. This created a feedback loop where incorrect balances led to more errors and made subsequent months increasingly difficult to process accurately.

This didn’t make it into the report, but another interesting behavior we observed w.r.t tool usage (with Claude in particular): if a tool failed 2-3 times (for example, runtime error in python code) Claude would tend to abandon it entirely for the rest of the session. Interestingly, this happened even when it knew how to fix the errors: on a couple of early runs, I observed Claude fixing a python bug (with the edit_tool tool) but then abandoning without even attempting to rerun, and reverting to SQL-only for the rest of the session.

riemannzeta 8 hours ago

Fascinating. Like there is some accuracy threshold beyond which they cannot converge, but instead run with the inaccuracy.