Comment by yunyu
8 hours ago
Hey all, member of the benchmark team here! The goal for this project was to see how LLMs well could do bookkeeping without an overly opinionated scaffold. We gave them access to processed transaction records and code execution tools, but it was up to them to choose exactly how to use those.
Claude and Grok 4 did reasonably well (within CPA baselines) for the first few months, but tended to degrade as more data came in. Interestingly, the failures aren’t exclusively a context length problem, as we reset the context monthly (with past decisions, accruals/deferrals, and comments available via tool calls) and the types of errors appear to be more reward hacking vs pure hallucinations.
Accounting is very interesting in an RL-first world as it is pretty easy to develop intermediate rewards for training models. We are pretty sure that we can juice the performance more with a far more rigid scaffold, but that’s less relevant from a capabilities research perspective. We’re pushing down this research direction and will see how it goes.
Let us know if you have any questions!
It's a start. The world needs a better way to handle bookkeeping, and the existing tools sure aren't cutting it.
Bookkeeping for my small business runs into the tens of thousands of dollars every year, and the amount of human error associated with processing assorted ecommerce and other transactions is astounding, even after extensive planning and SOPs.
The other pain point is Quickbooks. The tool is so sprawling and complex that half the time support agents can't figure out what's wrong. The fact that Intuit jacks up the price every year for this POS is very irritating. They get away with it because they are practically a monopoly, with most small business CPAs locked into their ecosystem.
Hope your team can work out the performance issues. Alternatives to the current bookkeeping options are sorely needed.
> It's a start. The world needs a better way to handle bookkeeping, and the existing tools sure aren't cutting it.
God, please, no. Non-deterministic language models aren't the solution to improve bookkeeping.
Humans (accountants) are non-deterministic, so unsure if an LLM would be better or worse if we threw more effort at the problem.
But in general, I tend to side with the "lets leave the math to purpose built models/applications" instead of generalized LLMS. LLMs are great if you are just aiming for "good enough to get through next quarter" type results. If you need 100% accuracy, an LLM isn't going to cut it.
2 replies →
Well I've seen worse bookkeepers. "You know, you approved of the budget, but where are our customers payments in the balance sheets? We can't find them!" - "Uhm..."
With no context of what your business is, I hated QuickBooks, love Xero though.
There's some other alternatives too, Zoho, freshbooks.
Really depends what you do.
Love this as a real world benchmark!
How much prompt iteration did you do? I've noticed when building real world agentic apps that small prompt tweaks can make a huge difference in behavior (re: the reward hacking vs hallucinating). Would love to learn more about the approach here.
Hey, member of the benchmark team. We iterated on the prompts based on observed model behaviors. A few key examples:
Schema introspection: Models were spending significant tokens exploring the database structure through trial-and-error SQL queries, so we included the complete data model in the system prompt upfront.
Reward hacking: We added explicit instructions against gaming the reconciliation checks. This reduced the frequency initially, but models would eventually ignore these constraints.
Domain context: Including company background (YC-backed startup) substantially improved transaction categorization, particularly for startup-specific items like SAFE notes that require domain knowledge to classify correctly.
This is a fascinating domain! Many years ago, I studied financial accounting in grad school and even spent some time modeling a double-entry bookkeeping system. The hardest problem, if I recall correctly, wasn't the implementation but the data quality. The world needs a golden dataset of accounting procedures.
Regarding the diminishing returns with frontier models:
My general experience working with LLMs is that they perform better incrementally and to avoid contiguous-greedy approaches. Aggregate as you go and don't take on incrementally larger tasks. Keep the workload minimal.
Regarding agentic tool building: feels like I'm looking at a window into the future.
It is really curious to see how the performance degraded despite the tool calls. What was different about the first month? Was all of the context there without tool calls in the first month? In the later months that seem like tool calls weren't happening. That should have been happening to inform the context?
(Another member of the team behind the benchmark here) The first month performed well because (1) the models effectively leveraged historical precedent - they could identify similar transactions from past data and apply established patterns, and (2) the starting balances were clean, so they were more easily able to understand / track down discrepancies.
> Was all of the context there without tool calls in the first month?
We provided schemas for the GL and source data in the system prompt, but none of the actual data. The model had to use its tools (SQL and python script) to understand / analyze historical data.
> In the later months that seem like tool calls weren’t happening. That should have been happening to inform the context?
We actually didn’t find that they stopped calling tools entirely. Instead, they weren’t able to make sense of the information fetched with tools (for example, a bank account starting balance that was >$100000 different from the starting balance on the supporting bank statement). They’d tend to either do nothing or just do a first pass without deduplicating / cleaning up. This created a feedback loop where incorrect balances led to more errors and made subsequent months increasingly difficult to process accurately.
This didn’t make it into the report, but another interesting behavior we observed w.r.t tool usage (with Claude in particular): if a tool failed 2-3 times (for example, runtime error in python code) Claude would tend to abandon it entirely for the rest of the session. Interestingly, this happened even when it knew how to fix the errors: on a couple of early runs, I observed Claude fixing a python bug (with the edit_tool tool) but then abandoning without even attempting to rerun, and reverting to SQL-only for the rest of the session.
Fascinating. Like there is some accuracy threshold beyond which they cannot converge, but instead run with the inaccuracy.
Is there a detailed overview (like an arxiv or an actual train set? )?