Comment by theodorewiles

7 months ago

For me this benchmark suggests that an LLM will try to “force the issue” which results in compounding errors. But I think the logical counterpoint is that you may be asking the LLM to come up an answer without all of the necessary details? Some of these are “baked into” historical transactions which is why it does well in months 1-2.

My takeaway is scaling in the enterprise is about making implicit information explicit.