Comment by Havoc
15 hours ago
Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]
I don't think you'll find many sane CFOs willing to send the resulting numbers to the IRS based on that. That's just asking to get nailed for tax fraud.
It is coming for the very bottom end of bookkeeping work quite soon though, especially for first draft. There are a lot of people doing stuff like expense classification. And if you give an LLM an invoice it can likely figure out whether it's stationary or rent with high accuracy. OCR and text classification is easier for LLMs than numbers. Things like concur can basically do this already.
> Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]
Interesting, 4o got this right for me in a couple different framings including the simple "Which number is larger, 9.9 or 9.11?". To be a full apologist, there are a few different places (a lot of software versioning as one) where 9.11 is essentially the bigger number so it may be an ambiguous question without context anyway.
How can "which is the larger number" be an ambiguous question?
Which is the bigger version number? Version 9.9 or version 9.11? Which is the bigger dollar amount? $9.9 or $9.11?
Periods are not always used for the decimal separator but also as a separator for multiple sets of semi-independent numbers.
2 replies →
As everyone else has said, semver. I use semver so often that my initial reading of 9.9 < 9.11 in a Hacker News comment would evaluate to true.
There are some contexts where 9.11 is larger than 9.9, such as semver, so it could be ambiguous depending on the context.
Larger in magnitude or in count of digits?
It gets it right for me... https://chatgpt.com/share/687e8c28-7714-800c-abf4-e9cd3ce87b...
Ah, wouldn’t be an LLM discussion thread without one of these “it works/doesn’t” conversations.
If it makes you feel any better, the other infamous one "I spend so much time chasing hallucinations, I could have done it myself" is currently a sibling comment
There were so many embarrassing topics about this, that openai for sure added it to training dataset with high priority
GPT-4o is so far behind the frontier; you shouldn't use it as an indicator of what LLMs are capable of.