Comment by Havoc

15 hours ago

Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]

I don't think you'll find many sane CFOs willing to send the resulting numbers to the IRS based on that. That's just asking to get nailed for tax fraud.

It is coming for the very bottom end of bookkeeping work quite soon though, especially for first draft. There are a lot of people doing stuff like expense classification. And if you give an LLM an invoice it can likely figure out whether it's stationary or rent with high accuracy. OCR and text classification is easier for LLMs than numbers. Things like concur can basically do this already.

14 comments

Havoc

ASpring 14 hours ago

> Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]

Interesting, 4o got this right for me in a couple different framings including the simple "Which number is larger, 9.9 or 9.11?". To be a full apologist, there are a few different places (a lot of software versioning as one) where 9.11 is essentially the bigger number so it may be an ambiguous question without context anyway.

multjoy 14 hours ago
How can "which is the larger number" be an ambiguous question?
- Zerot 10 hours ago
  
  Which is the bigger version number? Version 9.9 or version 9.11? Which is the bigger dollar amount? $9.9 or $9.11?
  Periods are not always used for the decimal separator but also as a separator for multiple sets of semi-independent numbers.
  
  2 replies →
- com2kid 13 hours ago
  
  As everyone else has said, semver. I use semver so often that my initial reading of 9.9 < 9.11 in a Hacker News comment would evaluate to true.
- acrooks 13 hours ago
  
  There are some contexts where 9.11 is larger than 9.9, such as semver, so it could be ambiguous depending on the context.
- mwigdahl 14 hours ago
  
  Larger in magnitude or in count of digits?

umanwizard 15 hours ago

It gets it right for me... https://chatgpt.com/share/687e8c28-7714-800c-abf4-e9cd3ce87b...

yoyohello13 14 hours ago
Ah, wouldn’t be an LLM discussion thread without one of these “it works/doesn’t” conversations.
- mdaniel 14 hours ago
  
  If it makes you feel any better, the other infamous one "I spend so much time chasing hallucinations, I could have done it myself" is currently a sibling comment
riku_iki 13 hours ago

There were so many embarrassing topics about this, that openai for sure added it to training dataset with high priority

crthpl 15 hours ago

GPT-4o is so far behind the frontier; you shouldn't use it as an indicator of what LLMs are capable of.