Comment by billforsternz
9 months ago
I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.
I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.
Idk I don’t think of it as fake - it’s creative fiction paired with sometimes highly skilled performance. I’ve learned a lot about how magic tricks work and I still love seeing performers do effects because it takes so much talent to, say, hold and hide 10 coins in your hands while showing them as empty or to shuffle a deck of cards 5x and have the audience cut it only to pull 4 aces off the top.
I think the problem-solving / want-to-be-engineer side of my brain lights up in that "how did he do that??" way. To me that's the fun of it... I immediately try to engineer my own solutions to what I just saw happen. So I guess I'm the first camp, but find trying to figure out the trick hugely interesting.
I love magic, and illusions in general. I know that Disney's Haunted Mansion doesn't actually have ghosts. But it looks pretty convincing, and watching the documentaries about how they made it is pretty mind-blowing especially considering that they built the original long before I was born.
I look at optical illusions like The Dress™ and am impressed that I cannot force my brain to see it correctly even though I logically know what color it is supposed to be.
Finding new ways that our brains can be fooled despite knowing better is kind of a fun exercise in itself.
I think magic is extremely interesting (particularly close-up magic), but I also hate the mindset (which seems to be common though not ubiquitous) that stigmatizes any curiosity in how the trick works.
In my view, the trick as it is intended to appear to the audience and the explanation of how the trick is performed are equal and inseparable aspects of my interest as a viewer. Either one without the other is less interesting than the pair.
3 replies →
It's still entertaining, that's true. I like magic tricks.
The point is the analogy to LLMs. A lot of people are very optimistic about their capabilities, while other people who have "seen behind the curtain" are skeptical, and feel that the fundamental flaws are still there even if they're better-hidden.
To be fair, I love that magicians can pull tricks on me even though I know it is fake.
2.5 pro nails each of these calculations. I don’t agree with Google’s decision to use a weak model in its search queries, but you can’t say progress on LLMs in bullshit as evidenced by a weak model no one thinks is close to SOTA.
It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.
The AI will create something for you and tell you it was them.
"That's impossible because..."
"Good point! Blah blah blah..."
Absolutely shameless!
I just asked my company-approved AI chatbot the same question.
It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.
It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.
When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.
Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).
I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.
Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.
> It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).
In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.
I agree those digits are not significant in the context of the question asked. But if the AI is going to use that level of precision in the answer, I expect it to be correct.
1 reply →
Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...
It's just the usual HN sport: ask a low-end, obsolete or unspecified model, get a bad answer, brag about how you "proved" AI is pointless hype, collect karma.
Edit: Then again, maybe they have a point, going by an answer I just got from Google's best current model ( https://g.co/gemini/share/374ac006497d ) I haven't seen anything that ridiculous from a leading-edge model for a year or more.
Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.
But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.
Massive search overlap though - and some questions (like the golf ball puzzle) can be cached for a long time.
3 replies →
I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.
Google is shooting themselves in the foot with whatever model they use for search. It's probably a 2B or 4B model to keep up with demand, and man is it doing way more harm than good.
Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?
It'll get it right next time because they'll hoover up the parent post.
This reminds me of Google quick answers we had for a time in search. It is quite funny if you live outside the US, because it very often got the units or numbers wrong because of different decimal delimiters.
No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?
I've seen humans make exactly these sorts of mistakes?
As another commenter mentioned, LLMs tend to make these bad mistakes with enormous confidence. And because they represent SOTA technology (and can at times deliver incredible results), they have extra credence.
More than even filling the gaps in knowledge / skills, would be a huge advancement in AI for it to admit when it doesn't know the answer or is just wildly guessing.
A lot of humans are similarly good at some stuff and bad at other things.
Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):
>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.