Comment by wedn3sday
1 month ago
The only metric I really care about, and the one that I think shows the fundamental failure of LLMs as a technology, is this one here [1]. The fact that o1 fails a non-zero amount of the time on the question, "what is 6*1?" means that the models just do not "understand" _anything_ and are still just fancy stochastic parrots. Now, stochastic parrots are still useful! Just not the digital god a lot of people seam to think we're heading towards.
[1] https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....
I'm not seeing anything in that graph that implies that o1 ever fails on "what is 6*1?" The chart is graphing the number of digits on each axis; it fails on "what is (some 6 digit number) * (some 1 digit number)"
I don't think this will or necessarily should ever be fixed. The eventual solution (I imagine) will be to simply plug in a calculator. All the MCP talk on HN pushed me to try MCP out, and I'm sold. A Swiss army knife of tools like a calculator available would let a brain do what a brain is best at, and a calculator what a calculator is best at.
The chart you show is about the accuracy of x*y where X and Y are an increasing amount of digits.
This graph shows that both o1 and o3-mini are better at calculating in one’s head than any human I have known. It only starts to break down towards calculating the product of two eight digit factors etc.
Humanity fails that question an embarrassingly large number of times.
I can proudly (?) proclaim I will never fail that question. Pretty sure I don't know anyone who would either, including my 7yo.
I've certainly heard things wrong before, and had high fever, and loads of alcohol, and migraines, and high sleep deprivation, and andrenaline. All things that greatly affect whether I can do something seemingly simple or not
your 7yo can multiply a 5digit number with another 5digit number with >95% accuracy?