Comment by CharlesW

1 day ago

Previously: https://swelljoe.com/post/will-it-mythos/: "Poor performer here, only found the one bug that almost every model found, despite its performance on other benchmarks being excellent for its size. […] It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive."

6 comments

CharlesW

NitpickLawyer 1 day ago

> It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive.

How is that a serious phrase in '26? I mean I have no idea if this fine-tune is good, haven't tried it, but testing a (clearly) agentic model without tool access and expecting it to work is crazy, no? What was he even testing?!

nodja 1 day ago

Last thing you want a model to do is hallucinate a tool call and it's outputs...
vikingcat 1 day ago

Maybe expecting it to recognize it's limitation without tools instead of hallucinate. But yeah, not wholly useful. It's performance (and proclivity to hallucinations) with tools is what really matters.
reactordev 21 hours ago

Visual Inspection Before Execution… it’s all vibe…

juliangoldsmith 19 hours ago

That benchmark ranks Kimi K2.6 and K2.7 Code near the bottom. Both are below Ornith 35B. It ranks Gemma 4 26B much higher than GLM-5.2. The results don't make much sense.