Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

10 hours ago (static.stepfun.com)

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

  • That score is on par with Gemini 3 Flash but these scores look much more affected by the agent used than the model, from scrolling through the results.

It's nice to see more focus on efficiency. All the recent new model releases have come along with massive jumps in certain benchmarks but when you dig into it it's almost always paired with a massive increase in token usage to achieve those results (ahem Google Deep Think ahem). For AI to truly be transformational it needs to solve the electricity problem

  • And not just token usage, expensive token usage; when it comes to tokens/joule not all tokens are equal. Efficient use of MoE architectures does have an impact on tokens/joule and tokens/sec.

Hallucinates like crazy. use with caution. Tested it with a simple "Find me championship decks for X pokemon", "How does Y deck work". Opus 4.6, Deepseek and Kimi all performed well as expected.

  • I mean, is it possible the latter models used Search? Not saying Stepfun's perfect (it is not.) Gemini especially and unsurprisingly uses search a lot and it is ridiculously fast, too.

Number of params isn’t really the relevant metric imo. Top models don’t support local inference. More relevant is tokens per dollar or per second.

Recent model released a couple of weeks ago. "Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token". Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.

Edit: there are 4 bit quants that can be run on an 128GB machine like a GB10 [1], AI Max+ 395, or mac studio.

[1] https://forums.developer.nvidia.com/t/running-step-3-5-flash...

  • > Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.

    Does this really mean anything? I for example, tend to ignore certain benchmarks that are focused towards agentic tasks because that is not my use case. Instruction following, long context reasoning and non-hallucinations has more weight to me.

SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.

Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.

The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.

Holy moly, I made a simple coding promt and the amount of reasoning output could fill a small book.

> create a single html file with a voxel car that drives in a circle.

Compared to GLM 4.7 / 5 and kimi 2.5 it took a while. The output was fast, but because it wrote so I had to wait longer. Also output was .. more bare bones compared to others.

  • That's been my experience as well. Huge amounts of reasoning. The model itself is good but even if you get twice as many tokens as with another model, the added amount of reasoning may make it slower in the end.

Interesting.

Each time a Chinese model makes the news, I wonder: How come no major models are coming from Japan or Europe?

  • You would be surprised to see how much the japanese IT industry is behind the times (a decade at least IMO). There is only a very limited startup culture here (both in size and talentpool and business ideas), there is no real risk taking venture capital market here (maybe Masayoshi Son is the exception here, but again he tends to invest in the US mostly) and most software companies use very very very outdated management practices. On top of that most software development had been/has been outsourced to India, Vietnam, China, etc, so management see no value in software talent... SW engineers' social recognition here are mostly on the level of accountants. Under such circumstances japan will never have a chance to contribute to AI meaningfully (other than niche academic research)

  • 1. The US and China are two biggest economies by GDP. 2. The US is the default destination for worldwide investors (because of historically good returns). China has huge state economy and the state can direct investments into this area.

  • The Koreans have released some good models lately. And Mistral is also release open weights models that aren't too shabby.

  • Have you heard of Pleias ? Their SML baguettotron is blazingly fast, and surprisingly good at reasoning (but it's not programming-oriented).

So who exactly is StepFun? What is their business (how do they make money)? Each time I click “About Stepfun” somewhere on their website, it sends me to a generic landing page in a loop.