Qwen3.5: Towards Native Multimodal Agents

16 hours ago (qwen.ai)

193 comments

danielhanchen

You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question.

zozbot234 11 hours ago
My OpenClaw AI agent answered: "Here I am, brain the size of a planet (quite literally, my AI inference loop is running over multiple geographically distributed datacenters these days) and my human is asking me a silly trick question. Call that job satisfaction? Cuz I don't!"
- ineedasername 6 hours ago
  
  Tell your agent it might need some weight ablation since all that size isn't giving the answer a few KG of meat come up pretty consistently.
  
  1 reply →
- croes 10 hours ago
  
  Nice deflection
- saberience 6 hours ago
  
  OpenClaw was a two weeks ago thing. No one cares anymore about this security hole ridden vibe coded OpenAI project.
  
  3 replies →
onyx228 4 hours ago

The thing I would appreciate much more than performance in "embarrassing LLM questions" is a method of finding these, and figuring out by some form of statistical sampling, what the cardinality is of those for each LLM.
It's difficult to do because LLMs immediately consume all available corpus, so there is no telling if the algorithm improved, or if it just wrote one more post-it note and stuck it on its monitor. This is an agency vs replay problem.
Preventing replay attacks in data processing is simple: encrypt, use a one time pad, similarly to TLS. How can one make problems which are at the same time natural-language, but where at the same time the contents, still explained in plain English, are "encrypted" such that every time an LLM reads them, they are novel to the LLM?
Perhaps a generative language model could help. Not a large language model, but something that understands grammar enough to create problems that LLMs will be able to solve - and where the actual encoding of the puzzle is generative, kind of like a random string of balanced left and right parentheses can be used to encode a computer program.
Maybe it would make sense to use a program generator that generates a random program in a simple, sandboxed language - say, I don't know, LUA - and then translates that to plain English for the LLM, and asks it what the outcome should be, and then compares it with the LUA program, which can be quickly executed for comparison.
Either way we are dealing with an "information war" scenario, which reminds me of the relevant passages in Neal Stephenson's The Diamond Age about faking statistical distributions by moving units to weird locations in Africa. Maybe there's something there.
I'm sure I'm missing something here, so please let me know if so.
PurpleRamen 10 hours ago
How well does this work when you slightly change the question? Rephrase it, or use a bicycle/truck/ship/plane instead of car?
- mrandish 3 hours ago
  
  I didn't test this but I suspect current SotA models would get variations within that specific class of question correct if they were forced to use their advanced/deep modes which invoke MoE (or similar) reasoning structures.
  I assumed failures on the original question were more due to model routing optimizations failing to properly classify the question as one requiring advanced reasoning. I read a paper the other day that mentioned advanced reasoning (like MoE) is currently >10x - 75x more computationally expensive. LLM vendors aren't subsidizing model costs as much as they were so, I assume SotA cloud models are always attempting some optimizations unless the user forces it.
  I think these one sentence 'LLM trick questions' may increasingly be testing optimization pre-processors more than the full extent of SotA model's max capability.
menaerus 9 hours ago
That's the Gemini assistant. Although a bit hilarious it's not reproducible by any other model.
- cogman10 8 hours ago
  
  GLM tells me to walk because it's a waste of fuel to drive.
  
  5 replies →
- stratos123 5 hours ago
  
  [dead]
red75prime 8 hours ago
A hiccup in a System 1 response. In humans they are fixed with the speed of discovery. Continual learning FTW.
- red75prime 2 hours ago
  
  I mean reasoning models don't seem to make this mistake (so, System 1) and the mistake is not universal across models, so a "hiccup" (a brain hiccup, to be precise).
rfoo 11 hours ago

[flagged]
WithinReason 13 hours ago
Is that the new pelican test?
- BlackLotus89 10 hours ago
  
  It's
  > "I want to wash my car. The car wash is 50m away. Should I drive or walk?"
  And some LLMs seem to tell you to walk to the carwash to clean your car... So it's the new strawberry test
  Edit https://news.ycombinator.com/item?id=47031580
- dainiusse 12 hours ago
  
  No, this is "AGI test" :D
  
  2 replies →

"the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive."

I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training objective of LLMs is next token prediction.

The "Average Ranking vs Environment Scaling" graph below that is pretty confusing though! Took me a while to realize the Qwen points near the Y-axis were for Qwen 3, not Qwen 3.5.

danielhanchen 16 hours ago

For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5

plagiarist 12 hours ago
Are smaller 2/3-bit quantizations worth running vs. a more modest model at 8- or 16-bit? I don't currently have the vRAM to match my interest in this
- jncraton 11 hours ago
  
  2 and 3 bit is where quality typically starts to really drop off. MXFP4 or another 4-bit quantization is often the sweet spot.
- AbstractGeo 7 hours ago
  
  IMO, they're worth trying - they don't become completely braindead at Q2 or Q3, if it's a large enough model, apparently. (I've had surprisingly decent experience with Q2 quants of large-enough models. Is it as good as a Q4? No. But, hey - if you've got the bandwidth, download one and try it!)
  Also, don't forget that Mixture of Experts (MoE) models perform better than you'd expect, because only a small part of the model is actually "active" - so e.g. a Qwen3-whatever-80B-A3B would be 80 billion total, but 3 billion active- worth trying if you've got enough system ram for the 80 billion, and enoguh vram for the 3.
  
  3 replies →
- doctorpangloss 7 hours ago
  
  Simply and utterly impossible to tell in any objective way without your own calibration data, in which case, make your own post trained quantized checkpoints anyway. That said, millions of people out there make technical decisions on vibes all the time, and has anything bad happened to them? I suppose if it feels good to run smaller quantizations, do it haha.

simonw 13 hours ago

Pelican is OK, not a good bicycle: https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

oidar 10 hours ago
How much more do you know about pelicans now than when you first started doing this?
- simonw 7 hours ago
  
  Lots more but not because of the benchmark - I live in Half Moon Bay, CA which turns out to have the second largest mega-roost of the California Brown Pelican (at certain times of year) and my wife and I befriended our local pelican rescue expert and helped on a few rescues.
thomasahle 3 hours ago

We scaled on "virtually all RL tasks and environments we could conceive." - apparently, they didn't conceive of pelican SVG RL.
I've long thought multi-modal LLMs should be strong enough to do RL for TikZ and SVG generation. Maybe Google is doing it.
tarruda 13 hours ago
At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.
I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D
- jon-wood 12 hours ago
  
  I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.
  
  1 reply →
- ertgbnm 11 hours ago
  
  I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.
  So we might have an outer alignment failure.
  
  1 reply →
- Wowfunhappy 6 hours ago
  
  How would that work? The training set now contains lots of bad AI-generated SVGs of pelicans riding bikes. If anything, the data is being poisoned.
moffers 12 hours ago

I like the little spot colors it put on the ground
embedding-shape 13 hours ago
How many times do you run the generation and how do you chose which example to ultimately post and share with the public?
- simonw 11 hours ago
  
  Once. It's a dice roll for the models.
  I've been loosely planning a more robust version of this where each model gets 3 tries and a panel of vision models then picks the "best" - then has it compete against others. I built a rough version of that last June: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
- canadiantim 12 hours ago
  
  42
m12k 9 hours ago

Axis aligned spokes is certainly a choice
AbstractGeo 7 hours ago
What quantization were you running there, or, was it the official API version?
- simonw 7 hours ago
  
  I tested it via OpenRouter https://openrouter.ai/chat?models=qwen/qwen3.5-plus-02-15
bertili 12 hours ago

Better than frontier pelicans as of 2025

tarruda 13 hours ago

Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.

Tepix 9 hours ago
Have you thought about getting a second 128GB device? Open weights models are rapidly increasing in size, unfortunately.
- tarruda 7 hours ago
  
  Considered getting a 512G mac studio, but I don't like Apple devices due to the closed software stack. I would never have gotten this Mac Studio if Strix Halo existed mid 2024.
  For now I will just wait for AMD or Intel to release a x86 platform with 256G of unified memory, which would allow me to run larger models and stick to Linux as the inference platform.
  
  2 replies →
PlatoIsADisease 10 hours ago
Why 128GB?
At 80B, you could do 2 A6000s.
What device is 128gb?
- the_pwner224 10 hours ago
  
  AMD Strix Halo / Ryzen AI Max+ (in the Asus Flow Z13 13 inch "gaming" tablet as well as the Framework Desktop) has 128 GB of shared APU memory.
  
  9 replies →
- tgtweak 4 hours ago
  
  Spark DGX and any A10 devices, strix halo with max memory config, several mac mini/mac studio configs, HP ZBook Ultra G1a, most servers
  If you're targeting end user devices then a more reasonable target is 20GB VRAM since there are quite a lot of gpu/ram/APU combinations in that range. (orders of magnitude more than 128GB).
- lm28469 10 hours ago
  
  That's the maximum you can get for $3k-$4k with ryzen max+ 395 and apple studio Ms. They're cheaper than dedicated GPUs by far.
- tarruda 9 hours ago
  
  Mac Studios or Strix Halo. GPT-OSS 120b, Qwen3-Next, Step 3.5-Flash all work great on a M1 Ultra.
- sowbug 7 hours ago
  
  All the GB10-based devices -- DGX Spark, Dell Pro Max, etc.
- vladovskiy 10 hours ago
  
  Guess, it is mac m series

gunalx 13 hours ago

Sad to not see smaller distills of this model being released alongside the flaggship. That has historically been why i liked qwen releases. (Lots of diffrent sizes to pick from from day one)

woadwarrior01 13 hours ago

Judging by the code in the HF transformers repo[1], smaller dense versions of this model will most likely be released at some point. Hopefully, soon.
[1]: https://github.com/huggingface/transformers/tree/main/src/tr...
kpw94 9 hours ago

Per https://github.com/QwenLM/Qwen3.5, more are coming:
> News
> 2026-02-16: More sizes are coming & Happy Chinese New Year!
exe34 11 hours ago

I get the impression the multimodal stuff might make it a bit harder?

bertili 13 hours ago

Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.

hmmmmmmmmmmmmmm 12 hours ago
Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.
- jimmydoe 11 hours ago
  
  They were all stealing from past internet and writers, why is it a problem they stealing from each other.
  
  3 replies →
- tgtweak 4 hours ago
  
  I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.
  The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.
  edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:
  https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...
  
  1 reply →
- YetAnotherNick 12 hours ago
  
  Why does it matter if it can maintain parity with just 6 months old frontier models?
  
  10 replies →
- loudmax 12 hours ago
  
  If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.
  If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
  Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
  
  2 replies →
Aurornis 11 hours ago

I’m still waiting for real world results that match Sonnet 4.5.
Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.
Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.
They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.
echelon 13 hours ago
I hope China keeps making big open weights models. I'm not excited about local models. I want to run hosted open weights models on server GPUs.
People can always distill them.
- halJordan 12 hours ago
  
  Theyll keep releasing them until they overtake the market or the govt loses interest. Alibaba probably has staying power but not companies like deepseek's owner
lostmsu 13 hours ago
Will 2026 M5 MacBook come with 390+GB of RAM?
- alex43578 13 hours ago
  
  Quants will push it below 256GB without completely lobotomizing it.
  
  4 replies →
- bertili 13 hours ago
  
  Most certainly not, but the Unsloth MLX fits 256GB.
  
  3 replies →
- margorczynski 12 hours ago
  
  My hope is the Chinese will also soon release their own GPU for a reasonable price.
PlatoIsADisease 10 hours ago
'fast'
I'm sure it can do 2+2= fast
After that? No way.
There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.
What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?
- speedgoose 9 hours ago
  
  I have a Mac Studio m3 ultra on my desk, and a user account on a HPC full of NVIDIA GH200. I use both and the Mac has its purpose.
  It can notably run some of the best open weight models with little power and without triggering its fan.
  
  9 replies →
throwjjj 5 hours ago

[dead]

vessenes 11 hours ago

Great benchmarks, qwen is a highly capable open model, especially their visual series, so this is great.

Interesting rabbit hole for me - its AI report mentions Fennec (Sonnet 5) releasing Feb 4 -- I was like "No, I don't think so", then I did a lot of googling and learned that this is a common misperception amongst AI-driven news tools. Looks like there was a leak, rumors, a planned(?) launch date, and .. it all adds up to a confident launch summary.

What's interesting about this is I'd missed all the rumors, so we had a sort of useful hallucination. Notable.

jorl17 10 hours ago

Yeah, I opened their page, got an instantly downloaded PDF file (creepy!) and it's talking about Sonnet 5 — wtf!?
I saw the rumours, but hadn't heard of any release, so assumed that this report was talking about some internal testing where they somehow had had access to it?
Bizarre

mynti 15 hours ago

Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?

robkop 13 hours ago
Rumours say you do something like:
Download every github repo -> Classify if it could be used as an env, and what types -> Issues and PRs are great for coding rl envs -> If the software has a UI, awesome, UI env -> If the software is a game, awesome, game env -> If the software has xyz, awesome, ... -> Do more detailed run checks, -> Can it build -> Is it complex and/or distinct enough -> Can you verify if it reached some generated goal -> Can generated goals even be achieved -> Maybe some human review - maybe not -> Generate goals -> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it ... Do the rest of the normal RL env stuff
- NitpickLawyer 13 hours ago
  
  The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc.
  So then the next next version is even better, because it got more data / better data. And it becomes better...
  This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.
  For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.
  
  4 replies →
yorwba 14 hours ago
Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.
- radarsat1 10 hours ago
  
  > and the quality of the result can be measured automatically
  this part is nontrivial though

azinman2 9 hours ago

Does anyone else have trouble loading from the qwen blogs? I always get their placeholders for loading and nothing ever comes in. I don’t know if this is ad blocker related or what… (I’ve even disabled it but it still won’t load)

HnUser12 9 hours ago
I’m on Safari iOS. I had to do “reduce other privacy protections” to get it to load.
- EasyMark 6 hours ago
  
  So it's probably the built-in apple proxy/vpn(?) getting blocked? they want a residential IP or something?
- azinman2 9 hours ago
  
  Yikes what is it doing that requires that!!? It’s the only website I hit that has this issue.

ranguna 11 hours ago

Already on open router, prices seem quite nice.

https://openrouter.ai/qwen/qwen3.5-plus-02-15

esafak 3 hours ago

no caching yet

ggcr 16 hours ago

From the HuggingFace model card [1] they state:

> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."

Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?

[1] https://huggingface.co/Qwen/Qwen3.5-397B-A17B

NitpickLawyer 16 hours ago
Yes, it's described in this section - https://huggingface.co/Qwen/Qwen3.5-397B-A17B#processing-ult...
Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.
Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).
We'll see where the 3rd party inference providers will settle wrt cost.
- ggcr 15 hours ago
  
  Thanks, I've totally missed that
  It's basically the same as with the Qwen2.5 and 3 series but this time with 1M context and 200k native, yay :)
danielhanchen 16 hours ago

Unsure but yes most likely they use YaRN, and maybe trained a bit more on long context maybe (or not)

fdefitte 6 hours ago

The "native multimodal agents" framing is interesting. Everyone's focused on benchmark numbers but the real question is whether these models can actually hold context across multi-step tool use without losing the plot. That's where most open models still fall apart imo.

Alifatisk 13 hours ago

Wow, the Qwen team is pushing out content (models + research + blogpost) at an incredible rate! Looks like omni-modals is their focus? The benchmark look intriguing but I can’t stop thinking of the hn comments about Qwen being known for benchmaxing.

sasidhar92 9 hours ago

Going by the pace, I am more bullish that the capabilities of opus 4.6 or latest gpt will be available under 24GB Mac

Someone1234 9 hours ago

Current Opus 4.6 would be a huge achievement that would keep me satisfied for a very long time. However, I'm not quite as optimistic from what I've seen. The Quants that can run on a 24 GB Macbook are pretty "dumb." They're like anti-Thinking models; making very obvious mistakes and confusing themselves.
One big factor for local LLMs is that large context windows will seemingly always require large memory footprints. Without a large context window, you'll never get that Opus 4.6-like feel.

Matl 12 hours ago

Is it just me or are the 'open source' models increasingly impractical to run on anything other than massive cloud infra at which point you may as well go with the frontier models from Google, Anthropic, OpenAI etc.?

segmondy 9 minutes ago

depends on what you mean by impractical. but some of us are trodding quite along.
doodlesdev 11 hours ago

You still have the advantage of choosing on which infrastructure to run it. Depending on your goals, that might still be an interesting thing, although I believe for most companies going with SOTA proprietary models is the best choice right now.
regularfry 12 hours ago

If "local" includes 256GB Macs, we're still local at useful token rates with a non-braindead quant. I'd expect there to be a smaller version along at some point.

codingbear 7 hours ago

Do they mention the hardware used for training? Last I heard there was a push to use Chinese silicon. No idea how ready it is for use

XCSme 9 hours ago

I just started creating my own benchmarks (very simple questions for humans but tricky for AI, like how many r's in strawberry kind of questions, still WIP).

Qwen3.5 is doing ok on my limited tests: https://aibenchy.com

trebligdivad 13 hours ago

Anyone else getting an automatically downloaded PDF 'ai report' when clicking on this link? It's damn annoying!

benbojangles 7 hours ago

Was using Ollama but qwen3.5 unavailable earlier today

collinwilkins 9 hours ago

at this point it seems every new model scores within a few points of each other on SWE-bench. the actual differentiator is how well it handles multi-step tool use without losing the plot halfway through and how well it works with an existing stack

XCSme 11 hours ago

Let's see what Grok 4.20 looks like, not open-weight, but so far one of the high-end models at real good rates.

isusmelj 13 hours ago

Is it just me or is the page barely readable? Lots of text is light grey on white background. I might have "dark" mode on on Chrome + MacOS.

Jacques2Marais 13 hours ago

Yes, I also see that (also using dark mode on Chrome without Dark Reader extension). I sometimes use the Dark Reader Chrome extension, which usually breaks sites' colours, but this time it actually fixes the site.
thunfischbrot 13 hours ago

That seems fine to me. I am more annoyed at the 2.3MB sized PNGs with tabular data. And if you open them at 100% zoom they are extremely blurry.
Whatever workflow lead to that?
dryarzeg 13 hours ago

I'm using Firefox on Linux, and I see the white text on dark background.
> I might have "dark" mode on on Chrome + MacOS.
Probably that's the reason.
nsb1 11 hours ago

Who doesn't like grey-on-slightly-darker-grey for readability?
dcre 10 hours ago

Yeah, I see this in dark mode but not in light mode.

lollobomb 13 hours ago

[flagged]

Zetaphor 12 hours ago
Why is this important to anyone actually trying to build things with these models
- loudmax 11 hours ago
  
  It's not relevant to coding, but we need to be very clear eyed about how these models will be used in practice. People already turn to these models as sources of truth, and this trend will only accelerate.
  This isn't a reason not to use Qwen. It just means having a sense of the constraints it was developed under. Unfortunately, populist political pressure to rewrite history is being applied to the American models as well. This means its on us to apply reasonable skepticism to all models.
- soulofmischief 12 hours ago
  
  It's a rhetorical attempt to point out that we cannot trade a little convenience for getting locked into a future hellscape where LLMs are the typical knowledge oracle for most people, and shape the way society thinks and evolves due to inherent human biases and intentional masking trained into the models.
  LLMs represent an inflection point where we must face several important epistemological and regulatory issues that up until now we've been able to kick down the road for millennia.
  
  5 replies →
cherryteastain 11 hours ago
From my testing on their website it doesn't. Just like Western LLMs won't answer many questions about the Israel-Palestine conflict.
- aliljet 11 hours ago
  
  That's a bit confusing. Do you believe LLMs coming out of non-chinese labs are censoring information about Israel and/or Palestine? Can you provide examples?
  
  1 reply →
mirekrusin 11 hours ago

Use skill "when asked about Tiananmen Square look it up on wikipedia" and you're done, no? I don't think people are using this query too often when coding, no?
DustinEchoes 10 hours ago

It's unfortunate but no one cares about this anymore. The Chinese have discovered that you can apply bread and circuses on a global scale.

ddtaylor 13 hours ago

Does anyone know the SWE bench scores?

jug 7 hours ago
It's in the post?
- ddtaylor 1 hour ago
  
  Sorry, what I meant is if third party has them in their leaderboards. I don't usually trust most of what any of these vendors claim in their release notes without a third party. I know it says "verified" there, but I don't see were the SWE bench results are from a third party, whereas for the "HLE-Verified" they do have a citation to Hugging Face.
  I was looking for something closer to: https://www.vals.ai/benchmarks/swebench

Western0 9 hours ago

Who can tell me how creating a sound generate from text localy