Comment by sieste

1 day ago

That's almost exactly my setup and I'm very happy with its performance.

I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.

Both fail at different tasks, and Qwen more so than Claude.

But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.

In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.

I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?

I keep finding more and more usecases for Q3.6 27b (same league) and the best performance is, when answers to my question is already in the context.

The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.

For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.

I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).

This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.

  • I keep playing around with this exact concept. While I don’t always trust entirely AI generated recipe, more traditional setups are super rigid when it comes to ingredients

    • I kept getting recipes with "that one ingredient", which was either a major PITA to source or produced too much waste, even from a real world dietician consultation. Example, use 1/4th of a pumpkin for something. Those were good recipes, in terms of macronutrient composition, but doesn't work long term due to logistics.

      I'm years after that strict diet needs, but that itch of fixing or easing some parts of the process stayed.

>In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.

Can't wait until we just remove the language from the LLMs for accuracy and efficiency

  • Imagine how accurate you could be if you could circumvent the coding agent and just type the source code directly into an editor all by yourself o_O Like a write_file skill but for humans!!!

I know the big labs like to pretend that their models are trillion parameter. But how likely is that really to be the case when Qwen 3.6 35B A3B gets so close to their performance? Seems that with the best research applied, best training data, they'd be able to top the charts with a 60B model quite easily.

  • Qwen 35B isn't even remotely close to the big models. It's just people over hyping small models. Ignore the benchmarks they are almost meaningless.

    If you want something comparable you need the trillion parameter open models like deepseek.

    • Number of parameters doesn't make the model smarter, it just makes it know more stuff out of the box.

      At some point there's diminishing returns and your coding LLM performs worse because you encoded useless stuff like Pokemon combinations or languages you don't speak into its parameter space.

      The "smartness" of the model comes from RLHF post-training, which is orthogonal to model size.

      Also, if you're using an agentic harness a much better approach is to let the model control its own context. If you ever reach a point where your coding LLM needs to know about Pokemon, just give it a web search tool and let it google the Pokemons.

      2 replies →

Do the two cards "share" their memory pool? Can work still be split across it? I'm wondering how it would do with something like fine tuning?

Do you get the speed of the 5080 with the memory of the 3090?

Frontier models are still better (everyone would use them if it was cheap). Open source models are capable on even non "simple" problems but I trust them less, even though I usually write plans for all changes, and they are worse at debugging. I recently converted my homelab to nixos and let's just say Deepseek failed and Fable did great (the night before getting killed)

  • While what you say is in general true, every model that followed Opus 4.6 on Anthropic side has been increasingly worse at what the previous user points out: they are extremely smart and can convince the user about major falsehood.

    They are way too trained/reinforced on solving problems rather than assisting you, something on which they have becoming extremely bad at.

    It's hard to explain because I too had the many moments where "Fable5 / Opus4.8 xhigh could solve bugs/stuff that previous models couldn't", I know that to be true, and they are very useful for that.

    But 90% of my tasks are quite mundane and I need thorough investigation and a proper assistant. Not a smart bullshitter fixated on solving the issue itself. On that Opus 4.6 has been the last good model.

    Anything after that is completely skewed towards passing benchmarks and E2E tasks, but definitely not assisting.

    Fable in particular was a disaster on that, non stop being thorough on the fix it fixated on, writing nthousand experiments in /tmp, etc. Great model, not gonna lie, but only if your focus is vibe coding and you accept that you're nothing but an assistant and accept its shortcomings.

    • yeah, the "proactivity" of recent anthropic models and sophisticated bullshitting are bad, although my experience is that even on simple tasks i've never used a oss model that has consistently been better in terms of the quality of the result.

I have said this before as well: these top-of-the-line models write clever, convoluted code. The code looks intelligent from above, but is a maintenance headache. Makes entire thing fragile for future developments on top of it.

The smaller models, especially the aforementioned ones, they fail much more, but, do not write that insanity of the code. They do simple, non-clever coding like humans do. Much easier to maintain and build upon.

Qwen-3.6-27b is a wonderful model. Exceptionally good for it's size, and excellent in general as well. And with mtp available now, it can run at 60+ tps on a single 3090... this is roughly 30% faster tgs than most of the hosted ones being served from giant data-centers.

Not having a lot of experience with this, I ask a naive question: is there a world where you can take your local LLM and hook it up to Claude and get more Claude-like results from your local model? Obviously, there are going to be material differences in how these perform, but are we getting close to a place where this is viable? I imagine that the answers are a combination of “not yet” and “yes but it’s a lot slower” and “yes but there is actually little point to doing this because ‘what Claude gets you’ is highly baked into anthropic’s models and that’s part of what you’re paying for.”

  • Already been done. Look at the Forge project for local LLMs. It can bring 8b models up to Opus-like performance at tool calling.

  • You can use ollama as the backend for claude code!

      ollama launch claude --model
    

    I would characterize it as doable, but not really viable. It's "yes you can do it but it's a lot slower", with a hint of "and the best local LLMs are on par with Haiku or Maybe Sonnet so larger and longer tasks get notably worse".

  • I have a "task router" that is a small local LLM on my mac mini (Qwen 3.5 0.8B) that I use to decide (when activated) with Pi whether to route a given task to my local LLM (Step 3.7 Flash) or to <given cloud provider>, if that counts? It works surprisingly well really. Though some of the cloud providers are getting so good and so cheap (GLM 5.1/5.2, MiniMax M3, among others) that the need to use my local one becomes less and less relevant, depressingly!

  • You're kinda talking about Claude being used for planning/architect role, while local LLM is just executing it (performing edits) -- at least in such form it's a thing, yes.

It's also going to fail consistently. When calling Claude you don't know what version of the model you are talking to, it might be quantified sure to load or have been patched.

This is true. The failure modes are simpler. And yes the ceiling is lower as well. Smaller models stability is lower over long sequences. And thus anything that needs a lot of CoT will be weaker. For example, I had a dumb lock + condvar with multiple defenses against lost wakeups in a N producer 1 consumer queue thing. Models generally need a lot of CoT before they realise they can switch it to a semaphore instead. Qwen typically isn't stable over such long CoTs and ends up adding more and more slop and band aids versus a larger model that outputs a large CoT and then realises it can swap 3 functions out with 2 lines if we use a semaphore.