Comment by mark_l_watson

1 day ago

I have spent a HUGE amount of time the last two years experimenting with local models.

A few lessons learned:

1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.

2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...

Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.

I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.

102 comments

mark_l_watson

sdrinf 20 hours ago

Just want to echo the recommendation for qwen3.5:9b. This is a smol, thinking, agentic tool-using, text-image multimodal creature, with very good internal chains of thought. CoT can be sometimes excessive, but it leads to very stable decision-making process, even across very large contexts -something we haven't seen models of this size before.

What's also new here, is VRAM-context size trade-off: for 25% of it's attention network, they use the regular KV cache for global coherency, but for 75% they use a new KV cache with linear(!!!!) memory-token-context size expansion! which means, eg ~100K token -> 1.5gb VRAM use -meaning for the first time you can do extremely long conversations / document processing with eg a 3060.

Strong, strong recommend.

steve_adams_86 20 hours ago
I've been building a harness for qwen3.5:9b lately (to better understand how to create agentic tools/have fun) and I'm not going to use it instead of Opus 4.6 for my day job but it's remarkably useful for small tasks. And more than snappy enough on my equipment. It's a fun model to experiment with. I was previously using an old model from Meta and the contrast in capability is pretty crazy.
I like the idea of finding practical uses for it, but so far haven't managed to be creative enough. I'm so accustomed to using these things for programming.
- tempoponet 18 hours ago
  
  What kind of small tasks do you find it's good at? My non-coding use of agents has been related to server admin, and my local-llm use-case is for 24/7 tasks that would be cost-prohibitive. So my best guess for this would be monitoring logs, security cameras, and general home automation tasks.
  
  6 replies →
threecheese 19 hours ago
You can really see the limitations of qwen3.5:9b in reasoning traces- it’s fascinating. When a question “goes bad”, sometimes the thinking tokens are WILD - it’s like watching the Poirot after a head injury.
Example: “what is the air speed velocity of a swallow?” - qwen knew it was a Monty Python gag, but couldnt and didnt figure out which one.
- scottmf 16 hours ago
  
  As a person who also knows there's a connection between that phrase and Monty Python and not much more information beyond that, I'm not sure how to feel.
- cassianoleal 14 hours ago
  
  African or European?
  
  1 reply →
kingo55 20 hours ago
How's it compare in quality with larger models in the same series? E.g 122b?
- buzzin_ 5 hours ago
  
  The chart on this link compares all qwen3.5 models down to 0.8B.
  https://www.reddit.com/r/LocalLLaMA/comments/1ro7xve/qwen35_...
ggsp 19 hours ago
How much difference are you seeing between standard and Q4 versions in terms of degradation, and is it constant across tasks or more noticeable in some vs others?
- rnewme 19 hours ago
  
  Less than expected, search for unsloths recent benchmark
dsr_ 19 hours ago
[flagged]
- tobr 8 hours ago
  
  Describing what computers do as ”thinking” is not new. It’s a useful and obvious metaphor. https://www.gutenberg.org/ebooks/68991
  
  1 reply →
- scronkfinkle 16 hours ago
  
  Do you also require computers to grow legs when they "run"?
  "Thinking" is just a term to describe a process in generative AI where you generate additional tokens in a manner similar to thinking a problem through. It's kind of a tired point to argue against the verb since it's meaning is well understood at this point
  
  9 replies →
- sayamqazi 10 hours ago
  
  Are insects not creatures?
- peddling-brink 17 hours ago
  
  Rebooting a machine running an LLM isn’t noticed by the LLM.
  Would you feel comfortable digitally torturing it? Giving it a persona and telling it terrible things? Acts of violence against its persona?
  I’m not confident it’s not “feeling” in a way.
  Yes its circuitry is ones and zeros, we understand the mechanics. But at some point, there’s mechanics and meat circuitry behind our thoughts and feelings too.
  It is hubris to confidently state that this is not a form of consciousness.
  
  8 replies →
- fragmede 17 hours ago
  
  What do you imagine the psychiatrist will do? That's an incredibly dismissive take.
  
  2 replies →
- inquirerGeneral 15 hours ago
  
  [dead]
- woctordho 14 hours ago
  
  Then don't get sorrow killing it. Living things are not so special.

johnmaguire 1 day ago

I'd love to know how you fit smaller models into your workflow. I have an M4 Macbook Pro w/ 128GB RAM and while I have toyed with some models via ollama, I haven't really found a nice workflow for them yet.

philipkglass 1 day ago
It really depends on the tasks you have to perform. I am using specialized OCR models running locally to extract page layout information and text from scanned legal documents. The quality isn't perfect, but it is really good compared to desktop/server OCR software that I formerly used that cost hundreds or thousands of dollars for a license. If you have similar needs and the time to try just one model, start with GLM-OCR.
If you want a general knowledge model for answering questions or a coding agent, nothing you can run on your MacBook will come close to the frontier models. It's going to be frustrating if you try to use local models that way. But there are a lot of useful applications for local-sized models when it comes to interpreting and transforming unstructured data.
- mandeepj 20 hours ago
  
  > I formerly used that cost hundreds or thousands of dollars for a license
  Azure Doc Intelligence charges $1.50 for 1000 pages. Was that an annual/recurring license?
  Would you mind sharing your OCR model? I'm using Azure for now, as I want to focus on building the functionality first, but would later opt for a local model.
  
  4 replies →
tempaccount5050 18 hours ago

Not OP but I had an XML file with inconsistent formatting for album releases. I wanted to extract YouTube links from it, but the formatting was different from album to album. Nothing you could regex or filter manually. I shoved it all into a DB, looked up the album, then gave the xml to a local LLM and said "give me the song/YouTube pairs from this DB entry". Worked like a charm.
Bluecobra 21 hours ago
I didn’t realize that you can get 128GB of memory in a notebook, that is impressive!
- lambda 21 hours ago
  
  I've got a 128 GiB unified memory Ryzen Ai Max+ 395 (aka Strix Halo) laptop.
  Trying to run LLM models somehow makes 128 GiB of memory feel incredibly tight. I'm frequently getting OOMs when I'm running models that are pushing the limits of what this can fit, I need to leave more memory free for system memory than I was expecting. I was expecting to be able to run models of up to ~100 GiB quantized, leaving 28 GiB for system memory, but it turns out I need to leave more room for context and overhead. ~80 GiB quantized seems like a better max limit when trying not running on a headless system so I'm running a desktop environment, browser, IDE, compilers, etc in addition to the model.
  And memory bandwidth limitations for running the models is real! 10B active parameters at 4-6 bit quants feels usable but slow, much more than that and it really starts to feel sluggish.
  So this can fit models like Qwen3.5-122B-A10B but it's not the speediest and I had to use a smaller quant than expected. Qwen3-Coder-Next (80B/3B active) feels quite on speed, though not quite as smart. Still trying out models, Nemotron-3-Super-120B-A12B just came out, but looks like it'll be a bit slower than Qwen3.5 while not offering up any more performance, though I do really like that they have been transparent in releasing most of its training data.
  
  1 reply →
- AzN1337c0d3r 21 hours ago
  
  Most workstation class laptops (i.e. Lenovo P-series, Dell Precision) have 4 DIMM slots and you can get them with 256 GB (at least, before the current RAM shortages).
  There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.
  Only Apple has the unique dynamic allocation though.
  
  13 replies →
saltwounds 1 day ago

I use Raycast and connect it to LM Studio to run text clean up and summaries often. The models are small enough I keep them in memory more often than not
aneyadeng 13 hours ago

[flagged]
echelon 21 hours ago
Shouldn't we prioritize large scale open weights and open source cloud infra?
An OpenRunPod with decent usage might encourage more non-leading labs to dump foundation models into the commons. We just need infra to run it. Distilling them down to desktop is a fool's errand. They're meant to run on DC compute.
I'm fine with running everything in the cloud as long as we own the software infra and the weights.
This is conceivably the only way we could catch up to Claude Code is to have the Chinese start releasing their best coding models and for them to get significant traction with companies calling out to hosted versions. Otherwise, we're going to be stuck in a take off scenario with no bridge.
- girvo 20 hours ago
  
  I run Qwen3.5-plus through Alibaba’s coding plan (Model Studio): incredibly cheap, pretty fast, and decent. I can’t compare it to the highest released weight one though.
  
  4 replies →

flutetornado 18 hours ago

My experience with qwen3.5 9b has not been the same. It’s definitely good at agentic responses but it hallucinates a lot. 30%-50% of the content it generated for a research task (local code repo exploration) turned out to be plain wrong to the extent of made up file names and function names. I ran its output through KimiK2 and asked it to verify its output - which found out that much of what it had figured out after agentic exploration was plain wrong. So use smaller models but be very cautious how much you depend on their output.

mongrelion 6 hours ago

At what temperature did you run it and what was your context limit?

adamkittelson 19 hours ago

Anecdotal but for some reason I had a pretty bad time with qwen3.5 locally for tool usage. I've been using GPT-OSS-120B successfully and switched to qwen so that I could process images as well (I'm using this for a discord chat bot).

Everything worked fine on GPT but Qwen as often as not preferred to pretend to call a tool and not actually call it. After much aggravation I wound up just setting my bot / llama swap to use gpt for chat and only load up qwen when someone posts an image and just process / respond to the image with qwen and pop back over to gpt when the next chat comes in.

GorbachevyChase 18 hours ago
You are responsible for the dead internet theory.
- yard2010 7 hours ago
  
  Why?

dhblumenfeld1 19 hours ago

Have you found that using a frontier model for planning and small local model for writing code to be a solid workflow? Been wanting to experiment with relying less on Claude Code/Codex and more on local models.

dataflow 20 hours ago

Thanks for sharing this, it's super helpful. I have a question if you don't mind: I want a model that I can feed, say, my entire email mailbox to, so that I can ask it questions later. (Just the text content, which I can clean and preprocess offline for its use.) Have any offline models you've dealt with seemed suitable for that sort of use case, with that volume of content?

lilactown 10 hours ago
If your inbox is as big as mine, you won’t be able to load all the text content into a prompt even with SotA cloud hosted models.
Instead you should give it tools to search over the mailbox for terms, labels, addresses, etc. so that the model can do fine grained filters based on the query, read the relevant emails it finds, then answer the question.
- dataflow 10 hours ago
  
  Thanks, yeah. I think strong prefiltering is pretty much always doable because, if nothing else, I usually know the time range of the relevant emails and probably the sender/recipient or some keywords, plus I know how to filter out a big chunk of the irrelevant emails (like mailing lists, etc.), so I'm hoping it's not actually that much data for each search. What I don't know is which models would be most suitable even in the case where I can fit the data.
  As an example of the kind of query I'm interested in, I want a model that can tell me all the flights I took within a given time range (so that means it'd have to filter out cancellations). Or, for a given flight, the arrival and departure times and time zones (or the city and country so I can look up the time zone). Stuff like that. (Travel is just an example obviously, I have other topics to ask about.) It's not a terribly large number of emails to search through in each query, but the email structures are too heterogeneous across senders to write custom tooling for each case.
perbu 19 hours ago
Prompt injection is a problem if your agent has access to anything.
The local models are quite weak here.
- dataflow 19 hours ago
  
  Security is not a concern for the purpose of my question here, please ignore that for now. I'm just looking for text summary and search functionality here, not looking to give it full system access and let it loose on my computer or network. I can easily set up VM/sandboxing/airgapping/etc. as needed.
  My question is really just about what can handle that volume of data (ideally, with the quoted sections/duplications/etc. that come with email chains) and still produce useful (textual) output.
  
  1 reply →

eek2121 19 hours ago

Qwen is actually really good at code as well. I used qwen3-coder-next a while back and it was every bit as good as claude code in the use cases I tested it in. Both made the same amount of mistakes, and both did a good job of the rest.

storus 18 hours ago

Coding locally with Qwen3-Coder-Next or Qwen-3.5 is a piece of cake on a workstation card (RTX Pro 6000); set it up in llama.cpp or vLLM in 1 hour, install Claude Code, force local API hostname and fake secret key, and just run it like regular setup with Claude4 but on Qwen.

chrisweekly 18 hours ago

Thanks for this, Mark. And for your website and books and generosity of spirit. Signal in the noise. Have an awesome weekend!

sakesun 18 hours ago

Becoming a retired builder is the ultimate bliss.

manmal 1 day ago

What about running e.g. Qwen3.5 128B on a rented RTX Pro 6000?

girvo 20 hours ago

IMO you’re better off using qwen3.5-plus through the model studio coding plan, but ymmv

nine_k 1 day ago

What kind of hardware did you use? I suppose that a 8GB gaming GPU and a Mac Pro with 512 GB unified RAM give quite different results, both formally being local.

fzzzy 21 hours ago
A Mac Pro with 512 gb unified ram does not exist.
- nine_k 21 hours ago
  
  Mac Studio Ultra, my bad. The 512 GB option existed up until March 2026: https://macdailynews.com/2026/03/06/apple-drops-512gb-m3-ult...

cyanydeez 20 hours ago

Cline (https://marketplace.visualstudio.com/items?itemName=saoudriz...) in vscode, inside a code-server run within docker (https://docs.linuxserver.io/images/docker-code-server/) using lmstudio (https://lmstudio.ai/) to access unsloth models (https://unsloth.ai/docs/get-started/unsloth-model-catalog) speficially (https://unsloth.ai/docs/models/qwen3-coder-next) appears to be right at the edge of productivity, as long as you realize what complexity means when issuing tasks.

kylehotchkiss 1 day ago

I've been really interested in the difference between 3.5 9b and 14b for information extraction. Is there a discernible difference in quality of capability?

sieabahlpark 20 hours ago

[dead]