← Back to context

Comment by pistoriusp

17 hours ago

Do you use a local/ free model?

I am currently using a local model qwen3:8b running on a 2020 (2018 intel chip) Mac mini for classifying news headlines and it's working decently well for my task. Each headline takes about 2-3 seconds but is pretty accurate. Uses about 5.3 gigs of ram.

  • Can you expand a bit on your software setup? I thought running local models was restricted to having expensive GPUs or latest Apple Silicon with unified memory. I have a Intel 11th gen home server which I would like to use to run some local model for tinkering if possible.

    • Those little 4B and 8B models will run on almost anything. They're really fun to try out but severely limited in comparison to the larger ones - classifying headlines to categories should work well but I wouldn't trust them to refactor code!

      If you have 8GB of RAM you can even try running them directly in Chrome via WebAssembly. Here's a demo running a model that's less than 1GB to load, entirely in your browser (and it worked for me in mobile safari just now): https://huggingface.co/spaces/cfahlgren1/Qwen-2.5-WebLLM

    • It's really just a performance tradeoff, and where your acceptable performance level is.

      Ollama, for example, will let you run any available model on just about any hardware. But using the CPU alone is _much_ slower than running it on any reasonable GPU, and obviously CPU performance varies massively too.

      You can even run models that are bigger than available RAM too, but performance will be terrible.

      The ideal case is to have a fast GPU and run a model that fits entirely within the GPU's memory. In these cases you might measure the model's processing speed in tens of tokens per second.

      As the idealness decreases, the processing speed decreases. On a CPU only with a model that fits in RAM, you'd be maxing out in the low single digit tokens per second, and on lower performance hardware, you start talking about seconds over token instead. If the model does not fit in RAM, then the measurement is minutes per token.

      For most people, their minimum acceptable performance level is in the double digit tokens per second range, which is why people optimize for that with high-end GPUs with as much memory as possible, and choose models that fit inside the GPU's RAM. But in theory you can run large models on a potato, if you're prepared to wait until next week for an answer.

      1 reply →

    • It really is a very simple setup. I basically had an old Intel based Mac mini from 2020. The intel chip inside it is from 2018). It's a 3 GHz 6-core Core i5. I had upgraded the ram on it to 32 GB when I bought it. However, the ollama only uses about 5.5 gigs of it. So it can be run on 16gb Mac too.

      The Qwen model I am using is fairly small but does the job I need it to for classifying headlines pretty decently. All I ask it to do is whether a specific headline is political or not. It only responds to me with in True or False.

      I access this model from an app (running locally) using the `http://localhost:11434/api/generate` REST api with `think` set to false.

      Note that this qwen model is a `thinking` model. So disabling it is important. Otherwise it takes very long to respond.

      Note that I tested this on my newer M4 Mac mini too and there, the performance is a LOT faster.

      Also, on my new M4 Mac, I originally tried using the Apple's built in Foundation Models for this task and while it was decent, many times, it was hitting Apple's guardrails and refusing to respond because it claimed the headline was too sensitive. So I switched to the Qwen model which didn't have this problem.

      Note that while this does the job I need it to, as another comment said, it won't be much help for things like coding.