← Back to context

Comment by kouteiheika

8 days ago

If you want to run LLMs locally then the localllama community is your friend: https://old.reddit.com/r/LocalLLaMA/

In general there's no "best" LLM model, all of them will have some strengths and weaknesses. There are a bunch of good picks; for example:

> DeepSeek-R1-0528-Qwen3-8B - https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

Released today; probably the best reasoning model in 8B size.

> Qwen3 - https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...

Recently released. Hybrid thinking/non-thinking models with really great performance and plethora of sizes for every hardware. The Qwen3-30B-A3B can even run on CPU with acceptable speeds. Even the tiny 0.6B one is somewhat coherent, which is crazy.

Yes at this point it's starting to become almost a matter of how much you like the model's personality since they're all fairly decent. OP just has to start downloading and trying them out. With 16GB one can do partial DDR5 offloading with llama.cpp and run anything up to about 30B (even dense) or even more at a "reasonable" speed for chat purposes. Especially with tensor offload.

I wouldn't count Qwen as that much of a conversationalist though. Mistral Nemo and Small are pretty decent. All of Llama 3.X are still very good models even by today's standards. Gemma 3s are great but a bit unhinged. And of course QwQ when you need GPT4 at home. And probably lots of others I'm forgetting.

> DeepSeek-R1-0528-Qwen3-8B https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B ... Released today; probably the best reasoning model in 8B size.

  ... we distilled the chain-of-thought from DeepSeek-R1-0528 to post-train Qwen3-8B Base, obtaining DeepSeek-R1-0528-Qwen3-8B ... on AIME 2024, surpassing Qwen3-8B by +10.0% & matching the performance of Qwen3-235B-thinking.

Wild how effective distillation is turning out to be. No wonder, most shops have begun to "hide" CoT now: https://news.ycombinator.com/item?id=41525201

  • > Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.

    Thank you for thinking of the vibe coders.

There was this great post the other day [1] showing that with llama-cpp you could offload some specific tensors to the CPU and maintain good performance. That's a good way to use lare(ish) models in commodity hardware.

Normally with llama-cpp you specifiy how many (full) layers you want to put in GPU (-ngl) . But CPU-offloading specific tensors that don't require heavy computation , saves GPU space without affecting speed that much.

I've also read a paper on loading only "hot" neurons into the cpu [2] . The future of home AI looks so cool!

[1] https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_of...

[2] https://arxiv.org/abs/2312.12456

> If you want to run LLMs locally then the localllama community is your friend: https://old.reddit.com/r/LocalLLaMA/

For folks new to reddit, it's worth noting that LocalLlama, just like the rest of the internet but especially reddit, is filled with misinformed people spreading incorrect "facts" as truth, and you really can't use the upvote/downvote count as an indicator of quality or how truthful something is there.

Something that is more accurate but put in a boring way will often be downvoted, while straight up incorrect but funny/emotional/"fitting the group think" comments usually get upvoted.

For us who've spent a lot of time on the web, this sort of bullshit detector is basically built-in at this point, but if you're new to places where the group think is so heavy as on reddit, it's worth being careful taking anything at face value.

  • LocalLlama is good for:

    - Learning basic terms and concepts.

    - Learning how to run local inference.

    - Inference-level considerations (e.g., sampling).

    - Pointers to where to get other information.

    - Getting the vibe of where things are.

    - Healthy skepticism about benchmarks.

    - Some new research; there have been a number of significant discoveries that either originated in LocalLlama or got popularized there.

    LocalLlama is bad because:

    - Confusing information about finetuning; there's a lot of myths from early experiments that get repeated uncritically.

    - Lots of newbie questions get repeated.

    - Endless complaints that it's been took long since a new model was released.

    - Most new research; sometimes a paper gets posted but most of the audience doesn't have enough background to evaluate the implications of things even if they're highly relevant. I've seen a lot of cutting edge stuff get overlooked because there weren't enough upvoters who understood what they were looking at.

    • > Most new research; sometimes a paper gets posted but most of the audience doesn't have enough background to evaluate the implications of things even if they're highly relevant. I've seen a lot of cutting edge stuff get overlooked because there weren't enough upvoters who understood what they were looking at.

      Is there a good place for this? Currently I just regularly sift through all of the garbage myself on arxiv to find the good stuff, but it is somewhat of a pain to do.

      1 reply →

  • This is entirely why I can't bring myself to use it. The groupthink and virtue signaling is intense, when it's not just extremely low effort crud that rises to the top. And yes, before anyone says, I know, "curate." No, thank you.

  • Well the unfortunate truth is HN has been behind the curve on local llm discussions so localllama has been the only one picking up the slack. There are just waaaaaaaay to many “ai is just hype” people here and the grassroots hardware/localllm discussions have been quite scant.

    Like, we’re fucking two years in and only now do we have a thread about something like this? The whole crowd here needs to speed up to catch up.

    • There are people who think LLMs are the future and a sweeping change you must embrace or be left behind.

      There are others wondering if this is another hype juggernaut like CORBA, J2EE, WSDL, XML, no-SQL, or who-knows-what. A way to do things that some people treated as the new One True Way, but others could completely bypass for their entire, successful career and look at it now in hindsight with a chuckle.

      3 replies →

  • I use it as a discovery tool. Like if anybody mentions something interesting I go and research install/start playing with it. I could care less if they like it or not I'll make my own opinion.

    For example I find all comments about model X be more "friendly" or "chatty" and model Y being more "unhinged" or whatever to be mostly BS. Like there's gazillion ways a conversation can go and I don't find model X or Y to be consistently chatty or unhinged or creative or whatever every time.

What do you recommend for coding with aider or roo?

Sometimes it’s hard to find models that can effectively use tools

I'd also recommend you go with something like 8b, so you can have the other 8GB of vram for a decent sized context window. There's tons of good 8b ones, as mentioned above. If you go for the largest model you can fit, you'll have slower inference (as you pass in more tokens) and smaller context.

  • I think your recommendation falls within

    > all of them will have some strengths and weaknesses

    Sometimes a higher parameter model with less quantization and low context will be the best, sometimes lower parameter model with some quantization and huge context will be the best, sometimes high parameter count + lots of quantization + medium context will be the best.

    It's really hard to say one model is better than another in a general way, since it depends on so many things like your use case, the prompts, the settings, quantization, quantization method and so on.

    If you're building/trying to build stuff depending on LLMs in any capacity, the first step is coming up with your own custom benchmark/evaluation that you can run with your specific use cases being put under test. Don't share this publicly (so it doesn't end up in the training data) and run it in order to figure out what model is best for that specific problem.

  • 8b is the number of parameters. The most common quant is 4 bits per parameter so 8b params is roughly 4GB of VRAM. (Typically more like 4.5GB)

  • With a 16GB GPU you can comfortably run like Qwen3 14B or Mistral Small 24B models at Q4 to Q6 and still have plenty of context space and get much better abilities than an 8B model.

  • Can system RAM be used for context (albeit at lower parsing speeds)?

    • Yeah, but it sucks. In fact, if you get the wrong graphics card and the memory bandwidth/speeds suck, things will suck too, so RAM is even worse (other than m1/m2/m3 stuff).

      1 reply →

  • I’m curious (as someone who knows nothing about this stuff!)—the context window is basically a record of the conversation so far and other info that isn’t part of the model, right?

    I’m a bit surprised that 8GB is useful as a context window if that is the case—it just seems like you could fit a ton of research papers, emails, and textbooks in 2GB, for example.

    But, I’m commenting from a place of ignorance and curiosity. Do models blow up the info in the context window, maybe do some processing to pre-digest it?

    • Yes, every token is expanded into a vector that can be many thousand of dimensions. The vectors are stored for every token and every layer.

      You absolutely can not fill even a single research paper in 2 GB much less an entire book.

> Released today; probably the best reasoning model in 8B size.

Actually DeepSeek-R1-0528-Qwen3-8B was uploaded Thursday (yesterday) at 11 AM UTC / 7 PM CST. I had to check if a new version came out since! I am waiting for the other sizes! ;D