Ask HN: What is the best LLM for consumer grade hardware?

8 days ago

I have a 5060ti with 16GB VRAM. I’m looking for a model that can hold basic conversations, no physics or advanced math required. Ideally something that can run reasonably fast, near real time.

If you want to run LLMs locally then the localllama community is your friend: https://old.reddit.com/r/LocalLLaMA/

In general there's no "best" LLM model, all of them will have some strengths and weaknesses. There are a bunch of good picks; for example:

> DeepSeek-R1-0528-Qwen3-8B - https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

Released today; probably the best reasoning model in 8B size.

> Qwen3 - https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...

Recently released. Hybrid thinking/non-thinking models with really great performance and plethora of sizes for every hardware. The Qwen3-30B-A3B can even run on CPU with acceptable speeds. Even the tiny 0.6B one is somewhat coherent, which is crazy.

  • Yes at this point it's starting to become almost a matter of how much you like the model's personality since they're all fairly decent. OP just has to start downloading and trying them out. With 16GB one can do partial DDR5 offloading with llama.cpp and run anything up to about 30B (even dense) or even more at a "reasonable" speed for chat purposes. Especially with tensor offload.

    I wouldn't count Qwen as that much of a conversationalist though. Mistral Nemo and Small are pretty decent. All of Llama 3.X are still very good models even by today's standards. Gemma 3s are great but a bit unhinged. And of course QwQ when you need GPT4 at home. And probably lots of others I'm forgetting.

  • > DeepSeek-R1-0528-Qwen3-8B https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B ... Released today; probably the best reasoning model in 8B size.

      ... we distilled the chain-of-thought from DeepSeek-R1-0528 to post-train Qwen3-8B Base, obtaining DeepSeek-R1-0528-Qwen3-8B ... on AIME 2024, surpassing Qwen3-8B by +10.0% & matching the performance of Qwen3-235B-thinking.
    

    Wild how effective distillation is turning out to be. No wonder, most shops have begun to "hide" CoT now: https://news.ycombinator.com/item?id=41525201

    • > Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.

      Thank you for thinking of the vibe coders.

  • There was this great post the other day [1] showing that with llama-cpp you could offload some specific tensors to the CPU and maintain good performance. That's a good way to use lare(ish) models in commodity hardware.

    Normally with llama-cpp you specifiy how many (full) layers you want to put in GPU (-ngl) . But CPU-offloading specific tensors that don't require heavy computation , saves GPU space without affecting speed that much.

    I've also read a paper on loading only "hot" neurons into the cpu [2] . The future of home AI looks so cool!

    [1] https://www.reddit.com/r/LocalLLaMA/comments/1ki7tg7/dont_of...

    [2] https://arxiv.org/abs/2312.12456

  • > If you want to run LLMs locally then the localllama community is your friend: https://old.reddit.com/r/LocalLLaMA/

    For folks new to reddit, it's worth noting that LocalLlama, just like the rest of the internet but especially reddit, is filled with misinformed people spreading incorrect "facts" as truth, and you really can't use the upvote/downvote count as an indicator of quality or how truthful something is there.

    Something that is more accurate but put in a boring way will often be downvoted, while straight up incorrect but funny/emotional/"fitting the group think" comments usually get upvoted.

    For us who've spent a lot of time on the web, this sort of bullshit detector is basically built-in at this point, but if you're new to places where the group think is so heavy as on reddit, it's worth being careful taking anything at face value.

    • LocalLlama is good for:

      - Learning basic terms and concepts.

      - Learning how to run local inference.

      - Inference-level considerations (e.g., sampling).

      - Pointers to where to get other information.

      - Getting the vibe of where things are.

      - Healthy skepticism about benchmarks.

      - Some new research; there have been a number of significant discoveries that either originated in LocalLlama or got popularized there.

      LocalLlama is bad because:

      - Confusing information about finetuning; there's a lot of myths from early experiments that get repeated uncritically.

      - Lots of newbie questions get repeated.

      - Endless complaints that it's been took long since a new model was released.

      - Most new research; sometimes a paper gets posted but most of the audience doesn't have enough background to evaluate the implications of things even if they're highly relevant. I've seen a lot of cutting edge stuff get overlooked because there weren't enough upvoters who understood what they were looking at.

      2 replies →

    • This is entirely why I can't bring myself to use it. The groupthink and virtue signaling is intense, when it's not just extremely low effort crud that rises to the top. And yes, before anyone says, I know, "curate." No, thank you.

      30 replies →

    • Lol this is true but also a TON of sampling innovations that are getting their love right now from the AI community (see min_p oral at ICLR 2025) came right from r/localllama so don't be a hater!!!

      1 reply →

    • Well the unfortunate truth is HN has been behind the curve on local llm discussions so localllama has been the only one picking up the slack. There are just waaaaaaaay to many “ai is just hype” people here and the grassroots hardware/localllm discussions have been quite scant.

      Like, we’re fucking two years in and only now do we have a thread about something like this? The whole crowd here needs to speed up to catch up.

      4 replies →

    • I use it as a discovery tool. Like if anybody mentions something interesting I go and research install/start playing with it. I could care less if they like it or not I'll make my own opinion.

      For example I find all comments about model X be more "friendly" or "chatty" and model Y being more "unhinged" or whatever to be mostly BS. Like there's gazillion ways a conversation can go and I don't find model X or Y to be consistently chatty or unhinged or creative or whatever every time.

  • What do you recommend for coding with aider or roo?

    Sometimes it’s hard to find models that can effectively use tools

    • I havent found one good locally, i use DeepSeek r1 0528 its slow but free and really good at coding (openrouter has it free currently)

      1 reply →

  • I'd also recommend you go with something like 8b, so you can have the other 8GB of vram for a decent sized context window. There's tons of good 8b ones, as mentioned above. If you go for the largest model you can fit, you'll have slower inference (as you pass in more tokens) and smaller context.

    • I think your recommendation falls within

      > all of them will have some strengths and weaknesses

      Sometimes a higher parameter model with less quantization and low context will be the best, sometimes lower parameter model with some quantization and huge context will be the best, sometimes high parameter count + lots of quantization + medium context will be the best.

      It's really hard to say one model is better than another in a general way, since it depends on so many things like your use case, the prompts, the settings, quantization, quantization method and so on.

      If you're building/trying to build stuff depending on LLMs in any capacity, the first step is coming up with your own custom benchmark/evaluation that you can run with your specific use cases being put under test. Don't share this publicly (so it doesn't end up in the training data) and run it in order to figure out what model is best for that specific problem.

    • 8b is the number of parameters. The most common quant is 4 bits per parameter so 8b params is roughly 4GB of VRAM. (Typically more like 4.5GB)

      3 replies →

    • With a 16GB GPU you can comfortably run like Qwen3 14B or Mistral Small 24B models at Q4 to Q6 and still have plenty of context space and get much better abilities than an 8B model.

    • I’m curious (as someone who knows nothing about this stuff!)—the context window is basically a record of the conversation so far and other info that isn’t part of the model, right?

      I’m a bit surprised that 8GB is useful as a context window if that is the case—it just seems like you could fit a ton of research papers, emails, and textbooks in 2GB, for example.

      But, I’m commenting from a place of ignorance and curiosity. Do models blow up the info in the context window, maybe do some processing to pre-digest it?

      1 reply →

  • > Released today; probably the best reasoning model in 8B size.

    Actually DeepSeek-R1-0528-Qwen3-8B was uploaded Thursday (yesterday) at 11 AM UTC / 7 PM CST. I had to check if a new version came out since! I am waiting for the other sizes! ;D

What is everyone using their local LLMs for primarily? Unless you have a beefy machine, you'll never approach the level of quality of proprietary models like Gemini or Claude, but I'm guessing these smaller models still have their use cases, just not sure what those are.

  • I'm currently experimenting with Devstral for my own local coding agent I've slowly built together. It's in many ways nicer than Codex in that 1) full access to my hardware so can start VMs, make network requests and everything else I can do, which Codex cannot and 2) it's way faster both in initial setup, working through things and creating a patch.

    Of course, it still isn't at the same level as Codex itself, the model Codex is using is just way better so of course it'll get better results. But Devstral (as I currently use it) is able to make smaller changes and refactors, and I think if I evolve the software a bit more, can start making larger changes too.

    • Why are you comparing it to Codex and not Claude Code, which can do all those things?

      And why not just use Openhands, which it was designed around which I presume can also do all those things?

  • I generally try a local model first for most prompts. It's good enough surprisingly often (over 50% for sure). Every time I avoid using a cloud service is a win.

  • I think that the future of local LLMs is delegation. You give it a prompt and it very quickly identifies what should be used to solve the prompt.

    Can it be solved locally with locally running MCPs? Or maybe it's a system API - like reading your calendar or checking your email. Otherwise it identifies the best cloud model and sends the prompt there.

    Basically Siri if it was good

    • I completely disagree. I don't see the current status quo fundamentally changing.

      That idea makes so much sense on paper, but until you start implementing it that you realized why no one does it (including Siri). "Some tasks are complex and better suited for complex giant model, but small models are perfectly capable of running simple limited task" makes a ton of sense, but the component best equipped at evaluating that decision is the smarter component of your system. At which point, you might as well have had it run the task.

      It's like assigning the intern to triage your work items.

      When actually implementing the application with that approach, every time you encounter an "AI-miss" you would (understandably) blame the small model, and eventually give up and delegate yet-another-scnario to the cloud model.

      Eventually you feel you're artificially handcuffing yourself compared to literally every body else trying to ship something utilizing a 1b model. You have the worst of all options, crappy model with lots of hiccups yet it's still (by far) the most resource intensive part of your application making the whole thing super heavy and you are delegating more and more to the cloud model.

      The local LLM scenario is going to be entirely driven by privacy concerns (around which there is no option. It's not like an E2EE LLM API could exist) or cost concerns if you believe you can run it cheaper.

      2 replies →

  • I avoid using cloud whenever I can on principle. For instance, OpenAI recently indicated that they are working on some social network-like service for ChatGPT users to share their chats.

    Running it locally helps me understand how these things work under the hood, which raises my value on the job market. I also play with various ideas which have LLM on the backend (think LLM-powered Web search, agents, things of that nature), I don't have to pay cloud providers, and I already had a gaming rig when LLaMa was released.

  • > unless you have a beefy machine

    The average person in r/locallama has a machine that would make r/pcmasterrace users blush.

    • An Apple M1 is decent enough for LMs. My friend wondered why I got so excited about it when it came out five years ago. It wasn't that it was particularly powerful - it's decent. What it did was to set a new bar for "low end".

      12 replies →

  • General local inference strengths:

    - Experiments with inference-level control; can't do the Outlines / Instructor stuff with most API services, can't do the advanced sampling strategies, etc. (They're catching up but they're 12 months behind what you can do locally.)

    - Small, fast, finetuned models; _if you know what your domain is sufficiently to train a model you can outperform everything else_. General models usually win, if only due to ease of prompt engineering, but not always.

    - Control over which model is being run. Some drift is inevitable as your real-world data changes, but when your model is also changing underneath you it can be harder to build something sustainable.

    - More control over costs; this is the classic on-prem versus cloud decision. Most cases you just want to pay for the cloud...but we're not in ZIRP anymore and having a predictable power bill can trump sudden unpredictable API bills.

    In general, the move to cloud services was originally a cynical OpenAI move to keep GPT-3 locked away. They've built up a bunch of reasons to prefer the in-cloud models (heavily subsidized fast inference, the biggest and most cutting edge models, etc.) so if you need the latest and greatest right now and are willing to pay, it's probably the right business move for most businesses.

    This is likely to change as we get models that can reasonably run on edge devices; right now it's hard to build an app or a video game that incidentally uses LLM tech because user revenue is unlikely to exceed inference costs without a lot of careful planning or a subscription. Not impossible, but definitely adds business challenges. Small models running on end-user devices opens up an entirely new level of applications in terms of cost-effectiveness.

    If you need the right answer, sometimes only the biggest cloud API model is acceptable. If you've got some wiggle room on accuracy and can live with sometimes getting a substandard response, then you've got a lot more options. The trick is that the things that an LLM is best at are always going to be things where less than five nines of reliability are acceptable, so even though the biggest models have more reliability, an average there are many tasks where you might be just fine with a small fast model that you have more control over.

  • This is an excellent example of local LLM application [1].

    It's an AI-driven chat system designed to support students in the Introduction to Computing course (ECE 120) at UIUC, offering assistance with course content, homework, or troubleshooting common problems.

    It serves as an educational aid integrated into the course’s learning environment using UIUC Illinois Chat system [2].

    Personally I've found it's really useful that it provides the details portions of course study materials for examples slides that's directly related to the discussions so the students can check the sources veracity of the answers provided by the LLM.

    It seems to me that RAG is the killer feature for local LLM [3]. It directly addressed the main pain point of LLM hallucinations and help LLMs stick to the facts.

    [1] Introduction to Computing course (ECE 120) Chatbot:

    https://www.uiuc.chat/ece120/chat

    [2] UIUC Illinois Chat:

    https://uiuc.chat/

    [3] Retrieval-augmented generation [RAG]:

    https://en.wikipedia.org/wiki/Retrieval-augmented_generation

    • Does this actually need to be local? Since the chat bot is open to the public and I assume the course material used for RAG all on this page (https://canvas.illinois.edu/courses/54315/pages/exam-schedul...) all stays freely accessible - I clicked a few links without being a student - I assume a pre-prompted larger non-local LLM would outperform the local instance. Though, you can imagine an equivalent course with all of its content ACL-gated/'paywalled' could benefit from local RAG, I guess.

  • You still can get decent stuff out of local ones.

    Mostly I use it for testing tools and integrations via API not to spend money on subscriptions. When I see something working I switch it to proprietary one to get best results.

    • If you're comfortable with the API, all the services provide pay-as-you-go API access that can be much cheaper. I've tried local, but the time cost of getting it to spit out something reasonable wasn't worth the literal pennies the answers from the flagship would cost.

      3 replies →

  • If you look on localllama you'll see most of the people there are really just trying to do NSFW or other questionable or unethical things with it.

    The stuff you can run on reasonable home hardware (e.g. a single GPU) isn't going to blow your mind. You can get pretty close to GPT3.5, but it'll feel dated and clunky compared to what you're used to.

    Unless you have already spent big $$ on a GPU for gaming, I really don't think buying GPUs for home makes sense, considering the hardware and running costs, when you can go to a site like vast.ai and borrow one for an insanely cheap amount to try it out. You'll probably get bored and be glad you didn't spend your kids' college fund on a rack of H100s.

    • There's some other reasons to run local LLMs. If it's on my PC, I can preload the context with, say, information about all the members of my family. Their birthdays, hobbies, favorite things. I can load in my schedule, businesses I frequent. I can connect it to local databases on my machine. All sorts of things that can make it a useful assistant, but that I would never upload into a cloud service.

  • Shouldn't the (MoE) mixture of experts approach allow one to conserve memory by working on specific problem type at a time?

    > (MoE) divides an AI model into separate sub-networks (or "experts"), each specializing in a subset of the input data, to jointly perform a task.

    • Sort of, but the "experts" aren't easily divisible in a conceptually interpretable way so the naive understanding of MoE is misleading.

      What you typically end up with in memory constrained environments is that the core shared layers are in fast memory (VRAM, ideally) and the rest are in slower memory (system RAM or even a fast SSD).

      MoE models are typically very shallow-but-wide in comparison with the dense models, so they end up being faster than an equivalent dense model, because they're ultimately running through fewer layers each token.

  • I have a large repository of notes, article drafts, and commonplace book-type stuff. I experimented a year or so ago with a system using RAG to "ask myself" what I have to say about various topics. (I suppose nowadays I would use MCP instead of RAG?) I was not especially impressed by the results with the models I was able to run: long-winded responses full of slop and repetition, irrelevant information pulled in from notes that had some semantically similar ideas, and such. I'm certainly not going to feed the contents of my private notebooks to any of the AI companies.

I concur LocalLLama subreddit recommendation. Not in terms of choosing "the best model" but to answer questions, find guides, latest news and gossip, names of the tools, various models and how they stack against each other, etc.

There's no one "best" model, you just try a few and play with parameters and see which one fits your needs the best.

Since you're on HN, I'd recommend skipping Ollama and LMStudio. They might restrict access to the latest models and you typically only choose from the ones they tested with. And besides what kind of fun is this when you don't get to peek under the hood?

llamacpp can do a lot itself, and you can do most recently released models (when changes are needed they adjust literally within a few days). You can get models from huggingface obviously. I prefer GGUF format, saves me some memory (you can use lower quantization, I find most 6-bit somewhat satisfactory).

I find that the the size of the model's GGUF file with roughly tell me if it'll fit in my VRAM. For example 24Gb GGUF model will NOT fit in 16Gb, whereas 12Gb likely will. However, the more context you add the more RAM will be needed.

Keep in mind that models are trained with certain context window. If it has 8Kb context (like most older models do) and you load it with 32Kb context it won't be much help.

You can run llamacpp on Linux, Windows, or MacOS, you can get the binaries or compile on your local. It can split the model between VRAM and RAM (if the model doesn't fit in your 16Gb). It even has simple React front-end (llamacpp-server). The same module provides REST service which has similar (but simpler) protocol to OpenAI and all the other "big" guys.

Since it implements OpenAI REST API, it also works with a lot of front-end tools if you want more functionality (ie oobabooga aka textgeneration webui).

Koboldcpp is another backend you can try if you find llamacpp to be too raw (I believe it's the still llamacpp under the hood).

  • Why skip ollama? I can pull any GGUF straight from HuggingFace eg:

    `ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:Q8_0`

  • > Since you're on HN, I'd recommend skipping Ollama and LMStudio.

    I disagree. With Ollama I can set up my desktop as an LLM server, interact with it over WiFi from any other device, and let Ollama switch seamlessly between models as I want to swap. Unless something has changed recently, with llama.cpp's CLI you still have to shut it down and restart it with a different command line flag in order to switch models even when run in server mode.

    That kind of overhead gets in the way of experimentation and can also limit applications: there are some little apps I've built that rely on being able to quickly swap between a 1B and an 8B or 30B model by just changing the model parameter in the web request.

    • llamacpp can set up REST server with OpenAI API so you can get many front-end LLM apps to talk to it the same way they talk to ChatGPT, Claude, etc. And you can connect to that machine from another one on the same network through whatever port you set it to. See llamacpp-server.

      When you get Ollama to "switch seamlessly" between models it still simply reloads a different model with llamacpp which is what it's based on.

      I prefer llamacpp because doing things "seamlessly" obscures the way things work behind the scenes, which is what I want to learn and play with.

      Also, and I'm not sure if it's the case anymore but it used to be, when llamacpp gets adjusted to work with the latest model, sometimes it takes them a bit to update the Python API which is what Ollama is using. It was the case with one of the LlaMas, forget which one, where people said "oh yeah don't try this model with Ollama, they're waiting on llamacpp folks to update llama-cpp-python to bring the latest changes from llamacpp, and once they do, Ollama will bring the latest into their app and we'll be up and running. Be patient."

      1 reply →

  • Ollama has a really good perk in that it makes it trivial which model is loaded and unloaded from the GPU. So if you're using a frontend like librechat or openwebui, then switching models is as easy as picking from the drop down without having to fiddle with the command line.

The best place to look is HuggingFace

Qwen is pretty good and in a variety of sizes. I’d suggest this one Qwen/Qwen3-14B-GGUF Q4_K_M for you given your vram and to run it using llama-server or lm studio (might be alternatives to lm studio but generally these are nice uis for llama server), it’ll use around 7-8GB for weights, leaving room for incidentals

Llama 3.3 could work for you

Devstral is too big but could run a quantized model

Gemma is good, tends to refuse a lot. Medgemma is a nice thing to have in case

“Uncensored” Dolphin models from Eric Hartford and “abliterated” models are what you want if you don’t want them refusing requests, it’s mostly not necessary for routine use, but sometimes you ask em to write a joke and they won’t do it, or if you wanted to do some business which involves defense contracting or security research, that kind of thing, could be handy.

Generally it’s bf16 dtype so you multiply the number of billions of parameters by two to get the model size unquantized

Then to get a model that fits on your rig, generally you want a quantized model, typically I go for “Q4_K_M” which means 4bits per param, so you divide the number of billions of params by two to calculate the vram for the weights.

Not sure the overhead for activations but might be a good idea to leave wiggle room and experiment with sizes well below 16GB

Llama server is a good way to run AI and has a gui on the index route and -hf to download models

LM Studio is a good gui and installs llama server for you and can help with managing models

Make sure you run some server that loads the model once. You definitely don’t want to load many gigabytes of weights into vram every question if you want fast realtime answers

I only have 8gb of vram to work with currently, but I'm running OpenWebUI as a frontend to ollamma and I have a very easy time loading up multiple models and letting them duke it out either at the same time or in a round robin.

You can even keep track of the quality of the answers over time to help guide your choice.

https://openwebui.com/

  • AMD 6700XT owner here (12Gb VRAM) - Can confirm.

    Once I figured out my local ROCm setup Ollama was able to run with GPU acceleration no problem. Connecting an OpenWebUI docker instance to my local Ollama server is as easy as a docker run command where you specify the OLLAMA_BASE_URL env var value. This isn't a production setup, but it works nicely for local usages like what the immediate parent is describing.

Qwen3 family (and the R1 qwen3-8b distill) is #1 in programming and reasoning.

However it's heavily censored on political topics because of its Chinese origin. For world knowledge, I'd recommend Gemma3.

This post will be outdated in a month. Check https://livebench.ai and https://aider.chat/docs/leaderboards/ for up to date benchmarks

  • > This post will be outdated in a month

    The pace of change is mind boggling. Not only for the models but even the tools to put them to use. Routers, tools, MCP, streaming libraries, SDKs...

    Do you have any advice for someone who is interested, developing alone and not surrounded by coworkers or meetups who wants to be able to do discovery and stay up to date?

At 16GB a Q4 quant of Mistral Small 3.1, or Qwen3-14B at FP8, will probably serve you best. You'd be cutting it a little close on context length due to the VRAM usage... If you want longer context, a Q4 quant of Qwen3-14B will be a bit dumber than FP8 but will leave you more breathing room. Mistral Small can take images as input, and Qwen3 will be a bit better at math/coding; YMMV otherwise.

Going below Q4 isn't worth it IMO. If you want significantly more context, probably drop down to a Q4 quant of Qwen3-8B rather than continuing to lobotomize the 14B.

Some folks have been recommending Qwen3-30B-A3, but I think 16GB of VRAM is probably not quite enough for that: at Q4 you'd be looking at 15GB for the weights alone. Qwen3-14B should be pretty similar in practice though despite being lower in param count, since it's a dense model rather than a sparse one: dense models are generally smarter-per-param than sparse models, but somewhat slower. Your 5060 should be plenty fast enough for the 14B as long as you keep everything on-GPU and stay away from CPU offloading.

Since you're on a Blackwell-generation Nvidia chip, using LLMs quantized to NVFP4 specifically will provide some speed improvements at some quality cost compared to FP8 (and will be faster than Q4 GGUF, although ~equally dumb). Ollama doesn't support NVFP4 yet, so you'd need to use vLLM (which isn't too hard, and will give better token throughput anyway). Finding pre-quantized models at NVFP4 will be more difficult since there's less-broad support, but you can use llmcompressor [1] to statically compress any FP16 LLM to NVFP4 locally — you'll probably need to use accelerate to offload params to CPU during the one-time compression process, which llmcompressor has documentation for.

I wouldn't reach for this particular power tool until you've decided on an LLM already, and just want faster perf, since it's a bit more involved than just using ollama and the initial quantization process will be slow due to CPU offload during compression (albeit it's only a one-time cost). But if you land on a Q4 model, it's not a bad choice once you have a favorite.

1: https://github.com/vllm-project/llm-compressor

I'd suggest buying a better GPU, only because all the models you want need a 24GB card. Nvidia... more or less robbed you.

That said, Unsloth's version of Qwen3 30B, running via llama.cpp (don't waste your time with any other inference engine), with the following arguments (documented in Unsloth's docs, but sometimes hard to find): `--threads (number of threads your CPU has) --ctx-size 16384 --n-gpu-layers 99 -ot ".ffn_.*_exps.=CPU" --seed 3407 --prio 3 --temp 0.6 --min-p 0.0 --top-p 0.95 --top-k 20` along with the other arguments you need.

Qwen3 30B: https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF (since you have 16GB, grab Q3_K_XL, since it fits in vram and leaves about 3-4GB left for the other apps on your desktop and other allocations llama.cpp needs to make).

Also, why 30B and not the full fat 235B? You don't have 120-240GB of VRAM. The 14B and less ones are also not what you want: more parameters are better, parameter precision is vastly less important (which is why Unsloth has their specially crafted <=2bit versions that are 85%+ as good, yet are ridiculously tiny in comparison to their originals).

Full Qwen3 writeup here: https://unsloth.ai/blog/qwen3

  • > only because all the models you want need a 24GB card

    ???

    Just run a q4 quant of the same model and it will fit no problem.

    • Q4_K_M is the "default" for a lot of models on HF, and they generally require ~20GB of VRAM to run. It will not fit entirely on a 16GB card. You want to be about 3-4GB VRAM on top of what the model requires.

      A back of the envelope estimate of specifically unsloth/Qwen3-30B-A3B-128K-GGUF is 18.6GB for Q4_K_M.

I'm afraid that 1) you are not going to get a definite answer, 2) an objective answer is very hard to give, 3) you really need to try a few most recent models on your own and give them the tasks that seem most useful/meaningful to you. There is drastic difference in output quality depending on the task type.

Generally speaking, how can you tell how much vram a model will take? It seems to be a valuable bit of data which is missing from downloadable models (gguf) files.

  • Very rougly you can consider the Bs of a model as GBs of memory then it depends on the quantization level. Say for an 8B model:

    - FP16: 2x 8GB = 16GB

    - Q8: 1x 8GB

    - Q4: 0.5x 8GB = 4GB

    It doesn't 100% neatly map like this but this gives you a rough measure. In top of this you need some more memory depending on the context length and some other stuff.

    Rationale for the calculation above: A model is basically a billions of variables with a floating number value. So the size of a model roughly maps to number of variables (weights) x word-precision of each variable (4, 8, 16bits..)

    You don't have to quantize all layers to the same precision this is why sometimes you see fractional quantizations like 1.58bits.

    • The 1.58bit quantization is using 3 values -- -1, 0, 1. The bits number comes from log_2(3) = 1.58....

      For that level you can pack 4 weights in a byte using 2 bits per byte. However, there is one bit configuration in each that is unused.

      More complex packing arrangements are done by grouping weights together (e.g. a group of 3) and assigning a bit configuration to each combination of values into a lookup table. This allows greater compression closer to the 1.68 bits value.

Ollama[0] has a collection of models that are either already small or quantized/distilled, and come with hyperparameters that are pretty reasonable, and they make it easy to try them out. I recommend you install it and just try a bunch because they all have different "personalities", different strengths and weaknesses. My personal go-tos are:

Qwen3 family from Alibaba seem to be the best reasoning models that fit on local hardware right now. Reasoning models on local hardware are annoying in contexts where you just want an immediate response, but vastly outperform non-reasoning models on things where you want the model to be less naive/foolish.

Gemma3 from google is really good at intuition-oriented stuff, but with an obnoxious HR Boy Scout personality where you basically have to add "please don't add any disclaimers" to the system prompt for it to function. Like, just tell me how long you think this sprain will take to heal, I already know you are not a medical professional, jfc.

Devstral from Mistral performs the best on my command line utility where I describe the command I want and it executes that for me (e.g. give me a 1-liner to list the dotfiles in this folder and all subfolders that were created in the last month).

Nemo from Mistral, I have heard (but not tested) is really good for routing-type jobs, where you need something with to make a simple multiple-choice decision competently with low latency, and is easy to fine-tune if you want to get that sophisticated.

[0] https://ollama.com/search

Basic conversations are essentially RP I suppose. You can look at KoboldCPP or SillyTavern reddit.

I was trying Patricide unslop mell and some of the Qwen ones recently. Up to a point more params is better than worrying about quantization. But eventually you'll hit a compute wall with high params.

KV cache quantization is awesome (I use q4 for a 32k context with a 1080ti!) and context shifting is also awesome for long conversations/stories/games. I was using ooba but found recently that KoboldCPP not only runs faster for the same model/settings but also Kobold's context shifting works much more consistently than Ooba's "streaming_llm" option, which almost always re-evaluates the prompt when hooked up to something like ST.

Related question: what is everyone using to run a local LLM? I'm using Jan.ai and it's been okay. I also see OpenWebUI mentioned quite often.

Did someone have a chance to try local llama for the new AMD AI Max+ with 128gb of unified RAM?

Wow, a 5060Ti. 16gb + I'm guessing >=32gb ram. And here I am spinning Ye Olde RX 570 4gb + 32gb.

I'd like to know how many tokens you can get out of the larger models especially (using Ollama + Open WebUI on Docker Desktop, or LM Studio whatever). I'm probably not upgrading GPU this year, but I'd appreciate an anecdotal benchmark.

  - gemma3:12b
  - phi4:latest (14b)
  - qwen2.5:14b [I get ~3 t/s on all these small models, acceptably slow]

  - qwen2.5:32b [this is about my machine's limit; verrry slow, ~1 t/s]
  - qwen2.5:72b [beyond my machine's limit, but maybe not yours]

  • I'm guessing you probably also want to include the quantization levels you're using, as otherwise they'll be a huge variance in your comparisons with others :)

This is what i have https://sabareesh.com/posts/llm-rig/ All You Need is 4x 4090 GPUs to Train Your Own Model

Question might sound very basic - but this is a question for the hardware folks - Are there any AI enabled embedded software tools that makes the lives of a embedded developer easier? Also, how many embedded developers would be there in large MNCs like automobile, medical devices or consumer electronics? I am trying to judge and TAM for such a startup.

I have an RTX 3070 with 8GB VRAM and for me Qwen3:30B-A3B is fast enough. It's not lightning fast, but more than adequate if you have a _little_ patience.

I've found that Qwen3 is generally really good at following instructions and you can also very easily turn on or off the reasoning by adding "/no_think" in the prompt to turn it off.

The reason Qwen3:30B works so well is because it's a MoE. I have tested the 14B model and it's noticeably slower because it's a dense model.

  • How are you getting Qwen3:30B-A3B running with 8GB? On my system it takes 20GB of VRAM to launch it.

    • It offloads to system memory, but since there are "only" 3 Billion active parameters, it works surprisingly well. I've been able to run models that are up to 29GB in size, albeit very, very slow on my system with 32GB RAM.

    • Probably offload to regular ram I'd wager. Or really, really, reaaaaaaally quantized to absolute fuck. Qwen3:30B-A3B Q1 with a 1k Q4 context uses 5.84GB of vram.

People ask this question a lot and annoyingly the answer is: there are many definitions of “best”. Speed, capabilities (e.g. do you want it to be able to handle images or just text?), quality, etc.

It’s like asking what the best pair of shoes is.

Go on Ollama and look at the most popular models. You can decide for yourself what you value.

And start small, these things are GBs in size so you don’t want to wait an hour for a download only to find out a model runs at 1 token / second.

I think you'll find that on that card most models that are approaching the 16G memory size will be more than fast enough and sufficient for chat. You're in the happy position of needing steeper requirements rather than faster hardware! :D

Ollama is the easiest way to get started trying things out IMO: https://ollama.com/

Phi-4 is scared to talk about anything controversial, as if they're being watched.

I asked it a question about militias. It thought for a few pages about the answer and whether to tell me, then came back with "I cannot comply".

Nidum is the name of uncensored Gemma, it does a good job most of the time.

I find Ollama + TypingMind (or similar interface) to work well for me. As for which models, I think this is changing from one month to the next (perhaps not quite that fast). We are in that kind of period. You'll need to make sure the model layers fit in VRAM.

Good question. I've had some success with Qwen2.5-Coder 14B, I did use the quantised version: huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct-GGUF:latest It worked well on my MacBook Pro M1 32Gb. It does get a bit hot on a laptop though.

I like the Mistral models. Not the smartest but I find them to have good conversation while being small, fast and efficient.

And the part I like the most is there is almost no censorship, at least not for the models I tried. For me, having an uncensored model is one of the most compelling reasons for running a LLM locally. Jailbreaks are a PITA and abliteration and other uncensoring fine-tunings tends to make models that have been made dumb by censorship even dumber.

Pick up a used 3090 with more ram.

It holds it's value so you won't lose much if anything when you resell it.

But otherwise, as said, install Ollama and/or Llama.cpp and run the model using the --verbose flag.

This will print out the token per second result after each promt is returned.

Then find the best model that gives you a token per second speed you are happy with.

And as also said, 'abliterated' models are less censored versions of normal ones.

I’ve had awesome results with Qwen3-30B-A3B compared to other local LMs I’ve tried. Still not crazy good but a lot better and very fast. I have 24GB of VRAM though

Agree with what others have said: you need to try a few out. But I'd put Qwen3-14B on your list of things to try out.

Does anyone know of any local models that are capable of the same tool use (specifically web searches) during reasoning that the foundation models are?

I realize they aren’t going to be as good… but the whole search during reasoning is pretty great to have.

hf.co/bartowski/deepseek-ai_DeepSeek-R1-0528-Qwen3-8B-GGUF:Q6_K is a decent performing model, if you're not looking for blinding speed. It definitely ticks all the boxes in terms of model quality. Try a smaller quant if you need more speed.

The largest Gemma 3 and Qwen 3 you can run. Offload to RAM as many layers you can.

It's a bit like asking what flavour of icecream is the best. Try a few and see.

For 16gb and speed you could try Qwen3-30B-A3B with some offload to system ram or use a dense model Probably a 14B quant

I'm running llama3.2 out of the box on my 2013 Mac Pro, the low end quad core Xeon one, with 64GB of RAM.

It's slow-ish but still useful, getting 5-10 tokens per second.

VEGA64 (8GB) is pretty much obsolete for this AI stuff, right (compared to e.g. M2Pro (16GB))?

I'll give Qwen2.5 a try on the Apple Silicon, thanks.

What about for a 5090?

  • I run Qwen3-32B with Unsloth Dynamic Quants 2.0, quantized to 4+ bits, and with the key-value cache reduced to 8-bit. It's my favorite configuration so far. This configuration has the best quality/speed ratio at this moment imho.

    It's pretty magical - it often feels like I'm talking to GPT-4o or o1, until it makes a silly mistake once in a while. It supports reasoning out of the box, which improves results considerably.

    With the settings above, I get 60 tokens per second on an RTX 5090, because it fits entirely in GPU memory. It feels faster than GPT-4o. A 32k context with 2 parallel generations* consumes 28 GB of VRAM (with llama.cpp), so you still have 4 GB left for something else.

    * I use 2 parallel generations because there's a few of us sharing the same GPU. If you use only 1 parallel generation, you can increase the context to 64k

  • Comes with 32GB VRAM right?

    Speaking of, would a Ryzen 9 12 core be nice for a 5090 setup?

    Or should one really go dual 5090?