DeepSeek-v3.1

2 days ago (api-docs.deepseek.com)

For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow.

./llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CPU"

More details on running + optimal params here: https://docs.unsloth.ai/basics/deepseek-v3.1

  • > More details on running + optimal params here: https://docs.unsloth.ai/basics/deepseek-v3.1

    Was that document almost exclusively written with LLMs? I looked at it last night (~8 hours ago) and it was riddled with mistakes, most egregious was that the "Run with Ollama" section had instructions for how to install Ollama, but then the shell commands were actually running llama.cpp, a mistake probably no human would make.

    Do you have any plans on disclosing how much of these docs are written by humans vs not?

    Regardless, thanks for the continued release of quants and weights :)

    • Oh hey sorry the docs are still in construction! Are you referring to merging GGUFs to Ollama - it should work fine! Ie:

      ``` ./llama.cpp/llama-gguf-split --merge \ DeepSeek-V3.1-GGUF/DeepSeek-V3.1-UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \ merged_file.gguf ```

      Ollama can only allow merged GGUFs (not splitted ones), so hence the command.

      All docs are made by humans (primarily my brother and me), just sometimes there might be some typos (sorry in advance)

      I'm also uploading Ollama compatible versions directly so ollama run can work (it'll take a few more hours)

    • > but then the shell commands were actually running llama.cpp, a mistake probably no human would make.

      But in the docs I see things like

          cp llama.cpp/build/bin/llama-* llama.cpp
      

      Wouldn't this explain that? (Didn't look too deep)

  • By the way, I'm wondering why unsloth (a goddamn python library) tries to run apt-get with sudo (and fails on my nixos). Like how tf we are supposed to use that?

    • Oh hey I'm assuming this is for conversion to GGUF after a finetune? If you need to quantize to GGUF Q4_K_M, we have to compile llama.cpp, hence apt-get and compiling llama.cpp within a Python shell.

      There is a way to convert to Q8_0, BF16, F16 without compiling llama.cpp, and it's enabled if you use `FastModel` and not on `FastLanguageModel`

      Essentially I try to do `sudo apt-get` if it fails then `apt-get` and if all fails, it just fails. We need `build-essential cmake curl libcurl4-openssl-dev`

      See https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...

      65 replies →

    • hey fellow crazy person! slight tangent: one thing that helps keep me grounded with "LLMs are doing much more than regurgitation" is watching them try to get things to work on nixos - and hitting every rake on the way to hell!

      nixos is such a great way to expose code doing things it shouldn't be doing.

      3 replies →

  • Thanks for your great work with quants. I would really appreciate UD GGUFs for V3.1-Base (and even more so, GLM-4.5-Base + Air-Base).

    • Thanks! Oh base models? Interesting since I normally do only Instruct models - I can take a look though!

  • It’d also be great if you guys could do a fine tune to run on an 8x80G A/H100. These H200/B200 configs are harder to come by (and much more expensive).

    • Unsloth should work on any GPU setup all the way until the old Tesla T4s and the newer B200s :) We're working on a faster and better multi GPU version, but using accelerate / torchrun manually + Unsloth should work out of the box!

      1 reply →

  • for such dynamic 2bit, is there any benchmark results showing how many performance I would give up compared to the original model? thanks.

    • if you are running a 2bit quant, you are not giving up performance but gaining 100% performance since the alternative is usually 0%. Smaller quants are for folks who won't be able to run anything at all, so you run the largest you can run relative to your hardware. I for instance often ran Q3_K_L, I don't think of how much performance I'm giving up, but rather how without Q3, I won't be able to run it at all. With that said, for R1, I did some tests against 2 public interfaces and my local Q3 crushed them. The problem with a lot of model providers is we can never be sure what they are serving up and could take shortcuts to maximize profit.

      5 replies →

For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

Looks to be the ~same intelligence as gpt-oss-120B, but about 10x slower and 3x more expensive?

https://artificialanalysis.ai/models/deepseek-v3-1-reasoning

  • > same intelligence as gpt-oss-120B

    Let's hope not, because gpt-oss-120B can be dramatically moronical. I am guessing the MoE contains some very dumb subnets.

    Benchmarks can be a starting point, but you really have to see how the results work for you.

  • My experience is that gpt-oss doesn't know much about obscure topics, so if you're using it for anything except puzzles or coding in popular languages, it won't do well as the bigger models.

    It's knowledge seems to be lacking even compared to gpt3.

    No idea how you'd benchmark this though.

    • > My experience is that gpt-oss doesn't know much about obscure topics

      That is the point of these small models. Remove the bloat of obscure information (address that with RAG), leaving behind a core “reasoning” skeleton.

      1 reply →

    • Something I was doing informally that seems very effective is asking for details about smaller cities and towns and lesser points of interest around the world. Bigger models tend to have a much better understanding and knowledge base for the more obscure places.

      3 replies →

It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those.

  • What formats? I thought the very schema of json is what allows these LLMs to enforce structured outputs at the decoder level? I guess you can do it with any format, but why stray from json?

    • Sometimes it will randomly generate something like this in the body of the text: ``` <tool_call>executeshell <arg_key>command</arg_key> <arg_value>echo "" >> novels/AI_Voodoo_Romance/chapter-1-a-new-dawn.txt</arg_value> </tool_call> ```

      or this: ``` <|toolcallsbegin|><|toolcallbegin|>executeshell<|toolsep|>{"command": "pwd && ls -la"}<|toolcallend|><|toolcallsend|> ```

      Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.

      3 replies →

    • In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema. Generally, the training is doing 99% of the work, of course, it's just "strict" means "we'll check it's work to the point a GBNF grammar created from the schema will validate."

      One of the funnier info scandals of 2025 has been that only Claude was even close to properly trained on JSON file edits until o3 was released, and even then it needed a bespoke format. Geminis have required using a non-formalized diff format by Aider. Wasn't until June Gemini could do diff-string-in-JSON better than 30% of the time and until GPT-5 that an OpenAI model could. (Though v4a, as OpenAI's bespoke edit format is called, is fine because it at least worked well in tool calls. Geminis was a clown show, you had to post process regular text completions to parse out any diffs)

      6 replies →

It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoning

Pricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1

  • Those Qwen3 2507 models are the local creme-de-la-creme right now. If you've got any sort of GPU and ~32gb of RAM to play with, the A3B one is great for pair-programming tasks.

  • I too like Qwen a lot, it's one of the best models for programming, I generally use it via the chat.

not sure if its just chat.deepseek.com but one strange thing I've noticed is that now it replies to like 90% of your questions with "Of course.", even when it doesnt fit the prompt at all. maybe it's the backend injecting it to be more obedient? but you can tell it `don't begin the reply to this with "of" ending "course"` and it will listen. it's very strange

Some people on reddit (very reliable source I know) are saying it was trained on a lot of Gemini and I can see that. for example it does that annoying thing gemini does now where when you use slang or really any informal terms it puts them in quotes in its reply

  • > for example it does that annoying thing gemini does now where when you use slang or really any informal terms it puts them in quotes in its reply

    Haven´t used Gemini much, but the time I used it, it felt very academic and theoretical compared to Opus 4. So that seems to fit. But I'll have to do more evaluation of the non-Claude models to get a better idea of the differences.

    • All this points to "personality" being a big -- and sticky -- selling point for consumer-facing chat bots. People really did like the chatty, emoji-filled persona of the previous ChatGPT models. So OpenAI was ~forced to adjust GPT-5 to be closer to that style.

      It raises a funny "innovator's dilemma" that might happen. Where an incumbent has to serve chatty consumers, and therefore gets little technical/professional training data. And a more sober workplace chatbot provider is able to advance past the incumbent because they have better training data. Or maybe in a more subtle way, chatbot personas give you access to varying market segments, and varying data flywheels.

Seems to hallucinate more than any model I've ever worked with in the past 6 months.

  • DeepSeek is bad for hallucinations in my experience. I wouldn't trust its output for anything serious without heavy grounding. It's great for fantastical fiction though. It also excels at giving characters "agency".

It’s a very smart move for DeepSeek to put out an Anthropic-compatible API, similar to Kimi-k2, GLM4.5 (Puzzled as to why Qwen didn’t do this). You can set up a simple function in your .zhsrc to run Claude-Code with these models:

https://github.com/pchalasani/claude-code-tools/tree/main?ta...

  • Wow thanks! I just ran into my claude code session limit like an hour ago and tried the method you linked and added 10 CNY to a deepseek api account and an hour later i've got 7.77 CNY left and have used 3.3 million tokens.

    I'm not confident enough to say it's as good as claude opus or even sonnet, but it seems not bad!

    I did run into an api error when my context exceeded deepseek's 128k window and had to manually compact the context.

Sad to see the off peak discount go. I was able to crank tokens like crazy and not have it cost anything. That said the pricing is still very very good so I can't complain too much.

So, is the output price there why most models are extremely verbose? Is it just a ploy to make extra cash? It's super annoying that I have to constantly tell it to be more and more concise.

  • > It's super annoying that I have to constantly tell it to be more and more concise.

    While system promting is the easy way of limiting the output in a somewhat predictable manner, have you tried setting `max_tokens` when doing inference? For me that works very well for constraining the output, if you set it to 100 you get very short answers while if you set it to 10,000 you can very long responses.

Is it foot at tool use? For me tool use is table stakes, if a model can't use tools then its almost useless.

Looks quite competitive among open-weight models, but I guess still behind GPT-5 or Claude a lot.

this might be OT and covered somewhere else but what's the latest/greatest on these models and their effect on the linguistics field, vs. what does the latest and greatness in linguistics feel about these models?

Cries in 128k context. Probably will be a good orchestrator though, can always delegate to Gemini.

It still cant name all the states in India

  • That's interesting. I am curious about the extent of the training data in these models.

    I asked Kimi K2 for an account of growing up in my home town in Scotland, and it was ridiculously accurate. I then asked it to do the same for a similarly sized town in Kerala. ChatGPT suggested that while it was a good approximation, K2 got some of the specifics wrong.

Cheep!

$0.56 per million tokens in — and $1.68 per million tokens out.

  • The next cheapest and capable model is GLM 4.5 at $0.6 per million tokens in and $2.2 per million tokens out. Glad to see DeepSeek is still be the value king.

    But I am sti disappointed with the price increase.

how can deepseek be so cheap* yet so effective?

*pricing: MODEL deepseek-chat deepseek-reasoner 1M INPUT TOKENS (CACHE HIT) $0.07 1M INPUT TOKENS (CACHE MISS) $0.56 1M OUTPUT TOKENS $1.68

  • I think it's because of a combination between the MoE model architecture and the inference done in large batches and run in parallel

Hmm. It’s still not close to paid frontier on SWE bench.

  • In my experience, Qwen 3 coder has been very good for agentic coding with Cline. I tried DeepSeek v3.1 and wasn't pleased with it.

just saw this on Chinese internet - deepseek officially mentioned that v3.1 is trained using UE8M0 FP8 as that is the FP8 to be supported by the next gen Chinese AI chip. so basically -

some Chinese next gen AI chips is coming, deepseek is working with them to get its flagship model trained using such domestic chips.

interesting time ahead! just imagine what it could do to NVIDIA share price when deepseek releases a SOTA new model trained without using NVIDIA chips.

  • Time to short Nvidia?

    • No because people never really talk about the quantity of the alternatives -- i.e. Huawei Ascent. Even if Huawei can match the quality, their yields are still abysmal. The numbers I've heard are in the hundreds of thousands vs. millions by Nvidia. In the near future, Nvidia's dominance is pretty secure. The only thing that can threaten it is if this whole AI thing isn't worth what some people imagined it is worth and people start to realize this.

    • No evidence v3.1 is trained on Chinese chips(they said very ambiguously, only said they adapted the model for Chinese chips, could be training, could be inference)

      Anyway, from my experience, if China really has advanced AI chips for SOTA model, I am sure propaganda machine will go all out, look how they boasted Huawei CPU that’s two generations behind Qualcomm and TSMC

They say the SWE bench verified score is 66%. Claude Sonnet 4 is 67%. Not sure if the 1% difference here is statistically significant or not.

I'll have to see how things go with this model after a week, once the hype has died down.

[flagged]

  • Every country acting in its own best interest, US is not unique in this regard

    wait until you find out that China also acting the same way toward the rest of the world (surprise pikachu face)

  • This does not make any sense to me. “There”? “‘Nationalist’ () bans” of and by whom?

    Dark propaganda opposed to what, light propaganda? The Chinese model being released is about keeping China down?

    You seem very animated about this, but you would probably have more success if you tried to clarify this a bit more.

Reminder DeepSeek is a Chinese company whose headstart is attributed to stealing IP from American companies. Without the huge theft, they'd be nowhere.

  • As if those american companies played fair with training their AIs

    It's theft all the way down, son

  • I can't say whether those claims are true. But even if they were, it feels selective. Every major AI company trained on oceans of data they didn't create or own. The whole field was built on "borrowing" IP, open-source code, academic papers, datasets, art, text, you name it.

    Drawing the line only now... saying this is where copying stops being okay doesn't seem very fair. No AI company is really in a position to whine about it from my POV (ignoring any lawyer POV). Cue the world's smallest violin

  • Can you contrast this with Western companies? What are the Chinese companies stealing that Western companies aren’t? Do you mean tech or content?

    • Ethics of Chinese vs. Western companies? Everything. I'm sure you're aware of how many hundreds of $billions of American IP are stolen by Chinese companies.

      4 replies →

  • I find it hilarious you felt the need to make this comment in defense of American LLMs. You know that American LLMs aren’t trained ethically either, right? Many people’s data was used for training without their permission.

    BTW DeepSeek has contributed a lot, with actual white papers describing in detail their optimizations. How are the rest of the American AI labs doing in contributing research and helping one another advance the field?

  • Reminder that OpenAI is an American company whose headstart is attributed to stealing copyrighted material from everyone else. Without the huge theft, they'd be nowhere.

    • Last I checked, as it concerns the training of their models, all legal challenges are pending. No theft has yet been proven, as they used publicly available data.

      3 replies →

  • If an American company did this, it would be "innovative bootstrapping". Yawn.