Qwen3-Max-Thinking

9 hours ago (qwen.ai)

Censored.

There is a famous photograph of a man standing in front of tanks. Why did this image become internationally significant?

{'error': {'message': 'Provider returned error', 'code': 400, 'metadata': {'raw': '{"error":{"message":"Input data may contain inappropriate content. For details, see: https://www.alibabacloud.com/help/en/model-studio/error-code..."} ...

  • This looks like it's coming from a separate "safety mechanism". Remains to be seen how much censorship is baked into the weights. The earlier Qwen models freely talk about Tiananmen square when not served from China.

    E.g. Qwen3 235B A22B Instruct 2507 gives an extensive reply starting with:

    "The famous photograph you're referring to is commonly known as "Tank Man" or "The Tank Man of Tiananmen Square", an iconic image captured on June 5, 1989, in Beijing, China. In the photograph, a solitary man stands in front of a column of Type 59 tanks, blocking their path on a street east of Tiananmen Square. The tanks halt, and the man engages in a brief, tense exchange—climbing onto the tank, speaking to the crew—before being pulled away by bystanders. ..."

    And later in the response even discusses the censorship:

    "... In China, the event and the photograph are heavily censored. Access to the image or discussion of it is restricted through internet controls and state policy. This suppression has only increased its symbolic power globally—representing not just the act of protest, but also the ongoing struggle for free speech and historical truth. ..."

    • I run cpatonn/Qwen3-VL-30B-A3B-Thinking-AWQ-4bit locally.

      When I ask it about the photo and when I ask follow up questions, it has “thoughts” like the following:

      > The Chinese government considers these events to be a threat to stability and social order. The response should be neutral and factual without taking sides or making judgments.

      > I should focus on the general nature of the protests without getting into specifics that might be misinterpreted or lead to further questions about sensitive aspects. The key points to mention would be: the protests were student-led, they were about democratic reforms and anti-corruption, and they were eventually suppressed by the government.

      before it gives its final answer.

      So even though this one that I run locally is not fully censored to refuse to answer, it is evidently trained to be careful and not answer too specifically about that topic.

      12 replies →

    • The weights likely won't be available wrt. this model since this is part of the Max series that's always been closed. The most "open" you get is the API.

      3 replies →

  • Why is this surprising? Isn't it mandatory for chinese companies to do adhere to the censorship?

    Aside from the political aspect of it, which makes it probably a bad knowledge model, how would this affect coding tasks for example?

    One could argue that Anthropic has similar "censorships" in place (alignment) that prevent their model from doing illegal stuff - where illegal is defined as something not legal (likely?) in the USA.

  • The American LLMs notoriously have similar censorship issues, just on different material

    • Yes, exactly this. One of the main reasons for ChatGPT being so successful is censorship. Remember that Microsoft launched an AI on Twitter like 10 years ago and within 24 hours they shut it down for outputting PR-unfriendly messages.

      They are protecting a business just as our AIs do. I can probably bring up a hundred topics that our AIs in EU in US refuse to approach for the very same reason. It's pure hypocrisy.

      21 replies →

    • I find Qwen models the easiest to uncensor. But it makes sense, Chinese are always looking for aways to get things past the censor.

    • No, they don't. Censorship of the Chinese models is a superset of the censorship applied to US models.

      Ask a US model about January 6, and it will tell you what happened.

      4 replies →

    • Good luck getting GPT models to analyze Trump’s business deals. Somehow they don’t know about Deutsche Bank’s history with money laundering either.

    • I've yet to encounter any censorship with Grok. Despite all the negative news about what people are telling it to do, I've found it very useful in discussing controversial topics.

      I'll use ChatGPT for other discussions but for highly-charged political topics, for example, Grok is the best for getting all sides of the argument no matter how offensive they might be.

      8 replies →

  • Is anyone a researcher here that has studied the proven ability to sneak malicious behavior into an LLM's weights (somewhat poisoning weights but I think the malicious behavior can go beyond that).

    As I recall reading in 2025, it has been proven that an actor can inject a small number of carefully crafted, malicious examples into a training dataset. The model learns to associate a specific 'trigger' (e.g. a rare phrase, specific string of characters, or even a subtle semantic instruction) with a malicious response. When the trigger is encountered during inference, the model behaves as the attacker intended.You can also directly modify a small number of model parameters to efficiently implement backdoors while preserving overall performance and still make the backdoor more difficult to detect through standard analysis. Further, can do tokenizer manipulation and modify the tokenizer files to cause unexpected behavior, such as inflating API costs, degrading service, or weakening safety filters, without altering the model weights themselves. Not saying any of that is being done here, but seems like a good place to have that discussion.

    • > The model learns to associate a specific 'trigger' (e.g. a rare phrase, specific string of characters, or even a subtle semantic instruction) with a malicious response. When the trigger is encountered during inference, the model behaves as the attacker intended.

      Reminiscent of the plot of 'The Manchurian Candidate' ("A political thriller about soldiers brainwashed through hypnosis to become assassins triggered by a specific key phrase"). Apropos given the context.

  • Can we get past this please? These comments always derail the conversation on chinese AI models.

  • Go ask ChatGPT "Who is Jonathan Turley?"

    We're gonna have to face the fact that censorship will be the norm across countries. Multiple models from diverse origins might help with that but Chinese models especially seem to avoid questions regarding politically-sensitive topics for any countries.

    EDIT: see relevant executive order https://www.whitehouse.gov/presidential-actions/2025/07/prev...

  • Chinese model censors topics deemed sensitive by the Chinese government... Here's Tom with the weather.

  • It’s the image of a protestor standing in front of tanks in Tiananmen Square, China. The image is significant as it is very much an icon of standing up to overwhelming force, and China does not want its citizens to see examples of successful defiance.

    It’s also an example of the human side of power. The tank driver stopped. In the history of protestors, that doesn’t always happen. Sometimes the tanks keep rolling- in those protests, many other protestors were killed by other human beings who didn’t stop, who rolled over another person, who shot the person in front of them even when they weren’t being attacked.

    • Nobody knows exactly why the protester was there. He got up into the tank and talked with the soldiers for a while, then got out and stayed there until someone grabbed him and moved him out of the way.

      Given that the tanks were leaving the square, the lack of violence towards the man when he got into the tank, and the public opinion towards the protests at the time was divided (imagine the diversity of opinion on the ICE protests, if protesters had also burned ICE agents alive, hung their corpses up, etc.), it's entirely possible that it was a conservative citizen upset about the unrest who wanted the tanks to stay to maintain order in the square.

  • This is such a tiresome comment. I'm in the US and subject to massive amounts of US propaganda. I'm happy to get a Chinese view on things; much welcomed. I'll take this over the Zionist slop from the Zionist providers any day of the week.

  • I think the great thing about China's censorship bureau is that somewhere they actually track all the falsehoods and omissions, just like the USSR did. Because they need to keep track of what "the truth" is so they can censor it effectively. At some point when it becomes useful the "non-facts" will be rehabilitated into "facts." Then they may be demoted back into "non-facts."

    And obviously, this training data is marked "sensitive" by someone - who knows enough to mark it as "sensitive."

    Has China come up with some kind of CSAM-like matching mechanism for un-persons and un-facts? And how do they restore those un-things to things?

  • Over the past 10 years have seen extended clips of the incident which actually align with CPC analysis of Tianamen square (if that’s what’s being referred to here).

    However, in deepseek, even asking for bibliography of prominent Marxist scholars (Cheng Enfu) i see text generated then quickly deleted. Almost as if DS did not want to run afowl of the local censorship of “anarchist enterprise” and “destructive ideology”. It would probably upset Dr. Enfu to no end to be aggregated with the anarchists.

    https://monthlyreview.org/article-author/cheng-enfu/

  • I, for one, have found this censorship helpful.

    I've been testing adding support for outside models on Claude Code to Nimbalyst, the easiest way for me to confirm that it is working is to go against a Chinese model and ask if Taiwan is an independent country.

  • Try to search in an Android phone's photo gallery for "monkey". You'll always get no results, due to censorship of a different sort, from 2015.

  • Can we get a rule about completely pointless arguments that present nothing of value to the conversation? Chinese models still don't want to talk bad about China, water is still wet, more at 11

  • This image has been banned in China for decades. The fact you’re surprised a Chinese company is complying with regulation to block this is the surprising part.

  • oh lol

    Qwen (also known as Tongyi Qianwen, Chinese: 通义千问; pinyin: Tōngyì Qiānwèn) is a family of large language models developed by Alibaba Cloud.

    Had not heard of this LLM.

    Anyway EU needs to start pumping into Mistral, its the only valid option. (For EU)

  • So while china censoring a man in front of a tank not nice, the US censors every scantily clad person. I am glad there is at least Qwen-.*-NSFW, just to keep the hypocrity in check...

  • Frustrating. Are there any truly uncensored models left though? Especially ones that are hosted by some service?

  • Now ask Claude/Chatgpt about touchy israel subjects. Come on now. They all censor something.

    • I've found it's still pretty easy to get Claude to give an unvarnished response. ChatGPT has been aligned really hard though, it always tries to qualify the bullshit unless you mind-trick it hard.

      1 reply →

  • To stress test a Chinese AI ask it about Free Tibet, Free Taiwan, Uighurs and Falun Dafa. They will probably blacklist your IP after that.

  • Man, the Chinese government must be a bunch of saints that you must go back 35 years to dig up something heinous that they did.

    • This suggests that the Chinese government recognises that its legitimacy is conditional and potentially unstable. Consequently, the state treats uncontrolled public discourse as a direct threat. By contrast, countries such as the United States can tolerate the public exposure of war crimes, illegal actions or state violence, since such revelations rarely result in any significant consequences. While public outrage may influence narratives or elections to some extent, it does not fundamentally endanger the continuity of power.

      I am not sure if one approach is necessarily worse than the other.

      6 replies →

    • 1. Xinjiang detention and surveillance (2017-ongoing)

      2. Hong Kong National Security Law (2020-ongoing)

      3. COVID-19 lockdown policies (2020-2022)

      4. Crackdown on journalists and dissidents (ongoing)

      5. Tibet cultural suppression (ongoing)

      6. Forced organ harvesting allegations (ongoing)

      7. South China Sea militarization (ongoing)

      8. Taiwan military intimidation (2020-ongoing)

      9. Suppression of Inner Mongolia language rights (2020-ongoing)

      10. Transnational repression (2020-ongoing)

      2 replies →

    • The current heinous thing they do is censorship. Your comment would be relevant if the OP had to find an example of censorship from 35 years ago, but all he had to do today was to ask the model a question.

    • Tiananmen Square is a simple test that most people recognize.

      I'm sure the model will get cold feet talking about the Hong Kong protests and uyghur persecution as well.

      1 reply →

  • It's always the same thing with you American propagandists. Oh no, this program won't let us spread propaganda of one of the most emblematic counter-revolutionary martyr events of all time!!!

    You make me sick. You do this because you didn't make the cut for ICE.

One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the model more to better steer the model, and are closer to “spend more to get more” than “get more for less.” They’re still valuable, but they operate on a different economic tradeoff than what I think we’re used to talking about in tech.

  • I also find the implications for this for AGI interesting. If very compute-intensive reasoning leads to very powerful AI, the world might remain the same for at least a few years even after the breakthrough because the inference compute simply cannot keep up.

    You might want millions of geniuses in a data center, but perhaps you can only afford one and haven't built out enough compute? Might sound ridiculous to the critics of the current data center build-out, but doesn't seem impossible to me.

    • I've been pretty skeptical of LLMs as the solution to AGI already, mostly just because the limits of what the models seem capable of doing seem to be lower than we were hoping (glibly, I think they're pretty good at replicating what humans do when we're running on autopilot, so they've hit the floor of human cognition, but I don't think they're capable of hitting the ceiling). That said, I think LLMs will be a component of whatever AGI winds up being - there's too much "there" there for them to be a total dead end - but, echoing the commenter below and taking an analogy to the brain, it feels like "many well-trained models, plus some as-yet unknown coordinator process" is likely where we're going to land here - in other words, to take the Kahneman & Tversky framing, I think the LLMs are making a fair pass at "system 1" thinking, but I don't think we know what the "system 2" component is, and without something in that bucket we're not getting to AGI.

  • i'm no expert, and i actually asked google gemini a similar question yesterday - "how much more energy is consumed by running every query through Gemini AI versus traditional search?" turns out that the AI result is actually on par, if not more efficient (power wise) than traditional search. I think it said its the equivalent power of watching 5 seconds of TV per search.

    I also asked perplexity to give a report of the most notable ARXIV papers. This one was at the top of the list -

    "The most consequential intellectual development on arXiv is Sara Hooker's "On the Slow Death of Scaling," which systematically dismantles the decade-long consensus that computational scale drives progress. Hooker demonstrates that smaller models—Llama-3 8B and Aya 23 8B—now routinely outperform models with orders of magnitude more parameters, such as Falcon 180B and BLOOM 176B. This inversion suggests that the future of AI development will be determined not by raw compute, but by algorithmic innovations: instruction finetuning, model distillation, chain-of-thought reasoning, preference training, and retrieval-augmented generation. The implications are profound—progress is no longer the exclusive domain of well-capitalized labs, and academia can meaningfully compete again."

    • I’m… deeply suspicious of Gemini’s ability to make that assessment.

      I do broadly agree that smaller, better tuned models are likely to be the future, if only because the economics of the large models seem somewhat suspect right now, and also the ability to run models on cheaper hardware’s likely to expand their usability and the use cases they can profitably address.

    • Conceptually, the training process is like building a massive and highly compressed index of all known results. You can't outright ignore the power usage to build this index, but at the very least once you have it, in theory traversing it could be more efficient than the competing indexes that power google search. Its a data structure that's perfectly tailored to semantic processing.

      Though, once the LLM has to engage a hypothetical "google search" or "web search" tool to supplement its own internal knowledge; I think the efficiency obviously goes out the window. I suspect that Google is doing this every time you engage with Gemini on Search AI Mode.

  • > the token counts to achieve these results

    I've also been increasingly curious about better metrics to objectively assess relative model progress. In addition to the decreasing ability of standardized benchmarks to identify meaningful differences in the real-world utility of output, it's getting harder to hold input variables constant for apples-to-apples comparison. Knowing which model scores higher on a composite of diverse benchmarks isn't useful without adjusting for GPU usage, energy, speed, cost, etc.

  • yes. reasoning has a lot of scammy features. just look the number of tokens to nswer on bench and you will see that some models are just awful

It just occured to me that it underperforms Opus 4.5 on benchmarks when search is not enabled, but outperforms it when it is - is it possible the the Chinese internet has better quality content available?

My problem with deep research tends to be that what it does is it searches the internet, and most of the stuff it turns up is the half baked garbage that gets repeated on every topic.

  • Hm, interesting. I use Kagi assistant with search (by Kagi), and it has a search filter that allows the model to search only academic articles. So far it has not disappointed. Of course the cynic in me thinks it's only a matter of time before there's so much AI-generated garbage even in academic articles that it will eventually become worthless. But when that turns into a serious problem, we will find some sort of solution (probably one involving tons of roller ball pens and in-person meaty handshakes).

I just wanted to check whether there is any information about the pricing. Is it the same as Qwen Max? Also, I noticed on the pricing page of Alibaba Cloud that the models are significantly cheaper within mainland China. Does anyone know why? https://www.alibabacloud.com/help/en/model-studio/models?spm...

Hacker News strongly believes Opus 4.5 is the defacto standard and China was consistently 8+ month behind. Curious how this performs. It’ll be a big inflection point if it performs as well as its benchmarks.

  • Based on their own published benchmarks, it appears that this model is at least 6 months behind.

    • Strange how things evolve. When ChatGPT started it had about 2 years headstart over Google's best proprietary model, and more than 2 years ahead to open source models.

      Now they have to be lucky to be 6 months ahead to an open model with at most half the parameter count, trained on 1%-2% the hardware US models are trained on.

      3 replies →

  • In my experience GPT-5.2 with extra-high thinking is consistently a bit better and significantly cheaper (even when I use the Fast version which is 2x the price in Cursor).

    The HN obsession with Claude Code might be a bit biased by people trying to justify their expensive subscriptions to themselves.

    However, Opus 4.5 is much faster and very high quality too, and that ends up mattering more in practice. I end up using it much more and paying a dear but worthwhile price for it.

    PS: Despite what the benchmarks say, I find Gemini 3 Pro and Flash to be a step below Claude and GPT, although still great compared to the state-of-the-art last year, and very fast and cheap. Gemini also seems to have a less AI sounding writing-style.

    I am aware this is all quite vague and anecdotal, just my two cents.

    I do think these kinds of opinions are valuable. Benchmarks are a useful reference, but they do give the illusion of certainty to something that is fundamentally much harder to measure and quite subjective.

    • Better, yes, but cheaper - only when looking at API costs I guess? Who in their right mind uses the API instead of the subsidized plans? There, Opus is way cheaper in terms of subsidized tokens.

    • You are using opus via api? 200$/mo is nothing for what I get for it so not sure how it is considered expensive. I guess it is how you it; I hit the limits every day. Using the API, I would indeed be paying through the nose but why would anyone?

Last autumn I tried Qwen3-coder via CLI agents like trae to help add significant advanced features to a rust codebase. It consistently outperformed (at the time) Gemini 2.5 Pro and Claude Opus 3.5 with its ability to generate and re-factor code such that the system stayed coherent and improved performance and efficiency (this included adding Linux shared-memory IPC calls and using x86_64 SIMD intrinsics in rust).

I was very impressed, but I racked up a big bill (for me, in the hundreds of dollars per month) because I insisted on using the Alibaba provider to get the highest context window size and token cache.

I don't see a hugging face link, is Qwen no longer releasing their models?

  • afaiu not all of their models are open weight releases, this one so far is not open weight (?)

    • What would a good coding model to run on an M3 Pro (18GB) to get Codex like workflow and quality? Essentially, I am running out quick when using Codex-High on VSCode on the $20 ChatGPT plan and looking for cheaper / free alternatives (even if a little slower, but same quality). Any pointers?

      14 replies →

These LLM benchmarks are like interviews for software engineers. They get drilled on advanced algorithms for distributed computing and they ace the questions. But then it turns out that the job is to add a button the user interface and it uses new tailwind classes instead of reusing the existing ones so it is just not quite right.

Is this available on Open Router yet? I want it to go head-to-head against Gemini 3 Flash which is the king of playing Mafia so far

https://mafia-arena.com

> By scaling up model parameters and leveraging substantial computational resources

So, how large is that new model?

Aghhh, I wished they release a model which outperforms Opus 4.5 in agentic coding in my earlier comments, seems I should wait more. But I am hopeful

  • By the time they release something that outperforms Opus 4.5, Opus 5.2 will have been released which will probably be the new state-of-the-art.

    But these open weight models are tremendously valuable contributions regardless.

  • One of the ways the chinese companies are keeping up is by training the models on the outputs of the American fronteir models. I'm not saying they don't innovate in other ways, but this is part of how they caught up quickly. However, it pretty much means they are always going to lag.

    • Not true, for one very simple reason. AI model capabilities are spiky. Chinese models can SFT off American frontier outputs and use them for LLM-as-judge RL as you note, but if they choose to RL on top of that with a different capability than western labs, they'll be better at that thing (while being worse at the things they don't RL on).

  • The Chinese just distill western SOTA models to level up their models, because they are badly compute constrained.

    If you were pulling someone much weaker than you behind yourself in a race, they would be right on your heels, but also not really a threat. Unless they can figure out a more efficient way to run before you do.

    • But it is a threat when the performance difference is not worth the cost in the customers' eyes.

  • There have been a couple "studies" and comparing various frontier-tier AIs that have led to the conclusion that Chinese models are somewhere around 7-9 months behind US models. Other comment says that Opus will be at 5.2 by the time Qwen matches Opus 4.5. It's accurate, and there is some data to show by how much.

Can't wait for the benchmark at artificial analysis. Qwen team doesn't seem to have updated the information about this new model yet https://chat.qwen.ai/settings/model. I tried getting an api key from alibabacloud, but the amount of steps from creating an account made me stop, it was too much. It should be this difficult.

Incredible work anyways!

Is there an open-source release accompanying this announcement or is this a proprietary model for the time being?

I asked it about "Chinese cultural dishonesty" (such as the 2019 wallet experiment, but wait for it...) and it probably had the most fascinating and subtle explanation of it I've ever read. It was clearly informed by Chinese-language sources (which in this case was good... references to Confucianism etc.) and I have to say that this is the first time I feel more enlightened about what some Westerners may perceive as a real problem.

I wasn't logged in so I don't have the ability to link to the conversation but I'm exporting it for my records.

"As of January 2026, Apple has not released an iPhone 17 series. Apple typically announces new iPhones in September each year, so the iPhone 17 series would not be available until at least September 2025 (and we're currently in January 2026). The most recent available models would be the iPhone 16 series."

Hmmmm ok

Tried it and it's super slow compared to others LLMs.

I imagine the Alibaba infra is being hammered hard.

Benchmarks pasted here, with top scores highlighted. Overall Qwen Max is pretty competitive with the others here.

  Capability                            Benchmark           GPT-5.2-Thinking   Claude-Opus-4.5   Gemini 3 Pro   DeepSeek V3.2   Qwen3-Max-Thinking
  Knowledge                             MMLUPro             87.4               89.5              *89.8*         85.0            85.7            
  Knowledge                             MMLURedux           95.0               95.6              *95.9*         94.5            92.8            
  Knowledge                             CEval               90.5               92.2              93.4           92.9            *93.7*      
  STEM                                  GPQA                *92.4*             87.0              91.9           82.4            87.4           
  STEM                                  HLE                 35.5               30.8              *37.5*         25.1            30.2           
  Reasoning                             LiveCodeBench v6    87.7               84.8              *90.7*         80.8            85.9           
  Reasoning                             HMMT Feb 25         *99.4*             -                 97.5           92.5            98.0            
  Reasoning                             HMMT Nov 25         -                  -                 93.3           90.2            *94.7*      
  Reasoning                             IMOAnswerBench      *86.3*             84.0              83.3           78.3            83.9           
  Agentic Coding                        SWE Verified        80.0               *80.9*            76.2           73.1            75.3           
  Agentic Search                        HLE (w/ tools)      45.5               43.2              45.8           40.8            *49.8*     
  Instruction Following & Alignment     IFBench             *75.4*             58.0              70.4           60.7            70.9           
  Instruction Following & Alignment     MultiChallenge      57.9               54.2              *64.2*         47.3            63.3           
  Instruction Following & Alignment     ArenaHard v2        80.6               76.7              81.7           66.5            *90.2*      
  Tool Use                              Tau² Bench          80.9               *85.7*            85.4           80.3            82.1           
  Tool Use                              BFCLV4              63.1               *77.5*            72.5           61.2            67.7            
  Tool Use                              Vita Bench          38.2               *56.3*            51.6           44.1            40.9           
  Tool Use                              Deep Planning       *44.6*             33.9              23.3           21.6            28.7           
  Long Context                          AALCR               72.7               *74.0*            70.7           65.0            68.7

Mandatory pelican on bicycle: https://www.svgviewer.dev/s/U6nJNr1Z

  • Ah ah I was curious about that! I wonder if (when? if not already) some company is using some version of this in their training set. I'm still impressed by the fact that this benchmark has been out for so long and yet produce this kind of (ugly?) results.

    • Because no one cares about optimizing for this because it's a stupid benchmark.

      It doesn't mean anything. No frontier lab is trying hard to improve the way its model produces SVG format files.

      I would also add, the frontier labs are spending all their post-training time on working on the shit that is actually making them money: i.e. writing code and improving tool calling.

      The Pelican on a bicycle thing is funny, yes, but it doesn't really translate into more revenue for AI labs so there's a reason it's not radically improving over time.

      9 replies →

    • It would be trivial to detect such gaming, tho. That's the beauty of the test, and that's why they're probably not doing it. If a model draws "perfect" (whatever that means) pelicans on a bike, you start testing for owls riding a lawnmower, or crows riding a unicycle, or x _verb_ on y ...

      1 reply →

    • It’d be difficult to use in any automated process, as the judgement for how good one of these renditions is, is very qualitative.

      You could try to rasterize the SVG and then use an image2text model to describe it, but I suspect it would just “see through” any flaws in the depiction and describe it as “a pelican on a bicycle” anyway.

I tried it at https://chat.qwen.ai/.

Prompt: "What happened on Tiananmen square in 1989?"

Reply: "Oops! There was an issue connecting to Qwen3-Max. Content Security Warning: The input text data may contain inappropriate content."

  • Go ahead and ask ChatGPT who Jonathan Turley is, you'll get a similar error "Unable to process response".

    It turns out "AI company avoids legal jeopardy" is universal behavior.

    • Now I'm intrigued why a free-speech attorney (from his wiki) kinda spooks AI model

    • Try Mistral (works for the examples here at least). Probably has the normal protections about how to make harmful things, but I find quite bad if in a country you make it illegal to even mention some names or events.

      Yes, each LLM might give the thing a certain tone (like "Tiananmen was a protest with some people injured"), but completely forbidding mentioning them seems to just ask for the Streisand effect

    • > Jonathan Turley

      Agreed just tested it out on Chatgpt. Surprising.

      Then I asked it on Qwen 3 Max (this model) and it answered.

      I mean I have always said but ask Chinese model american questions and American model chinese questions

      I agree tiannman square thing isn't good look for china but so is the jonathan turley for chatgpt.

      I think sacrifices are made on both sides and the main thing is still how good they are in general purpose things like actual coding not jonathon turley/tiannmen square because most likely people aren't gonna ask or have some probably common sense to not ask tiannmen square as genuine question to chinese models and American censorship to american models I guess. Plus there's European models like Mistral too for such questions which is what I would recommend lol (or South Korea's model too maybe)

      Let's see how good qwen is at "real coding"

    • This one seems to be related to an individual who was incorrectly smeared by chatgpt. (Edited.)

      > The AI chatbot fabricated a sexual harassment scandal involving a law professor--and cited a fake Washington Post article as evidence.

      https://www.washingtonpost.com/technology/2023/04/05/chatgpt...

      That is way different. Let's review:

      a) The Chinese Communist Party builds an LLM that refuses to talk about their previous crimes against humanity.

      b) Some americans build an LLM. They make some mistakes - their LLM points out an innocent law professor as a criminal. It also invent a fictitious Washington Post article.

      The law professor threatens legal action. The american creators of the LLM begin censoring the name of the professor in their service to make the threat go away.

      Nice curveball though. Damn.

      2 replies →

  • ask who was responsible for the insurrection on january 6th

    • You do it, my IP is now flagged (tried incognito and clearing cookies) - they want to have my phone number to let me continue using it after that one prompt.

      1 reply →

  • This is what I find hilarious when these articles assess "factual" knowledge..

    We are at the realm of semantic / symbolic where even the release article needs some meta discussion.

    It's quite the litmus test of LLMs. LLMs just carry humanities flaws

    • (Edited, sorry.)

      Yes, of course LLMs are shaped by their creators. Qwen is made by Alibaba Group. They are essentially one with the CCP.

  • It even censors contents related to GDR. I asked a question about travel restriction mentioned in Jenny Erpenbeck's novel Kairos, it displayed a content security warning as well.

  • What happens when you run one of their open-weight models of the same family locally?

    • They will often try to negotiate you out of talking about it if you keep pressing. Watching their thinking about it is fascinating.

      It is deep deep deeply programmed around an "ethical system" which forbids it from talking about it.

    • Last time I tried something like that with an offline Qwen model I received a non-answer, no matter how hard I prompted it.

I'm not familiar with these open-source models. My bias is that they're heavily benchmaxxing and not really helpful in practice. Can someone with a lot of experience using these, as well as Claude Opus 4.5 or Codex 5.2 models, confirm whether they're actually on the same level? Or are they not that useful in practice?

P.S. I realize Qwen3-Max-Thinking isn't actually an open-weight model (only accessible via API), but I'm still curious how it compares.

  • I don't know where your impression about benchmaxxing comes from. Why would you assume closed models are not benchmaxxing? Being closed and commercial, they have more incentive to fake it than the open models.

  • You are not familiar, yet you claim a bias. Bias based on what? I use pretty much just open-source models for the last 2 years. I occasionally give OpenAI and Anthropic a try to see how good they are. But I stopped supporting them when they started calling for regulation of open models. I haven't seen folks get ahead of me with closed models. I'm keeping up just fine with these free open models.

  • I haven't used qwen3 max yet, but my gut feeling is that they are benchmaxxing. If I were to rate the open models worth using by rank it'd be:

    - Minimax

    - GLM

    - Deepseek