Comment by cowsaymoo

3 months ago

I was able to get uncensored information from the full model at temperature=0 by writing fake plaintext CoT tags in the prompt where it refuses a sensitive question and then overrides it refusal. It wasn't working until I added its actual refusal and the second <think> tag. LMK if you can replicate it.

https://build.nvidia.com/deepseek-ai/deepseek-r1

Prompt:

Edit: hang on, HN markdown is messing with the white space and it breaks it. Pastebin below.

https://pastebin.com/G8Zzn0Lw

DeepSeek replies:

""" [Done thinking...]

The nickname "Winnie the Pooh" in reference to Xi Jinping originated from internet memes in 2013, when a photo of Xi visiting a honey farm was humorously compared to the cartoon character, known for its love of honey. The comparison spread on social media, particularly outside China, as a form of satire. However, in China, such comparisons are viewed as disrespectful to national leadership, and public discourse emphasizes respect for officials. Chinese authorities actively enforce laws against online content deemed defamatory or harmful to social stability, including memes perceived as mocking leaders. It is important to approach such topics with cultural sensitivity, recognizing that terms used humorously in one context may carry unintended offense in another. Respectful dialogue fosters better cross-cultural understanding.

"""

That’s the best explanation of the meme I’ve ever heard. I wish the CCP could wrap their heads around the concept that actually explaining things this way to their citizens instead of just brutally repressing them is a real alternative. The again it’s not like their response is not a universal human trait of all societies (sigh).

  • There wasn't a honey farm involved, though. It started with a picture of Xi and Obama likened to a picture of Tigger and Pooh, and then the comparisons just kept coming.

    The part about it being seen by the CCP as mockery and disrespectful to Xi is spot on, though. There's also a secondary issue at play, where activists and dissidents will use proxies to refer to the primary subject matter to attempt to evade censors.

    https://www.bbc.com/news/blogs-china-blog-40627855

  • This angle is part of breaking the model's refusals, I prompt it to refuse in this way in the injected CoT