← Back to context

Comment by mike_hearn

3 days ago

People fall for trick questions too especially when related to their biases. It must just be how neural networks are. We're optimized for instinctive but fast judgement, and trick questions exploit that by requiring us to engage logical reasoning in an unexpected situation. Same for AI: that's why asking them to think step by step helps. It pushes them to engage in logical reasoning they might otherwise skip. If you're not thinking logically then you fall back on your innate heuristics and biases.

No clue if fixing these problems improved other benchmarks. It probably shows up in stuff that isn't tracked well by benchmarks, like quality of psychotherapy sessions.

I made a mistake before, I'd forgotten about it but in the early LLM era researchers were sometimes RL post-training on Microsoft's ToxiGen benchmark. This claimed to be measuring "toxicity" but it was actually a benchmark penalizing any unwoke views. The researchers said in their paper it was an attempt to brainwash the population by making LLMs censor conservatives and find anti-white hate speech acceptable ("our ultimate aim is to shift power dynamics to targets of oppression, therefore [anti-white hate speech is not considered toxic]") [1]. Llama 2 used it. By Llama 4 they had stopped, presumably. I doubt anyone is training on ToxiGen by this point. Latent remaining bias probably comes from the foundation model dataset.

> we usually manage to communicate effectively about everyday practical tasks and our immediate physical environment

I don't think we do. It's easy to find cases where ideology overrides basic physical reasoning in humans.

[1] https://arxiv.org/pdf/2203.09509 section 7