Comment by tqi

14 days ago

> Our testing shows that Llama 4 responds with strong political lean at a rate comparable to Grok (and at half of the rate of Llama 3.3) on a contentious set of political or social topics. While we are making progress, we know we have more work to do and will continue to drive this rate further down.

My experience is that these subjective benchmarks are completely meaningless, because the researchers involved have a strong incentive (promotions, discretionary equity) to cherrypick measures that they can easily improve.