Comment by joshuaisaact
9 hours ago
This has been pretty comprehensively disproven:
https://arxiv.org/abs/2311.10054
Key findings:
-Tested 162 personas across 6 types of interpersonal relationships and 8 domains of expertise, with 4 LLM families and 2,410 factual questions
-Adding personas in system prompts does not improve model performance compared to the control setting where no persona is added
-Automatically identifying the best persona is challenging, with predictions often performing no better than random selection
-While adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random
Fun piece of trivia - the paper was originally designed to prove the opposite result (that personas make LLMs better). They revised it when they saw the data completely disproved their original hypothesis.
Persona’s is not the same thing as a role. The point of the role is to limit what the work of the agent, and to focus it on one or two behaviors.
What the paper is really addressing is does key words like you are a helpful assistant give better results.
The paper is not addressing a role such as you are system designer, or you are security engineer which will produce completely different results and focus the results of the LLM.
Aside from what you said about applicability, the paper actually contradicts their claim!
In the domain alignment section:
> The coefficient for “in-domain” is 0.004(p < 0.01), suggesting that in-domain roles generally lead to better performance than out-domain roles.
Although the effect size is small, why would you not take advantage of it.
How well does such llm research hold up as new models are released?
Most model research decays because the evaluation harness isn’t treated as a stable artefact. If you freeze the tasks, acceptance criteria, and measurement method, you can swap models and still compare apples to apples. Without that, each release forces a reset and people mistake novelty for progress.
In a discussion about LLMs you link to a paper from 2023, when not even GPT-4 was available?
And then you say:
> comprehensively disproven
? I don't think you understand the scientific method
Fair point on the date - the paper was updated October 2024 with Llama-3 and Qwen2.5 (up to 72B), same findings. The v1 to v3 revision is interesting. They initially found personas helped, then reversed their conclusion after expanding to more models.
"Comprehensively disproven" was too strong - should have said "evidence suggests the effect is largely random." There's also Gupta et al. 2024 (arxiv.org/abs/2408.08631) with similar findings if you want more data points.
...or even how fast technology is evolving in this field.
One study has “comprehensively disproven” something for you? You must be getting misled left right and centre if that’s how you absorb study results.