← Back to context

Comment by janalsncm

2 days ago

Can you expand on this? For tasks with verifiable rewards you can improve with rejection sampling and search (i.e. test time compute). For things like creative writing it’s harder.

For creative writing, you can do the same, you just use human verifiers rather than automatic ones.

LLMs have encountered the entire spectrum of qualities in its training data, from extremely poor writing and sloppy code, to absolute masterpieces. Part of what Reinforcement Learning techniques do is reinforcing the "produce things that are like the masterpieces" behavior while suppressing the "produce low-quality slop" one.

Because there are humans in the loop, this is hard to scale. I suspect that the propensity of LLMs for certain kinds of writing (bullet points, bolded text, conclusion) is a direct result of this. If you have to judge 200 LLM outputs per day, you prize different qualities than when you ask for just 3. "Does this look correct at a glance" is then a much more important quality.