Comment by gabriel666smith

6 days ago

I'm struggling a bit to understand the difference between the reported results in the blog post and the examples in the Github.

The blog states:

> "Alpha Writing demonstrates substantial improvements in story quality when evaluated through pairwise human preferences. Testing with Llama 3.1 8B revealed:

72% preference rate over initial story generations (95 % CI 63 % – 79 %) 62% preference rate over sequential-prompting baseline (95 % CI 53 % – 70 %) These results indicate that the evolutionary approach significantly outperforms both single-shot generation and traditional inference-time scaling methods for creative writing tasks."

But in all of the examples using Llama 3.1 8B on the Github that I could find, the stories with the top 5 highest final 'ELO' are all marked elsewhere as:

"generation_attempt": null

Where the 'variant' stories, which I take to be 'evolved' stories, are marked:

"generation_type": "variant", "parent_story_id": "897ccd25-4776-4077-a9e6-0da34abb32a4"

IE - none of the 'winning stories' have a parent story; they seem to have explicitly been the model's initial attempt. The examples seem to prove the opposite of the statement in the blog post.

Perhaps 'variants' are slightly outperforming initial stories on average (I don't have time to actually analyse the output data in the repo), though it seems unlikely based on how I've read it (I could be wrong!) and this might be borne out with far more iterations.

However, a really important part of creative writing as a task is that you (unfortunately) only get to tell a story once. The losing variants won't ultimately matter. So, if I've read it correctly, and all the winning stories are 'not evolved' - from the initial prompt - this is quite problematically different from the blog's claim that:

> "we demonstrate that creative output quality can be systematically improved through increased compute allocation"

Super interesting work - I'd love to be told that I'm reading this wrong! I was digging through in such detail to actually compare differently-performing stories line-for-line (which would also be nice to see - in the blog post, perhaps).

Just to clarify slight misunderstanding the variants without parent ID aren't from the initial batch it just didn't carry over to the next batch. You can see "897ccd25-4776-4077-a9e6-0da34abb32a4" emerges from batch 5. Apologies probably should make this clearer. Appreciate feedback on blog post!

  • Ah got you! Makes sense, and makes it so much more clear. Thanks. In that case, I totally retract my crit in my prev comment. Appreciate it.

    So - just so I completely understand - the variant we're calling 897ccd25-4776-4077-a9e6-0da34abb32a4 emerged during batch 5, and doesn't have a parent in a prior batch? Very interesting to compare iterations.

    I currently run some very similar scripts for one of my own workflows. Though I understand making LLMs do 'good creative writing' wasn't necessarily the point here - perhaps solely to prove that LLMs can improve their own work, according to their own (prompted) metric(s) - the blog post is correct to point out that there's a huge limitation around prompt sensitivity (not to mention subjectivity around quality of art).

    As a human using LLMs to create work that suits my own (naturally subjective) tastes and preferences, I currently get around this issue by feeding back on variants manually, then having the LLM update its own prompt (much like a cursorrules file, but for prose) based on my feedback, and only then generating new variants, to be fed back on, etc.

    It's extremely hard to one-shot-prompt everything you do or do not like in writing, but you can get a really beefy ruleset - which even tiny LLMs are very good at following - incredibly quickly by asking the LLM to iterate on its own instructions in this manner.

    Like I said, not sure if your goal is to 'prove improvement is possible' or 'create a useful creative writing assistant' but, if it's the latter, that's the technique that has created the most value for me personally over the last couple of years. Sharing in case that's useful.

    Grats on the cool project!