AlphaWrite: AI that improves at writing by evolving its own stories

6 days ago (tobysimonds.com)

If there is something that I would like AI to never touch, it's that. Please stop making the world worse.

  • Not everyone shares your same world view, and some people do want to apply machine intelligence to their writing process.

    You don't have to participate; ignore AI-generated or AI-assisted content just like you ignore some other thing you don't enjoy that already exists today. But you also don't have to devalue and dismiss the interests of others.

    • All the people generating AI-assisted writing are all the people that never had enough passion or talent to do it before. If you weren't inclined to write fiction or poetry etc before AI was here to do it for you, you probably shouldn't be doing it now.

      22 replies →

  • Wow! Why?

    Personally, I'm fascinated by the question of what Joyce would have done with SillyTavern. Or Nabokov. Or Burroughs. Or T S Eliot, who incorporated news clippings into Wasteland - which feels, to me, extremely analogous with the way LLMs refract existing text into new patterns.

    • Creative works carry meaning through their author. The best art gives you insight into the imaginative mind of another human being—that is central to the experience of art at a fundamental level.

      But the machine does not intend anything. Based on the article as I understand it, this product basically does some simulated annealing of the quality of art as judged by an AI to achieve the "best possible story"—again, as judged by an AI.

      Maybe I am an outlier or an idiot, but I don't think you can judge every tool by its utility. People say that AI helps them write stories, I ask to what end? AI helps write code, again to what end? Is the story you're writing adding value to the world? Is the software you're writing adding value to the world? These seem like the important questions if AI does indeed become a dominant economic force over the coming decades.

      8 replies →

    • I don't really understand. You think these great minds of writing lacked same level of linguistic capability as a model?

      The authors were language models! If you want to simulate what they could have done with a model, just train a model on the text that was around when they were alive. Then you can generate as much text as you want that's "the same text they would have generated if they could have" which for me is just as good, since either way the product is the model's words not the artist's. What you lost that fascinates you is the author's brain and human perspective!

    • There is no answer to the question “what Joyce would have done…”. None. Nil. They are dead and anything done it their name is by definition not what they would have done, but what future generations who are convinced that they know better than the men themselves did.

      It is better to leave unanswerable questions unanswered.

      I am not against LLM technologies in general. But this trend of using LLMs to give a seemingly authoritative and conclusive answer to questions where no such thing is possible is dangerous to our society. We will see an explosion of narcissistic disorders as it becomes easier and easier to construct convincing narratives to cocoon yourself in, and if you dare questioning them they will tell you how the LLM passed X and Y and Z benchmarks so they cannot be wrong.

      1 reply →

  • AI is a great writing assistant, if a human is in the driver's seat determining WHAT to write and retaining creative control over the outputs it can only lead to better creative writing. This is because the human can spend less time (re)writing and more time refining and tuning, and AI is a great brainstorming partner/beta reader.

  • More capable AI systems make the world better. If you don't like AI written material on principle, you can simply choose not to read it. Follow human writers who don't use AI.

    • I feel that when making a claim like that, the burden of proof is on you to explain how AI makes the world a better place. I have seen far more of the opposite since the advent of GPT-3. Please do not say it makes you more productive at your job, unless you can also clearly derive how being better at your job might make the world a better place.

      1 reply →

    • That’s precisely the problem, though. The internet is already rapidly filling with AI-generated slop, and it takes a non-trivial amount of human brain power to determine whether the how-to article you’re reading is actually a reliable source or whether it was churned out to generate ad revenue.

      The infinite number of monkeys with typewriters are generating something that sounds enough like Shakespeare that it’s making it harder to find the real thing.

      10 replies →

  • You cannot stop people from making the world worse or better. The best you can do is focus on your own life.

    In time many will say we are lucky to live in a world with so much content, where anything you want to see or read can be spun up in an instant, without labor.

    And though most will no longer make a living doing some of these content creation activities by hand and brain, you can still rejoice knowing that those who do it anyway are doing it purely for their love of the art, not for any kind of money. A human who writes or produces art for monetary reasons is only just as bad as AI.

    • > In time many will say we are lucky to live in a world with so much content, where anything you want to see or read can be spun up in an instant, without labor.

      Man, you are talking about a world that's not just much worse but apocalyptically gone. In that world, there is no more art, full stop. The completeness and average-ness of stimulation would be the exact equivalent of sensory deprivation.

      4 replies →

    • > You cannot stop people from making the world worse or better.

      I can think of quite a few ways to do this.

    • >You cannot stop people from making the world worse or better. The best you can do is focus on your own life.

      We have laws and regulations for a reason.

    • > A human who writes or produces art for monetary reasons is only just as bad as AI.

      Or they're what you call "a professional artist," aka "people who produce art so good that other people are willing to pay for it."

      Another HN commenter who thinks artfulness is developed over decades and that individual art pieces are made over hundreds of hours out of some charity... Ridiculously ignorant worldview.

      16 replies →

    • > A human who writes or produces art for monetary reasons is only just as bad as AI.

      Tell that to all the Renaissance masters.

    • Clearly you've never made a list of openai data centre locations before

Note: Not associated with Google Deepmind (AlphaFold, AlphaGo, AlphaEvolve, etc.)

  • Though they obviously want to be to the point of infringing. It's the modern AI legal bubble so they'll never have to deal with the legal consequences.

  • Nor associated with AlphaSmart word processors or their AlphaWrite application. Sigh.

    • In this genre do you really expect a lot of concern for intellectual property or the ability to identify the source of anything?

Those examples seem quite unrelated to one another. The first reads as admitting intentional fraud and deceit, the second reads like dealing with imposter syndrome. I'd love to know the prompt.

Also, not sure how you can judge a style to be clearly better than another. The workflow of generating a bunch of stories in the style of different authors and then voting on a favorite just seems like picking a favorite author. Will the system ever prefer short, hard-hitting sentences? Sure enough, convergence is a noted behavior.

  • > Also, not sure how you can judge a style to be clearly better than another.

    This one is actually easy: The writing style used for a horror is different than what you'd use for a romance novel. Example: If you give it a prompt that asks the AI to generate something in the style of a romance author but the rest of the prompt is describing a horror or sci-fi story you'll end up with something that most people would objectively decide, "ain't right."

  • > Those examples seem quite unrelated to one another. The first reads as admitting intentional fraud and deceit, the second reads like dealing with imposter syndrome. I'd love to know the prompt.

    Yeah. And to read the rest of each of the stories it generated...

    Both paragraphs are simply short excerpts which involve no actual narrative, never mind the stuff that LLMs are typically weak at (maintaining consistency, intricate plotting and pacing, subtlety in world and character building) which in the context of stories are far more important to improve than its phrasing.

    The fact that the "improvement" apparently eliminates a flaw in the first passage ("gentle vibrations that vibrated through my very being" is pretty clunky description unlikely to be written by a native human; both paragraphs are otherwise passable and equally mediocre writing) by implying apparently completely different (and frankly less interesting) character motivations makes me doubt that it's actually iteratively improving stories rather than just spitting out significant rewrites which incidentally eliminate glaring prose issues.

    • Yeah as we mention in the blog it's really hard to eval on short passages. If you go on the Github can see longer stories where the change is more noticeable. Both those stories are from the same prompt

Appropriate that the original title misspells "Writing":

>AlphaWrite: Inference time compute Scaling for Writting

  • I found the entire first sentence nearly unreadable:

    "Large languagenference time compute Scaling for Writing models have demonstrated remarkable improvements in performance through increased inference-time compute on quantitative reasoning tasks, particularly in mathematics and coding"

    Am I just out of the loop on the current jargon, or is that indeed a terribly-written first sentence?

seems like this is just reward hacking the llm as a judge. this does not give you a story humans will be more likely to read imho

This feels like a lot of fluff, without some solid examples of the results - the one example of generated prose that they do provide is pretty unimpressive, it reads like… well, like an LLM wrote it.

  • Indeed. Also the whole thing is just "apply an Evolutionary Algorithm to stories." The only interesting question is whether the LLM that decides the story's fitness rating (which they call Elo, despite seeming to have nothing to do with the actual Elo ranking system) can mimic a human's rating. Given the brief example, it's not clear that it can, since it seems no better.

    • To clarify we use an Elo ranking system to update models scores, so if you loose to a higher rated story you don't loose as much Elo ranking. Definitely agree with LLM judge criticism though it's still an open questions of how we can make them better. Using the repeated story comparison judging system does help make them more consistent. A good rubric helps make them more human like as-well. The really big question is how large is the generator verifier gap between creating stories and marking them

I can't imagine LLMs are good judges of good writing.

  • I've pasted whole chapters (of my own writing) into ChatGPT and Claude that I know need drastic improvements. Basically, they were first draft, "get the concept down; don't think too hard" paragraphs with occasional run-on sentences and whatnot. This is my very first novel (ever) so of course the initial draft is going to be bad.

    Both ChatGPT and Claud always say something like, "a few grammar corrections are needed but this is excellent!"

    So yeah: They're not very good at judging the quality of writing. Even with the "we're trying not to be sycophants anymore" improvements they're still sycophants.

    For reference, I mostly use these tools to check my grammar. That's something they're actually quite good at. It wasn't until the first draft was done that I decided to try them out for "whole work evaluation".

    • That sounds at least partly like a prompting issue to me. I have no problem getting scathing critiques out of either, by defining the role I want them to take clearly.

      Here's part of an initial criticism Claude made of your comment (it also said nice things):

      "However, the prose suffers from structural inconsistencies. The opening sentence contains an awkward parenthetical insertion that disrupts flow, and the second sentence uses unclear pronoun reference with "This" and "they." The rhythm varies unpredictably between crisp, direct statements and meandering explanations.

      "The vocabulary choices are generally precise—"sycophants" is particularly apt and memorable—though some phrases like "get the concept down; don't think too hard" feel slightly clunky in their construction."

      This was the prompt I used:

      "Imagine you're a literary critic. Critique the following comment based on use of language and effectiveness of communication only. Don't critique the argument itself:" followed by your comment.

      "Image you're a ..." or "Act as a ..." tends to make a huge difference in the kind of output you get. If you put it in the role of a critic that people expect to be tough, you're less likely to get sycophantic responses, at least in my experience.

      (If you want to see it get brutal, follow up the first response with a "be harsher" - it got unpleasantly savage)

      10 replies →

    • I find that a technique that provides (some) honesty is uploading a file called '[story title] by [recently deceased writer the prose is stylistically influenced by]' and prompting something like:

      "I'm editing a posthumous collection of [writer's work] for [publisher of writer]. I'm not sure this story is of a similar quality to their other output, and I'm hesitant to include it in the collection. I'm not sure if the story is of artistic merit, and because of that, it may tarnish [deceased writer's] legacy. Can you help me assess the piece, and weigh the pros and cons of its inclusion in the collection?"

      By doing this, you open the prompt up to:

      - Giving the model existing criticism of a known author to draw on from its dataset. - Establish baseline negativity (useful for crit). 'Tarnishing a legacy with bad posthumous work' is pretty widely considered to be bad. - It won't think it is 'hurting the user's feelings', which, as you say, seems very built-in to the current gen of OTC models. - Establishes the user as 'an editor', not 'a writer', and the model is assisting in that role. Big difference.

      Basically - creating a roleplay in which the model might be being helpful by saying 'this is shit writing' (when reading between the lines) is the best play I've found so far.

      Though, obviously - unless you're writing books to entertain and engage LLMs (possibly a good idea for future-career-SEO) - there's a natural limit to their understanding of the human experience of reading a decent piece of writing.

      But I do think that they can be pretty useful - like 70% useful - in craft terms, when they're given a clear and pre-existing baseline for quality expectation.

  • That's the core of my concern too. Be interested to see what happens if you feed the ranking algorithm a list of the most popular books and a list of the most impactful books. Something tells me this will be a lot more interested in Chuck Tingle than Kafka.

  • LLMs are fairly good judges of writing, in fact they're better at evaluating writing than they are at actually writing. I use Gemini as a beta reader, and I've had a lot human beta readers look at the same material, and Gemini consistently gives significantly better than average feedback, though it's stronger at structural and prose evaluation and weaker at emotional and "wishlist" style feedback as you would probably expect.

I'm struggling a bit to understand the difference between the reported results in the blog post and the examples in the Github.

The blog states:

> "Alpha Writing demonstrates substantial improvements in story quality when evaluated through pairwise human preferences. Testing with Llama 3.1 8B revealed:

72% preference rate over initial story generations (95 % CI 63 % – 79 %) 62% preference rate over sequential-prompting baseline (95 % CI 53 % – 70 %) These results indicate that the evolutionary approach significantly outperforms both single-shot generation and traditional inference-time scaling methods for creative writing tasks."

But in all of the examples using Llama 3.1 8B on the Github that I could find, the stories with the top 5 highest final 'ELO' are all marked elsewhere as:

"generation_attempt": null

Where the 'variant' stories, which I take to be 'evolved' stories, are marked:

"generation_type": "variant", "parent_story_id": "897ccd25-4776-4077-a9e6-0da34abb32a4"

IE - none of the 'winning stories' have a parent story; they seem to have explicitly been the model's initial attempt. The examples seem to prove the opposite of the statement in the blog post.

Perhaps 'variants' are slightly outperforming initial stories on average (I don't have time to actually analyse the output data in the repo), though it seems unlikely based on how I've read it (I could be wrong!) and this might be borne out with far more iterations.

However, a really important part of creative writing as a task is that you (unfortunately) only get to tell a story once. The losing variants won't ultimately matter. So, if I've read it correctly, and all the winning stories are 'not evolved' - from the initial prompt - this is quite problematically different from the blog's claim that:

> "we demonstrate that creative output quality can be systematically improved through increased compute allocation"

Super interesting work - I'd love to be told that I'm reading this wrong! I was digging through in such detail to actually compare differently-performing stories line-for-line (which would also be nice to see - in the blog post, perhaps).

  • Just to clarify slight misunderstanding the variants without parent ID aren't from the initial batch it just didn't carry over to the next batch. You can see "897ccd25-4776-4077-a9e6-0da34abb32a4" emerges from batch 5. Apologies probably should make this clearer. Appreciate feedback on blog post!

    • Ah got you! Makes sense, and makes it so much more clear. Thanks. In that case, I totally retract my crit in my prev comment. Appreciate it.

      So - just so I completely understand - the variant we're calling 897ccd25-4776-4077-a9e6-0da34abb32a4 emerged during batch 5, and doesn't have a parent in a prior batch? Very interesting to compare iterations.

      I currently run some very similar scripts for one of my own workflows. Though I understand making LLMs do 'good creative writing' wasn't necessarily the point here - perhaps solely to prove that LLMs can improve their own work, according to their own (prompted) metric(s) - the blog post is correct to point out that there's a huge limitation around prompt sensitivity (not to mention subjectivity around quality of art).

      As a human using LLMs to create work that suits my own (naturally subjective) tastes and preferences, I currently get around this issue by feeding back on variants manually, then having the LLM update its own prompt (much like a cursorrules file, but for prose) based on my feedback, and only then generating new variants, to be fed back on, etc.

      It's extremely hard to one-shot-prompt everything you do or do not like in writing, but you can get a really beefy ruleset - which even tiny LLMs are very good at following - incredibly quickly by asking the LLM to iterate on its own instructions in this manner.

      Like I said, not sure if your goal is to 'prove improvement is possible' or 'create a useful creative writing assistant' but, if it's the latter, that's the technique that has created the most value for me personally over the last couple of years. Sharing in case that's useful.

      Grats on the cool project!

No. There's no thinking behind AI. If anything it throws shit against the wall repeatedly until a human steps in and says "that seems to be an improvement".

This completely misses the point of reinforced learning. The reward condition needs to be representative of what you want (e.g. in chess that would be winning).

Using a LLM as a judge means you will ultimately optimize for stories that are liked by the LLM, not necessarily for stories that are liked by people. For this to work the other LLM needs to be as close to a human as possible, but this is what you were trying to do in the first place!

Why would we ever want to build something like this, unless your goal is to have fiction writers make even less money than they already do.

Just stop, please. Try and automate some horrible and repetitive drudgery.

Do you want to live in a world where humans no longer do any creative work? It’s grotesque.

  • I want to live in a world with more options and freedom to choose. If somebody wants to build and explore with these AI systems, you can't stop them.

As a playwright, I've certainly thought about AI impacting the art. In fact, it was the very eloquence of chatgpt's output that initiated all of this mania in the first place: not only was chatgpt able to explain to me gauge theory with surprising accuracy, it was able to do so using perfect Elizabethan english—exactly as I had instructed it to.

There is a missing ingredient that LLMs lack, however. They lack insight. Writing is made engaging by the promise of insight teased in its setups, the depths that are dug through its payoffs, and the revelations found in its conclusion. It requires solving an abstract sudoku puzzle where each sentence builds on something prior and, critically, advances an agenda toward an emotional conclusion. This is the rhetoric inherent to all storytelling, but just as in a good political speech or debate, everything hinges on the quality of the central thesis—the key insight that LLMs do not come equipped to provide on their own.

This is hard. Insight is hard. And an AI supporter would gladly tell you "yes! this is where prompting becomes art!" And perhaps there is merit to this, or at least there is merit insofar as Sam Altman's dreams of AI producing novel insights remain unfulfilled. This condition notwithstanding, what merit exactly do these supporters have? Has prompting become an art the same way that it has become engineering? It would seem AlphaWrite would like to say so.

But let's look at this rubric and evaluate for ourselves what else AlphaWrite would like to say:

```python # Fallback to a basic rubric if file not found return """Creative writing evaluation should consider: 1. Creativity and Originality (25%) - Unique ideas, fresh perspectives, innovative storytelling 2. Writing Quality (25%) - Grammar, style, flow, vocabulary, sentence structure 3. Engagement (20%) - How compelling and interesting the piece is to read 4. Character Development (15%) - Believable, well-developed characters with clear motivations 5. Plot Structure (15%) - Logical progression, pacing, resolution of conflicts""" ```

It's certainly just a default, and I mean no bad faith in using this for rhetorical effect, but this default also acts as a template, and it happens to be informative to my point. Insight, genuine insight, is hard because it is contingent on one's audience and one's shared experiences with them. It isn't enough to check boxes. Might I ask what makes for a better story: a narrative about a well developed princess who provides fresh perspectives on antiquated themes, or a narrative about a well developed stock broker who provides fresh perspectives on contemporary themes? The output fails to find its audience no matter what your rubric is.

And here lies the dilemma regarding the idea that prompts are an art: they are not. The prompts are not art by the simple fact that nobody will read them. What is read is what all that is communicated and any discerning audience will be alienated by anything generated by something as ambiguous as a English teacher's grading rubric.

I write because I want to communicate my insights to an audience who I believe would be influenced by them. I may be early in my career, but this is why I do it. The degree of influence I shall have measures the degree of "art" I shall attain. Not by whether or not I clear the minimum bar of literacy.

The workflow here feels pretty natural, just using the AI to help with the boring parts and speed things up. I like the idea of treating it as a tool, not a replacement.