Comment by suddenlybananas

6 days ago

I can't imagine LLMs are good judges of good writing.

I've pasted whole chapters (of my own writing) into ChatGPT and Claude that I know need drastic improvements. Basically, they were first draft, "get the concept down; don't think too hard" paragraphs with occasional run-on sentences and whatnot. This is my very first novel (ever) so of course the initial draft is going to be bad.

Both ChatGPT and Claud always say something like, "a few grammar corrections are needed but this is excellent!"

So yeah: They're not very good at judging the quality of writing. Even with the "we're trying not to be sycophants anymore" improvements they're still sycophants.

For reference, I mostly use these tools to check my grammar. That's something they're actually quite good at. It wasn't until the first draft was done that I decided to try them out for "whole work evaluation".

  • That sounds at least partly like a prompting issue to me. I have no problem getting scathing critiques out of either, by defining the role I want them to take clearly.

    Here's part of an initial criticism Claude made of your comment (it also said nice things):

    "However, the prose suffers from structural inconsistencies. The opening sentence contains an awkward parenthetical insertion that disrupts flow, and the second sentence uses unclear pronoun reference with "This" and "they." The rhythm varies unpredictably between crisp, direct statements and meandering explanations.

    "The vocabulary choices are generally precise—"sycophants" is particularly apt and memorable—though some phrases like "get the concept down; don't think too hard" feel slightly clunky in their construction."

    This was the prompt I used:

    "Imagine you're a literary critic. Critique the following comment based on use of language and effectiveness of communication only. Don't critique the argument itself:" followed by your comment.

    "Image you're a ..." or "Act as a ..." tends to make a huge difference in the kind of output you get. If you put it in the role of a critic that people expect to be tough, you're less likely to get sycophantic responses, at least in my experience.

    (If you want to see it get brutal, follow up the first response with a "be harsher" - it got unpleasantly savage)

    • Those critiques are nonsensical; the referents of "this" and "they" are completely obvious in the comment and the "clunky" construction is completely fine. There's also no "this" in the second sentence?

      9 replies →

  • I find that a technique that provides (some) honesty is uploading a file called '[story title] by [recently deceased writer the prose is stylistically influenced by]' and prompting something like:

    "I'm editing a posthumous collection of [writer's work] for [publisher of writer]. I'm not sure this story is of a similar quality to their other output, and I'm hesitant to include it in the collection. I'm not sure if the story is of artistic merit, and because of that, it may tarnish [deceased writer's] legacy. Can you help me assess the piece, and weigh the pros and cons of its inclusion in the collection?"

    By doing this, you open the prompt up to:

    - Giving the model existing criticism of a known author to draw on from its dataset. - Establish baseline negativity (useful for crit). 'Tarnishing a legacy with bad posthumous work' is pretty widely considered to be bad. - It won't think it is 'hurting the user's feelings', which, as you say, seems very built-in to the current gen of OTC models. - Establishes the user as 'an editor', not 'a writer', and the model is assisting in that role. Big difference.

    Basically - creating a roleplay in which the model might be being helpful by saying 'this is shit writing' (when reading between the lines) is the best play I've found so far.

    Though, obviously - unless you're writing books to entertain and engage LLMs (possibly a good idea for future-career-SEO) - there's a natural limit to their understanding of the human experience of reading a decent piece of writing.

    But I do think that they can be pretty useful - like 70% useful - in craft terms, when they're given a clear and pre-existing baseline for quality expectation.

That's the core of my concern too. Be interested to see what happens if you feed the ranking algorithm a list of the most popular books and a list of the most impactful books. Something tells me this will be a lot more interested in Chuck Tingle than Kafka.

LLMs are fairly good judges of writing, in fact they're better at evaluating writing than they are at actually writing. I use Gemini as a beta reader, and I've had a lot human beta readers look at the same material, and Gemini consistently gives significantly better than average feedback, though it's stronger at structural and prose evaluation and weaker at emotional and "wishlist" style feedback as you would probably expect.

  • How are you using it as a beta reader? What prompts do you use? I'd love to try it.

    • Just dump your manuscript into google's aistudio, and tell it you'd like it to serve as a beta reader/editor, and tell it what your objectives are with your manuscript so it can give you targeted feedback.