Comment by hnfong
2 days ago
I have a feeling that you believe "translation, grammar, and tone-shifting" works but "code generation sucks" for LLMs because you're good at coding and hence you see its flaws, and you're not in the business of doing translation etc.
Pretty sure if you're going to use LLMs for translating anything non-trivial, you'd have to carefully review the outputs, just like if you're using LLMs to write code.
You know, you're right. It -also- sucks at those tasks because on top of the issue you mention, unedited LLM text is identifiable if you get used to its patterns.
Exactly. Books are still being translated by human translators.
I have a text on my computer, the first couple of paragraphs from the Dutch novel "De aanslag", and every few years I feed it to the leading machine translation sites, and invariably, the results are atrocious. Don't get me wrong, the translation is quite understandable, but the text is wooden, and the translation contains 3 or 4 translation blunders.
GPT-5 output for example:
Far, far away in the Second World War, a certain Anton Steenwijk lived with his parents and his brother on the edge of Haarlem. Along a quay, which ran for a hundred meters beside the water and then, with a gentle curve, turned back into an ordinary street, stood four houses not far apart. Each surrounded by a garden, with their small balconies, bay windows, and steep roofs, they had the appearance of villas, although they were more small than large; in the upstairs rooms, all the walls slanted. They stood there with peeling paint and somewhat dilapidated, for even in the thirties little had been done to them. Each bore a respectable, bourgeois name from more carefree days: Welgelegen Buitenrust Nooitgedacht Rustenburg Anton lived in the second house from the left: the one with the thatched roof. It already had that name when his parents rented it shortly before the war; his father had first called it Eleutheria or something like that, but then written in Greek letters. Even before the catastrophe occurred, Anton had not understood the name Buitenrust as the calm of being outside, but rather as something that was outside rest—just as extraordinary does not refer to the ordinary nature of the outside (and still less to living outside in general), but to something that is precisely not ordinary.
Can you provide a reference translation or at least call out the issues you see with this passage? I see "far far away in the [time period]" which I should imagine should be "a long time ago" What are the other issues?
- "they were more small than large" (what?)
- "even in the thirties little had been done to them" (done to them?)
- "Welgelegen Buitenrust Nooitgedacht Rustenburg" (Untranslated!)
- "his father had first called it Eleutheria" (his father'd rather called it)
- "just as extraordinary does not refer to the ordinary nature of the outside" (complete non-sequitur)
2 replies →
Not the original poster, but you can read these paragraphs translated by a human on amazon's sneak peek: https://lesen.amazon.de/sample/B0D74T75KH?f=1&l=de_DE&r=2801...
The difference is gigantic.
By definition, transformers can never exceed average.
That is the thing, and what companies pushing LLMs don't seem to realize yet.
Can you expand on this? For tasks with verifiable rewards you can improve with rejection sampling and search (i.e. test time compute). For things like creative writing it’s harder.
For creative writing, you can do the same, you just use human verifiers rather than automatic ones.
LLMs have encountered the entire spectrum of qualities in its training data, from extremely poor writing and sloppy code, to absolute masterpieces. Part of what Reinforcement Learning techniques do is reinforcing the "produce things that are like the masterpieces" behavior while suppressing the "produce low-quality slop" one.
Because there are humans in the loop, this is hard to scale. I suspect that the propensity of LLMs for certain kinds of writing (bullet points, bolded text, conclusion) is a direct result of this. If you have to judge 200 LLM outputs per day, you prize different qualities than when you ask for just 3. "Does this look correct at a glance" is then a much more important quality.