← Back to context

Comment by miki123211

6 months ago

If somebody writes in a foreign language and asks Chat GPT to translate to English, is that AI generated content? What about if they write on paper and use an LLM to OCR? What if they give the AI a very detailed outline, constantly ask for rewrites and are ruthless in removing any facts they're not 100% sure of if they slip in? What if they only use AI to fix the grammar and rewrite bad English into a proper scientific tone?

My answer would be a clear "no" to all of these, even though the content ultimately ends up fully copy-pasted from an LLM in all those cases.

My answer is clear "yes" to most of those.

Yes, machine translations are AI-generated content - I read foreign-language news sites which sometimes has machine translation articles and the quality stands out and not in a good way.

"Maybe" for "writing on paper and using LLM for OCR". It's like automatic meeting transcript - if the speaker has perfect pronunciation, it works well. If they don't, then the meeting notes still look coherent but have little relationship to what speaker said and/or will miss critical parts. Sadly there is no way for reader to know that from reading the transcript, so I'd recommend labeling "AI edited" just in case.

Yes, even if "they give the AI a very detailed outline, constantly ask for rewrites, etc.." it's still AI generated. I am not sure how can you argue otherwise - it's not their words. Also, it's really easy to convince yourself that you are "ruthless in removing any facts they're not 100% sure" while actually you are anything but.

"What if they only use AI to fix the grammar and rewrite bad English into a proper scientific tone?" - I'd label it "AI-edited" if the rewrites are minor or "AI-generated" if the rewrites are major. This one is especially insidious as people may not expect rewrites to change meaning, so they won't inspect them too much, so it will be easier for hallucinations to slip in.

  • > they give the AI a very detailed outline […]

    Honestly, I think that's a tough one.

    (a) it "feels" like you are doing work. Without you the LLM would not even start. (b) it is very close to how texts are generated without LLMs. Be it in academia, with the PI guiding the process of grad students, or in industry, with managers asking for documentation. In both cases the superior takes (some) credit for the work that is in large parts by others.

    • Don't see anything "tough" here.

      At least in academia, if PI takes credit for student's work and does not list them as co-author, it's considered widely unethical. The rules there are simple - someone contributed to the text, they get onto the author list.

      If we had same same rule for blogs - "this post is authored by fho and ChatGPT" - then I'd be completely satisfied, as this would be sufficient AI disclosure.

      As for industry, I think the rules are very different place-by-place. In some places the authorship does not even come up - the slide deck/document can contain copies from random internet sites, or some previous version of the doc, and the reference will only be present if there is a need (say to lend an authority)

For the translate part let me just point out the offensively bad translations that reddit (sites with an additional ?tl=foo) and YouTube automatic dubbing force upon users.

These are immediately, negatively obvious as AI content.

For the other questions the consensus of many publications/journals has been to treat grammar/spellcheck just like non-AI but require that other uses have to be declared. So for most of your questions the answer is a firm "yes".

If the purpose is to identify text that can be used as training data, in some ways it makes sense to me to mark anything and everything that isn't hand-typed as AI generated.

Like for your last example: to me, the concept "proper scientific tone" exists because humans hand-typed/wrote in a certain way. If we use AI edited/transformed text to act as a source for what "proper scientific tone" looks like, we still could end up with an echo chamber where AI biases for certain words and phrases feed into training data for the next round.

Being strict about how we mark text could mean a world where 99% of text is marked as AI-touched and less than 1% is marked as human-originated. That's still plenty of text to train on, though such a split could also arguably introduce its own (measurable) biases...

  • > we still could end up with an echo chamber where AI biases for certain words and phrases feed into training data for the next round.

    That’s how it works with humans too. “That sounds professional because it sounds like the professionals”.

All four of your examples are situations where an LLM has potential to contaminate the structure or content of the text, so in all four cases it is clear-cut that the output poses the same essential hazards to training or consumption as something produced "whole cloth" from a minimal prompt; post-hoc human supervision will at best reduce the severity of these risks.

OK, sure, there are gradations.

The new encoding can contain a FLOAT32 side channel on every character, to represent its proportional "AI-ness" – kinda like the 'alpha' transparency channel on pixels.

Stop ruining my simple and perfect ideas with nuance and complexity!

  • Nuance and complexity are a thing, but many of the GP's examples should be clearly AI labeled...

    > What if they give the AI a very detailed outline, constantly ask for rewrites and are ruthless in removing any facts they're not 100% sure of if they slip in?

    • The whole point of those examples is to demonstrate that there is considerably diversity in opinion on how those cases "should" be classified -- which tells us that, at least in the near term, nothing useful can be expected from such a simplistic classification scheme.