Comment by prodigycorp

22 days ago

Random aside about training data:

One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.

11 comments

prodigycorp

evntdrvn 22 days ago

There was a really great article or blog post published in the last few months about the author's very personal experience whose gist was "People complain that I sound/write like an LLM, but it's actually the inverse because I grew up in X where people are taught formal English to sound educated/western, and those areas are now heavily used for LLM training."

I wish I could find it again, if someone else knows the link please post it!

gxnxcxcx 22 days ago
I'm Kenyan. I don't write like ChatGPT, ChatGPT writes like me
https://news.ycombinator.com/item?id=46273466
- tverbeure 21 days ago
  
  Thanks for that link.
  This part made me laugh though:
  > These detectors, as I understand them, often work by measuring two key things: ‘Perplexity’ and ‘burstiness’. Perplexity gauges how predictable a text is. If I start a sentence, "The cat sat on the...", your brain, and the AI, will predict the word "floor."
  I can't be the only one who's brain predicted "mat" ?
  
  2 replies →
- evntdrvn 20 days ago
  
  Thank you!!! :)
awesome_dude 22 days ago

I've been critical of people that default to "an em dash being used means the content is generated by an LLM", or, "they've numbered their points, must be an LLM"
I do know that LLMs generate content heavy with those constructs, but they didn't create the ideas out of thin air, it was in the training set, and existed strongly enough that LLMs saw it as common place/best practice.

blenderob 22 days ago

That's very interesting. Any examples you can share which has those agreeable effects?

prodigycorp 22 days ago
I'm going to do a cursory look through my antigrav history, i want to find it too. I remember it's primarily in the exclamations of agreement/revelation, and one time expressing concern which I remember were slightly off natural for an american english speaker.
- prodigycorp 22 days ago
  
  Cant find anything, too many messages telling the agent "please do NOT thosec changes". I'm going to remember to save them going forward.