Comment by prodigycorp
6 hours ago
Random aside about training data:
One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.
There was a really great article or blog post published in the last few months about the author's very personal experience whose gist was "People complain that I sound/write like an LLM, but it's actually the inverse because I grew up in X where people are taught formal English to sound educated/western, and those areas are now heavily used for LLM training."
I wish I could find it again, if someone else knows the link please post it!
I'm Kenyan. I don't write like ChatGPT, ChatGPT writes like me
https://news.ycombinator.com/item?id=46273466
I've been critical of people that default to "an em dash being used means the content is generated by an LLM", or, "they've numbered their points, must be an LLM"
I do know that LLMs generate content heavy with those constructs, but they didn't create the ideas out of thin air, it was in the training set, and existed strongly enough that LLMs saw it as common place/best practice.
That's very interesting. Any examples you can share which has those agreeable effects?
I'm going to do a cursory look through my antigrav history, i want to find it too. I remember it's primarily in the exclamations of agreement/revelation, and one time expressing concern which I remember were slightly off natural for an american english speaker.
Cant find anything, too many messages telling the agent "please do NOT thosec changes". I'm going to remember to save them going forward.