Comment by prodigycorp
11 days ago
Random aside about training data:
One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.
There was a really great article or blog post published in the last few months about the author's very personal experience whose gist was "People complain that I sound/write like an LLM, but it's actually the inverse because I grew up in X where people are taught formal English to sound educated/western, and those areas are now heavily used for LLM training."
I wish I could find it again, if someone else knows the link please post it!
I'm Kenyan. I don't write like ChatGPT, ChatGPT writes like me
https://news.ycombinator.com/item?id=46273466
Thanks for that link.
This part made me laugh though:
> These detectors, as I understand them, often work by measuring two key things: ‘Perplexity’ and ‘burstiness’. Perplexity gauges how predictable a text is. If I start a sentence, "The cat sat on the...", your brain, and the AI, will predict the word "floor."
I can't be the only one who's brain predicted "mat" ?
2 replies →
Thank you!!! :)
I've been critical of people that default to "an em dash being used means the content is generated by an LLM", or, "they've numbered their points, must be an LLM"
I do know that LLMs generate content heavy with those constructs, but they didn't create the ideas out of thin air, it was in the training set, and existed strongly enough that LLMs saw it as common place/best practice.
That's very interesting. Any examples you can share which has those agreeable effects?
I'm going to do a cursory look through my antigrav history, i want to find it too. I remember it's primarily in the exclamations of agreement/revelation, and one time expressing concern which I remember were slightly off natural for an american english speaker.
Cant find anything, too many messages telling the agent "please do NOT thosec changes". I'm going to remember to save them going forward.