Comment by COAGULOPATH
5 months ago
>but much worse (and worse even in comparison to GPT4) than English composition
O1 is supposed to be a reasoning model, so I don't think judging it by its English composition abilities is quite fair.
When they release a true next-gen successor to GPT-4 (Orion, or whatever), we may see improvements. Everyone complains about the "ChatGPTese" writing style, and surely they'll fix that eventually.
>Like they hired a few hundred professors, journalists and writers to work with the model and create material for it, so you just get various combinations of their contributions.
I'm doubtful. The most prolific (human) author is probably Charles Hamilton, who wrote 100 million words in his life. Put through the GPT tokenizer, that's 133m tokens. Compared to the text training data for a frontier LLM (trillions or tens of trillions of tokens), it's unrealistic that human experts are doing any substantial amount of bespoke writing. They're probably mainly relying on synthetic data at this point.
> When they release a true next-gen successor to GPT-4 (Orion, or whatever), we may see improvements. Everyone complains about the "ChatGPTese" writing style, and surely they'll fix that eventually.
IMO that has already peaked. GPT4 original certainly was terminally corny, but competitors like Claude/Llama aren't as bad, and neither is 4o. Some of the bad writing does from things they can't/don't want to solve - "harmlessness" RLHF especially makes them all cornier.
Then again, a lot of it is just that GPT4 speaks African English because it was trained by Kenyans and Nigerians. That's actually how they talk!
https://medium.com/@moyosoreale/the-paul-graham-vs-nigerian-...
I just wanted to thank you for the medium article you posted. I was online when Paul made that bizarre “delve” tweet but never knew so much about Nigeria and its English. As someone from a former British colony too I understood why using such a word was perfectly normal but wasn’t aware Kenyans and Nigerians trained ChatGPT.
It wasn't bizarre, it was ignorant if not borderline racist. He is telling native English speakers from non-anglosaxon countries that their English isn't normal
4 replies →
Italians would say enormous since it's directly coming from latin.
In general all the people whose main language is a latin language are very likely to use those "difficult" words, because to them they are "completely normal" words.
[dead]
The bulk in terms of the number of tokens may well be synthetic data, but I personally know of at least 3 companies, 2 of whom I've done work for, that have people doing substantial amounts of bespoke writing under rather heavy NDAs. I've personally done a substantial amount of bespoke writing for training data for one provider, at good tech contractor fees (though I know I'm one of the highest-paid people for that company and the span of rates is a factor of multiple times even for a company with no exposure to third world contractors).
That said, the speculation you just "get various combinations" of those contributions is nonsense, and it's also by no means only STEM data.
how do those companies gauge that what those contractors are writing isnt AI-generated?
It doesn't matter if it's AI-generated per se, so it's no crisis if some make it true. It matters if it is good. So multiple rounds of reviews to judge the output and pick up reviewers that keep producing poor results.
But I also know they've fired people who were dumb enough to cut and paste a response that included UI elements from a given AI website...
I’m not sure I see the value in conflating input, tokens, and output. Tokens. Hamilton certainly read and experienced more tokens than he wrote on a pieces of paper.
What could go wrong!