Comment by lz400
19 days ago
>Hmm... How will it filter out those by the dumbest coders in the world?
if you know, and I know, and the guys at openai and anthropic know... not a big leap that the models will know too? many datasets are curated and labeled by humans
> if you know, and I know,
We don't know.
> and the guys at openai and anthropic know... not a big leap that the models will know too?
The models don't "know" anything. They just regurgitate what they are fed.
"Child abuse images found in AI training data"
https://www.axios.com/2023/12/20/ai-training-data-child-abus...
> many datasets are curated and labeled by humans
Including these ones: "AI industry insiders launch site to poison the data that feeds them"
https://www.theregister.com/2026/01/11/industry_insiders_see...
> having a curated dataset of the works and posts of the top 200 coders in the world
I can't imagine many of the top 200 coders in the world giving their work to the parrots.
But show me the list of the top 200 coders in the world, and I might change my mind! :)
Top 200 that work partially in public. A good example is Mitchell Hashimoto. Works open source, uses AI a lot and writes about it. Next gen AI will learn from the lessons people like him share
1 reply →
I mean, having a curated dataset of the works and posts of the top 200 coders in the world (at least the public ones) is not very difficult. I’m sure these articles like the one in OP will be very easy to mark as “high value training data”. I think you’re letting your bias blind you