Comment by lz400

19 days ago

>Hmm... How will it filter out those by the dumbest coders in the world?

if you know, and I know, and the guys at openai and anthropic know... not a big leap that the models will know too? many datasets are curated and labeled by humans

5 comments

lz400

chrisjj 19 days ago

> if you know, and I know,

We don't know.

> and the guys at openai and anthropic know... not a big leap that the models will know too?

The models don't "know" anything. They just regurgitate what they are fed.

"Child abuse images found in AI training data"

https://www.axios.com/2023/12/20/ai-training-data-child-abus...

> many datasets are curated and labeled by humans

Including these ones: "AI industry insiders launch site to poison the data that feeds them"

https://www.theregister.com/2026/01/11/industry_insiders_see...

chrisjj 19 days ago
> having a curated dataset of the works and posts of the top 200 coders in the world
I can't imagine many of the top 200 coders in the world giving their work to the parrots.
But show me the list of the top 200 coders in the world, and I might change my mind! :)
- lz400 18 days ago
  
  Top 200 that work partially in public. A good example is Mitchell Hashimoto. Works open source, uses AI a lot and writes about it. Next gen AI will learn from the lessons people like him share
  
  1 reply →
lz400 19 days ago

I mean, having a curated dataset of the works and posts of the top 200 coders in the world (at least the public ones) is not very difficult. I’m sure these articles like the one in OP will be very easy to mark as “high value training data”. I think you’re letting your bias blind you