Comment by smt88
5 months ago
Reddit and HN are among the highest quality sources of training text and are probably weighted very heavily as "probably human" in the mainstream models.
Any source of text with huge amounts of automated and community moderation will be better quality than, say, Twitter.
Reddit is anything but high quality.
That depends heavily on the subreddits you browse. There absolutely are places with high quality content, though it feels like they are getting sparser and sparser.
Not in that sense; high quality in the sense that there are a lot of actual, real people posting there, and those people tend to come from a pretty diverse set of backgrounds.
Perhaps on the smaller subreddits, but have a look at /r/all on any given day and it's obvious that real people, and diverse backgrounds, it is not. Every single subreddit that goes above a certain activity threshold collapses into the exact same state of astroturfed, mass-produced political slop targeted towards low IQ people.
1 reply →
Old Reddit was.
Oh man, someone should train an LLM on pre-Digg death Reddit and modern Reddit and have them chat. It’d be a hoot.
"among the highEST" is comparative; it doesn't entail "high".