Comment by hintymad

13 days ago

Wouldn't this be worrisome? People used StackOverflow and generated new knowledge along the way. Without such medium for discussion, how can we feed models with up-to-date quality knowledge?

33 comments

hintymad

Jyaif 13 days ago

We unironically need an StackOverflow for LLMs.

LLMs would post solutions to the issues that they've discovered after doing a lot of research.

Unfortunately the LLMs are concentrated into few providers (OpenAI, Anthropic, Google) so there's a chance they each end up doing their own private (and closed) StackOverflows. By leveraging their private StackOverflows, their LLMs will be able to short-circuit complex reasoning, saving tokens, time, and money.

JadeNB 13 days ago
> LLMs would post solutions to the issues that they've discovered after doing a lot of research.
How do you envision the correctness of these solutions being judged? If by other LLMs, then we run into a problem of infinite descent. If by humans, then you'd need some way to motivate expert or semi-expert humans (so that their ratings are themselves correct) to participate in a massive project of evaluating the correctness of a constant stream of content from content-generators that never sleep.
- Jyaif 13 days ago
  
  > How do you envision the correctness of these solutions being judged?
  By LLMs. I think it's possible for agents to infer whether the user was satisfied or not, at least with my usage pattern. For example if I end the discussion it's a good sign. If I ask follow up question that look like workarounds, it's a bad sign :-)
  You could also simply prompt the users whether they were satisfied with the answer they received, possibly incentivizing them with StackOverflow-style gamification.
nikole9696 13 days ago

This actually reminds me of the MCP concept. Similar?

crazygringo 13 days ago

Plenty of documentation, and plenty of code that the AI can read itself.

E.g. if a library has a bug that has a common workaround, it can learn that from open source code using the library that uses the workaround.

hintymad 13 days ago

This and the the other thread that talks about RL and synthetic data seem to suggest that AI can figure out all the technical issues without humans looking into them. I'm not sure if that's true at all.
nitwit005 13 days ago
That assumes there is documentation or examples. A big reason Stack Overflow took off was people struggling with things like the Android API documentation.
Some of those discussions made people go figure out how to do it, and then post it as an answer. The knowledge didn't exist anywhere until they did.
- ToValueFunfetti 13 days ago
  
  It might make sense for AI companies to throw agents at new technologies to trial-and-error their way to internal documentation which they then provide to their models. On the other hand, the people making tomorrow's APIs have LLMs too and that makes documentation ~free. Hallucinations could still bring you back to the first hand, though.
- crazygringo 13 days ago
  
  When I talk about code it can learn from, I'm talking about GitHub etc.
  Even if stuff isn't in the official documentation, eventually there are projects that use it.
  And if the library in question is open-source, then the LLM's can just ingest and read that directly.
soraminazuki 12 days ago

Sounds nothing like the world we live in. When has there ever been a time where there were an abundance of software documentation? How can plenty of documentation or code be made if AI scraper bots hammer servers that host them, steal content and drive people away from the actual authors?
kajman 13 days ago
The only way I could see this being surfaced the same is if the code essentially had a SO answer written into the doc comment.
- mcswell 13 days ago
  
  What documentation?
insane_dreamer 12 days ago

lots of undocumented gotchas that only surfaced because someone used it and posted about it

vanuatu 13 days ago

I don't think its much of an issue

- Rl envs + synthetic data + human annotated

- Usage data from codex/claude code/cursor

Most of the model abilities in coding come from post-training, not pretraining

torben-friis 13 days ago
A better question is what's left for those who don't have access to that. We went from publicly available to vacuumed from private users
- vanuatu 13 days ago
  
  Open source models
  unfortunately all the incentives right now are for repos to be private
  
  1 reply →

jmyeet 13 days ago

Yeah, this is something I've been thinking about too. LLMs have basically profited from "stealing" (arguably) user-generated content from a time when there were no LLMs. In the LLM era there won't be a new Stack Overflow to train LLMs on going forward.

We're getting closer to Dead Internet Theory too where a lot of accounts, particularly on Twitter, are just LLMs. I imagine it's a huge problem on Reddit too. Just people farming karma or otherwise involved in influence campaigns or simply grifting to ad revenue.

So we're going to get to a point where the corpus we train LLMs on will itself just be filled with LLM slops. Self-reinforcing slop. Is that the future?

mattmanser 13 days ago

It's happening here too, I saw dang hint that they're not even responding to a lot of questions about it anymore because of the volume of the problem.
If you browse with showdead on you'll be seeing a lot more of what look like reasonable comments greyed out.
aucisson_masque 13 days ago

It's been studied,LLM that feed on its own data regress and it becomes very bad after a few generations.

hgoel 12 days ago

People still like to talk about the interesting problems they solved and how. Issue isn't SO having choked itself out, issue is that even the major search engines are pivoting towards AI answers instead of surfacing small blogs.

stackghost 13 days ago

I'm sure the AI companies will continue to pirate textbooks and papers, like always.

add-sub-mul-div 13 days ago

Careful, you can't point out that the AI emperor has no clothes or you'll get called a Luddite.

piker 13 days ago

Yes. Very.

akkad33 13 days ago

Pointing them to docs? Which is anyway what stack overflow answers did?

mlinhares 13 days ago

I wrote multiple answers to questions that weren't just "point to docs". And even when it is pointing to docs you are providing the reasoning as to why it works one way or another.
izacus 13 days ago
What docs? Who writes docs now that AIs answer everything?
- Fabricio20 13 days ago
  
  Ever since the AI stuff started rolling around on coding i've seen MORE documentation, theres a big incentive to properly document your API endpoints so LLMs can figure it out from specs, and even when not documented the llms can also just read the code and figure it out directly (for libraries and similar). And at least in my experience they tend to document or write it down for future sessions too!
- ethagnawl 13 days ago
  
  I know you're being facetious but there may well be docs. It's just that the same AI most likely wrote _them_, too.
  Did anyone (person or competing LLM) bother to verify that they're correct, though? Who knows! Let the next generation of models worry about that.
  
  1 reply →
- Morromist 13 days ago
  
  I've heard this is now most of some CS jobs now. Just writing documentation for AI.
- vanuatu 13 days ago
  
  on the contrary, theres more of an incentive for apis to have docs for agent discovery. the docs / interfaces themselves can be auto-gened (stainless / mintlify)

nsxwolf 13 days ago

How do you convince people to not want an instant answer? Even if SO didn’t result in so many “What have you tried?” responses and immediate closures, most people would still prefer instant feedback.