Comment by simonw

2 days ago

It turns out the Cursor one is stitching together a ton of open source components already.

That said, I don't really find the critique that models have browser source code in their training data particularly interesting.

If they spat out a full, working implementation in response to a single prompt then sure, I'd be suspicious they were just regurgitating their training data.

But if you watch the transcripts for these kinds of projects you'll see them make thousands of independent changes, reacting to test failures and iterating towards an implementation that matches the overall goals of the project.

The fact that Firefox and Chrome and WebKit are likely buried in the training data somewhere might help them a bit, but it still looks to me more like an independent implementation that's influenced by those and many other sources.

> The fact that Firefox and Chrome and WebKit are likely buried in the training data somewhere might help them a bit, but it still looks to me more like an independent implementation that's influenced by those and many other sources.

They generate a statistically appropriate token based on a very small context window. And they are slightly nerfed not to reproduce everything verbatim because that would bring all sorts of lawsuits.

Of course they are not reproducing Webkit or Blink or Firefox verbatim. However, it's not an "independent implementation". That's why it's "stringing together a bunch of open-source components": https://news.ycombinator.com/item?id=46649586

Edit: also, this "independent implementation" cannot be compiled by their own CI and doesn't work, apparently.