Comment by diggan

2 days ago

> And this is exactly why open source models can't surpass open weight models.

It is a fair point, but how strong of a point it is remains to be seen, some architectures are better than others, even with the same training data, so not impossible we could at one point see some innovative architectures beating current proprietary ones. It would probably be short-lived though, as the proprietary ones would obviously improve in their next release after that.

9 comments

diggan

jowea 2 days ago

How can open source models respectful of robots.txt possibly perform equally if they are missing information that the other models have access to?

Dylan16807 1 day ago

Maybe the missing data makes it 3% worse but the architecture is 5% better. Or your respect for robots.txt gets you more funding and you gain a 4% advantage by training longer.
Don't focus too much on a single variable, especially when all the variables have diminishing returns.
datameta 2 days ago
How can we possibly find out without trying?
- jowea 2 days ago
  
  It is logically impossible for a LLM to, for example, to know that fooExecute() takes two int arguments if the documentation is blocked by robots.txt and there are no examples of fooExecute() usage in the wild, don't you agree?
  
  4 replies →
lllllm 1 day ago

this is what this paper tries to answer: https://arxiv.org/abs/2504.06219 the quality gap is surprisingly small between compliant and not