Comment by jowea

2 days ago

How can open source models respectful of robots.txt possibly perform equally if they are missing information that the other models have access to?

8 comments

jowea

Dylan16807 1 day ago

Maybe the missing data makes it 3% worse but the architecture is 5% better. Or your respect for robots.txt gets you more funding and you gain a 4% advantage by training longer.

Don't focus too much on a single variable, especially when all the variables have diminishing returns.

datameta 2 days ago

How can we possibly find out without trying?

jowea 2 days ago
It is logically impossible for a LLM to, for example, to know that fooExecute() takes two int arguments if the documentation is blocked by robots.txt and there are no examples of fooExecute() usage in the wild, don't you agree?
- diggan 1 day ago
  
  I agree, but also think it's less important. I don't want a big fat LLM that memorized every API out there, and as soon as the API changed, the weights have to updated. I like the current approach of Codex (and similar) where they can look up the APIs they need to use as they're doing the work instead, so same weights will continue to work no matter how much the APIs change.
- tharant 2 days ago
  
  Sure, the model would not “know” about your example, but that’s not the point; the penultimate[0] goal is for the model to figure out the method signature on its own just like a human dev might leverage her own knowledge and experience to infer that method signature. Intelligence isn’t just rote memorization.
  [0] the ultimate, of course, being profit.
  
  2 replies →

lllllm 1 day ago

this is what this paper tries to answer: https://arxiv.org/abs/2504.06219 the quality gap is surprisingly small between compliant and not