Comment by TZubiri

20 days ago

forgive the skepticism, but this translates directly to "we asked the model pretty please not to do it in the system prompt"

22 comments

TZubiri

ffsm8 20 days ago

It's mind boggling if you think about the fact they're essential "just" statistical models

It really contextualizes the old wisdom of Pythagoras that everything can be represented as numbers / math is the ultimate truth

glemion43 20 days ago
They are not just statistical models
They create concepts in latent space which is basically compression which forces this
- jrmg 19 days ago
  
  You’re describing a complex statistical model.
  
  2 replies →
- mmooss 19 days ago
  
  What is "latent space"? I'm wary of metamagical descriptions of technology that's in a hype cycle.
  
  7 replies →
GrowingSideways 20 days ago

How so? Truth is naturally an apriori concept; you don't need a chatbot to reach this conclusion.

mikaraento 20 days ago

That might be somewhat ungenerous unless you have more detail to provide.

I know that at least some LLM products explicitly check output for similarity to training data to prevent direct reproduction.

TZubiri 19 days ago

So it would be able to produce the training data but with sufficient changes or added magic dust to be able to claim it as one's own.
Legally I think it works, but evidence in a court works differently than in science. It's the same word but don't let that confuse you and don't mix them both.
guenthert 19 days ago
Should they though? If the answer to a question^Wprompt happens to be in the training set, wouldn't it be disingenuous to not provide that?
- ttctciyf 19 days ago
  
  Maybe it's intended to avoid legal liability resulting from reproducing copyright material not licensed for training?
  
  1 reply →

ComplexSystems 19 days ago

The model doesn't know what its training data is, nor does it know what sequences of tokens appeared verbatim in there, so this kind of thing doesn't work.

efskap 20 days ago

Would it really be infeasible to take a sample and do a search over an indexed training set? Maybe a bloom filter can be adapted

hexaga 20 days ago

It's not the searching that's infeasible. Efficient algorithms for massive scale full text search are available.
The infeasibility is searching for the (unknown) set of translations that the LLM would put that data through. Even if you posit only basic symbolic LUT mappings in the weights (it's not), there's no good way to enumerate them anyway. The model might as well be a learned hash function that maintains semantic identity while utterly eradicating literal symbolic equivalence.