Comment by byschii

5 months ago

isn't this dangerous? isn't the efficiency given at the expense of safety and interpretability?

https://arxiv.org/abs/2412.14093 (Alignment faking in large language models)

https://joecarlsmith.com/2024/12/18/takes-on-alignment-fakin...

PS I m definitely not an expert

9 comments

byschii

numba888 5 months ago

> isn't this dangerous? isn't the efficiency given at the expense of safety and interpretability?

Final text is only a small part of model's thinking. It's produced from embeddings which probably have much more in them. Each next token depends not only on previous, but all the intermediate values for all tokens. We don't know them, they are actually important and represent inner 'thinking'. So, LLM is still a black box. The result is usually A because of B. Sort of explanation for A, but where B came from we can only guess.

achierius 5 months ago

Yes, but what do you think matters more: - Safety and (in the long run) human lives - More papers ?

jononor 5 months ago
Turns out we are the main paperclip optimizers...
- anticensor 5 months ago
  
  or goat compressors: https://x.com/GZilgalvis/status/1883107575010619649

winwang 5 months ago

Depends on if we can interpret the final hidden layer. It's plausible we evolve models to _have_ interpretable (final/reasoning) hidden layers, just that they aren't constrained to the (same representation of) input/output domains (i.e. tokens).

swagmoney1606 5 months ago

We should always be able to clearly understand and interpret all of the thinking leading to an action done by an AI. What would the point be if we don't know what it's doing, just that it is doing "something"

IshKebab 5 months ago

I don't see how it is any more dangerous than the already existing black-box nature of DNNs.

nowittyusername 5 months ago

the hidden tokens can be decoded to English language if the user wants to see the thinking process.

patcon 5 months ago

Yeah, agreed. The limits of human minds constrain language. To allow these things to reason outside words is in my intuitions a tactic with more abundant paths toward super intelligence, and the exact sort of path we'll have a harder time monitoring (we'll need fancy tools to introspect instead of just watching it think)

My current thinking is that I would support a ban on this style of research. Really hard to set lines for regulation, but this feels like an easy and intuitive place to exercise caution