Comment by byschii
3 months ago
isn't this dangerous? isn't the efficiency given at the expense of safety and interpretability?
https://arxiv.org/abs/2412.14093 (Alignment faking in large language models)
https://joecarlsmith.com/2024/12/18/takes-on-alignment-fakin...
PS I m definitely not an expert
> isn't this dangerous? isn't the efficiency given at the expense of safety and interpretability?
Final text is only a small part of model's thinking. It's produced from embeddings which probably have much more in them. Each next token depends not only on previous, but all the intermediate values for all tokens. We don't know them, they are actually important and represent inner 'thinking'. So, LLM is still a black box. The result is usually A because of B. Sort of explanation for A, but where B came from we can only guess.
Yes, but what do you think matters more: - Safety and (in the long run) human lives - More papers ?
Turns out we are the main paperclip optimizers...
or goat compressors: https://x.com/GZilgalvis/status/1883107575010619649
Depends on if we can interpret the final hidden layer. It's plausible we evolve models to _have_ interpretable (final/reasoning) hidden layers, just that they aren't constrained to the (same representation of) input/output domains (i.e. tokens).
We should always be able to clearly understand and interpret all of the thinking leading to an action done by an AI. What would the point be if we don't know what it's doing, just that it is doing "something"
I don't see how it is any more dangerous than the already existing black-box nature of DNNs.
the hidden tokens can be decoded to English language if the user wants to see the thinking process.
Yeah, agreed. The limits of human minds constrain language. To allow these things to reason outside words is in my intuitions a tactic with more abundant paths toward super intelligence, and the exact sort of path we'll have a harder time monitoring (we'll need fancy tools to introspect instead of just watching it think)
My current thinking is that I would support a ban on this style of research. Really hard to set lines for regulation, but this feels like an easy and intuitive place to exercise caution