Comment by ottah

3 days ago

It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.

op here. Important point, but I disagree. We see explainability/interpretability as a CORE need for AI safety. We believe you can't align/audit/debug/fix a system that you don't understand.

Just to give you some answers for what we can do:

1) We can find the training data that is causing a model to output toxic/unwanted text and correct it. 2) We know what high level concepts the model is relying on for any group of tokens it generates, hence, reducing that generation is as simple as toggling the effect of the output on that concept.

Most of the AI safety techniques fall under finetuning. Our model allows your to do this without fine-tuning. You can toggle the presence of .

For example, wouldn't you like to know why a model is being sycophantic? Or Sandbagging? Is it a particular kind of training data that is causing this? Or is it some high level part of the model's representations? For any of this, our model can tell you exactly why the model generated that output. Over the coming weeks, we'll show exactly how you can do this!

  • This is fantastic to read. LLMs feel like black boxes and for the large ones especially I have a sense they genuinely form concepts. Yet the internals were opaque. I remember reading how LLMs cannot explain their own behaviour when asked.

    I feel this would give insight into all that including the degree of true conceptualisation. I’m curious if this can demonstrate what else the model is aware of when answering, too.

    • Our decomposition allows us to answer question like: for 84 percent of the model's representation, we know it is relying on this concept to give an answer.

      We can also trace its behavior to the training data that led to it, so that can show us where some of these concepts are formed from.

  • > wouldn't you like to know why a model is being sycophantic? Or Sandbagging?

    Actually, emphatically no. The only thing I care about is that I have recourse. It shouldn't matter the reason, in fact explainability can be an impediment to accountability. It's just another plausible barrier to a remedy that a bureaucracy can use deny changing a decision.

I work on ML problems in the healthcare/life sciences area, anything that enhances explainability is helpful. To a regulator, it's not really good enough to point at a black box and say you don't know why it gave the wrong answer this time. They have an odd acceptance of human error, but very little for technological uncertainty.