Comment by stingraycharles

16 hours ago

You do realize that similar biases are also present in the training data?

I do, inevitable, but ime the prompts force certain behaviors at similar strength (instruction following). So it's one thing that the model is biased towards any particular direction by its latent space, it's another that it is biased by an immodifiable prompt which can only be contradicted for the benefit of the lcd at the expense of the more involved operator.

Sure, but now we have to remodel whatever bias we want for our use case with every new release because the system prompt changes, whereas the underlying data does not.

  • Underlying data changes all the time, as do training methodologies / preferences.

    You do realize that these LLMs are trained with a metric ton of synthetic examples? You describe the kind of examples / behavior you want, let it generate thousands of examples of this behavior (positive and negative), and you feed that to the training process.

    So changing this type of data is cheap to change, and often not even stored (one LLM is generating examples while the other is training in real-time).

    Here's a decent collection of papers on the topic: https://github.com/pengr/LLM-Synthetic-Data

    • Well, I'd say it's a reasonable expectation for the model to behave similarly across releases. Am I wrong to assume that?

      I imagine the system prompt can correct some training artifacts and drive abnormal behavior to the mean in the dimensions that Anthropic deems fit. So it's either that they are responding to their brittle training process, or that they chose this direction deliberately for a different reason.