Comment by upghost

21 hours ago

> Pre-training allows organizations to build domain-aware models by learning from large internal datasets.

> Post-training methods allow teams to refine model behavior for specific tasks and environments.

How do you suppose this works? They say "pretraining" but I'm certain that the amount of clean data available in proper dataset format is not nearly enough to make a "foundation model". Do you suppose what they are calling "pretraining" is actually SFT and then "post-training" is ... more SFT?

There's no way they mean "start from scratch". Maybe they do something like generate a heckin bunch of synthetic data seeded from company data using one of their SOA models -- which is basically equivalent to low resolution distillation, I would imagine. Hmm.

Pre-training mean exposing an already-trained model to more raw text like PDF extracts etc (aka continued pre-training). You wouldn't be starting from scratch, but it's still pre-training because the objective is just next token prediction of the text you expose it to.

Post-training means everything else: SFT, DPO, RL, etc. Anything that involves things like prompt/response pairs, reward models, or benefits from human feedback of any kind.

  • Yeah, this checks out. I wonder what they are doing to prevent semantic collapse. Also, I wonder if the base model would already be instruct and RLHF tuned or only pre-trained. Trying to do additional training without semantic collapse in a way that is meaningful would be interesting to understand. Presumably they are using adapters but I've never had much luck in stacking adapters.

    i.e.:

    1. Do I start with an RLHF tuned model, "pretrain" on top of that (with adapter or by freezing weights?), then SFT on top of that (stack another adapter, or add layer(s) and freeze weights?) (and where did I get the dataset? synthetic extraction from corpus?), then RL (adapter, add layer(s) and freeze?)

    2. or do I start at SF tuned model, ...

    3. or do I start at raw pre-trained model, ...

    Would love to know what the matrix used was.

  • Er, then what is the "already trained" model? I thought pre-training was the gradient descent through the internet part of building foundational models.

I would guess:

Pre-training: refining the weights in an existing model using more training data.

Post-training: Adding some training data to the prompt (RAG, basically).

I can imagine that, as usual, you start with a few examples and then instruct an LLM to synthesize more examples out of that, and train using that. Sounds horrible, but actually works fairly well in practice.

Probably just means SFT fine-tuning a base model, vs behavioural dpo and/or SFT fine-tuning a instruction model.