← Back to context

Comment by WhitneyLand

2 years ago

In case it’s confusing for anyone to see “weight” as a verb and a noun so close together, there are indeed two different things going on:

1. There are the model weights, aka the parameters. These are what get adjusted during training to do the learning part. They always exist.

2. There are attention weights. These are part of the transformer architecture and they “weight” the context of the input. They are ephemeral. Used and discarded. Don’t always exist.

They are both typically 32-bit floats in case you’re curious but still different concepts.

I always thought the verb was "weigh" not "weight", but apparently the latter is also in the dictionary as a verb.

Oh well... it seems like it's more confusing than I thought https://www.merriam-webster.com/wordplay/when-to-use-weigh-a...

  • “To weight” is to assign a weight (e.g., to weight variables differently in a model), whereas “to weigh” is to observe and/or record a weight (as a scale does).

    • A few other cases of this sort of thing:

      affect (n). an emotion or feeling. "She has a positive affect."

      effect (n). a result or change due to some event. "The effect of her affect is to make people like her."

      affect (v). to change or modify [X], have an effect upon [X]. "The weather affects my affect."

      effect (v). to bring about [X] or cause [X] to happen. "Our protests are designed to effect change."

      Also:

      cost (v). to require a payment or loss of [X]. "That apple will cost $5." Past tense cost: "That apple cost $5."

      cost (v). to estimate the price of [X]. "The accounting department will cost the construction project at $5 million." Past tense costed. "The accounting department costed the construction project at $5 million."

I think in most deployments, they're not fp32 by the time you're doing inference no them, they've been quantized, possibly down to 4 bits or even fewer.

On the training side I wouldn't be surprised if they were bf16 rather than fp32.