Comment by sigmoid10
2 days ago
These probability shifts would only account for the final output layer (which may also have some shift), but I expect the largest shift to be in the activations in the intermediate latent space. There are a bunch of papers out there that try to get some offset vector using PCA or similar to tune certain model behaviours like vulgarity or friendlyness. You don't even need much data for this as long as your examples capture the essence of the difference well. I'm pretty certain you could do this with "historicalness" too, but projecting it into the future by turning the "contemporaryness" knob way up probably won't yield an accurate result. There are too many outside influences on language that won't be captured in historical trends.
On whether this accounts only the final output layer -- once the first token is generated (i.e. selected according to the modified sampling procedure), and assuming a different token is selected compared to standard sampling, then all layers of the model would be affected during generation of subsequent tokens.
This way it wouldn't be much better than instructing the model to elicit a particular behaviour using the system prompt. Limiting tokens to a subset of outputs is already common (and mathematically equivalent to a large shift in the output vector), e.g. for structured outputs, but it doesn't change the actual world representation inside the model. It would also be very sensitive to your input prompt to do it this way.