Comment by tape_measure
10 months ago
WORDS IN CAPS are different tokens than lowercase, so maybe the lowercase tokens tie into more trained parts of the manifold.
10 months ago
WORDS IN CAPS are different tokens than lowercase, so maybe the lowercase tokens tie into more trained parts of the manifold.
That's a super interesting hypothesis. From an information theory perspective, rarer tokens are more informative. Maybe this results in the caps lock tokens being weighted higher by the attention mechanism.