Comment by kadushka

4 days ago

Is there any evidence to support your hypothesis?

Good question! I did a small experiment: trained a small logistics regression from embedding vectors into 1st/2nd/3rd character in token: https://chatgpt.com/share/6871061a-7948-8007-ab53-5b0b697e90...

I got 0.863 (for 1st)/0.559 (for 2nd)/0.447 (for 3rd) accuracy for Qwen 3 8B model embeddings. Note the code is hacky and might be wrong in ways + in reality transformers do know more because here I utilize only embedding layer. However it does show there are very clear signals on characters in tokens in embedding vectors.

  • Thank you! I guess if there's enough spelling related text in the dataset, a model is forced to learn some info about token composition in order to predict such texts.

    I wonder if it would help to explicitly insert this info into an embedding vector, similar to how we encode word position info. For example, allocate the first 20 vector elements to represent ASCII codes of token's characters (in some normalized way).

    • Ok, bonus content #2.

      I took Qwen3 1.7B model and did the same but rather then using embedding vector I used vector after 1st/etc layer, below accuracies for 1st positions:

      - embeddings: 0.855

      - 1st: 0.913

      - 2nd: 0.870

      - 3rd: 0.671

      - 16th: 0.676

      - 20th: 0.683

      And now mega bonus content: the same but with prefix "count letters in ":

      - 1st: 0.922

      - 2nd: 0.924

      - 3rd: 0.920

      - 16th: 0.877

      - 20th: 0.895

      And for 2nd letter:

      - embeddings: 0.686

      - 1st: 0.679

      - 2nd: 0.682

      - 3rd: 0.674

      - 16th: 0.572

    • One way here is to use one hot encoding in first (token length * alphabet length) dimensions.

      But to be frank I don’t think it’s really needed, I bet everything really needed model learns by itself. If I had time I would’ve tried it though :)

      Bonus content, accuracies for other models (notice DeepSeek!):

      - Qwen3-32B: 0.873 / 0.585 / 0.467

      - Qwen3-235B-A22B: 0.857 / 0.607 / 0.502

      - DeepSeek-V3: 0.869 / 0.738 / 0.624