Comment by boroboro4
3 days ago
One way here is to use one hot encoding in first (token length * alphabet length) dimensions.
But to be frank I don’t think it’s really needed, I bet everything really needed model learns by itself. If I had time I would’ve tried it though :)
Bonus content, accuracies for other models (notice DeepSeek!):
- Qwen3-32B: 0.873 / 0.585 / 0.467
- Qwen3-235B-A22B: 0.857 / 0.607 / 0.502
- DeepSeek-V3: 0.869 / 0.738 / 0.624
No comments yet
Contribute on Hacker News ↗