Comment by janalsncm
2 days ago
As I understand it, BLT uses a small nn to tokenize but doesn’t change the attention mechanism. MTA uses traditional BPE for tokenization but changes the attention mechanism. You could use both (latency be damned!)
2 days ago
As I understand it, BLT uses a small nn to tokenize but doesn’t change the attention mechanism. MTA uses traditional BPE for tokenization but changes the attention mechanism. You could use both (latency be damned!)
No comments yet
Contribute on Hacker News ↗