Comment by npodbielski
21 days ago
Yes, I was thinking about the same approach because I have Strix Halo and it slows down with longer context so context with less than <10k tokens would be achievable this way. If this could be done with small model that have >50tk/s that would be huge.
Unfortunately I am caught up right now in other projects at work and otherwise and just tried few dozens of prompts to see if this is even achievable.
No comments yet
Contribute on Hacker News ↗