Comment by npodbielski

21 days ago

Yes, I was thinking about the same approach because I have Strix Halo and it slows down with longer context so context with less than <10k tokens would be achievable this way. If this could be done with small model that have >50tk/s that would be huge.

Unfortunately I am caught up right now in other projects at work and otherwise and just tried few dozens of prompts to see if this is even achievable.