Comment by amelius

10 hours ago

And a "prune here" button.

It often happens that the interesting information is in the first paragraph or so, and the remainder is all just the LLM not knowing when to stop. This is super annoying as a conversation then ends up being 90% noise.

3 comments

amelius

postalcoder 8 hours ago

Pruning an assistant's response like that would break prompt caching.

Prompt caching is probably the single most important thing that people building harnesses think about and yet it's mind share in end users is virtually zero. If you had to think of all the weirdest, most seemingly baffling design decisions in an AI product, the answer to "why" is probably "to not break prompt caching".

zozbot234 3 hours ago

Grug says prompt caching just store KV-cache which is sequenced by token. Easy cut it back to just before edit. Then regenerate after is just like prefill but tiny.
amelius 5 hours ago

Maybe so, but pruning is still a useful feature.
If it hurts performance that much, maybe pruning could just hide the text leaving the cache intact?