Comment by ksplicer

1 year ago

This is something we've been grappeling with on my team. Many of the researchers in the org want to try all these reasoning techniques to increase performance, and my team keeps pushing back that we don't actually need that extra performance- we just want to decrease latency and cost.

So make the requirement using a cheaper and lower latency model and try to increase the performance to a satisfactory level. Assuming that you are not already using the cheapest/lowest latency model.