To add to the sibling "good is relative" it also depends what you're running, not just your relative tolerances of what good is. E.g. in a MoE the decode speedup means the speed of prompt processing delay is more noticeable for the same size model in RAM.
Let’s say TTFT needed the most improvement. At some point, loading the model with enough context size may take tens of seconds in some macs.
To add to the sibling "good is relative" it also depends what you're running, not just your relative tolerances of what good is. E.g. in a MoE the decode speedup means the speed of prompt processing delay is more noticeable for the same size model in RAM.
Good is relative but first token was clearly the biggest limitation.
Yeah TTFT was terrible. I don’t think it’s unreasonable to benchmark the most-improved metric.