Comment by simianwords

12 hours ago

> The dataset for this insight combines data on large language model (LLM) API prices and benchmark scores from Artificial Analysis and Epoch AI. We used this dataset to identify the lowest-priced LLMs that match or exceed a given score on a benchmark. We then fit a log-linear regression model to the prices of these LLMs over time, to measure the rate of decrease in price. We applied the same method to several benchmarks (e.g. MMLU, HumanEval) and performance thresholds (e.g. GPT-3.5 level, GPT-4o level) to determine the variation across performance metrics

This should answer. In your case, GPT-3.5 definitely is cheaper per token than 4o but much much less capable. So they used a model that is cheaper than GPT-3.5 that achieved better performance for the analysis.