Comment by Tiberium

5 hours ago

I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:

- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).

- GPT-5.4 Mini averages about 180-190 t/s on API. Priority does nothing for it currently.

- GPT-5.4 Nano is at about 200 t/s.

To put this into perspective, Gemini 3 Flash is about 130 t/s on Gemini API and about 120 t/s on Vertex.

This is raw tokens/s for all models, it doesn't exclude reasoning tokens, but I ran models with none/minimal effort where supported.

And quick price comparisons:

- Claude: Opus 4.6 is $5/$25, Sonnet 4.6 is $3/$15, Haiku 4.5 is $1/$5

- GPT: 5.4 is $2.5/$15 ($5/$22.5 for >200K context), 5.4 Mini is $0.75/$4.5, 5.4 Nano is $0.2/$1.25

- Gemini: 3.1 Pro is $2/$12 ($3/$18 for >200K context), 3 Flash is $0.5/$3, 3.1 Flash Lite is $0.25/$1.5

11 comments

Tiberium

attentive 7 minutes ago

token/sec is meaningless without thinking level. If it's fast but keeps rambling about instead of jumping on it then it can take a very long time vs low token/sec but low/none thinking.

rglynn 1 hour ago

IME tok/s is only useful with the additional context of ttft and total latency. At this point a given closed-model does not exist in a vaccuum but rather in a wider architecture that affects the actual performance profile for an API consumer.

This isn't usually an issue comparing models within the same provider, but it does mean cross-provider comparison using only tok/s is not apples-to-apples in terms of real-world performance.

Rapzid 1 hour ago

Exactly. Really frustrating they don't advertise TTFT and etc, and that it's really hard to find any info in that regard on newer models.
For voice agents gpt-4.1 and gpt-4.1-mini seem to be the best low latency models when you need to handle bigger data more complex asks.
But they are a year old and trying to figure out if these new models(instant, chat, realtime, mini, nona, wtf) are a good upgrade is very frustrating. AFAICT they aren't; the TTFT latencies are too high.

daniel_iversen 2 hours ago

Curious to hear why people pick GPT and Claude over Google (when sometimes you’d think they have a natural advantage on costs, resources and business model etc)?

coderjames 1 hour ago

In my workplace, its availability. We have to use US-only models for government-compliance reasons, so we have access to Opus 4.6 and GPT 5.4, but only Gemini 2.5 which isn't in the same class as the first two.

msp26 1 hour ago

Man the lowest end pricing has been thoroughly hiked. It was convenient while it lasted.

coder543 5 hours ago

I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.

JLO64 4 hours ago

In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.
I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.
asselinpaul 3 hours ago
OpenRouter has this information
- coder543 1 hour ago
  
  I do not see prompt processing, only some kind of nebulous “throughput” that could be output or input+output, but definitely not input only.

rattray 3 hours ago

Wow. How fast is haiku?