Comment by rfoo

1 day ago

If you compare "schema validation error count" plus "Count of Finish Reason others" then SiliconFlow and Infinigence is in the same bucket too. Maybe their API layer detected incorrect tool call and set finish reason to something else?

IMO this likely is what you get from running the model correctly as-is (i.e. using the same weight and activation dtype), so Together is not bad.

Moonshot AI themselves and Groq likely uses some sampler tricks to eliminate schema validation errors.

So really the only thing this shows is: Nebius, Chutes, AtlasCloud could be running something else (for example further quantized model). Or bugs.

2 comments

rfoo

wishawa 1 day ago

Fair point. If Moonshot is holding back the true weights or inference techniques that affect correctness, then providers including Together should call them out on that. I for one would stop using Kimi if that is the case.

Anyway, Novita is doing significantly better on the vendor verifier chart than Together, so the low quality must be partially Together's fault at least.

rfoo 1 day ago

I don't think it's weight being different or special inference techniques, more like they are not able to train the model to follow tool schema perfectly yet, and both Moonshot and Groq decided to use something like https://github.com/noamgat/lm-format-enforcer to make sure at least the output format is correct.