Comment by seemaze
5 hours ago
> One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode.
I thought Llamafile was just a model and llama.cpp bundled in to a single binary - is this the difference between Llamafile injecting a default sysmtem prompt vs hitting the raw llama-server endpoint with no harness?
That seems like comparing apples to apple pie, there's some ingredients missing.
I was surprised as well. I did go with an extreme (but true) example in the post. In this case, native function-calling template likely is in play.
However, that doesn't explain the Lamaserver prompt vs llamafile at ~ +4pts, or vs Ollama (at ~ +30ish pts) that sits almost perfectly between llamaserver native and llamafile.
The backend affects almost all model families, and was just something I've never seen really talked about.
Do you have any suspicion about what is different between the backends?
That's an absolutely bonkers statistic: it would mean spurious differences in hosting container overwhelm the performance differences between models.
I genuinely don't, sadly. I'm a mathematician originally, evolved organically into ML then AI - but I never really was a SWE.
I feel like there's some backend decoding or chat template thing going on at a much lower level than what I'm best at. Maybe it's injecting headers or something that eventually compounds to model confusion? I really have no idea.
I really hope folks better than me at backend stuff take a look and dive into it though because it's definitely under-reported and super consistent across model families and backends ranging from ollama, lama.cpp native, prompt, llamafile, and even vLLM that I didn't formally benchmark in the repo.
I wouldn't expect such difference