Comment by seemaze

5 hours ago

> One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode.

I thought Llamafile was just a model and llama.cpp bundled in to a single binary - is this the difference between Llamafile injecting a default sysmtem prompt vs hitting the raw llama-server endpoint with no harness?

That seems like comparing apples to apple pie, there's some ingredients missing.

4 comments

seemaze

zambelli 5 hours ago

I was surprised as well. I did go with an extreme (but true) example in the post. In this case, native function-calling template likely is in play.

However, that doesn't explain the Lamaserver prompt vs llamafile at ~ +4pts, or vs Ollama (at ~ +30ish pts) that sits almost perfectly between llamaserver native and llamafile.

The backend affects almost all model families, and was just something I've never seen really talked about.

eob 4 hours ago
Do you have any suspicion about what is different between the backends?
That's an absolutely bonkers statistic: it would mean spurious differences in hosting container overwhelm the performance differences between models.
- zambelli 3 hours ago
  
  I genuinely don't, sadly. I'm a mathematician originally, evolved organically into ML then AI - but I never really was a SWE.
  I feel like there's some backend decoding or chat template thing going on at a much lower level than what I'm best at. Maybe it's injecting headers or something that eventually compounds to model confusion? I really have no idea.
  I really hope folks better than me at backend stuff take a look and dive into it though because it's definitely under-reported and super consistent across model families and backends ranging from ollama, lama.cpp native, prompt, llamafile, and even vLLM that I didn't formally benchmark in the repo.

imachine1980_ 5 hours ago

I wouldn't expect such difference