Comment by interleave

6 months ago

Technically my wife would be a perfect customer because we literally just prototyped your solution at home. But I'm confused.

For context:

My wife does leadership coaching and recently used vanilla GPT-4o via ChatGPT to summarize a transcript of an hour-long conversation.

Then, last weekend we thought... "Hey, let's test local LLMs for more privacy control. The open source models must be pretty good in 2025."

So I installed Ollama + Open WebUI plus the models on a 128GB MacBook Pro.

I am genuinely dumbfounded about the actual results we got today of comparing ChatGPT/GPT-4o vs. Llama4, Llama3.3, Llama3.2, DeepSeekR1 and Gemma.

In short: Compared to our reference GPT-4o output, none (as in NONE, zero, zilch, nil) of the above-mentioned open source models were able to create even a basic summary based on the exact same prompt + text.

The open source summaries were offensively bad. It felt like reading the most bland, generic and idiotic SEO slop I've read since I last used Google. None of the obvious topics were part of the summary. Just blah. I tested this with 5 models to boot!

I'm not an OpenAI fan per se, but if this is truly OS/SOTA then, we shouldn't even mention Llama4 or the others in the same breath as the newer OpenAI models.

What do you think?

2 comments

interleave

FrasiertheLion 6 months ago

Ollama does heavily quantize models and has a very short context window by default, but this has not been my experience with unquantized, full context versions of Llama3.3 70B and particularly, Deepseek R1, and that is reflected in the benchmarks. For instance I used Deepseek R1 671B as my daily driver for several months, and it was at par with o1 and unquestionably better than GPT-4o (o3 is certainly better than all but typically we've seen opensource models catch up within 6-9 months).

Please shoot me an email at tanya@tinfoil.sh, would love to work through your use cases.

interleave 6 months ago

Hey Tanya! Thank you for helping me understand the results better.
I just posted the results of another basic interview analysis (4o vs. Llama4) here: https://x.com/SpringStreetNYC/status/1923774145633849780
To your point: Do I understand correctly that, for example, by running the default model of Llama4 via ollama, the context window is very short even when the model's context is, like 10M. In order to "unlock" the full context version, I need to get the unquantized version.
For reference, here's what `ollama show llama4` returns: - parameters 108.6B # llama4:scount - context length 10485760 # 10M - embedding length 5120 - quantization Q4_K_M