← Back to context

Comment by skolos

10 hours ago

How many times did you try? Same model running multiple times can produce both very good and very bad results. In my benchmark even 10 runs often not enough to tell for sure if one model is better than another.

Usually just once (and I did just one test for this particular one), but I've found the overall quality to be relatively consistent.

There's too many confounding variables here, randomness just one of them. So I don't think of it as a definitive test (and reliable ordering), just another data point (along with actual benchmarks, pelicans, etc) to get a sense of the capabilities.

For example, I managed to get something out of DeepSeek 4 Flash quantized to 2-bit with Antirez' DwarfStar, used via Pi. Almost kinda worked! :) Which makes me optimistic for using local models for serious development soon - I'd say within a year.