Comment by creddit

7 hours ago

Playing with this some more and it's actively not good. Just basic mathematical errors riddling responses. Did some basic adversarial testing where its responses are analyzed by Gemini and Gemini is finding basic math errors across every relatively (relative to Opus, Gemini or GPT can handle) simple ask I make. Yikes.

3 comments

creddit

smlacy 3 hours ago

Post actual results, make a blog post. Don't just say "this sucks" without tangible evidence.

Otherwise you're doomed to "sample size of one" level of relevance.

titanomachy 24 minutes ago

Then your internal benchmarks will be in the post-training set and you’ll have to make new ones.
thorum 2 hours ago

I have the opposite experience: random HN/Reddit comments saying “this sucks” or “whoa this is a huge improvement” are the only benchmark that means anything. Standard benchmarks are all gamed and don’t capture the complexity of the real world.