Comment by creddit
7 hours ago
Playing with this some more and it's actively not good. Just basic mathematical errors riddling responses. Did some basic adversarial testing where its responses are analyzed by Gemini and Gemini is finding basic math errors across every relatively (relative to Opus, Gemini or GPT can handle) simple ask I make. Yikes.
Post actual results, make a blog post. Don't just say "this sucks" without tangible evidence.
Otherwise you're doomed to "sample size of one" level of relevance.
Then your internal benchmarks will be in the post-training set and you’ll have to make new ones.
I have the opposite experience: random HN/Reddit comments saying “this sucks” or “whoa this is a huge improvement” are the only benchmark that means anything. Standard benchmarks are all gamed and don’t capture the complexity of the real world.