← Back to context

Comment by simianwords

10 hours ago

I trust you. If it were happening so frequently you may be able to give me a single prompt to get it to bullshit?

I did this in one attempt just now: https://gemini.google.com/share/b4e016be1f69

#8 has an incorrect answer (3 appearances according to Gemini, 2 according to reality https://en.wikipedia.org/wiki/Bowl_championship_series#BCS_a...)

So it works well 95% of the time for literally a trivial use case. Imagine if any other tech tool had that kind of reliability: `ls` displays 95% of your files, your phone successfully sends and receives 95% of text messages, or Microsoft Word saving 95% of the characters you typed in. That's just not acceptable.

  • Hi! The challenge was ChatGPT but even then it looks like you used the weakest version of Gemini.

    • >I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks

      I did exactly what I said I did. I'm using these systems the way they're designed and advertised. I'm following the happy path with tasks that are small, trivial, and easy to check. This is the charitable approach. Yet the system creaks under the lightest load. If Google wants to put on a better show with stronger models, then they should make those the default.

      You don't need to make excuses for shoddy engineering from multi-billion dollar corporations. And you're quite welcome to run the same prompt on ChatGPT and evaluate it on your own time.