Comment by sheepscreek

16 hours ago

It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.

I've done very similar things with my custom agent that uses Gemini and have gotten very similar results. Working on the evals to back that claim up