Comment by sheepscreek
11 hours ago
It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.
11 hours ago
It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.
I've done very similar things with my custom agent that uses Gemini and have gotten very similar results. Working on the evals to back that claim up