Comment by nusl

6 months ago

Been using Gemini for a few months, somehow it's gotten much, much worse in that time. Hallucinations are very common, and it will argue with you when you point it out. So, don't have much confidence.

11 comments

nusl

panarky 6 months ago

In my experience with chat, Flash has gotten much, much better. It's my go-to model even though I'm paying for Pro.

Pro is frustrating because it too often won't search to find current information, and just gives stale results from before its training cutoff. Flash doesn't do this much anymore.

For coding I use Pro in Gemini CLI. It is amazing at coding, but I'm actually using it more to write design docs, decomp multi-week assignments down to daily and hourly tasks, and then feed those docs back to Gemini CLI to have it work through each task sequentially.

With a little structure like this, it can basically write its own context.

declan_roberts 6 months ago

I like flash because when it's wrong it's wrong very quickly. You can either change the prompt or just solve the problem yourself. It works well for people who can spot the answer as being "wrong"
okdood64 6 months ago

> Flash has gotten much, much better. It's my go-to model even though I'm paying for Pro.
Same I think also Pro got worse...
vicnov 6 months ago
interesting out of all "thinking models," I struggle with Gemini the most for coding. Just can't make it perform. I feel like they silently nerfed it over the last months.
- nusl 6 months ago
  
  It does feel worse. I've swapped to Claude and it's massively better for my tasks. Jules just released so I'll see if that's useful.

arnaudsm 6 months ago

I feel the same, but cannot measure the effect in any context benchmark like fiction.livebench.

Are they aggressively quantizing, or are our expectations silently increasing ?

nusl 6 months ago

Yeah, it's hard to measure. Not sure about our expectations, though I recall way better output when I first started using Gemini 2.5 vs now. It seems to be stupider and more headstrong somehow?

_proofs 6 months ago

my recent experience with flash and using it to prototype a c++ header i was developing:

- it was great to brainstorm with but it routinely introduced edits and dramatic code changes, often unnecessary and many times causing regressions to existing, tested code. - numerous times recursion got introduced to revisions without being prompted or without any justified or good reason - hallucinated a few times regarding c++ type deduction semantics

i eventually had to explicitly tell it to not introduce edits in any working code being iterated on without first discussing the changes, and then being prompted by me to introduce the edits.

all in all i found base chatgpt a lot more productive and accurate and ergonomic for iterating (on the same problem just working it in parallel with gemini).

- code changes were not always arbitrarily introduced or dramatic - it attempted to always work with the given code rather than extrapolate and mind read - hallucinated on some things but quickly corrected and moved forward - was a lot more interactive and documenting - almost always prompted me first before introducing a change (after providing annotated snippets and documentation as the basis for a proposed change or fix)

however, both were great tools to work with when it came to cleaning up or debugging existing code, especially unit testing or anything related to TDD

alecco 6 months ago

Same here. I stopped using Gemini Pro because on top of it's hard to follow verbosity it was giving contradicting answers. Things that Claude Sonnet 4 could answer.

Speaking of Sonnet, I feel like it's closing the gap to Opus. After the new quotas I started to try it before Opus and now it gets complex things right more often than not. This wasn't my experience just a couple of months ago.

quadrature 6 months ago

Is the problem mainly with tool use ? and are you using it through AI studio or through the API ?.

I've found that it hallucinates tool use for tools that aren't available and then gets very confident about the results.

nusl 6 months ago

Via the chat prompt mostly, and sometimes via Copilot. It was quoting me sources and links that didn't exist, and when I told it the links were wrong it doubled down forever, no matter how hard I tried to tell it otherwise. Even sent screenshots, etc.
Kinda just got stuck in a self-confident loop that time. Other times the output is just far worse than Claude for similar use cases, where a couple months back it was stronger, at least in my subjective experience.