Comment by nerdsniper
7 days ago
My experience also shows that Gemini has unique strength in “generalized” (read: not coding) tasks. Gemini 2.5 Pro and 3 Pro seems stronger at math and science for me, and their Deep Research usually works the hardest, as long as I run it during off-hours. Opus seems to beat Gemini almost “with one hand tied behind its back” in coding, but Gemini is so cheap that it’s usually my first stop for anything that I think is likely to be relatively simple. I never worry about my quota on Gemini like I do with Opus or Chat-GPT.
Comparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.
I do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.
They all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time.
> their Deep Research usually works the hardest
That's sortof damning with faint praise I think. So, for $work I needed to understand the legal landscape for some regulations (around employment screening) so I kicked off a deep research for all the different countries. That was fineish, but tended to go off the rails towards the end.
So, then I split it out into Americas, APAC and EMEA requirements. This time, I spent the time checking all of the references (or almost all anyways), and they were garbage. Like, it ~invented a term and started telling me about this new thing, and when I looked at the references they had no information about the thing it was talking about.
It linked to reddit for an employment law question. When I read the reddit thread, it didn't even have any support for the claims. It contradicted itself from the beginning to the end. It claimed something was true in Singapore, based on a Swedish source.
Like, I really want this to work as it would be a massive time-saver, but I reckon that right now, it only saves time if you don't want to check the sources, as they are garbage. And Google make a business of searching the web, so it's hard for me to understand why this doesn't work better.
I'm becoming convinced that this technology doesn't work for this purpose at the moment. I think that it's technically possible, but none of the major AI providers appear to be able to do this well.
Oh yeah, LLMs currently spew a lot of garbage. Everything has to be double-checked. I mainly use them for gathering sources and pointing out a few considerations I might have otherwise overlooked. I often run them a few times, because they go off the rails in different directions, but sometimes those directions are helpful for me in expanding my understanding.
I still have to synthesize everything from scratch myself. Every report I get back is like "okay well 90% of this has to be thrown out" and some of them elicit a "but I'm glad I got this 10%" from me.
For me it's less about saving time, and more about potentially unearthing good sources that my google searches wouldn't turn up, and occasionally giving me a few nuggets of inspiration / new rabbit holes to go down.
Also, Google changed their business from Search, to Advertising. Kagi does a much better job for me these days, and is easily worth the $5/mo I pay.
> For me it's less about saving time, and more about potentially unearthing good sources that my google searches wouldn't turn up, and occasionally giving me a few nuggets of inspiration / new rabbit holes to go down.
Yeah, I see the value here. And for personal stuff, that's totally fine. But these tools are being sold to businesses as productivity increasers, and I'm not buying it right now.
I really, really want this to work though, as it would be such a massive boost to human flourishing. Maybe LLMs are the wrong approach though, certainly the current models aren't doing a good job.