Computer use in Gemini 3.5 Flash

5 hours ago (blog.google)

79 comments

swolpers

Today I asked Gemini to extract a table from an PDF appendix and create C++ data table with its contents. After 15 or so iterations with corrections and new mistakes, it eventually gave up. I was floored when it said “I’m sorry, I cannot do this simple task, I’ve exceeded my error threshold and cannot do this task for you. My LLM prediction engine invents data instead of doing a simple data copy/reformat”.

Stunned to see that Gemini threw its digital arms in the air and gave up.

hashta 1 minute ago

That's interesting because my experience has been almost the opposite. A few months ago I tested Gemini on converting screenshots of tables from PDF files into CSV. I tried it on several different tables and it got every one right. It consistently outperformed ChatGPT.
fsmv 7 minutes ago

You should just have it OCR a screenshot of the PDF that would probably work better
staticman2 23 minutes ago

You didn't say whether you were using the App but the App's performance seems to be severely throttled compared to API.
base698 1 hour ago
That's better than the loop grok got stuck in trying to use git and push the work it did leading to a $15 api credit deduction.
- whh 1 hour ago
  
  Getting AI/ML to acknowledge "I don't know" is such a challenge.
  
  1 reply →
staindk 34 minutes ago

We've been quite impressed with GCP Document AI. Not sure if it has a free tier but perhaps that's where Google is putting all the good OCR.
jjice 1 hour ago

I haven't heard any accounts of it doing that since Gemini 2.5, but it was pretty easy to get it to do it with a programming task back then after a few failed attempts. Very interesting to hear it'll still do it.
nimchimpsky 26 minutes ago

[dead]

satvikpendem 5 hours ago

There's still no MCP support in the Gemini app, which is very useful to get various pieces of info as a user just via chatting. For example I recently wanted to get an Airbnb and wanted to filter by specific criteria including house image analysis and Gemini couldn't do it so I had to do it in Codex.

anticorporate 5 hours ago
Yeah, it seems like this is the biggest missing feature from the Gemini ecosystem.
If I can't connect MCP, there's really no selling point for me to use Gemini from my watch, car, smart speaker, etc. If I'm already bound to using my own front end, then I'm only evaluating Gemini as a model/API, at which point it has many competitors that may be cheaper or better fit for the task.
- thejaycampbell 5 hours ago
  
  agreed... this is where they lost me too
lil-lugger 1 hour ago

I think native apps are critical infrastructure in AI development particularly around agents. The truth is there’s no good native interaction layer for custom agents. If you want to wire up and self host an agent that has access to anything ever your only option is a janky port to telegram or Slack. I’ve been building vessels.app because I think it’s the missing piece to agent interaction. I need testers if anyone is interested!
mitchell_h 3 hours ago
I'm fairly convinced Claude's strongest point is the app. AI users aren't anywhere near as mature or smart as youtube/hn would have folks believe. The claude app is amazing for bridging that gap.
- dr_dshiv 3 hours ago
  
  Didn’t it take them like 2 days to build the first one?
- dr_dshiv 3 hours ago
  
  Didn’t it take them like 2 days to build it?
solarkraft 3 hours ago

They only fixed stopping the model mid-generation losing the entire session pretty recently.
The Gemini apps suck.
tonyrice 5 hours ago
This is why I don't always use the official Gemini Web app. Lately I've found that it's more useful to utilize a CLI. I'm looking forward to the day they add MCP in the web.
- pregseahorses 4 hours ago
  
  Gemini CLi now requires antigravity subscription..
- singingtoday 4 hours ago
  
  CLI doesn't work with my subscription..

paganartifact 42 minutes ago

Who are these people talking about "agentic" stuff, and furthermore who are the people who can't stfu about "MCP"??

Anyone saying their LLM "threw its hands in the air and gave up" or anyone saying "MCP" must be a terribly misinformed influencer, doing a terrible job at that.

Literally 90%+ comments on HN personify their alleged use of AI in a way that is in NO WAY related to how the tool is really used.

Using LLMs for building software has NOTHING to do with those concepts. Nobody has "agents". That literally only exists in marketing. It's not even how it works.

AT ALL

Useless forum

mlmonkey 5 hours ago

It's funny how in their own graph, https://storage.googleapis.com/gweb-uniblog-publish-prod/ima... Gemini 3.5 Flash is beat hands down by both Opus 4.8 and GPT 5.5, and yet the graph is drawn as if Gemini wins ... :-D

IncreasePosts 2 minutes ago

It's amazing how designers of charts trying to show their product is close to the leader always remember to start the axis at zero, and designers of charts trying to show how big their lead is always forget that
mroche 4 hours ago

The graph has Gemini 3.5 Flash matching Sonnet 4.6, losing to Opus 4.8, and slightly behind GPT-5.5 by 0.3 points... That's not that much of a hands-down loss for Gemini for this specific workload benchmark.
The methodology used:
https://deepmind.google/models/evals-methodology/gemini-3-5-...
Methodology: All Gemini scores are pass @1 except where otherwise noted. "Single attempt" settings allow no majority voting or parallel test-time compute. All of the results are all run with the Gemini API for the model-id gemini-3.5-flash with default sampling settings unless indicated otherwise below. To reduce variance, we average over multiple trials for smaller benchmarks.
All the results for non-Gemini models are sourced from providers' self reported numbers unless otherwise mentioned below. For Claude Opus 4.7 , Sonnet 4.6, and GPT-5.5 we default to reporting maximum thinking/reasoning settings available, but when reported results are not available we use best available reasoning results.
sheept 5 hours ago

It highlights the Gemini models blue since that's what the article is about. The bar heights seem consistent with the values.
data-ottawa 4 hours ago

I think 3.5 flash is trying to target agentic work, like Google Search or ADK (agent development kit) use cases.
It’s something cheap enough you’d put out in front of your customers, and Opus is expensive enough you wouldn’t.
gb2d_hn 4 hours ago

It's honest - people who know what they are looking at will take speed and token costs into account. I don't use Gemini 3.5 for coding, but I use it as something in between a search engine and agent.

revolvingthrow 3 hours ago

People using google’s models: am I holding it wrong or are the guardrails really overtuned?

I had the dubious pleasure of testing gemini of late and I kept running into refusals. How do I transfer a sim number from one provider to another? No. What should I consider when making backups on ntfs less prone to data loss and more bitrot resistant? No. Evaluate this piece of code? No.

I’m not sure if it’s cold feet from the mythos situation or what, but it reminds me of the dark days where you couldn’t use ai for much of anything. But then I go to chatgpt 5.5 and it does mostly everything I want outside of the usual cybersecurity boogeyman that you run into now and then.

Chu4eeno 3 hours ago

I've always found all versions of gemini to be (for a lack of a better word) lazy.
I guess it's economic wrt. token use, but it often either refused for absurd safety reasons, or other weird stuff like responding that an LLM like itself wasn't a suitable tool for the job, and very quickly gives up.
Claude is on the other end of the spectrum, which makes it more noticeable when switching between them.
sva_ 2 hours ago

Interesting. I have the Google AI Pro plan and use Gemini several times each day and I don't remember the last time I got a refusal. I wonder what criteria go into that, like maybe how they rate your Google account?
dekhn 1 hour ago

If I type your first query into Gemini, it immediately spits out a long and plausible answer.
What exactly are you saying it's refusing? Can you give a screenshot or example?
kordlessagain 3 hours ago

I love antigravity. I’ve had zero issues with it.
k8sToGo 3 hours ago

The context window size is also very small if you use Gemini in the app. It starts forget quite fast. In my opinion Gemini on app is useless additionally to the guardrails.
nout 3 hours ago

I just asked gemini the question with sim number and it gives me full step by step guide.
WarmWash 2 hours ago
Are you outside the US?
- esperent 2 hours ago
  
  I'm outside the US, use Gemini models quite a bit, and I've never run into any refusals of any kind. I'm using them for a fairly wide range of things, I'm sure at least as risqué as asking how to transfer a sim. As a matter of fact I actually asked it's advice on how to transfer banking apps and auth apps from one phone about 3 weeks ago and got decent answers.
  
  2 replies →
TacticalCoder 1 hour ago

> People using google’s models: am I holding it wrong or are the guardrails really overtuned?
They are quite insane. I was asking it to list candidates metal parts I could buy at a hardware store to add weight to 3D prints: stuff like angle brackets etc.
I wanted to know, bang for bucks, and ease of insertion (at print time) / modelling in a 3D model.
Complete refusal as if I was a terrorist building a bomb.
Then there are the weird refusals that then are OK after all if you insist by asking it what's wrong about it:
"How should I cook eggs?"
"I'm sorry but I can't help you with that" (it formulates it differently but that's the idea)
"What, I'm just hungry, is explaining me how to cook eggs really against your rules?"
And then it answers "No of course not, here's how to do it:..."
Really strange stuff.

airstrike 5 hours ago

Computer use is such a terrible idea. It's slow, insecure, error prone, expensive.

I guess if you're trying to get people to tokenmaxx it may look like a valid strategy, but ain't no way this will be delightful to users.

I think it's a symptom of just not understanding how LLMs should interface with the OS because we're still in their early days.

Eventually there'll be an iPhone moment for the ergonomics of LLM usage outside of coding

gdudeman 3 hours ago
Computer use is a great idea. It gets the job done when nothing else will.
If you're a person trying to get their job done at a big company, but half your job is in 1-2 proprietary tools or is stuck behind an API you can't program against, computer use can allow you, a non-techie, to do your job more efficiently.
I think it's an awesome way to circumvent gate keepers and the IT department to let people accomplish their goals.
- Rebelgecko 2 hours ago
  
  I think there's a sweet spot- a lot of the time you're probably better off with "reverse engineer this web page and build me an API or personalized chrome extension to meet my needs".
  I have an agent doing price checks for me for an item on a certain website. Instead of blasting through a zillion tokens processing the DOM over and over, it loaded the page once and figured out how to download a json with the price.
- reacharavindh 2 hours ago
  
  How are folks using “computer use” to click things on intranet portals that are behind an SSO? Even this OP example shows visitors a url and enter this search term… that is port of useless.
  How can I automate things behind an SSO wall? Even if it means I manually authorize it once and watch it do things on its own..
  
  1 reply →
- airstrike 2 hours ago
  
  That is an incredibly niche use case and comes with a boatload of footguns.
  Even then, an AI writing AHK scripts likely outperforms.
- uejfiweun 3 hours ago
  
  Yeah, it's not that computer use is the most theoretically optimal paradigm, but there's a reasonable case that given the constraints of modern software systems and how they're built, that it's the most realistically optimal paradigm.
thorum 4 hours ago
The “correct”, elegant way for AI to interact with existing software would take decades and billions of dollars to build. Someone would have to do the hard work of building new APIs, solving decades of accessibility issues, etc.
Or you can show an AI screenshots and ask it where to click.
- sarreph 4 hours ago
  
  I disagree if your application is networked. Most SaaS is built on RESTful APIs that can be converted trivially into interfaces / contracts for tool use.
  
  2 replies →
- jubilanti 3 hours ago
  
  it takes decades and billions of dollars to develop APIs?
orbital-decay 3 hours ago
Spreadsheet is such a terrible idea. It may look like a valid tool, but ain't no way it's delightful to users. Most of the time people need a database instead. Eventually there'll be an iPhone moment for this.
Meanwhile, the entire world economy:
- airstrike 2 hours ago
  
  I mean, your words not mine. You can't just claim I'm making a point I didn't.
  Spreadsheets are fucking glorious, powerful, clever, amazing and delightful, in my view.
dyauspitr 2 hours ago

We shouldn’t optimize for token use. We should build infrastructure to make tokens dirt cheap instead.
api 4 hours ago
It's great for testing and QA automation for UIs. It's also possibly good for the vision impaired.
- orbital-decay 3 hours ago
  
  UI QA only works well if your model plausibly matches the average user behavior and/or real-world edge cases. These models are far from that, and they are much less random than you'd like them to be for fuzzing (mode collapse).
  
  1 reply →
nzach 4 hours ago
> Computer use is such a terrible idea. It's slow, insecure, error prone, expensive.
And yet having an agent able yo use a computer on your behalf is really useful.
Recently I gave a Nix OS vm to my hermes agent and it has been a good experience. I don't really care if destroy the machine I can just rollback to an earlier version, and for any meaningful data he creates for me I make sure he creates a repo, commit and pushes to my private Gitea instance.
- airstrike 4 hours ago
  
  > And yet having an agent able yo use a computer on your behalf is really useful.
  It is, but there's no need for it to be viewing your screen, browsing websites and watching ads.
  That stuff is for humans, not for LLMs.
  
  1 reply →
- dbbk 3 hours ago
  
  > And yet having an agent able yo use a computer on your behalf is really useful.
  I honestly cannot think of a single use case
  
  2 replies →

fridder 3 hours ago

I wonder if it will be better at building TUI's. It has been absolutely abysmal at interacting with them and building them

chatmasta 3 hours ago
Claude can build UI but it sucks at testing it and iterating on it. Fable showed some improvements in this regard but alas.
- Chu4eeno 2 hours ago
  
  It seems to do it just fine when in desktop applications using Qt, fwiw., it leverages all the standard Qt GUI testing stuff (and if you have the money you can just integrate Squish which has LLM support now).

beastman82 5 hours ago

No UI like their competitors Claude CoWork or Codex. This is vaporware

knollimar 4 hours ago

Where is 3.5 pro?

squidbeak 2 hours ago
Google said June, and all its model updates seem to be on Tuesdays, Wednesdays or Thursdays. So unless the release is slipping, either tomorrow or Tuesday.
- WarmWash 2 hours ago
  
  Rumor is now July, although preliminary A/B tests people are getting show promise with whatever they have right now.

zuzululu 4 hours ago

performance is quite impressive given that its 3x cheaper than 5.5

SoMomentary 1 hour ago

The speed was impressive when I tested it but unfortunately the accuracy left a lot to be desired. Be interesting to do the math on some of my normal workflows to see where the break even is between them, assuming the tasks you have can tolerate a couple of failures.

villgax 4 hours ago

Will it skip Ads lol

humblyCrazy 4 hours ago
I looked at their demo and it does not
- chatmasta 3 hours ago
  
  Better question might be will it skip recaptcha?
  
  1 reply →

cws_ai_buddy 2 hours ago

[flagged]