GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

16 hours ago (arxiv.org)

32 comments

gmays

Click coordinates. Agentic GUI is really annoying when the multi-modal agent cannot click on x,y coordinates.

I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords. (did not try GLM-5V yet)

GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals.

Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http

lopuhin 12 hours ago

Qwen3.5 is able to output click coordinates and bounding boxes just fine, as values normalized to 0..1000, I’d hope Qwen3.6 didn’t loose this capability.
withinrafael 13 hours ago
I've had lots of success with generating coordinates and answering questions using the UI-TARS model https://github.com/bytedance/UI-TARS.
- theturtletalks 10 hours ago
  
  I’d also checkout midscene, you can set the model and UI-TARS works but you can also use qwen vision models and it works.
cyanydeez 15 hours ago
This sounds a lot like another hacker news posted in the last few days. The same problem image generators have with a prompt like, produce numbers 1-50 in a spiral pattern and it can't count properly. But if you break it into a raster/vector where you have it first produce the visual content and then a SVG overlay, it's completely capable.
Have you tried doing a two step: review the image, then render a vector?
- julius 14 hours ago
  
  Maybe there is a smart trick to get them to do the right thing, but the things I tried did not work.
  At one point I had some smaller model draw bounding boxes around everything that looked interactable and labels like "e3" ... then asked the model to tell me "click on e3". Did not work in my tests was pretty much as bad as x,y.
  
  1 reply →

gertlabs 16 hours ago

GLM-5V-Turbo is a model I wanted to like due to its speed and API reliability, but it didn't perform well in our coding and reasoning testing. More recent open source models have made it obsolete. GLM 5.1 is so many light years ahead of it on everything except speed, that I'm not sure why it's still being served.

Comprehensive evaluation results at https://gertlabs.com/rankings

gruez 15 hours ago
>but it didn't perform well in our coding and reasoning testing
>Comprehensive evaluation results at https://gertlabs.com/rankings
But if you go to the linked site, it seems like the only thing that's part of the evaluation is how well the models play various games? I suppose that counts as "reasoning", but I don't see how coding ability tested?
- gertlabs 15 hours ago
  
  Games is loosely defined here, as we run the bench across hundreds of unique environments. For some, the models write code to play a game, either one-shot or via a harness where they can iterate and use tools. Some they play directly, making a decision on each game tick. Some are real-time, giving the models a harness where they can write code handlers or submit decisions to interact with environments directly.
  Coding is what we test for most heavily. Testing this via a game format (instead of correct/incorrect answers) allows us to score code objectively, scale to smarter models, and directly compare performance to other models. When we built the first iteration last year, I was surprised by how well it mapped to subjective experience with using models for coding. Games really are great for measuring intelligence.
XYen0n 16 hours ago

GLM-5.1 does not support image input.
BugsJustFindMe 15 hours ago
This may be a strange request, but is it at all possible to include Cursor's Composer models in your tests?
- gertlabs 14 hours ago
  
  I am curious about the model, but for the most part, we have access to the same models that you do and only test models with standalone API releases.
scotty79 16 hours ago

I think the point is to use them both with GLM 5.1 delegating vision tasks to GLM-5V-Turbo

_pdp_ 14 hours ago

We just migrated an AI agent from Kimi to GLM and frankly I am surprised by the results. It feels premium.

However, both Kimi and GLM can end up in doom loops so be careful how you use them. Without a proper harness the agent can easily get into some tricky situations with no escape.

We had to develop new heuristics in our cloud harness just because of this but I am really grateful that we did as the platform feels now more robust.

A small price to pay for model plug & play!

jadbox 13 hours ago
What version of Kimi was that using? Do you have any specific insight on Kimi vs GLM in real world scenarios?
- _pdp_ 7 hours ago
  
  K2.6. My experience is anecdotal but GLM was able to complete the task more throughly and quicker.
- LurusCode 11 hours ago
  
  [dead]

zozbot234 13 hours ago

Looks like this was not an open release, the latest GLM-xV release was 4.6V and Turbo models were never open.

desireco42 14 hours ago

I've been using GLM pretty much exclusively last 6-8 months. I have access to Anthropic and OpenAI models and others. I always keep returning to GLM, it isn't the best, sometimes I would go to Codex to help it, but overall, especially with Turbo, it is everyday good model.

Turbo makes a huge difference in everyday use because it saves you time and you are not in the mood always to wait endlessly.

edg5000 6 hours ago

> I've been using GLM pretty much exclusively last 6-8 months. I have access to Anthropic and OpenAI models and others. I always keep returning to GLM, it isn't the best
Very interesting. What sort of tasks do you use it for, and what client do you use?
When you want to use a custom client and a coding plan to control costs (daily use, a few hundred USD/m budget), this is the landscape:
- Anthropic/Google: Deterring custom clients actively
- OpenAI: Grey area.
- Z.ai: Technically only allows clients in their (large) approved list of clients. Likely won't actively ban custom clients.
- Moonshot: Seem to allow custom clients?
- DeepSeek/Alibaba: No coding plans at this time

DexOmg 11 hours ago

[dead]

muddi900 16 hours ago

z.ai will use quantized models in off hours. Buyer beware

NekkoDroid 1 hour ago

This... doesn't make sense. Why would they use a quantized model when load is low and the full model when load is high???
yogthos 15 hours ago

I have a subscription and I have not seen any difference in performance during on/off hours. What exactly are you basing this on?
_aavaa_ 16 hours ago
Do you have proof for this?
- 2ndorderthought 11 hours ago
  
  No they don't it's just a smear campaign because the US tech companys are freaking out
  
  1 reply →
desireco42 14 hours ago
I hear a lot of people complaining, I am on their Max plan, I never hit limits, use it non-stop and overall it has been fantastic experience.
- danilopopeye 7 hours ago
  
  Same feeling here on the Pro plan. I’m still in the old plan without the Weekly quotas, but never never exhausted the 5H so far.
- s900mhz 8 hours ago
  
  Has 5.1 reliability improved? I would love to use it again. The inference was just too unreliable when it was first released.
jauntywundrkind 4 hours ago

I was one of the people just absolutely in misery when the GLM-5.1 model dropped. It wasn't quantized, I don't think, but it had some very gnarly issues where it would hit a context size, then seemingly try to quantize, and fall apart. It was unusable. It went from being an excellent model all the way to 200k, to being only 60k before it couldn't write in sentances and definitely couldn't tool call, to being 100k, to 120k. It was terrible, and I was so sad they had made my subscription so much worse, it felt like. https://news.ycombinator.com/item?id=47677853
But very shortly after this submission/release of 5.1, after a mass pouring out of sadnesses, they fixed it. Things have been back to absolutely amazing. I joined right before 4.7, and 4.7 was incredible. 5.0 was fantastic. 5.1 has been a dream. GPT still catches a lot of stuff and is smarter, but man, GLM-5.1 is so capable, and it's frankly often a better writer, often better understands and captures purpose and notion, where-as GPT often feels dry and focused on narrow technicals. I really appreciate GLM-5.1.
And I'm really glad Z.ai fixed the absurd damage they had in their systems. I do suspect they were trying to dynamically quantize as the context window grew, or some such trickery. It was not working at all, but somehow it tooks months to fix.