Comment by mkagenius

3 days ago

VLMs are great - I have been able to use it for a similar project too [1]. And it's only going to get better. Congratulations on the product launch what's your VLM model for this?

1. A framework to use/control mobile phones via any LLM - https://github.com/BandarLabs/clickclickclick

We finetune our own VLMs for this -- unfortunately prefer not to share which ones we use specifically! ClickClickClick looks awesome, have you heard of FerretUI (https://arxiv.org/pdf/2404.05719)? Pretty similar idea.

  • Yes, I tried a similar one called "omniparser" - where the issue was it was missing annotating some UI elements sometimes. Moreover, Gemini and Molmo worked right out of the box without needing any fine tune.

I'm surprised you named your framework clickclickclick instead of taptaptap.