Google’s new agent controls web apps through clicks and keystrokes, arriving days after rivals with a narrower scope and speed-first pitch.
OpenAI and Anthropic are pushing agents that operate whole computers. Google just shipped one that only runs the browser—and says that’s exactly the point. The company released Gemini 2.5 Computer Use on October 7, positioning it as faster and more accurate on web tasks even if it can’t manage your desktop. That’s the tension.
What’s actually new
Gemini 2.5 Computer Use is a specialized version of Google’s flagship model that sees screenshots, reasons about on-screen elements and decides the next move. Developers send a goal, a screenshot and the recent action history; the model replies with a function call: click here, type there, scroll down. Client code performs the step, grabs a fresh screenshot and the loop repeats until the task completes—or a safety rule intervenes. It’s a tight loop by design.
Scope is explicit: this model is optimized for browsers. It supports a defined set of thirteen actions, from clicking coordinates and typing text to navigating URLs, triggering key combos, hovering menus, scrolling, and drag-and-drop. There’s no file manager, no desktop windows, no native apps. Just the web.
The browser bet
On paper, that looks like a handicap. Anthropic’s Claude can act across the operating system; OpenAI is extending ChatGPT’s reach into desktop apps. Google’s agent won’t rearrange your folders or tweak a spreadsheet in a native client. That’s a real gap.
But most enterprise work already happens in the browser: CRMs, ERPs, ticketing queues, payroll portals, procurement flows, expense tools. If you can reliably automate what screens demand—forms, filters, checkboxes, dropdowns—you cover the majority of repetitive digital labor without the unpredictability of OS-level control. Focus reduces blast radius. It also reduces weird edge cases.
Evidence vs. claims
Google says the browser-only focus yields better quality at lower latency on multiple web and mobile benchmarks, including Online-Mind2Web and WebVoyager. The company hasn’t published full scorecards, but it did ship with a public demo and references to internal deployments. Benchmarks are helpful. Real usage is better.
Internally, versions of this agent have been used for UI testing in Project Mariner, Firebase’s testing tools and Search’s AI Mode. Google’s payments team reports the model now “rehabilitates” a majority of flaky end-to-end UI tests that previously took days to fix. That suggests the loop is robust under brittle, real-world pages. It’s a meaningful claim.
External testers provide early signals too. Poke.com says it picked this model for speed; Autotab cites better reliability parsing complex context. These are vendor-adjacent anecdotes, not neutral trials, but they align with the product thesis: do less, faster.
Safety architecture, not slogans
Agents that click buttons on the open web can also click the wrong ones. Google wired in two layers of friction. First, a per-step safety service reviews each proposed action. If the action is low-risk, client code can execute it. If the action crosses a threshold—think purchases, cookie banners, risky prompts—the model returns “requires confirmation,” and the developer must ask the user to approve before proceeding. Bypassing that prompt is prohibited in the terms.
Second, developers can set guardrails: exclude specific actions entirely, constrain navigation with allowlists, or add their own custom functions that require explicit consent. Observability and logging are encouraged, as is running the agent in a sandboxed browser profile or VM. Safety is a workflow, not a filter. Keep humans in the loop.
One headline risk already produced confusion: CAPTCHAs. A widely shared demo appeared to show Gemini solving a Google CAPTCHA before running a search; the author later retracted the claim after learning the demo host, Browserbase, solved it under the hood. The model waited; the infrastructure clicked. That matters. CAPTCHA-busting would signal both impressive capability and clear misuse potential. This isn’t that.
Access and economics
Availability is straightforward: public preview via Google AI Studio and Vertex AI. Pricing matches Gemini 2.5 Pro’s token rates, but there’s no free tier during preview. That means teams pay from the first token. For many enterprises, that’s fine; for hobbyists, it’s a speed bump. It is what it is.
The credibility test
Three things will decide whether this becomes a template or a cul-de-sac. First, independent benchmarks. Google’s “outperforms leading alternatives” needs third-party replication with hardware, prompts and harnesses spelled out. Without that, it’s marketing.
Second, adoption outside Google’s walls. Internal teams had months to tune prompts and build guardrails; external developers start with docs and a demo environment. If the experience is rough, momentum will stall—even if the core loop is strong.
Third, enterprise traction. Browser automation maps neatly to compliance-heavy workflows that hate surprises. If agents can complete a purchase order without clicking the wrong radio button, they’ll earn trust. If they misfile or over-click, they won’t. Reliability is the product.
Bottom line
Google arrived later and narrower than rivals, but with a clear thesis: the web is the workbench, and speed plus restraint beats breadth plus fragility. If benchmarks and field reports confirm that the agent really does finish web tasks faster and with fewer mistakes, the browser-only constraint becomes a feature, not a flaw. If not, it’s just a missed target with good intentions. The next few weeks of third-party testing will tell. Choose focus, then prove it.
⏱️
Miss one day. Miss everything.
AI waits for no one. We'll keep you caught up.
Why this matters
- Most business workflows live in the browser; reliable web automation could remove a large swath of repetitive digital work without OS-level risk.
- Safety that forces human confirmation on risky actions is more than optics; it’s how agents avoid costly mistakes in regulated environments.
❓ Frequently Asked Questions
Q: How does Gemini 2.5 Computer Use pricing compare to competitors?
A: It matches standard Gemini 2.5 Pro rates: $1.25-$2.50 per million input tokens and $10-$15 per million output tokens. The catch is there's no free tier during preview, unlike base Gemini. OpenAI and Anthropic pricing for computer use features varies by model tier, but both offer free trial access.
Q: What are the 13 actions the model can perform?
A: The model handles: open browser, wait 5 seconds, go back, go forward, search, navigate to URL, click at coordinates, hover, type text, press key combinations, scroll entire page, scroll at specific location, and drag-and-drop. Developers can exclude unwanted actions or add custom ones for mobile.
Q: Does it work on mobile devices or just desktop browsers?
A: Primarily desktop browsers, but Google tested it on mobile interfaces and reports strong performance on the AndroidWorld benchmark. Developers can adapt it for mobile by excluding browser-specific actions and adding custom mobile functions like "open_app" and "long_press_at." It's not optimized for full mobile OS control.
Q: Can developers test it before paying?
A: Browserbase hosts a free demo environment at gemini.browserbase.com where you can test the model without API access. For actual development, you need a Google AI Studio or Vertex AI account and pay from the first token—no free API tier exists during preview.
Q: What's the main difference between this and Claude or ChatGPT's computer control?
A: Scope. Claude and ChatGPT control entire operating systems—desktop apps, file management, system settings. Gemini 2.5 Computer Use only controls web browsers. Google claims this narrower focus delivers faster, more reliable results for web tasks, which cover most enterprise workflows anyway.