Google Gemini 3.1 Flash Live Tops Audio Benchmarks

Google on Thursday released Gemini 3.1 Flash Live, a voice and audio model that the company says outperforms all previous Gemini audio models on three independent benchmarks. The model scores 90.8% on ComplexFuncBench Audio, a test measuring multi-step function calling through spoken commands, and leads Scale AI's Audio MultiChallenge with 36.1% on complex instruction-following during the interruptions and half-sentences of natural speech. Google is deploying it across developer tools, enterprise products, and consumer-facing services including a global expansion of Search Live to more than 200 countries.

The release lands on the same day that Cohere open-sourced a speech recognition model and Mistral shipped an open-source text-to-speech system. Three companies, three voice models, one Thursday. The companies building AI infrastructure now treat voice as a default interface, not something to experiment with later.

Key Takeaways

Gemini 3.1 Flash Live scores 90.8% on ComplexFuncBench Audio and leads Scale AI's Audio MultiChallenge at 36.1%.
Search Live expands to 200+ countries with multilingual voice and camera search powered by the new model.
Cohere and Mistral shipped competing open-source voice models the same day, intensifying the voice AI infrastructure race.
Enterprise customers including Kroger, Verizon, and The Home Depot are deploying Google voice agents in production.

What the benchmarks actually measure

The two headline numbers describe different problems. ComplexFuncBench Audio, originally a text-based evaluation from the ZAI research group, tests whether a model can execute a sequence of interdependent function calls. Book a hotel. Arrange a car to the airport. Adjust both when the flight changes. Google synthesized audio for each prompt and ran its realtime API against the published scoring apparatus. A score of 90.8% means the model completed nearly all multi-step chains without dropping a step or inventing a constraint.

Audio MultiChallenge, built by Scale AI, aims at something harder. It throws long conversations at the model with corrections, digressions, and the kind of sentences people start but never finish. The 36.1% score sounds low until you read the benchmark's design: it punishes any failure to maintain coherence across extended dialogue where a human speaker changes direction mid-thought. Google enabled "thinking" mode for this test, letting the model reason before responding. Not how most voice assistants work in practice.

A third benchmark, BigBenchAudio, evaluated comprehension across 1,000 recordings spanning speech understanding, accent identification, and environmental sound recognition. Google showed bar graph comparisons suggesting improvement over prior versions but did not publish a standalone score.

The model card from DeepMind fills in the architecture: 3.1 Flash Live builds on Gemini 3 Pro, accepts audio, images, video, and text within a 128,000-token context window, and generates up to 64,000 tokens of audio or text output. Every piece of audio it produces carries a SynthID watermark, Google's steganographic system for flagging AI-generated content.

Enterprise customers are already in the building

Google is not positioning 3.1 Flash Live as a research preview. The model ships directly into Gemini Enterprise for Customer Experience, the product Google Cloud showed off at NRF in January. Kroger signed on. Lowe's and Papa Johns followed. The Home Depot and Woolworths too. What these companies share is phone volume: millions of calls a month where a human picks up, asks the same fifteen questions, and either solves the problem or transfers the caller. The platform replaces that loop with voice agents that take food orders, walk customers through returns, and close purchases at in-store kiosks. Real environments with real background noise.

Google says 3.1 Flash Live recognizes acoustic signals like pitch and pace better than 2.5 Flash Native Audio, the model it replaces. In a customer support call, the system can theoretically detect when a caller's voice tightens with frustration and adjust its tone before the situation escalates. Darshan Kantak, VP of Applied AI at Google Cloud, described the goal at NRF as "combining the best of Google Cloud's AI and infrastructure with a business's own institutional intelligence." The implication is clear: Google wants its model to sound less like a chatbot and more like a company's best employee.

Verizon, another named partner, already runs Google's earlier voice models across 28,000 customer care representatives and reports 96% accuracy in agent assistance. In April 2025, CEO Hans Vestberg predicted that AI-assisted routing would help retain 100,000 subscribers by identifying why they were calling and connecting them with the right representative. The Home Depot is testing voice agents in select retail locations. LiveKit, the infrastructure company that built the backend for ChatGPT's voice mode, lists the Gemini Live API as a supported integration and raised a $100 million Series C in January at a $1 billion valuation. Russ d'Sa, LiveKit's co-founder, wrote in January that voice AI applications "are realtime and stateful" and that "the whole stack has to be rebuilt" to support them. His bet: 2026 is the year voice AI moves from demos to deployment at scale.

The enterprise angle is where the money changes hands. Consumer voice assistants have existed for a decade, and none of them generated the kind of revenue that justifies a $1 billion valuation for an infrastructure provider. Enterprise voice agents that complete transactions, resolve billing disputes, and schedule appointments carry higher stakes. The math works differently when a dropped call costs you a customer.

But companies are more afraid of falling behind than of failing in public. That's the tell. The money is moving.

Search Live goes global, for real this time

The consumer piece of the launch centers on Search Live, Google's voice-and-camera search feature. Point your phone at something, ask a question out loud, get a spoken answer. Google says it now works in more than 200 countries and territories, powered by 3.1 Flash Live's multilingual engine.

Here's the thing about that timing. On March 18, Google told Engadget that Search Live was expanding globally, then retracted the announcement hours later, saying the feature "remains available in the US and India, with testing currently underway in additional markets." Eight days and one embarrassing retraction later, the expansion went live for real.

India got there first, outside the U.S. English and Hindi launched initially, then Bengali, Tamil, Telugu, and four other regional languages stacked on top. Nine languages where a year ago there was one. The global rollout puts Google's multilingual claims under immediate pressure, because a voice assistant that stumbles over Urdu conjugation in Lahore will generate complaints faster than any benchmark can predict.

In Gemini Live, Google's separate conversational AI mode, the new model doubles the length of sustained conversation compared to its predecessor. The company also claims it can detect frustration and confusion in a speaker's voice and shift responses to match. Whether that sensitivity holds up across dozens of languages and accents is another question entirely. Press releases are not user testing.

The competition is not waiting around

Google was not the only company shipping on March 26. Cohere released Transcribe, a 2-billion-parameter open-source speech recognition model under Apache 2.0. It supports 14 languages and tops the Hugging Face Open ASR leaderboard with a word error rate of 5.42, processing 525 minutes of audio per minute of compute time. Cohere offers it free through its API and plans to plug it into North, its enterprise agent platform. The company reportedly generates $240 million in annual recurring revenue and has hinted at an IPO.

Mistral released Voxtral TTS, an open-source text-to-speech model built on Ministral 3B. It supports nine languages, clones a voice from under five seconds of sample audio, and reaches a time-to-first-audio of 90 milliseconds on a 500-character input. Pierre Stock, Mistral's VP of science operations, said the team designed it to run on a smartwatch. "The cost of it is a fraction of anything else on the market, but it offers state-of-the-art performance," Stock told TechCrunch. Earlier this year Mistral launched transcription models, giving it a full voice stack that competes with Google's integrated offering.

The competitive picture extends beyond same-day launches. ElevenLabs, valued at $3.3 billion after a $180 million Series C in January, released a speech-to-text model that outperforms Google's Gemini 2.0 Flash and OpenAI's Whisper Large V3 across 99 languages. On the AA-WER benchmark from Artificial Analysis, ElevenLabs' Scribe v2 leads with a 2.3% word error rate, followed by Gemini 3 Pro at 2.9% and Mistral's Voxtral Small at 3.0%. OpenAI's Whisper Large V3 lands mid-pack at 4.2%.

And Bland, a voice agent platform, used the day to launch Norm, a tool that generates production-ready voice agents from conversational prompts. CEO Isaiah Granet put it bluntly: "We're not going to ever be OpenAI in the sense of general intelligence, but what we will be is phone call intelligence better than anybody else in the world."

What shifts when voice becomes infrastructure

Three weeks ago, NVIDIA's PersonaPlex 7B showed that the traditional three-model voice pipeline, speech recognition to language model to text-to-speech, could collapse into a single model running on a MacBook. Google went the other direction. Rather than shrinking the pipeline to fit on a laptop, Google buried it inside a cloud API. You send audio in, you get audio back, and you never see the three models underneath.

Different strategy. Same conclusion. Voice has split off from text-based AI entirely. It has its own models now. Its own benchmarks. Its own pricing tiers that nobody has published yet. None of the companies that shipped on March 26 called their work experimental. The keyboard and the search bar still work. But the assumption behind every product announcement this week is that fewer people will use them.

Then there is the licensing question. Cohere and Mistral ship open weights. Any company can download those models, run them on its own servers, and keep every recorded customer call inside its own walls. A hospital running voice intake does not want patient audio on someone else's infrastructure. Neither does a bank processing loan applications by phone. Google's counter: one API call handles recognition, understanding, synthesis, and watermarking together. Less wiring. More lock-in.

Google has not published pricing for 3.1 Flash Live API access. Cohere and Mistral are leading with free tiers and open weights, pressuring Google to compete on quality rather than cost alone. For you, if you are an enterprise buyer evaluating voice agent platforms right now, the choice narrows to a familiar tension: one vendor's complete stack, or the freedom to wire together open-source components that you control.

That decision will shape which companies own the next interface layer. And how much it costs to let your customers talk to your software instead of typing at it.

Frequently Asked Questions

What is Gemini 3.1 Flash Live?

Google's newest voice and audio AI model, built on the Gemini 3 Pro architecture. It handles real-time spoken dialogue with a 128,000-token context window and produces audio output watermarked with SynthID for AI content detection.

How does Gemini 3.1 Flash Live perform on benchmarks?

It scores 90.8% on ComplexFuncBench Audio for multi-step function calling through voice commands and leads Scale AI's Audio MultiChallenge at 36.1% for complex instruction-following during natural speech interruptions and corrections.

What is Search Live and where is it available?

Search Live is Google's voice-and-camera search feature that lets users ask questions aloud or point their phone at objects for real-time answers. It now works in more than 200 countries and territories with multilingual support.

Which companies are using Gemini 3.1 Flash Live for enterprise applications?

Kroger, Lowe's, Papa Johns, The Home Depot, and Woolworths use it through Gemini Enterprise for Customer Experience. Verizon runs Google's earlier voice models across 28,000 customer care representatives with 96% accuracy.

What competing voice models launched the same day?

Cohere released Transcribe, a 2-billion-parameter open-source speech recognition model with a 5.42 word error rate. Mistral shipped Voxtral TTS, an open-source text-to-speech model small enough to run on a smartwatch with 90ms time-to-first-audio.

Know someone who’d find this useful? Forward this article. They can subscribe free here.

AI News

Maria Garcia

Los Angeles

Bilingual tech journalist slicing through AI noise at implicator.ai. Decodes digital culture with a ruthless Gen Z lens—fast, sharp, relentlessly curious. Bridges Silicon Valley's marble boardrooms, hunting who tech really serves.

Google Ships Gemini 3.1 Flash Live, Extends Voice Search to More Than 200 Countries

What the benchmarks actually measure

Enterprise customers are already in the building

Search Live goes global, for real this time

The competition is not waiting around

What shifts when voice becomes infrastructure

Related Stories

Maria Garcia

All our articles are free to read.

Related Stories

WhatsApp Now Uses AI to Draft Reply Suggestions Based on Your Conversations

Apple Adds Four Partners to U.S. Manufacturing Program, Pledges $400 Million

GitHub Reverses Policy, Will Train AI on Copilot User Data Starting April 24