On Monday afternoon, Mira Murati's company put a stopwatch inside the chatbot. Thinking Machines Lab published a research preview of interaction models, a system built around 200ms slices of input and output rather than the old sequence of prompt, wait, answer. Ten months earlier, investors had priced the lab at $12 billion after a $2 billion seed round. The first model preview shows what that money is chasing: conversation timing as a model problem.

The launch makes the interface layer part of model architecture. Thinking Machines is arguing that collaboration quality has to be trained into the model itself, across audio, video, and text, instead of assembled around a text model with speech recognition, turn detection, and synthetic voice. That is a product claim wrapped in a research post.

Key Takeaways

AI-generated summary, reviewed by an editor. More on our AI guidelines.

The wrapper becomes the target

Thinking Machines describes current systems as models that experience reality "in a single thread." The user speaks or types, the model waits, and then the model replies while its view of the world freezes. The company says that setup leaves "no room" for the human to stay in the loop during real work, where instructions change, mistakes surface late, and feedback often arrives before a sentence is finished.

The proposed fix is a multi-stream model trained from scratch for interaction. Thinking Machines says its model processes 200ms of input while generating 200ms of output, then repeats the cycle. In the architecture diagram, the useful oddity is the input row: Text, Frame, Audio, with 40x40 image patches beside dMel audio signals.

OpenAI's Realtime API already streams audio through a persistent WebSocket connection and can handle interruptions. Google says Project Astra can respond "without interrupting or time lag." Implicator.ai covered Google's move when Search Live brought camera and voice conversations into search. Thinking Machines is not first to real-time voice. Its claim is narrower: the wrapper should disappear into training.

The benchmark claim arrives with caveats

The company's benchmark chart gives the preview its sharpest edge. TML-Interaction-Small scored 77.8 on FD-bench v1.5 average quality, compared with 46.8 for GPT-realtime-2.0 minimal and 54.3 for Gemini 3.1 Flash Live minimal. On simple turn-taking latency, Thinking Machines reports 0.40 seconds, against 1.18 seconds for GPT-realtime-2.0 minimal and 0.57 seconds for Gemini's minimal live preview.

Those are company-reported numbers, and the same table is not a clean sweep. On QIVD video and audio accuracy, TML-Interaction-Small reports 54.0, below GPT-realtime-2.0 minimal at 57.5 and Qwen 3.5 OMNI-plus-realtime at 59.0. Scale's Audio MultiChallenge work uses 452 conversations from 47 speakers and 1,712 rubrics, and the highest-performing model in its writeup reached 54.65 percent. Speech contains restarts, repairs, backtracking, and speakers talking over one another.

Thinking Machines also discloses a constraint inside its own result. TML-Interaction-Small is a 276B-parameter mixture-of-experts system with 12B active parameters. The lab says its larger pretrained models are too slow to serve in this setting, even as it argues that interaction should improve as models scale.

"For interactivity to scale with intelligence, it must be part of the model itself."

The line is stronger because the limitation sits nearby: the larger model cannot yet do the job fast enough.

Murati treats interaction as infrastructure

Axios reported in March that Thinking Machines committed to at least one gigawatt of Nvidia Vera Rubin systems in 2027, while the company had grown from about 30 employees to roughly 120. TechCrunch reported the $2 billion seed round and $12 billion valuation.

That spending fits a company treating interaction as infrastructure. OpenAI sees the same fight from the distribution side. Implicator.ai reported in January that OpenAI had merged audio teams and targeted a faster voice architecture because distribution was moving into phones, glasses, cars, and ambient devices.

Murati's prior work fits the direction. TechCrunch noted that before OpenAI she worked on Tesla's Model X during early Autopilot releases and at Leap Motion on hand and finger tracking. Both jobs sat near the same boundary Thinking Machines is now trying to model: watching human action before it becomes a clean command.

Thinking Machines' earlier talent story still belongs in the frame. Meta hired co-founder Andrew Tulloch after the lab raised billions and shipped Tinker. The company can buy compute; productizing a new interaction regime still depends on people who know how model behavior fails under latency, noise, and user impatience.

The release test is practical

Users have seen the demo version of this story before. OpenAI showed GPT-4o answering audio in as little as 232ms with a 320ms average. Google launched Gemini Live with 10 voices and interruption support. Project Astra can interpret a camera view and is being tested with blind and low-vision users.

Thinking Machines is setting itself a harder test than novelty. The research preview is still closed, and the wider release is promised later this year. Its public demos include listening for animal mentions in a story, translating speech in real time, and telling a user when they are slouching.

The next test is an ordinary work session: code on screen, hesitation in the user's voice, search or tool use running in the background, and a correction arriving before the model has finished speaking. Thinking Machines says the interaction model can remain present while a background model handles slower reasoning.

Murati's preview starts with 200ms chunks. The product test starts when a user changes their mind mid-sentence.

Frequently Asked Questions

What is an interaction model?

Thinking Machines uses the term for a model trained to handle real-time exchange across audio, video, and text, rather than waiting for a full user turn before responding.

How is this different from existing voice assistants?

OpenAI and Google already support real-time voice features. Thinking Machines claims the key behaviors should be native to the model, not added by external turn-detection and voice layers.

What does the 200ms figure mean?

The preview processes 200ms chunks of input while generating 200ms chunks of output, letting the model treat overlap, silence, and interruption as context.

Are the benchmark results independently verified?

The cited FD-bench and latency numbers are company-reported. The article treats them as useful signals, not independent proof of market leadership.

When can users try it?

Thinking Machines says the research preview is closed, with limited access in the coming months and a wider release planned later this year.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

Deepgram Launches Flux Multilingual Speech Model With 10-Language Mid-Call Switching
Deepgram today announced Flux Multilingual, a conversational speech recognition model that supports 10 languages with real-time language detection and the ability to switch languages during an active
DeepL Adds Voice Translation, but the Delay Is the Product
DeepL picked Thursday for the voice launch it has been circling for years. The new suite plugs into Zoom and Microsoft Teams, stretches to mobile conversations and training rooms, and gives contact ce
Beginner Tutorial: How to Use Claude Cowork. Seven Capabilities That Replace Your Busywork
Claude Cowork turned Anthropic's chatbot into a desktop agent back in January 2026. Three months in, the tool handles file management, persistent memory, third-party connectors, reusable skills, proje
Analysis

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: [email protected]