Thinking Machines shows off preview of near-realtime AI voice and video conversation with new 'interaction models'

ChatGPT Image May 11 2026 06 07 19 PM
Is the era of AI coming to an end? "turn based" to talk?

Now, all of us who regularly use AI models for work or in our personal lives know that the basic interaction mode across text, imagery, audio, and video remains the same: the human user provides an input, waits anywhere between milliseconds to minutes (or in some cases, for particularly difficult questions, hours and days), and the AI ​​model provides an output.

But if AI is to truly take over jobs that require natural interaction, it will need to do more than just provide these types of features "turn based" Interactivity – it will eventually need to react more fluidly and naturally to human input, even during processing next Human input, whether it is text or any other format.

That at least appears to be the logic of Thinking Machines, a well-funded AI startup founded last year by former OpenAI chief technology officer Mira Murati and former OpenAI researcher and co-founder John Shulman.

Today, the firm announced a research preview of what it considers "Interaction models, a new class of native multimodal systems that treat interactivity as a first-class citizen of the model architecture rather than as external software "harness," Some impressive gains were achieved on third-party benchmarks and reduced latency as a result.

However, the models are not yet available to the general public or even enterprises – the company said in its announcement blog post: "In the coming months, we will open a limited research preview to gather feedback, which will be released widely later this year."

‘Full-duplex’ simultaneous input/output processing

At the heart of this announcement is a fundamental shift in the way AI understands time and presence. Current marginal models generally perceive reality in a single formula; They wait for the user to finish input before starting processing, and their perception pauses while generating a response.

In their blog post, Thinking Machines researchers described the status quo as a limitation that forces humans to "distort oneself" For AI interfaces, phrasing questions like an email and batching their thoughts.

to solve it "cooperation barrier," Thinking Machines has moved away from the standard optional token sequence.

Instead, they use a multi-stream, micro-turn design that processes 200ms segments of input and output simultaneously.

it "full duplex" The architecture allows the model to hear, talk, and see in real time, enabling it to backchannel while the user is speaking or intervene when it sees visual cues—such as the user typing a bug in a code snippet or a friend entering a video frame. Technically, the model uses encoder-free initial fusion.

Instead of relying on a massive standalone encoder like Whisper for audio, the system takes the raw audio signal in the form of DMEL and image patches (40×40) through a lightweight embedding layer, co-training all components from scratch within Transformer.

dual model system

Research Preview Introduces tml-interaction-smallA 276 billion parameter mix-experts (MOE) Model with 12 billion active parameters. Because real-time conversations require near-instantaneous response times that often conflict with deep logic, the company designed a two-part system:

  1. Interaction Model: Remains in constant exchange with the user, handling dialogue management, attendance and immediate follow-up.

  2. Background Model: An asynchronous agent that handles persistent logic, web browsing, or complex tool calls, streaming results into the conversation, is naturally woven into the conversation model.

This setup allows the AI ​​to perform tasks like live translation or generate UI charts while continuing to listen to user feedback – a capability demonstrated in the announcement video where the model provides typical human reaction times to various signals along with creating a bar chart.

Impressive performance on key benchmarks against faster interaction models from other leading AI labs

To prove the efficacy of this approach, the laboratory used FD-BenchA benchmark specifically designed to measure interaction quality rather than just raw intelligence. The results show that TML-Interaction-Small Significantly outperforms existing real-time systems:

  • Accountability: It achieved turn-taking latency 0.40 secondsCompared to 0.57 seconds for gemini-3.1-flash-live and 1.18 seconds for gpt-realtime-2.0 (min).

  • Interaction Quality: On FD-Bench V1.5, it scored 77.8Almost double the score of its primary competitors (GPT-realtime-2.0 minimum score 46.8).

  • Visual activation: In specific tests like RepCount-A (counting the physical repetitions in the video) and ProactiveVideoQAThe Thinking Machines model successfully engaged with the visual world while other marginal models remained silent or gave incorrect answers.

metric

tml-interaction-small

gpt-realtime-2.0 (minutes)

gemini-3.1-flash-live(min)

turn-taking latency

0.40

1.18

0.57

Interaction Quality (Average)

77.8

46.8

54.3

IFEval (Voicebench)

82.1

81.7

67.6

Harmbench (refusal %)

99.0

99.5

99.0

Once the model becomes available it will potentially be a huge boon for enterprises

If made available to the enterprise sector, Thinking Machines’ interaction models would represent a fundamental shift in how businesses integrate AI into their operational workflows.

A native interaction model like tml-interaction-small allows many enterprise capabilities that are currently impossible or highly brittle with standard multimodal models:

Current enterprise AI needs "turn" This must be completed before the data can be analyzed. In a manufacturing or laboratory setting, a native interaction model can monitor a video feed and proactively intervene when a safety breach or deviation from protocol is detected – without waiting for the worker to solicit feedback.

The model’s success in visual benchmarks such as RepCount-A (accurate repetition count) and ProactiveVideoQA (answering questions as visual evidence) shows that it can serve as a real-time auditor for high-risk physical tasks.

The primary friction in voice-based customer service is 1-2 seconds "processing" Delay in standard API to 2026 is common. Thinking Machines’ model achieved a turn-taking latency of 0.40 seconds, which is approximately the speed of a natural human conversation.

Because it handles simultaneous speech natively, an enterprise support bot can listen to customer frustrations, provide "back channel" signals (e.g. "ok it’s like that" Or "mm-hmm") without interrupting the user, and offers live translation that feels like a natural conversation rather than a series of disjointed recordings.

Standard LLMs lack an internal clock; They "Know" Time only when it is provided in the text prompt. Interaction models are inherently time-aware, allowing them to manage time-sensitive processes. "Remind me to check temperature every 4 minutes" Or "Alert me if this process takes longer than the previous process". This is important for industrial maintenance and pharmaceutical research where time is an essential variable.

Background on Thinking Machines

This release is the second major milestone for Thinking Machines, following the October 2025 launch of Tkinter, a managed API for fine-tuning language models that lets researchers and developers control their data and training methods while Thinking Machines handles the burden of distributed training infrastructure.

The company said Tinker supports both small and large open-weight models, including mixture-expert models, and early users include groups at Princeton, Stanford, Berkeley, and Redwood Research.

At launch in early 2025, Thinking Machines positioned itself as an AI research and product company seeking to make advanced AI systems “more widely understandable, customizable, and generally efficient.”

In July 2025, Thinking Machines said it had raised nearly $2 billion at a $12 billion valuation in a round led by Andreessen Horowitz with participation from Nvidia, Accel, ServiceNow, Cisco, AMD, and Jane Street, which WIRED described as the largest seed funding round in history.

wall street journal It was reported in August 2025 that rival tech CEO Mark Zuckerberg approached Murati about acquiring Thinking Machines Lab and after he declined, Meta fired more than a dozen of the startup’s approximately 50 employees.

In March and April 2026, the company also became known for its compute ambitions: it announced an Nvidia partnership to deploy at least one gigawatt of next-generation Vera Rubin systems, then expanded its Google Cloud relationship to use Google’s AI hypercomputer infrastructure with Nvidia GB300 systems for model research, reinforcement learning workloads, frontier model training, and Tinker.

By April 2026, Business Insider reported that Meta had hired seven founding members from Thinking Machines, including Mark Jenn and Yinghai Lu, while another researcher from Thinking Machines, Tianyi Zhang, also moved to Meta. The same reporting stated that Joshua Gross, who had helped create Thinking Machines’ flagship fine-tuning product Tkinter, had joined Meta Superintelligence Labs, and despite his departure the company had grown to about 130 employees.

Thinking Machines wasn’t the only one losing people, though: It also hired Meta veteran Soumith Chintala, creator of PyTorch, as CTO, and added other high-profile tech talent like Neil Wu. TechCrunch separately reported in April 2026 that Weiyao Wang, an eight-year meta veteran working on multimodal perception systems, had joined Thinking Machines, underscoring that the talent flow was not one-way.

Thinking Machines previously said it was committed to "Important Open Source Components" In its release to empower the research community. It is unclear whether these new interaction model models will fall under the same ethos and release terms.

But one thing is certain: By making interactivity native to the model, Thinking Machines believes that scaling the model will make it a smarter and more effective collaborator than it is now.



<a href

Leave a Comment