Ask Siri a question and three separate models wake up behind the curtain. One transcribes your voice into text. A second reads that text and writes a response. A third converts the response back into speech. Each handoff is a point of failure, another chance for the meaning of what you actually said to get lost in translation.
NVIDIA's PersonaPlex 7B was supposed to change that. Released in January as an open-source, full-duplex speech-to-speech model, it collapses those three stages into one. Audio goes in. Audio comes out. It listens while it talks. Interruptions don't break it. It throws in "uh-huh" and "right" at the places a human would, and it stays in character the whole time. One model where three used to sit.
That alone would be interesting but not the story.
What happened next is the part worth paying attention to. A developer named Ivan took NVIDIA's 16.7-gigabyte PyTorch checkpoint, quantized it to 5.3 GB at 4-bit precision, rewrote the inference stack in native Swift using Apple's MLX framework, and got it running faster than real-time on a MacBook. No Python. No server. No cloud. Just a laptop and a microphone.
The voice pipeline that powers every major assistant on the market just got an existence proof for its own replacement. And the replacement runs on hardware you already own.
The Breakdown
- NVIDIA's PersonaPlex 7B replaces the three-model voice pipeline (ASR→LLM→TTS) with a single full-duplex model that listens and speaks simultaneously.
- An independent developer ported it to Apple Silicon in native Swift via MLX, quantized to 5.3 GB, running faster than real-time on an M2 Max.
- Full-duplex accuracy still trails cascade pipelines, scoring 4.29/5 on task adherence. HN testers called it a proof of concept, not a product.
- Voice AI is migrating from cloud APIs to local hardware, with implications for privacy, cost, and who controls the stack.
The telephone game that ate voice AI
The short version: NVIDIA's PersonaPlex 7B collapses the three-model voice pipeline (ASR, LLM, TTS) into a single full-duplex model. An independent developer ported it to run on a MacBook in native Swift via Apple's MLX framework, quantized to 5.3 GB and faster than real-time. The architecture works. The accuracy doesn't, yet. But the shift from cloud to laptop changes who gets to solve that problem.
Every voice assistant you use today plays a version of telephone. Your words pass through an ASR model that strips out tone, hesitation, and emphasis to produce flat text. A language model reads that text with no memory of how you said it. A TTS model tries to put emotion back into a response it never heard the original context for.
The losses compound. Prosody disappears at step one. Emotional register vanishes. The subtle cues that make human conversation feel human, a rising inflection that signals uncertainty, a trailing "so..." that invites the other person to jump in, all of it evaporates during transcription and never comes back. You get a technically correct answer delivered in the vocal equivalent of a form letter.
This is why ChatGPT's advanced voice mode feels off even when the answers are right. One Hacker News commenter put it bluntly: "I just want hands-free conversations with SOTA models and don't care if I have to wait a couple of seconds for a reply." The speed is fine. The soul is missing.
PersonaPlex attacks this at the architecture level. Built on Kyutai's Moshi foundation, it processes 17 parallel token streams at 12.5 Hz, one frame every 80 milliseconds. Eight streams carry user audio, eight carry agent audio, and one handles text. The temporal transformer, all seven billion parameters of it, sees everything simultaneously. It doesn't translate between modalities. It thinks in audio.
The training data reflects the ambition. NVIDIA fed the model 7,303 real conversations from the Fisher English corpus, totaling 1,217 hours, then supplemented that with over 2,000 hours of synthetic dialogue covering customer service and teaching scenarios. The real conversations teach backchanneling and natural rhythm. The synthetic ones teach task adherence.
Five gigabytes and a MacBook
Forget PersonaPlex for a moment. NVIDIA designed it for A100s and H100s. Those cards alone run $10,000-plus, and the model needs 24 GB of VRAM just to load. Good luck fitting that in your office, let alone your backpack.
What matters is what happened when the model hit the open-source community. Ivan's qwen3-asr-swift library didn't just port PersonaPlex to Apple Silicon. It built an entire speech pipeline in native Swift through MLX: ASR via Qwen3, multilingual text-to-speech via CosyVoice, and speaker diarization via pyannote and WeSpeaker. The whole diarization stack weighs 32 megabytes. The quantized PersonaPlex model downloads once at 5.3 GB, then runs at 68 milliseconds per step on an M2 Max. That's an RTF of 0.87, meaning the model produces audio faster than a human can listen to it.
The 4-bit quantization tells the compression story. The Depformer, the component that generates audio codebooks step by step, shrank from 2.4 GB to 650 MB. A 3.7x reduction with no measurable quality loss in round-trip ASR tests. You can verify this yourself because the same library includes the ASR model that checks the speech-to-speech output, a closed loop running entirely on your machine.
Stay ahead of the curve
Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.
No spam. Unsubscribe anytime.
Apple's MLX framework is the quiet enabler here. Metal acceleration on the GPU, unified memory that eliminates tensor copying between CPU and GPU, and APIs in Swift that let you build native macOS and iOS apps without touching Python. If you've been wondering what Apple's on-device AI strategy actually looks like beyond marketing slides, this is it. Not CoreML. Not the Neural Engine SDK. A community-driven framework that researchers and indie developers are turning into a real runtime.
The honest problem with full-duplex
Here is where the enthusiasm needs a correction. PersonaPlex is architecturally right and practically limited. The Hacker News discussion from this week made that clear.
One commenter who downloaded the 5 GB model discovered it only processes WAV files, not live microphone input. "I'd skip this for now," they wrote. "It does not allow any kind of interactive conversation." Another developer who has spent significant time working with voice agents pointed out that the full-duplex architecture "is still a bit off in terms of giving you good accuracy/performance, and it's quite difficult to train."
The cascade pipeline, for all its telephone-game losses, has a real advantage: composability. You can swap the ASR model without touching TTS. You can route complex queries to a large language model while keeping a tiny one for simple acknowledgments. You can mix local and cloud endpoints depending on the task. Full-duplex gives you none of that. One model, one capability profile, take it or leave it.
And the response quality issue cuts deep. When asked "Can you guarantee that the replacement part will be shipped tomorrow?", PersonaPlex with a customer-service prompt replies: "I can't promise a specific time, but we'll do our best to get it out tomorrow." As one HN commenter observed, "It's not surprising that people have little interest in talking to AI if they're being lied to." The model learned to mimic customer service scripts. It didn't learn to be useful.
NVIDIA looks emboldened by the open-source reception but clearly knows the limitations. The model scores 90.8 on FullDuplexBench for conversational dynamics, close to Moshi's 95, but the task adherence metrics tell a different story. PersonaPlex hits 4.29 out of 5 on following instructions. That's passing. Barely.
Who wins when voice leaves the data center
The shift that PersonaPlex-on-MLX represents isn't about one model or one library. The entire voice stack is migrating from cloud APIs to local hardware, and it's happening faster than the companies selling those APIs would like.
Anthropic just added voice mode to Claude Code, rolling it out to 5% of users. OpenAI charges for voice API access per minute. Google's Gemini voice conversations are five times longer than text ones on average, which means five times the compute cost. Every major AI lab has an economic incentive to keep voice processing in their cloud.
But the math is shifting. Run that 5.3 GB model on a $2,000 laptop, and you pay nothing per minute. No API bill. No audio leaves the machine. Think about what that means for a hospital recording patient intake, or a bank logging compliance calls where every second of audio routed to a third-party server is a regulatory headache waiting to happen.
The developers building on this stack are anxious about different things than the labs. They worry about voice activity detection accuracy, speaker diarization quality, and whether forced alignment can produce word-level timestamps reliable enough for production. Granular, boring, essential problems. The kind that get solved by engineers with access to the weights, not by API customers filing feature requests.
The 80-millisecond bet
PersonaPlex's frame budget is 80 milliseconds. That's the window between one audio frame and the next at 12.5 Hz. Hit it consistently and the conversation flows. Miss it and the model stutters, buffers, breaks the illusion.
Right now the quantized model clears that bar on an M2 Max. The M4 generation will clear it with room to spare. And here is the real bet: by the time the accuracy problems get solved, the hardware will have lapped the requirements twice over.
The three-model pipeline won't disappear overnight. It's entrenched and composable, and engineers understand it. But every month that passes, another piece of the speech stack gets ported to MLX, quantized to fit in unified memory, and released under an Apache license. The diarization stack landed last week. Forced alignment the week before. The telephone game is losing players.
You can clone the library today and run it. That's the part that's different from every other "future of voice AI" story. This isn't a research paper or a demo video. It's a swift build -c release and a WAV file away from running on your desk. The model talks back. Not well enough yet, not for production, not for anything that matters today. But faster than real-time, on hardware that fits in a backpack, with every weight open and every line of inference code readable.
The pipeline is cracking. What comes through the cracks is still rough. But it fits on a laptop now, and that changes who gets to work on it.
Frequently Asked Questions
What is full-duplex speech-to-speech and how does it differ from current voice assistants?
Full-duplex means the model listens and speaks at the same time, like a human conversation. Current assistants use a three-step pipeline: transcribe speech to text, generate a text response, then synthesize speech. Each handoff adds latency and loses non-verbal cues like tone and hesitation.
What hardware do you need to run PersonaPlex locally on a Mac?
The 4-bit quantized model weighs 5.3 GB and runs on Apple Silicon Macs through the MLX framework. Benchmarks show an RTF of 0.87 on an M2 Max with 64 GB, meaning it produces audio faster than real-time. An M4 would clear the 80ms frame budget with headroom to spare.
What is Apple's MLX framework and why is it relevant here?
MLX is Apple's open-source machine learning framework built for Apple Silicon. It uses Metal GPU acceleration and unified memory to eliminate tensor copying between CPU and GPU. Unlike CoreML, it provides APIs in Swift, Python, C++, and C, letting developers build native on-device AI apps without cloud dependencies.
Why can't PersonaPlex replace traditional voice pipelines today?
Accuracy and composability. PersonaPlex scores 4.29/5 on task adherence and currently only processes WAV files, not live microphone input. The cascade pipeline lets you swap individual models, mix local and cloud endpoints, and route different queries to different-sized models. Full-duplex can't do that yet.
What does an RTF of 0.87 mean for real-time voice interaction?
RTF (Real-Time Factor) measures how fast the model generates audio relative to playback speed. Below 1.0 means faster than real-time. At 0.87, PersonaPlex produces each 80ms audio frame in about 68ms, leaving 12ms of headroom per step for streaming without buffering delays.



IMPLICATOR