Arm bets mobile AI on CPUs, not NPUs

💡 TL;DR - The 30 Seconds Version

🎯 Arm launched Lumex platform with SME2 instructions, claiming 5x AI performance gains on CPUs versus dedicated neural processors

🔧 Four new C1 cores span from C1-Ultra (25% faster single-thread) to C1-Nano (26% more efficient) targeting phones to wearables

📱 Platform targets 3-nanometer manufacturing with devices expected in late 2024-early 2025, reaching 3 billion devices by 2030

⚡ KleidiAI integration means thousands of Android apps gain SME2 acceleration without code changes across major frameworks

🥊 Strategy directly challenges Qualcomm, Apple, and MediaTek's NPU-focused approaches amid developer frustration with fragmented AI hardware

🌍 CPU-first AI could reshape mobile roadmaps, favoring software compatibility over specialized neural hardware efficiency gains

New Lumex platform touts SME2 and up to 5× AI speedups while sidestepping NPU fragmentation.

Arm is pushing against the smartphone industry’s NPU rush with Lumex, a new compute subsystem that puts most on-device AI work back on the CPU. In its launch, detailed in Arm’s official Lumex CSS announcement, the company claims up to 5× faster AI performance from Scalable Matrix Extension v2 (SME2) instructions integrated across its latest C1 CPU cores. The pitch: developers get predictable acceleration that works everywhere, without rewriting code for every vendor’s neural engine.

What’s actually new

At the heart of Lumex is SME2, Arm’s second-generation matrix math extension woven into a four-tier C1 core lineup: C1-Ultra for peak single-thread performance, C1-Premium for area efficiency, C1-Pro for sustained efficiency, and C1-Nano for wearables. Arm says the Ultra core delivers a 25% single-thread uplift over last year’s top core, while the Pro and Nano tiers emphasize steady clocks and lower power. That mix lets chipmakers compose clusters—think two C1-Ultra plus six C1-Pro—for flagship phones or scale down for rings and watches. It’s a configurable stack.

The platform arrives with a new C1-DSU fabric, tuned for power-aware scheduling and memory moves, and targets 3-nanometer nodes. Arm is designing for smartphone and PC tiers alike, with the same SME2 instruction model spanning both. One ISA, many form factors.

Evidence vs. claims

Arm’s performance numbers are eye-catching but situational. The company cites up to 5× gains on AI tasks, 4.7× lower latency on speech workloads, and 2.8× faster audio generation versus the prior generation. It also points to demos: a 2.4× boost in text-to-speech and roughly 40% faster large-language-model response times in collaborations with Alipay and vivo. Conditions matter. These are vendor-run tests, with limited disclosure on prompts, models, and thermals. Treat them as direction, not gospel.

The more pragmatic story sits in software. Arm’s KleidiAI libraries now plug into mainstream runtimes—PyTorch ExecuTorch, Google LiteRT, Alibaba MNN, ONNX Runtime—so thousands of Android apps can see SME2 without code changes, according to Arm. Google’s own apps (Gmail, YouTube, Photos) are “SME2-ready,” and the company says many of the same optimizations carry to Windows on Arm. Portability sells. So does low friction.

The countertrend: NPUs still matter

Arm’s stance pushes against a visible market current. Gartner’s definition of a “GenAI smartphone” centers on a built-in neural engine, and Qualcomm, Apple, MediaTek, and Samsung all tout dedicated NPUs as their AI workhorses. Qualcomm has already demoed multi-billion-parameter models on an Android device. Apple continues to expand its Neural Engine alongside CPU gains. The NPU narrative is familiar: best performance-per-watt for matrix math, crucial under battery constraints. That argument won’t fade.

Developer economics, not just FLOPs

Today’s mobile AI stack is fragmented, and developers feel it. As TECHnalysis Research’s Bob O’Donnell notes, many app teams default to CPU and GPU paths because NPU interfaces vary widely by vendor and generation. That slows adoption. Lumex leans into this reality: standardize around a CPU feature that is present on every Arm phone; accelerate the common case; keep the code portable. It’s a defensible, if conservative, bet. Sometimes boring wins.

The nuance: Arm isn’t anti-GPU or anti-NPU. It’s prioritizing a baseline that always exists. OEMs can still route camera or vision tasks to other accelerators as frameworks mature. The CPU is the floor, not the ceiling.

Graphics: the other accelerator in the box

Lumex also introduces the Mali G1-Ultra GPU with a redesigned Ray Tracing Unit v2. Arm claims a 2× uplift in ray tracing versus last year’s part, about 20% higher gaming performance in popular titles, and roughly 9% less energy per frame. The GPU also delivers up to 20% faster AI inference. That creates a healthy tension in Arm’s story: CPUs as the universal AI path, GPUs as opportunistic accelerators for both graphics and tensors. That’s fine. Phones are heterogeneous by design.

Ray tracing on mobile remains early. But fidelity expectations keep rising, and Arm is laying groundwork for when demand catches up. It’s a long game.

Timelines, nodes, and who moves first

Lumex targets 3-nanometer processes and flagship cycles first, which means expensive silicon and initial wins skewed to premium phones and PCs. Arm is giving licensees two paths: take the pre-tuned CSS “as is” for speed to market, or harden and customize the RTL for differentiation. Either way, this is a 2025–2026 story that scales down over time. No specific devices were named at launch. That’s typical for Arm.

The real question

Is “CPU-first AI” a bridge until NPU software finally standardizes, or a durable equilibrium? If developers keep finding that CPU paths “just work” across billions of phones, SME2 could anchor the everyday AI workload, with GPUs and NPUs reserved for peak bursts and vendor features. If, instead, the industry converges on stable, portable NPU APIs, the efficiency argument could reassert itself. Both futures are possible. The next two product cycles will tell.

For now, Arm has planted a clear flag: make AI fast where developers already ship code, and remove the sharp edges.

Why this matters:

CPU-first AI could reset mobile roadmaps, favoring portability and time-to-market over bespoke neural hardware gains.
If SME2 becomes the default target for app developers, it may dictate which accelerators get used—and funded—in future phones.

❓ Frequently Asked Questions

Q: What exactly is SME2 and how is it different from regular CPU instructions?

A: SME2 (Scalable Matrix Extension v2) adds dedicated matrix multiplication instructions to ARM CPUs. Regular CPU instructions handle one calculation at a time, while SME2 can process entire matrices simultaneously—the mathematical foundation of AI models. It essentially turns every CPU core into a mini AI accelerator without needing separate neural hardware.

Q: When can I actually buy a phone with Lumex chips?

A: Arm expects devices with Lumex chips to reach market in late 2024 or early 2025. The platform targets 3-nanometer manufacturing, which is expensive initially, so expect flagship phones first before the technology filters down to mid-range devices over the following years.

Q: How does this compare to Apple's Neural Engine approach?

A: Apple uses dedicated Neural Engine hardware alongside CPU improvements in A-series chips, while Arm puts AI acceleration directly into the CPU cores via SME2. Apple's approach may be more power-efficient for AI tasks, but Arm's works across all Android devices and Windows on ARM without requiring specific neural hardware.

Q: Why are developers frustrated with NPUs if they're supposedly better for AI?

A: NPU architectures vary widely between Qualcomm, MediaTek, Samsung, and other vendors. Developers must write different code for each NPU type, while CPU instructions are standardized. TECHnalysis Research notes that "few software developers are actually using the NPUs" because of this fragmentation, defaulting to CPUs and GPUs instead.

Q: Will SME2 drain my phone's battery faster than current chips?

A: Arm claims SME2 delivers AI performance gains while using less power than previous generations. The C1-Nano core offers 26% better efficiency than previous designs. However, real-world battery life will depend on how often AI features run and how device makers tune their power management systems.