Alibaba on Monday released Qwen 3.5. The open-source model beats GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro on several benchmarks, the company claims. Inference costs dropped 60%, the company says. Built on a sparse mixture-of-experts architecture, the model activates just 17 billion of its 397 billion total parameters per query, an efficiency-first design that recalls DeepSeek's breakout strategy from last year. Reuters reported the release caps a week in which nearly every major Chinese AI company shipped a new flagship model ahead of Lunar New Year.
The Breakdown
• Qwen 3.5 activates 17 billion of 397 billion parameters per query, cutting inference costs 60% versus its predecessor.
• Alibaba claims the model beats GPT-5.2 and Claude Opus 4.5 on vision and instruction-following benchmarks but trails on coding and math.
• Visual agent features let the model operate phone and desktop screens, though a shopping agent test crashed under 120 million orders.
• Six Chinese AI companies shipped flagship models in a single week ahead of Lunar New Year.
Alibaba calls this the "agentic AI era." ByteDance used nearly identical language when it released Doubao 2.0 on Saturday. The shared framing is deliberate. Both companies are betting the next competitive front won't be chatbots answering questions but models that see screens and run multi-step tasks on their own, no human in the loop. "Built for the agentic AI era, Qwen 3.5 is designed to help developers and enterprises move faster and do more with the same compute," Alibaba said in a statement.
Fewer parameters, faster inference
The headline number in the release is what got smaller. Qwen 3.5's open-source model carries 397 billion parameters in total, but only 17 billion activate for any given query. The rest stay idle. Its predecessor, Qwen3-Max-Thinking, weighed in above one trillion parameters with a far larger active footprint. The newer, lighter model matches or beats it on every benchmark Alibaba published.
The architecture behind that claim fuses two mechanisms rarely combined at this scale: Gated Delta Networks, a form of linear attention that processes input sequences more cheaply than standard attention, and a high-sparsity mixture-of-experts layer that routes each query to a small fraction of the model's total capacity. If you picture a building with 397 billion light switches, only 17 billion flip on for any given task. The rest draw no power. DeepSeek's V3 model demonstrated last year that sparse activation could match denser architectures on performance while slashing costs. Alibaba is pushing the same idea further, with an even higher sparsity ratio.
Each generation of Chinese AI models achieves more with less active compute. U.S. chip export controls have made high-end Nvidia GPUs harder and more expensive for Chinese companies to acquire, and the engineering response has been to extract maximum performance from every chip they can get. For labs like Alibaba, sparse activation has become less a design preference than an adaptation to hardware scarcity.
Alibaba claims decoding throughput runs 8.6 times faster than Qwen3-Max at 32,000-token contexts. Push that to 256,000 tokens and the multiplier hits 19. On cost, the company puts the savings at roughly 60% per query over its predecessor. For developers building applications on top of open-source models, cost per query determines whether a project is viable or dies on a spreadsheet. Performance benchmarks fill press releases. Inference bills fill inboxes.
There's also a closed-source version. Qwen 3.5-Plus lives on Alibaba Cloud's Model Studio. Context window: one million tokens. The model can also call APIs or run code without breaking the conversation. The token vocabulary grew to 250,000 from 150,000, and Alibaba added 82 new languages on top, most from South Asia, Oceania, and Africa. The vocabulary expansion alone boosts encoding efficiency by 10 to 60 percent depending on the language, the company says.
The benchmark picture, honestly
Alibaba published comparison tables pitting Qwen 3.5 against GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro. Read selectively, the model looks like it matches or beats every frontier model out of the West. Read the full tables and the picture gets messier.
Where Qwen 3.5 leads: visual reasoning, instruction following, and multilingual tasks. On MathVision, a visual math benchmark, it scored 88.6, ahead of GPT-5.2 at 83.0 and Gemini 3 Pro at 86.6. IFBench measures how precisely a model follows complex instructions. Qwen 3.5 posted 76.5, edging past GPT-5.2 at 75.4 and leaving Claude's 58.0 far behind. On NOVA-63, a multilingual evaluation, it hit 59.1. Highest of any model in the test. OCR and text recognition were strong too: 93.1 on OCRBench, topping the field.
Where it trails: coding and hard math. GPT-5.2 scored 80.0 on SWE-bench Verified, a test that measures whether a model can fix real software bugs. Claude Opus 4.5 hit 80.9. Qwen 3.5 came in at 76.4. On AIME 2026 competition math, GPT-5.2 reached 96.7. Qwen 3.5 posted 91.3. On LiveCodeBench, Gemini 3 Pro topped everyone at 90.7 while Qwen 3.5 scored 83.6.
Qwen 3.5 leads where agents need to lead. A model built to operate phone screens and desktop applications needs to be excellent at vision, instruction following, and spatial reasoning. It does not need to win math competitions. Alibaba's strengths track precisely to its advertised use case: a visual agent that reads interfaces and acts on them.
The South China Morning Post pointed out the benchmarks are self-reported and noted the comparison "was not with the three US heavyweights' latest models." Self-reported scores from any AI lab, American or Chinese, should be treated as directional until independent testing confirms them. Alibaba isn't the first company to publish flattering internal benchmarks, and it won't be the last.
Visual agents as the competitive wedge
Qwen 3.5 can look at a phone or desktop screen, identify interface elements, and take actions, tapping buttons and filling out forms across applications. Alibaba calls this "visual agentic capabilities." It's the feature the company is marketing hardest, and the one that differentiates Qwen 3.5 from a conventional language model.
Stay ahead of the curve
Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.
No spam. Unsubscribe anytime.
On OSWorld-Verified, a benchmark testing computer-use ability in real desktop environments, Qwen 3.5 scored 62.2. Claude Opus 4.5 scored 66.3, the highest in that category. On AndroidWorld, a mobile agent test, Qwen 3.5 reached 66.8. Competitive numbers, not dominant ones. But the efficiency argument matters here. Claude is a far larger active model. A system firing just 17 billion parameters that lands within a few points of Claude on desktop automation suggests the sparse architecture handles visual agent tasks well, at a fraction of the compute bill.
Alibaba's blog post showed demos that illustrate how far visual agents have come and how far they still need to go. One had the model autonomously filling missing Excel rows after a single instruction, then reverse-engineering a simple video game from footage and spitting out working HTML. The most striking demo involved driving video: the model identified traffic signals changing from green to amber, then explained why the vehicle didn't stop, reasoning through UK traffic rules and the physical distance to the stop line. Impressive on a demo reel. The real-world test came sooner than Alibaba might have liked.
Earlier this month, the company turned Qwen into a shopping agent, letting users order food and drinks through the chatbot. Backed by three billion yuan in subsidies, about $431 million, the campaign drove 120 million orders in six days and pushed a seven-fold jump in active users, according to Alibaba Cloud. It also crashed the system. Scaling agents is harder than scaling chatbots. When an agent fails, you don't get a wrong answer. You get a wrong order, a charged credit card, a delivery nobody requested.
ByteDance is making the same bet. Doubao 2.0, released Saturday, positions its nearly 200-million-user chatbot as an agent-first product. Every major Chinese AI company now wants to own the word "agent." The engineering hasn't caught up to the ambition. Not yet.
China's Lunar New Year model dump
Qwen 3.5 arrived inside a pile-up of Chinese AI announcements that felt excessive even by current standards. The holiday timing is strategic for companies and awkward for analysts. Launching before the week-long break guarantees attention but delays independent testing and developer feedback. By the time Chinese developers return to their keyboards, the news cycle will have moved on.
ByteDance shipped Doubao 2.0 on Saturday. Alibaba's DAMO Academy unveiled RynnBrain, a robotics model that can count oranges on a table and retrieve milk from a fridge, which puts Alibaba in direct competition with Nvidia and Google on physical AI. Kuaishou released Kling 3.0, a video generation model producing 15-second clips with native multilingual audio. Zhipu AI launched GLM-5, claiming it approaches Claude Opus 4.5 on coding benchmarks. MiniMax shipped M2.5 with enhanced agent tools. CNBC reported the wave showed Chinese firms keeping pace with American rivals across multiple AI categories.
Six flagship models from five companies in a single week.
Google DeepMind CEO Demis Hassabis told CNBC that Chinese AI models sit just "months" behind Western ones. The Lunar New Year releases suggest that gap, even if accurate, should make Western labs nervous rather than comfortable. A few months is not a defensive position. It's a sprint distance.
For Alibaba, the mood is anxious. Qwen trails Doubao in users and DeepSeek in global visibility. DeepSeek became the first Chinese AI company to break through internationally last year, and Alibaba's initial response, releasing Qwen 2.5-Max, felt reactive. The subsidy blitz earlier this month and the Qwen 3.5 launch are bolder moves, pairing open-source distribution with aggressive price competition on inference costs.
Alibaba is building two products for two fights. Qwen 3.5 open-source is a developer play, meant to seed the infrastructure layer and attract builders who will lock into the Alibaba stack. Qwen 3.5-Plus is the enterprise product: a hosted model with a million-token window and built-in tools aimed at companies willing to pay for managed AI. The split mirrors what Meta did with Llama, giving away the base model and charging for the platform. Whether Alibaba can execute that playbook against DeepSeek, which gives away the weights and the training recipes, is the commercial question hanging over the whole release.
Qwen 3.5's weights are up on Hugging Face. The benchmark scores are posted. What Alibaba cannot publish is proof that those 17 billion lit switches hold up when 120 million orders hit the system in a single week. That answer comes from production servers, not leaderboards.
Frequently Asked Questions
Q: What is sparse mixture-of-experts architecture?
A: A design where a model contains many parameters but only activates a fraction for each query. Qwen 3.5 has 397 billion total parameters but fires just 17 billion per request. The idle parameters consume no compute. DeepSeek popularized this approach last year, showing it could match denser models at lower cost. Alibaba pushed the sparsity ratio even higher.
Q: How does Qwen 3.5 compare to GPT-5.2 on coding tasks?
A: It trails. GPT-5.2 scored 80.0 on SWE-bench Verified, which tests real bug-fixing ability. Claude Opus 4.5 scored 80.9. Qwen 3.5 posted 76.4. On LiveCodeBench, Gemini 3 Pro led at 90.7 versus Qwen 3.5's 83.6. Alibaba's model leads on vision and instruction following, not coding or hard math.
Q: What are Qwen 3.5's visual agent capabilities?
A: The model can look at a phone or desktop screen, identify buttons and form fields, and take actions across applications. On OSWorld, a desktop automation benchmark, it scored 62.2 compared to Claude Opus 4.5's 66.3. On AndroidWorld for mobile, it reached 66.8. Competitive but not dominant.
Q: What happened when Alibaba tested Qwen as a shopping agent?
A: Alibaba let users order food and drinks through the Qwen chatbot, backed by $431 million in subsidies. The campaign generated 120 million orders in six days and a seven-fold jump in active users. It also crashed the system, showing that scaling agents in production is harder than scaling chatbots.
Q: Is Qwen 3.5 open source?
A: The base model, Qwen 3.5-397B-A17B, is open-weight and available on Hugging Face. A separate closed-source version called Qwen 3.5-Plus is hosted on Alibaba Cloud with a one-million-token context window and built-in tool use. The open-weight release follows a strategy similar to Meta's Llama.



