Salesforce CEO Marc Benioff called for National Guard troops in San Francisco from his private plane, shocking his own PR team. The theatrical Trump embrace—timed before his big conference—tests whether loud loyalty beats quiet accommodation.
Coursera courses now live inside ChatGPT—world-class instructors summoned mid-conversation. No app switching, no browser tabs. EU users blocked by privacy laws while Wall Street bets big on AI-powered education. The learning shortcut arrived, but not for everyone.
Google's 1.3 quadrillion token milestone sounds massive—until you see growth rates halving and realize tokens measure server intensity, not customer demand. The slowdown reveals something uncomfortable about AI economics.
DeepSeek’s V3.1 goes open, targets parity at a fraction of the cost
DeepSeek's V3.1 delivers Claude-level coding performance at 1/70th the cost through open source. The Chinese startup's MIT-licensed model challenges American AI economics while optimizing for domestic chips—signaling parallel ecosystems.
👉 DeepSeek quietly released V3.1 this week, achieving 71.6% on Aider coding benchmarks while matching Claude Opus 4 performance
📊 The open-source model costs roughly $1 per coding task versus $70 for comparable closed alternatives—a 70x cost reduction
🏭 V3.1 features 671B total parameters with 37B active and includes FP8 optimization for upcoming Chinese domestic chips
🌍 MIT licensing allows teams to download, modify and deploy privately, eliminating data governance concerns with third-party APIs
🚀 The release pressures American AI companies to justify premium pricing through superior accuracy rather than mere access control
Chinese lab pairs faster “agentic” behavior with chip-friendly FP8—and one MIT-licensed download
DeepSeek quietly shipped V3.1 this week, positioning an open-weight model as a direct foil to closed rivals. Early testers cite code performance near the top of the pack and 128K context, while the V3.1 listing on Hugging Face confirms an MIT license and the same core architecture as V3. The bet is simple: match capability, crush price, and remove access friction.
What’s actually new
V3.1 keeps the Mixture-of-Experts foundation introduced with V3—671 billion total parameters with roughly 37 billion active per token—then layers post-training for longer context and tighter tool use. It also adds a hybrid inference mode. Users can run the model in fast, non-reasoning chats or flip a “deep thinking” setting for harder problems. That split matters for latency. Speed wins sessions.
DeepSeek also tuned an FP8 format for domestic accelerators and published weights that developers can cast to BF16. It is a pragmatic engineering choice. Precision options broaden the hardware base.
Evidence, not vibes
Signals arrived quickly despite the low-key debut. Independent developers reported a 71.6% score on the Aider coding benchmark, placing V3.1 at or near state of the art for non-reasoning code tasks. Others noted the model’s 128K context handled “needle in a haystack” style prompts reliably. These are narrow tests, but they track with V3’s strengths. They also travel fast.
Crucially, the economics changed. Community runs put the cost of typical coding tasks near a dollar when self-hosted on commodity cloud instances, versus several dozen dollars for comparable closed-API work at published list prices. It is a directional claim, not a universal tariff. Yet the order of magnitude is the story.
On the product side, DeepSeek says V3.1 “outdoes” its own reasoning flagship on some agentic tasks and returns answers markedly faster. The company also flagged API price changes effective September 6 and rolled V3.1 across its app and web endpoints. Consolidation suggests fewer public model lines and a single, tuned workhorse.
The open challenge to closed economics
This release extends the most potent challenge to American AI business models: frontier-ish capability without paywall friction. Open-weight availability allows teams to test, fine-tune, and deploy privately, which trims both per-request costs and data-governance headaches. Enterprises that balked at sending sensitive prompts to third-party stacks now have options. That is leverage.
It also pressures incumbents on two fronts. First, pricing: if open alternatives deliver “good enough” for common workloads, premium tokens must earn their keep with standout accuracy, tools, or support. Second, velocity: an open model with broad contributor energy can absorb bug fixes, adapters, and quantizations at internet speed. Closed systems must compete with service quality and integration depth. The bar just moved.
Geopolitics in the weights
FP8 support tuned for “next-generation domestic chips” is not a throwaway line. It knits the model to a rising ecosystem that aims to de-risk foreign dependencies. If Chinese NPUs deliver acceptable throughput on FP8 workloads, local buyers can field modern LLMs without U.S. parts. Policy meets precision.
For Washington, this looks like capability seep through alternate channels rather than export waivers. For Beijing, it is proof that software choices—precision, kernels, runtimes—can widen the corridor for homegrown silicon. Both readings may be true.
Limits and unknowns
Two caveats deserve emphasis. First, public benchmarks are still a patchwork of community tests; production reliability, safety filters, and long-running agent behavior are less proven. Demos are not SLAs. Second, the parameter math invites confusion: the “685B” figure seen in some posts reflects the main model plus a separate multi-token prediction module bundled on Hugging Face. The core MoE remains in the 671B/37B-active configuration. Precision matters here.
Then there is the maintenance question. Open-weight releases are easy to clone and hard to sustain. Keeping a unified model fresh across chips, runtimes, and guardrails is grueling. We will know this worked if the V3.1 branch stays coherent rather than fracturing into a dozen incompatible forks. Watch the next two months.
Why this matters
Open-weight parity pressures closed-API pricing and forces incumbents to justify premiums with superior accuracy, trust, and enterprise support—not mere access.
FP8 tuning for domestic accelerators nudges AI toward parallel hardware ecosystems, diluting the bite of export controls and reshaping where value accrues in the stack.
❓ Frequently Asked Questions
Q: What exactly is FP8 and why does DeepSeek optimize for it?
A: FP8 (8-bit floating point) is a data format that lets AI models run faster while using less memory than traditional 16-bit formats. DeepSeek's FP8 optimization targets upcoming Chinese-made AI chips, allowing domestic hardware to run advanced models without relying on U.S. semiconductor technology.
Q: How does the "hybrid inference" mode actually work?
A: Users can toggle between fast standard responses and slower "deep thinking" mode via a button in DeepSeek's app. The fast mode handles routine queries instantly, while deep thinking mode activates complex reasoning for harder problems—similar to having both ChatGPT and GPT-o1 in one model.
Q: What hardware do I need to actually run this 700GB model?
A: The full V3.1 requires substantial computational resources—typically multiple high-end GPUs with combined memory exceeding 700GB. However, cloud providers and community developers will likely offer hosted versions and smaller quantized variants that run on consumer hardware within weeks.
Q: Why does MIT licensing matter more than other open source licenses?
A: MIT is among the most permissive licenses—companies can modify, commercialize, and redistribute V3.1 without restrictions or royalties. Unlike GPL licenses that require sharing modifications, MIT lets enterprises build proprietary products on top of DeepSeek's foundation without legal complications.
Q: What are the "agentic tasks" DeepSeek claims V3.1 excels at?
A: Agentic tasks involve AI systems performing multi-step workflows autonomously—like debugging code, conducting research across multiple sources, or managing complex business processes. V3.1's improved tool calling and reasoning capabilities let it chain actions together more reliably than previous versions.
Tech translator with German roots who fled to Silicon Valley chaos. Decodes startup noise from San Francisco. Launched implicator.ai to slice through AI's daily madness—crisp, clear, with Teutonic precision and sarcasm.
E-Mail: marcus@implicator.ai
Google's 1.3 quadrillion token milestone sounds massive—until you see growth rates halving and realize tokens measure server intensity, not customer demand. The slowdown reveals something uncomfortable about AI economics.
ASML fills an 18-month CTO vacancy as the chip monopoly races to expand Eindhoven capacity, deploy AI tools, and navigate China export bans—while every AI model depends on machines only it can make. Succession meets systemic risk.
Microsoft licenses Harvard Medical School content for Copilot health queries while training models to replace OpenAI's infrastructure. The healthcare play addresses a billion-download gap and builds switching costs where credibility trumps speed.
Sam Altman laid out OpenAI's plan: one AI assistant that follows users everywhere, backed by a trillion-dollar compute buildout. The vision is coherent. The execution surface is vast, spanning chips, power, and partner risk across simultaneous bets.