Mercor’s APEX says GPT-5 scores 64% on real work—still not production-ready

Mercor's new benchmark tested AI on real work from Goldman Sachs, McKinsey, and top law firms. GPT-5 scored 64%—impressive on paper, but not enough for autonomous operation. The gap between 'helpful sometimes' and 'production-ready' remains wide.

GPT-5 Scores 64% on Real Work Test—Not Production-Ready

💡 TL;DR - The 30 Seconds Version

🎯 Mercor's APEX benchmark tested 23 AI models on 200 real professional tasks from Goldman Sachs, McKinsey, Latham & Watkins, and Mount Sinai—GPT-5 topped out at 64.2%, Grok 4 at 61.3%, Gemini 2.5 Flash at 60.4%.

⚠️ None of the tested models meet the "production bar" for autonomous work—a model that gets only two-thirds of a financial model or diagnosis correct creates rework, not efficiency.

💰 Mercor spent over $500,000 developing the 200 tasks, recruiting roughly 100 experts averaging 7+ years experience who earn $200+ hourly—matching their previous compensation at elite firms.

📊 Performance varies by domain: law scores highest (GPT-5 at 70.5%) because tasks reward structured synthesis, while medicine scores lowest (62.0%) where clinical judgment and error consequences matter most.

🏢 Results land amid growing skepticism—MIT found 95% of organizations see zero AI ROI, while Harvard-Stanford researchers identify "workslop" output that looks polished but fails to advance actual work.

🔮 Enterprise deployment economics hinge on cutting oversight costs, but APEX confirms expert review remains mandatory—partial completion at scale can help with drafts but doesn't eliminate the human in the loop.

A professional-grade benchmark shows progress, not parity.

AI vendors promise office automation. A new benchmark says the machines still need chaperones. Mercor’s first AI Productivity Index (APEX), a professional-grade test of models on real deliverables—finds that even the leaders fall well short of hands-off reliability. (See the APEX benchmark overview.)

APEX v1.0 tested 23 models on 200 tasks written and reviewed by working experts from Goldman Sachs, McKinsey, Latham & Watkins, and Mount Sinai. These aren’t trivia items or coding katas. They are valuation builds, legal research memos, patient workups, and competitive analyses—work products that would take a human between one and eight hours and that come with source packets averaging 26,000 tokens. Every task has a granular rubric—roughly 29 criteria on average—so grading looks more like unit tests than vibes.

Top-line: GPT-5 led with 64.2%. Grok 4 followed at 61.3%. Gemini 2.5 Flash landed at 60.4%. Mercor’s verdict is blunt: none of the tested systems clear the “production bar” for autonomous execution in the four professions covered. Put differently, a model that reliably gets only two-thirds of a deal model or a discharge plan right creates rework, not leverage. That stings more in domains where errors carry real risk.

What APEX actually measures

The benchmark tries to mirror how knowledge work is organized and valued. Mercor recruited about 100 vetted specialists—bankers averaging 8.7 years’ experience at bulge-bracket firms; Big Law attorneys from top programs; MBB consultants; primary-care physicians with frontline credentials. Experts mapped their time budgets by workflow and then weighted the test set accordingly: if financial modeling consumes 30% of an analyst’s week, it’s 30% of the banking eval. Each task ships with a sources packet and a prompt-specific checklist. That keeps grading objective and reduces the temptation to reward grandstanding.

To scale scoring, APEX uses a panel of three LLM judges and takes the majority vote per criterion, a setup Mercor says aligns with human raters about 89% of the time. It’s not perfect. It is consistent enough to compare models across hundreds of long-form outputs. That matters.

The production gap, by domain

Performance clusters reflect the nature of the work. Law is currently the easiest hill: models averaged 56.9%, and GPT-5 reached 70.5% on legal tasks, which lean on retrieval and structured synthesis. Medicine was hardest, with a 47.5% average and GPT-5 at 62.0%, because clinical reasoning punishes hedging and shortcuts. Investment banking and consulting sat in the middle. The lesson is not “law is solved.” It’s that models do better when the task rewards organized citation and pattern recall rather than judgment with human stakes. Reliability, not brilliance, is the bar. And the bar is high.

Price tiers don’t guarantee performance

One awkward subplot: billing rate isn’t destiny. Within model families, the “pro” or “bigger” variant didn’t always win. Opus 4 trailed Sonnet 4 on most metrics. o3 Pro edged o3 by just 0.1 percentage points on mean score. Gemini 2.5 Flash slightly beat Gemini 2.5 Pro. Open source showed pockets of strength—Qwen 3 235B was seventh overall—but trailed closed systems on average by roughly nine points. “Thinking” modes correlate with higher scores but are confounded by recency and training differences. The neat ladder of capability and cost looks messy when judged against economically grounded tasks. That’s useful signal for buyers.

Benchmarks meet enterprise reality

APEX lands in a skeptical moment. An MIT study recently reported that 95% of organizations saw no measurable return from AI initiatives. A Harvard-Stanford team coined “workslop” for outputs that look polished yet fail to advance the task. In response, labs are publishing their own yardsticks: OpenAI’s GDPval and Anthropic’s Economic Index both argue that well-scoped, repetitive slices of work can already be automated or accelerated. APEX threads the needle. It shows meaningful progress on hard, long-form professional tasks—while confirming that “substantial human oversight” remains a requirement, not a nicety.

That nuance matters for deployment math. Oversight time is overhead. If the human still has to audit every calculation, citation, and recommendation, the productivity win shrinks fast. Partial completion at scale can help—drafts, frameworks, first passes—but it can also flood teams with artifacts that require even more coordination and cleanup. Tools must raise the floor without filling the inbox.

The new cost of credible testing

Mercor says it spent more than $500,000 to build the first 200 cases. That’s roughly $2,500 per task once you include expert sourcing, interviews, trial assignments, separate reviewers, and rubric design. It points to where evaluation is headed: away from crowdwork and toward specialist-authored tests that mirror real deliverables. That should tighten the link between scores and economic value. It also raises barriers to entry and creates a new contamination surface; as these datasets become more valuable, the temptation to train on lookalike tasks increases. Keeping a held-out set helps. It’s not a cure-all.

Limits—and the obvious next step

APEX v1.0 is text-in, text-out. That’s intentional, but it understates how much real work relies on tools, files, and multi-turn coordination. Mercor plans “APEX World,” a simulated suite with SharePoint, Google Workspace, and APIs so models can search, calculate, and orchestrate. The harder leap is behavioral: moving from passing unit-tested criteria to sustaining reliability over hours of messy context and changing constraints. That’s where trust is earned—or lost. Progress is visible. Autonomy is not yet.

Why this matters

  • Reliability, not raw capability, determines whether AI saves time or creates rework; today’s 60–70% scores aren’t enough for unsupervised use in law, medicine, banking, or consulting.
  • Enterprise ROI hinges on cutting oversight costs; APEX shows value in drafts and scaffolds, but confirms that expert review remains mandatory for professional-grade outputs.

❓ Frequently Asked Questions

Q: How does APEX grade AI responses if tasks take humans 1-8 hours?

A: APEX uses three AI judges—o3, Gemini 2.5 Pro, and Sonnet 4—that vote on each of the 29 average criteria per task. The majority vote determines pass/fail for each criterion. This panel agrees with human graders 89% of the time and lets Mercor score hundreds of long-form outputs without manual review for every model.

Q: Why did medicine score lowest at 47.5% when law scored 56.9%?

A: Medical tasks require clinical judgment where errors carry direct patient harm, while legal tasks emphasize research and structured synthesis—current model strengths. A consultant can iterate on a flawed market analysis, but a physician fact-checking an AI diagnostic report adds time rather than saving it. The stakes change the math on what counts as useful.

Q: What is "APEX World" and when will it launch?

A: APEX World is Mercor's planned upgrade that adds simulated work environments—SharePoint, Google Workspace, APIs—so models can search files, run calculations, and coordinate across tools instead of just processing text. No launch date announced. The shift tests whether models can sustain reliability over hours of messy context, not just pass unit-tested criteria on isolated tasks.

Q: Why does APEX cost $2,500 per task when other benchmarks use crowdworkers?

A: APEX requires experts averaging 7+ years at Goldman Sachs, McKinsey, or Mount Sinai earning $200+ hourly—matching their previous compensation. Each task needs source documents, workflow mapping, granular rubrics with ~29 criteria, and separate reviewer approval. Earlier benchmarks paid crowdworkers a few dollars hourly. Economic relevance requires economic-grade talent.

Q: How does APEX compare to academic benchmarks like MMLU or GPQA?

A: APEX correlates at 0.79 with existing benchmarks—similar to how MMLU, GPQA, and others correlate with each other (0.58-0.84 range). The difference isn't radical divergence but focus: APEX tests deliverables that generate revenue if completed correctly, while academic benchmarks test abstract capabilities like graduate-level physics or reasoning puzzles. Both measure intelligence; APEX measures economic utility.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Implicator.ai.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.