A customer support bot running 10,000 conversations a day burns through 400 million tokens before lunch. That number comes from an OneUptime analysis published this month, and it gets worse: those conversations also generate one million trace spans and four million metric data points. Every single day. If you are running LLM features in production and your cost visibility amounts to checking provider dashboards once a month, you are already behind.

Gartner estimates that only 15 percent of GenAI deployments have LLM observability in place right now. The rest fly on instruments that weren't built for this. Traditional APM tools track latency and error rates. They have no idea what a reasoning token is, why it costs four times more than input, or why your five-turn agent conversation just ran up a bill fifteen times larger than the first message alone.

That gap should make engineering leads nervous. Judging by the hiring surge in "AI FinOps" roles that barely existed twelve months ago, it does.

The market responded with a wave of specialized tools. Gateways, observability platforms, caching layers, preflight estimators, CLI loggers. They don't all solve the same problem, and grabbing the wrong category first wastes months. And here's the part nobody warns you about: monitoring LLM traffic is itself expensive. One infrastructure team described a 200 percent jump in their Datadog bill after instrumenting AI workloads. The reason is structural. A single LLM call generates trace spans, token counts, latency metrics, evaluation scores, retry logs. Multiply by the volume of a production chatbot and the telemetry dwarfs what a traditional API ever produced. Ten to fifty times more data, depending on how many agent steps your pipeline runs. You need tools worth their own overhead.

I've spent the last three weeks testing most of them. Some are worth your afternoon. Others will waste it.

The Tool Stack at a Glance

AI-generated summary, reviewed by an editor. More on our AI guidelines.

Start by seeing, not by capping

Most teams grab a gateway with budget caps before they can answer a basic question: where is the money going? Backwards. Find out where you're bleeding first. Then decide what to tourniquet. The tools below start with that foundation and work up.

Traces and cost attribution

Langfuse is where most teams start evaluating. ClickHouse acquired it earlier this year, and the backend now chews through millions of traces on Postgres-plus-ClickHouse. What makes Langfuse worth the effort: hierarchical tracing that pins each agent step to model-specific pricing. Hand the resulting per-user, per-feature cost breakdowns to a product lead and watch the conversation shift. Self-hosting means prompt data stays on your machines. MIT license, so no surprise walls down the road. The catch is operational: running Langfuse yourself means maintaining Postgres, ClickHouse, Redis, and S3. That's four services before you've traced a single token. CI/CD quality gates need custom wiring too. If data sovereignty is a hard requirement, the overhead pays for itself. If you just need a cost dashboard by Friday, look elsewhere.

Helicone takes a radically different path. It sits as a proxy between your app and the model provider. Change the base URL in your OpenAI client. Done. You have observability. No SDK integration, no code changes. Built on Rust and deployed at Cloudflare's edge, Helicone adds under 10 milliseconds per request. The built-in semantic cache is the real draw: it matches semantically similar queries and serves cached responses, cutting redundant API calls without touching your application logic. SOC 2 certified, data residency options for enterprise buyers. One caveat that compliance teams catch late: the "omit logs" header stops Helicone from storing your prompts, but the data still transits their backend before being discarded. "Not stored" and "never transmitted" are different promises. Also, proxy architecture means Helicone sits in your critical path. If it goes down, your LLM calls go down. Self-hosting solves this, though it adds the operational burden the proxy was supposed to eliminate.

Arize Phoenix lives in a Jupyter notebook. It runs locally, captures traces via OpenTelemetry, and specializes in RAG pipeline analysis: embedding drift, retrieval relevance, document-level metrics that simple token counters miss. Its "LLM-as-judge" feature benchmarks quality before and after cost optimization, answering the question nobody else asks: did cheaper prompts make the answers worse? The tradeoff is scope. Phoenix shines in experimentation and research-phase work, but hits walls elsewhere. You'll need a Postgres backend for persistence. The open-source version ships without enterprise RBAC. And the OpenTelemetry learning curve? Steeper than just swapping a base URL.

Gateways and budget enforcement

Once you know where tokens go, these tools let you set hard limits. Finance people sleep better when caps exist, though a cap is only as useful as the budget behind it.

Portkey just open-sourced its entire gateway, governance features included, stuff that used to hide behind a paywall. Traffic through Portkey already exceeds a trillion tokens per day. Each request spits out 40-plus data points. Budgets lock at the team level, project level, individual API key. PII redaction and "do not track" mode come standard, which matters more than most engineers realize until legal sends the first email. The old knock on Portkey was pricing. Going open-source flattens that complaint, though a few advanced enforcement features still sit behind paid tiers. And here's the part nobody mentions in the launch post: your uptime now depends on theirs.

LiteLLM is the open-source proxy self-hosting teams build on. It normalizes 100-plus providers behind an OpenAI-compatible API and exposes budget controls at the key, user, and team level. Virtual keys with hard spending caps keep one team's experiment from devouring everybody else's budget. Worth knowing: a March 2026 bug report documented budget limits getting bypassed under certain naming conventions. The fix shipped, but the lesson stands. Caps break. It's a system that can fail if misconfigured, especially across 100 providers with different billing models. The Python-based proxy can also choke under extreme concurrency. Production deployments need Postgres for state.

Bifrost is the speed answer. Written in Go by Maxim AI, it delivers 11-microsecond overhead at 5,000 requests per second. That's roughly 50 times faster than LiteLLM. Hierarchical budgets enforce limits from the organization down to individual keys. Native Model Context Protocol support governs tool-calling agents with the same token precision as standard completions. The tradeoff: smaller community, less ecosystem support. Adding a new provider means changing Go code, not editing a config file. And you'll need Maxim AI for quality evaluation, since Bifrost focuses on routing and governance, not scoring.

Ecosystem picks

LangSmith is the natural fit for LangChain teams. Trace-level token attribution breaks down cost by component of a multi-step agent workflow. The prompt hub versions templates like code, and input/output masking keeps sensitive data out of the logs. If your stack is not heavily LangChain, though, LangSmith feels more like a framework control plane than a vendor-neutral cost system. Token metadata can also break across SDK updates. LangChain's own issue tracker documents cases where cost fields simply disappeared after version bumps.

W&B Weave appeals to organizations already running Weights & Biases for ML experiment tracking. Adding LLM cost attribution feels like a natural extension, not a new vendor relationship. Custom cost overrides let you map internal chargeback rates. The catch: Weave bills for telemetry ingestion and storage on top of what you pay the model provider. At scale, "how much does it cost to know what we're spending?" becomes a real question.

The solo developer stack

Tokscale is a Rust-powered CLI that scrapes usage data from Claude Code, Codex CLI, Gemini CLI, Cursor, and other agents that don't always expose full usage in provider dashboards. The terminal UI shows daily breakdowns and model splits. Catches shadow AI usage that would otherwise be invisible. Single-machine only, and it depends on parsing log formats that providers can change without warning.

LLM by Simon Willison stores every prompt, response, and token count in a local SQLite database. Query it with SQL, browse it with Datasette, export it whenever you want. Your interaction history never touches a third party. For developers who distrust SaaS telemetry and want complete data ownership, nothing else comes close. No dashboards, no team features, no enforcement. Just clean data and total control.

When you're paying for a subscription, not an API

Everything above assumes you're calling APIs and paying per token. But a growing number of professionals hit the same wall from the other side: you're on Claude Max or ChatGPT Pro, paying $100 to $200 a month for a subscription, and you still run into rate limits. Anthropic tightened usage caps even for Max subscribers earlier this year. The frustration is real, and it's worse because providers rarely show you a clear usage meter. You get throttled. You don't know how close you were to the line. You don't know what pushed you over. And good luck planning your workday around a limit you can't see.

GitHub noticed. A small ecosystem of subscription-tier trackers has exploded in early 2026, almost entirely focused on Claude Code users who burn through their daily allowance without warning.

ccusage is the breakout hit, with over 12,500 stars in just weeks. It parses the JSONL session files that Claude Code writes locally, tallies tokens by model and conversation, and renders the breakdown in your terminal or as a web dashboard. No API keys needed, no network calls. It reads what's already on your disk. The approach is clever and fragile in equal measure: Anthropic could change the JSONL format tomorrow and break every parser that depends on it. For now, ccusage gives Claude Code subscribers the usage visibility that Anthropic itself refuses to provide.

Claude Code Usage Monitor takes a different tack. Instead of a one-time report, it runs continuously and tracks your burn rate against the specific limits of your subscription tier, whether that's Pro, Max at $100, or Max at $200. It calculates remaining capacity and warns you before you hit the wall. Think of it as a fuel gauge for your daily coding budget. Over 7,400 stars, which tells you how many people got tired of guessing.

If you want something that lives in your menu bar rather than your terminal, Claude Usage Tracker and TokenEater both sit in your macOS tray. Claude Usage Tracker parses local logs and understands Anthropic's rolling five-hour billing windows, showing remaining capacity per window. TokenEater takes a different path entirely: it uses Anthropic's OAuth API to pull usage data directly from your account, skipping the local file parsing altogether. Cleaner data, but it depends on an OAuth flow that Anthropic hasn't officially documented for this purpose.

For the speed-obsessed, toktrack rewrites the JSONL parsing in Rust, claiming processing times under 100 milliseconds for large session histories. Probably overkill unless you're running hundreds of Claude Code sessions. But Rust developers gonna Rust.

The honest truth: this entire category exists because providers treat usage data as their property, not yours. Every tool on this list is a workaround for a dashboard that should already exist. They parse local files that could change format, call undocumented APIs that could get locked down, and rely on reverse-engineered billing windows. They work today. Whether they work next month depends on whether Anthropic and OpenAI decide to help or hinder. Until providers ship real usage meters, these scrapers and widgets are all you've got.

What to build first

Forget the tools for a second. The question that actually matters is sequencing. StackSpend's research on enterprise AI spending found that the most common failure looks like this: an engineering team sets rate limits and budget caps, then tells finance "we've got cost management handled." Finance takes their word for it. Nobody discovers until Q3 that rate limits and financial visibility are completely different things. A cap prevents overspend in one dimension. It tells you nothing about whether the spending that did happen was worth it.

Spending under $5,000 a month on LLM APIs? Langfuse or Helicone for visibility. Tokscale if most usage runs through coding agents. See where the money goes first. Then decide what to cap.

Once the bill crosses $5,000 a month, bolt on a gateway. Portkey if governance is the priority. LiteLLM if you want full infrastructure control. Enforcement stacked on measurement. Never substituted for it.

Past $100,000, the stack gets real: traces for attribution, gateways for enforcement, caching to cut waste, preflight estimation to route expensive requests toward cheaper models, and OpenTelemetry conventions so your telemetry doesn't die with the next platform migration. At that scale the tool choice matters less than the org chart. Someone has to own the number. Not a committee. One person who sees engineering usage, finance invoices, and the product roadmap at the same time.

Gartner projects the GenAI market will exceed $25 billion this year, reaching $75 billion by 2029. Token bills climb whether you watch the dashboard or not. But the teams that instrument early, in the right order, will be the ones who can explain the number when the board asks. Everyone else will be reconstructing last quarter from provider invoices and Slack threads.

Frequently Asked Questions

Which LLM token monitoring tool should I start with?

Langfuse or Helicone for cost visibility. Langfuse if you need self-hosting and data sovereignty. Helicone if you want the fastest setup with one line of code. Add enforcement (Portkey, LiteLLM) only after you know where the money goes.

How much does LLM monitoring itself cost?

AI workloads generate 10-50x more telemetry than traditional services. Teams report Datadog bill increases of 40-200% after adding AI monitoring. Open-source self-hosted tools like Langfuse and Phoenix avoid per-span SaaS charges.

What is the fastest LLM gateway for token monitoring?

Bifrost, written in Go, delivers 11-microsecond overhead at 5,000 requests per second, roughly 50x faster than Python-based LiteLLM. It trades community size and provider flexibility for raw performance.

Can I track AI token usage from coding agents like Claude Code or Cursor?

ccusage (12,500+ stars) parses Claude Code's local JSONL files and shows token breakdowns by model. Claude Code Usage Monitor tracks burn rate against your specific subscription tier. Tokscale covers Claude Code, Cursor, and Codex CLI. All run locally with zero cloud dependency.

What percentage of AI deployments have cost observability?

About 15% as of early 2026, according to Gartner's figures. The firm expects that to hit 50% by 2028, which means the vast majority of production AI right now runs with no cost instruments at all. Flying blind.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

AI Token Prices Fall 99% as Value Migrates to Infrastructure
AI token prices fell 99% in three years. Enterprise value now flows to orchestration, data flywheels, workflow integration, and compliance.

Google Restricts AI Ultra Accounts Over OpenClaw OAuth
Google AI Ultra subscribers report account restrictions for OpenClaw OAuth use. Anthropic banned third-party access days earlier.

Clawdbot DIY AI; Claude Code Fleet; Microsoft BitLocker Keys
Clawdbot turns $5 VPS into personal AI assistant. Boris Cherny runs 5-10 AI agents to write code. Microsoft hands encryption keys to FBI.
AI News

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: [email protected]