A Chinese lab shrinks documents into vision tokens—then forces a bigger question: should LLMs read as images?
DeepSeek says a 1,000-word article can compress to ~100 vision tokens with ~97% fidelity. In practice, that means far longer documents fit into a model’s context without blowing up compute. The headline capability is OCR, but the provocation is deeper: if images encode text an order of magnitude more efficiently, why are we still feeding models discrete tokens at all?
What’s actually new
Not just better OCR. A compression strategy that challenges how language models represent information.
DeepSeek-OCR treats text as an image first. It then encodes that page into a small set of “vision tokens” that a decoder turns back into text—or Markdown, tables, and charts. The claim: roughly 10× fewer tokens at near-text accuracy. That’s the tension.
The Breakdown
• DeepSeek-OCR compresses documents to ~100 vision tokens with 97% accuracy—10× fewer than text tokens for equivalent content
• Beats GOT-OCR 2.0 using 100 tokens vs 256, and MinerU 2.0 with under 800 vs 6,000+ tokens per page
• Practitioner validated system works on mismatched hardware in 40 minutes, producing usable results despite version conflicts
• Training models from scratch on visual text remains unsolved—current approach requires existing token-based models as foundation
How it works
The system has two main pieces. DeepEncoder handles the page; a lightweight text generator decodes it.
DeepEncoder combines local segmentation from SAM with global context from CLIP. A 16× compressor sits between them, slashing image tokens before they reach the heavy vision stack. Start with a 1,024×1,024 page (4,096 tokens). After SAM and the compressor, only 256 tokens go to CLIP. That’s where the savings show up.
Compression adapts to the document. Simple slides can use 64 tokens. Books and reports are around 100. Dense newspapers can demand “Gundam mode” with up to 800. The decoder—built on a 3B-parameter MoE with ~570M active parameters—turns those tokens into text, Markdown, or structured outputs. One short sentence: it is fast.
Benchmarks and scale
On OmniDocBench, DeepSeek beats GOT-OCR 2.0 using 100 vision tokens versus 256. It also tops MinerU 2.0 with under 800 tokens while MinerU needs more than 6,000 tokens per page. That’s a stark efficiency gap.
Throughput numbers are equally aggressive. DeepSeek reports more than 200,000 pages per day on a single NVIDIA A100. At twenty servers with eight A100s each, they claim ~33 million pages daily. If those figures hold, this is industrial OCR, not a demo.
The model targets real documents, not just clean scans. It handles text plus diagrams, chemical formulas, and geometric figures across ~100 languages. Formatting can pass through intact. When needed, it can also emit plain text.
A field test, warts and all
Developer Simon Willison pushed the release through mismatched hardware—and still got it running. He spun up a Docker container on an NVIDIA Spark box and delegated setup to an AI coding agent. PyTorch initially failed on the GPU’s newer compute capability, then succeeded after switching to a newer wheel. Inference ran.
Three modes stood out. “Free OCR” was quickest and extracted clean text. “Markdown” preserved structure at a modest speed cost. “Grounding” returned bounding boxes and coordinates along with text, trading speed for richer output. The important bit: the system produced usable results without a perfect environment. Reality rarely is.
The paradox behind the trick
Images contain far more raw data than plain text. Yet DeepSeek shows a page-as-image can be represented inside a model in fewer tokens than the page-as-text. How?
Because token IDs are efficient for transmission but inefficient for representation. A text token is a single integer on the wire, then expands to a dense embedding inside the model—thousands of floating-point values encoding meaning and context. Image tokens start life as continuous embeddings. They can pack information more tightly than a sparse scatter of discrete text tokens.
Analysts call this “optical compression.” It suggests a path to bigger effective contexts without bigger models. One idea from the paper: treat conversation history like memory. Keep fresh turns at high visual resolution. Blur older turns to cheaper, lower-resolution tokens. The model remembers what matters. The rest fades.
The unanswered training question
DeepSeek shows how to decode compressed vision tokens into text. It does not show how to pretrain a foundation model to read as vision-first. That’s a hard gap.
LLMs work because “predict the next token” is simple to specify and easy to grade. With images of text, what’s the training target? You can split a page into image-words and predict the next one, but it’s slow and messy to evaluate. Or you can predict text tokens after reading images—at which point you’ve reintroduced the very tokens this approach tries to avoid.
Until someone demonstrates end-to-end pretraining on visual text, optical compression will sit beside, not replace, tokens. It’s a powerful adapter, not a new brain.
What changes if it sticks
If compression holds at scale, token economics change. Longer briefs, contracts, and books fit in context windows that used to choke. Structured extraction looks cleaner: financial charts become Markdown tables; figures become vectors. Enterprises care about that.
It also opens a more forgiving deployment story. Willison’s test shows this can work through warnings, version mismatches, and imperfect drivers. Systems that survive real-world entropy tend to spread.
The caveats
This is lab-reported performance plus an early practitioner run. Edge cases will surface: low-contrast scans, complex layouts, non-Latin scripts, degraded faxes. Vector graphics remain tricky. And compression is a dial, not a constant; “Gundam mode” costs tokens when pages get gnarly. One more caveat: benchmarks aren’t production.
Why this matters
- A 10× token reduction for documents reshapes context limits, pricing, and how enterprises move dense paperwork through AI.
- The first group to crack visual-first pretraining for text could reset the stack; until then, optical compression is the strongest bridge we have.
❓ Frequently Asked Questions
Q: What hardware do I need to run DeepSeek-OCR?
A: The model is 6.6GB and runs on PyTorch with CUDA. A single NVIDIA A100 processes 200,000+ pages daily. Willison got it working on an NVIDIA Spark using PyTorch 2.9.0 with CUDA 13.0 despite compute capability mismatches—the system threw warnings but ran inference successfully. Code and weights are open-source under MIT license on GitHub and Hugging Face.
Q: Why can't we train foundation models on images of text from scratch?
A: Current LLMs work because "predict next token" provides clear training targets and simple accuracy measurement. With images of text, there's no clean equivalent. Predicting the next word-image is slow and hard to evaluate accurately. Predicting text tokens after reading images just reintroduces tokenization. DeepSeek's approach fine-tunes existing token-based models to decode visual representations—it doesn't replace the token foundation.
Q: Does this work well for languages other than English?
A: DeepSeek trained on roughly 100 languages, including 25 million pages in Chinese and English specifically. The system handles multilingual text, but compression efficiency varies by script complexity. Dense character systems and right-to-left scripts may require higher token budgets. The benchmark tests primarily used English and Chinese documents, so edge-case performance for other languages remains less documented.
Q: How fast is this compared to traditional OCR systems?
A: DeepSeek processes a 3,503×1,668 pixel image in 24 seconds for basic text extraction ("Free OCR" mode). Structured Markdown output takes 39 seconds. Full grounding with bounding boxes needs 58 seconds. Traditional OCR often processes faster but requires thousands more tokens for equivalent accuracy—MinerU 2.0 uses over 6,000 tokens per page where DeepSeek uses under 800.
Q: Can this help with long conversation histories in chatbots?
A: Yes—DeepSeek proposes "visual decay" where older conversation turns get stored as progressively lower-resolution images, mimicking human memory fading. Recent exchanges stay at high fidelity while older context compresses further. This extends effective context windows without linear token cost increases. The paper suggests this approach for any application needing long-term memory, though production implementations haven't been documented yet.