Utkarsh Kanwat
· 5 min read

Mistral OCR 4: The Frontier of Document OCR

An independent benchmark of Mistral OCR 4 against frontier vision models and the open parsing stack on olmOCR-Bench. It lands in the top accuracy tier and is the only system here that also returns grounded, confidence-scored output an agent can cite.

Mistral put out OCR 4 last week. I ran some pages through the API and the response wasn't the markdown dump I expected. You get a list of blocks. Each block has a type, so it tags headings and tables and equations and signatures separately, plus pixel coordinates and a confidence score for every word. The actual text is just one field on the block. It handles 170 languages, reads PDFs and Office files with no conversion step, and runs in a single container if you'd rather not send documents to an API at all. They're shipping it as the ingestion layer for their Search Toolkit, so it's clearly meant to feed a retrieval system or an agent.
None of that matters if the read underneath is wrong, so I wanted to see how it holds up on hard pages against what people actually use. I ran nine systems through olmOCR-Bench, Allen AI's benchmark that grades specific assertions about each page rather than overall text similarity. The nine: OCR 4, four frontier vision models doing OCR (GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.8, Qwen3-VL), and the parsers people actually reach for (Docling, Marker, unstructured, LlamaParse). Forty pages of the stuff that tends to wreck parsers, old scans, multi-column layouts, dense tables, equations, with 223 assertions across them. The harness and full scorecard are in the repo. I left out the specialized research models that top the raw-accuracy leaderboards, PaddleOCR-VL, MinerU, Chandra; they return structured output too and lead on other benchmarks, so rather than guess how they'd land here I scoped this to the production APIs and parsers people deploy.
Here's how they came out.
Parsertablesreading-orderscansmathOverall
Mistral OCR 494.373.357.468.474.9
Qwen3-VL 235B (open)90.053.361.771.172.6
GPT-5.592.956.753.268.471.3
Claude Opus 4.892.960.044.772.471.3
Gemini 3.1 Pro87.150.051.175.070.4
Marker65.766.748.944.755.2
Docling64.343.321.321.137.7
LlamaParse51.440.048.92.632.7
unstructured45.760.040.42.631.8
Overall pass rate on olmOCR-Bench, 40 docs and 223 tests
Overall pass rate on olmOCR-Bench, 40 docs and 223 tests
The numbers are lower than the 85.20 Mistral reports, which is expected, these are hard pages and I graded them tight, so everyone takes a hit. Two things stand out. The top of the table is tight: the five strongest systems sit within a few points of each other across 223 tests, close enough that I wouldn't read much into the exact ordering among them. But all five are well clear of the dedicated parsers, Marker, the best of those, is about twenty points back, and the rest trail by more. OCR 4 sits at the top, and it leads the two structure-heavy columns, tables and reading order. Math is the column I trust least, heuristic LaTeX matching rather than rendering, and the ordering holds with or without it.
The more useful split isn't the ranking, it's what comes back. The four vision models read the page about as well as OCR 4, but they hand back only text, no coordinates tied to what they read, so you take the transcription on trust. (Gemini and Qwen-VL have separate visual-grounding modes, but that's a different request, point at this object, not a typed, per-word layout for the whole page.) OCR 4 returns every block typed, boxed, and confidence-scored, the way a dedicated OCR engine does, while reading at the level of the frontier models. That combination, frontier-level reading plus grounded output, is the thing I didn't find anywhere else in the set.
Accuracy against structured grounding. OCR 4 is the only model in the accurate-and-grounded corner.
Accuracy against structured grounding. OCR 4 is the only model in the accurate-and-grounded corner.
Here's one page where it mattered. One of the test docs is a scanned immunisation table from a medical journal. Two of its cells, one reading "Local council only" and one about incomplete immunisations, got skipped by both GPT-5.5 and Claude. OCR 4 caught them, and it gave me the box they sit in, so I could pull up the page and confirm it read them right. With the vision models there's no box to check against, you just trust the text.
OCR 4's bounding box for a table cell two of the vision models dropped
OCR 4's bounding box for a table cell two of the vision models dropped
It doesn't win everything, which is what I'd expect; a clean sweep would make me doubt my own grading. Qwen3-VL, which is open weight, reads old scans better than OCR 4. Gemini and Claude come out ahead on math, the column I trust least. If you've got a pile of bad photocopies and you don't care where on the page things came from, a vision model is a fine pick. It's the moment you need to know where a value sits that the vision models drop out, the best of them on text still gives you nothing to locate it with.
The harness and the full scorecard are in the repo if you want to run it yourself or on more pages. Several of these models read a page about equally well now. Fewer give that read back as typed values with boxes and confidence scores, the structure an agent needs to point at where a number came from. OCR 4 is the one here that does both.

The harness, scoring code, and full scorecard are on GitHub. OCR 4's official scores (85.20 olmOCR-Bench, 93.07 OmniDocBench, 72% blind win rate) come from the OCR 4 announcement. No affiliation with any vendor here.

Discussion

Subscribe

New essays on AI engineering, by email. No spam. My writing has reached 400k+ readers so far.

No spam, unsubscribe anytime. I write about AI engineering.