TL;DR

LlamaIndex just released ParseBench, the first document parsing benchmark designed for AI agents — not for humans reading PDFs. It tests 14 parsers on 2,078 enterprise pages with 169,011 rules across 5 dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Headline: LlamaParse Agentic leads at 84.88%; Gemini 3 Flash is the best external VLM at 71.0%; GPT-5 Mini and Anthropic Haiku 4.5 collapse on visual grounding (<10%). The benchmark is open-source (Apache-2.0) and available on HuggingFace, GitHub, and arXiv.

ParseBench overview: 5 capability dimensions for evaluating document parsers

What's new

The bar for OCR has shifted. As LlamaIndex puts it: from "good enough for a human to read" to "reliable enough for an agent to act on." Existing benchmarks like OmniDocBench, OCRBench v2, and olmOCR-Bench rely on text-similarity metrics (BLEU, edit distance) that miss agent-critical failures — a transposed table header, a chart reduced to raw OCR text, a strikethrough silently dropped. ParseBench introduces what the team calls semantic correctness: does the parsed output preserve enough structure and meaning for correct downstream decisions?

The benchmark covers ~2,000 human-verified pages from real enterprise documents — insurance (SERFF filings), financial reports, government submissions — stratified across 5 capability dimensions:

  • Tables — structural fidelity for merged cells, hierarchical headers, cross-page continuity.
  • Charts — exact data-point extraction with correct labels from bar/line/pie/compound charts.
  • Content Faithfulness — omissions, hallucinations, and reading-order violations.
  • Semantic Formatting — strikethrough, super/subscript, bold, hyperlinks (formatting that carries meaning).
  • Visual Grounding — every extracted element traceable back to its source location for auditability.

Why it matters

In agentic workflows, small parsing errors become decision errors. An insurance agent approving a claim reads a specific cell in a coverage table — if the header is misaligned, it reads the wrong column. A financial analyst agent quoting a price might quote a struck-through (invalidated) price as the current one. These failures don't show up in BLEU scores, but they break production.

"What matters is not whether a parser produces text that looks similar to a reference, but whether it preserves the structure and meaning needed for correct downstream decisions."

Most prior benchmarks miss the mark on enterprise content. OmniDocBench draws only 6% of pages from enterprise sources; olmOCR-Bench skews 42% toward arXiv math papers. ParseBench is the first to score all 5 dimensions on the documents that actually drive automation revenue.

Technical facts

DimensionPagesDocsRulesMetric
Tables503284GTRM (GriTS + TableRecordMatch)
Charts568994,864ChartDataPointMatch
Content Faithfulness506506141,322Content Faithfulness Score
Semantic Formatting4764765,997Semantic Formatting Score
Visual Grounding50032116,325Element Pass Rate
Total (unique)2,0781,211169,011

Two new metrics matter: TableRecordMatch treats a table as a bag of records (insensitive to column/row order, brutal on transposed headers), and ChartDataPointMatch verifies annotated data points in the parser's output table — tolerant of formatting differences (currency, units, separators) but unforgiving on missing values.

Comparison: the leaderboard

MethodOverallTablesChartsContent Faith.FormatVisual Ground.
LlamaParse Agentic84.8890.7489.6885.2480.62
LlamaParse Cost Effective71.8973.04
Google Gemini 3 Flash71.089.964.886.258.456.0
Reducto67.870.357.086.456.868.7
Qwen 3 VL62.074.728.287.664.255.2
Azure Doc Intelligence59.686.01.684.951.973.8
Dots OCR 1.555.885.20.990.047.055.8
Docling (OSS)50.666.452.866.91.066.1
AWS Textract47.984.66.074.83.770.4
OpenAI GPT-5 Mini46.869.830.182.345.86.2
Anthropic Haiku 4.545.277.213.878.749.46.7

Three patterns jump out:

  • Charts are the great divider. Only 4 methods crack 50%. Most specialized parsers score <6% — they output raw OCR text instead of structured data tables.
  • Formatting is widely ignored. Range: Docling at 1.0% to LlamaParse Agentic at 85.24%. Most parsers strip strikethrough/superscripts as cosmetic.
  • Visual grounding separates VLMs from layout-aware systems. GPT-5 Mini and Haiku 4.5 score under 8%; Azure (73.8%) and Textract (70.4%) crush them because they were built around layout detection.

Use cases

ParseBench is built for agent workflows in industries where parsing errors compound into financial or compliance risk:

  • Insurance — claims approval reading specific table cells; SERFF regulatory filings with merged headers.
  • Finance — due diligence, financial models, analyst pipelines parsing 10-K filings and earnings reports.
  • Legal & contracts — strikethrough preservation matters (a struck-through clause is not the active clause).
  • Government/regulatory — submissions where every value must be traceable to source for audit.

Limitations & pricing

The headline finding: no method is consistently strong across all 5 dimensions. Even on "mostly solved" content faithfulness, the ~90% top scores mean agents still hit omissions or hallucinations on 1 in 10 pages — unacceptable for high-stakes workflows.

On compute-vs-quality: throwing more thinking budget at VLMs gives diminishing returns. Gemini gains ~5 points moving from minimal to high thinking — at 4× the cost. GPT-5 Mini and Haiku 4.5 see even smaller gains at 3–4× cost. Reducto's agentic mode at ~5¢/page (the most expensive option) yields only ~4 points over its base.

LlamaParse pricing sits on the Pareto frontier:

  • Agentic: ~1.2¢/page · 84.88% — outperforms all others at any cost level.
  • Cost Effective: <0.4¢/page · 71.89% — competitive with Gemini at minimal thinking.

Availability: Apache-2.0 license. Dataset on HuggingFace (llamaindex/ParseBench, 592 MB, 169,011 rows). Code: run-llama/ParseBench with 90+ pre-configured pipelines. Paper: arXiv:2604.08538. Website: parsebench.ai.

What's next

The team flags an official public leaderboard "soon," plus three roadmap directions: greater scale and broader enterprise domain coverage; extending beyond parsing into structured extraction and document classification/splitting; and harder evaluation settings — ultra-high-resolution pages, visually dense technical documents, adversarial enterprise cases.

If you're building an agent that touches PDFs, ParseBench gives you the first honest answer to a question that's been impossible to benchmark properly: which parser won't silently corrupt your agent's context?

Sources: LlamaIndex blog, arXiv paper, GitHub repo, HuggingFace dataset.