Benchmark Results

Phase 0 benchmark: 50 academic PDFs, 8 citation styles, 3 engines, AI-verified ground truth. Engine choice settled here; capability layers (tables, sections, multi-format) added since.

Decision: pdftotext default mode

After benchmarking 5 engines on 50 PDFs and verifying with Claude Opus 4.6 reading actual PDF pages, we selected pdftotext default mode + 27-step Academic normalization + pdfplumber SMP recovery as the primary engine. 100% accuracy on 29 ground truth passages, ~400ms per PDF, zero AGPL dependencies.

Since this benchmark

The text-extraction engine is unchanged. On top of it, the library now adds: Camelot stream-flavor table extraction (replaced pdfplumber tables after a 5-option bake-off, 2026-05-09); universal-coverage section identification (Abstract … References, Footnotes, Appendix); layout-aware footnote stripping (F0); render-to-markdown; and parallel DOCX + HTML input pipelines that share the same normalization + section logic.

Final 3-Way Showdown (10 stats-heavy PDFs)

Engine	Stats Found	Avg Speed	Eta-sq	Chi-sq	Garbled	Papers Won
pdftotext + normalization	587	431ms	8	4	0	4
PyMuPDF + column_boxes	587	2,139ms	8	4	0	4
pymupdf4llm	564	9,248ms	0	0	2	1

Why pdftotext Won

Speed: 5-20x faster

~400ms vs 2.1s (column_boxes) or 9.2s (pymupdf4llm). On Railway, this means lower cost and faster API responses.

No AGPL license

PyMuPDF (column_boxes) and pymupdf4llm are AGPL-licensed. pdftotext (GPL) + pdfplumber (MIT) avoids AGPL for an authenticated service.

Equal accuracy with normalization

pdftotext + our 27-step Academic normalization pipeline matches column_boxes on stat detection (587 each). The normalization pipeline is the differentiator, not the engine.

Robust recovery path

Auto-detects garbled Unicode (U+FFFD) from Xpdf's SMP limitation and recovers via pdfplumber. Zero garbled output across all 50 test PDFs.

Engines Dropped

pymupdf4llm

Bold Markdown markers wrap statistical values, HTML < in "p < .001" gets interpreted as tag eating 40K chars, 20x slower, AGPL license.

PyMuPDF column_boxes

Equal accuracy but AGPL license incompatible with authenticated SaaS. 5x slower than pdftotext.

Docling (IBM)

Out-of-memory crash on CPU (std::bad_alloc). Even in FAST mode: 5GB+ RAM, 30-80s per PDF, crashed on paper #10.

Camelot (as a text engine)

Too slow as a general-purpose text extractor. Note: later adopted as the dedicated TABLE-extraction engine after a separate 5-option bake-off (2026-05-09) — see "Since this benchmark" above.

Results by Citation Style (50 PDFs)

Style	Papers	Best Engine	Notes
APA (Psychology)	10	pdftotext	Stats-heavy. Default mode handles 2-column correctly.
Vancouver (Medical)	8	Tie	Simple layouts. All engines work well.
Nature	6	pdftotext	Up to 400 ligatures per paper. Normalization essential.
IEEE	6	pdftotext	pymupdf4llm has false positives from figure HTML tags.
Harvard	6	Tie	Business journals. Moderate complexity.
AMA	5	pdftotext	Medical style, similar to Vancouver.
ASA	5	pdftotext	Sociology. 2-column layouts.
AOM	4	Tie	Management journals. Simple layouts.

Key Finding

The normalization pipeline matters more than the extraction engine. pdftotext + normalization matches or exceeds engines that are 5-20x slower. The pipeline (originally consolidated from 5 projects: ESCIcheck, MetaESCI, Scimeto, MetaMisCitations, COREcoding; now 27 academic-level steps) addresses every real-world artifact we've encountered across 8,500+ academic PDFs.

Methodology

Test set: 50 PDFs from CitationGuard validation suite, spanning 8 citation styles (APA, IEEE, Nature, Vancouver, AMA, ASA, Harvard, AOM).

Ground truth: Claude Opus 4.6 reading PDF pages as images and counting statistical expressions (p-values, F-tests, t-tests, correlations, CIs, effect sizes).

Metrics: Statistical patterns found, extraction time, garbled character count, ligature count, column interleaving detection.

Deep verification: Page-level comparison on 6 stats-heavy papers (chan_feldman, ip_feldman, chandrashekar, bmc_med_3, bmc_med_4, nat_comms_4).

Full benchmark data available on GitHub.