Benchmark Results

Phase 0 benchmark: 50 academic PDFs, 8 citation styles, 3 engines, AI-verified ground truth. Engine choice settled here; capability layers (tables, sections, multi-format) added since.

Decision: pdftotext default mode

After benchmarking 5 engines on 50 PDFs and verifying with Claude Opus 4.6 reading actual PDF pages, we selected pdftotext default mode + 27-step Academic normalization + pdfplumber SMP recovery as the primary engine. 100% accuracy on 29 ground truth passages, ~400ms per PDF, zero AGPL dependencies.

Since this benchmark

The text-extraction engine is unchanged. On top of it, the library now adds: Camelot stream-flavor table extraction (replaced pdfplumber tables after a 5-option bake-off, 2026-05-09); universal-coverage section identification (Abstract … References, Footnotes, Appendix); layout-aware footnote stripping (F0); render-to-markdown; and parallel DOCX + HTML input pipelines that share the same normalization + section logic.

Final 3-Way Showdown (10 stats-heavy PDFs)

EngineStats FoundAvg SpeedEta-sqChi-sqGarbledPapers Won
pdftotext + normalization587431ms8404
PyMuPDF + column_boxes5872,139ms8404
pymupdf4llm5649,248ms0021

Why pdftotext Won

Speed: 5-20x faster

~400ms vs 2.1s (column_boxes) or 9.2s (pymupdf4llm). On Railway, this means lower cost and faster API responses.

No AGPL license

PyMuPDF (column_boxes) and pymupdf4llm are AGPL-licensed. pdftotext (GPL) + pdfplumber (MIT) avoids AGPL for an authenticated service.

Equal accuracy with normalization

pdftotext + our 27-step Academic normalization pipeline matches column_boxes on stat detection (587 each). The normalization pipeline is the differentiator, not the engine.

Robust recovery path

Auto-detects garbled Unicode (U+FFFD) from Xpdf's SMP limitation and recovers via pdfplumber. Zero garbled output across all 50 test PDFs.

Engines Dropped

pymupdf4llm

Bold Markdown markers wrap statistical values, HTML < in "p < .001" gets interpreted as tag eating 40K chars, 20x slower, AGPL license.

PyMuPDF column_boxes

Equal accuracy but AGPL license incompatible with authenticated SaaS. 5x slower than pdftotext.

Docling (IBM)

Out-of-memory crash on CPU (std::bad_alloc). Even in FAST mode: 5GB+ RAM, 30-80s per PDF, crashed on paper #10.

Camelot (as a text engine)

Too slow as a general-purpose text extractor. Note: later adopted as the dedicated TABLE-extraction engine after a separate 5-option bake-off (2026-05-09) — see "Since this benchmark" above.

Results by Citation Style (50 PDFs)

StylePapersBest EngineNotes
APA (Psychology)10pdftotextStats-heavy. Default mode handles 2-column correctly.
Vancouver (Medical)8TieSimple layouts. All engines work well.
Nature6pdftotextUp to 400 ligatures per paper. Normalization essential.
IEEE6pdftotextpymupdf4llm has false positives from figure HTML tags.
Harvard6TieBusiness journals. Moderate complexity.
AMA5pdftotextMedical style, similar to Vancouver.
ASA5pdftotextSociology. 2-column layouts.
AOM4TieManagement journals. Simple layouts.

Key Finding

The normalization pipeline matters more than the extraction engine. pdftotext + normalization matches or exceeds engines that are 5-20x slower. The pipeline (originally consolidated from 5 projects: ESCIcheck, MetaESCI, Scimeto, MetaMisCitations, COREcoding; now 27 academic-level steps) addresses every real-world artifact we've encountered across 8,500+ academic PDFs.

Methodology

Test set: 50 PDFs from CitationGuard validation suite, spanning 8 citation styles (APA, IEEE, Nature, Vancouver, AMA, ASA, Harvard, AOM).

Ground truth: Claude Opus 4.6 reading PDF pages as images and counting statistical expressions (p-values, F-tests, t-tests, correlations, CIs, effect sizes).

Metrics: Statistical patterns found, extraction time, garbled character count, ligature count, column interleaving detection.

Deep verification: Page-level comparison on 6 stats-heavy papers (chan_feldman, ip_feldman, chandrashekar, bmc_med_3, bmc_med_4, nat_comms_4).

Full benchmark data available on GitHub.