Benchmark Results
Phase 0 benchmark: 50 academic PDFs, 8 citation styles, 3 engines, AI-verified ground truth. Engine choice settled here; capability layers (tables, sections, multi-format) added since.
Decision: pdftotext default mode
After benchmarking 5 engines on 50 PDFs and verifying with Claude Opus 4.6 reading actual PDF pages, we selected pdftotext default mode + 27-step Academic normalization + pdfplumber SMP recovery as the primary engine. 100% accuracy on 29 ground truth passages, ~400ms per PDF, zero AGPL dependencies.
Since this benchmark
The text-extraction engine is unchanged. On top of it, the library now adds: Camelot stream-flavor table extraction (replaced pdfplumber tables after a 5-option bake-off, 2026-05-09); universal-coverage section identification (Abstract … References, Footnotes, Appendix); layout-aware footnote stripping (F0); render-to-markdown; and parallel DOCX + HTML input pipelines that share the same normalization + section logic.
Final 3-Way Showdown (10 stats-heavy PDFs)
| Engine | Stats Found | Avg Speed | Eta-sq | Chi-sq | Garbled | Papers Won |
|---|---|---|---|---|---|---|
| pdftotext + normalization | 587 | 431ms | 8 | 4 | 0 | 4 |
| PyMuPDF + column_boxes | 587 | 2,139ms | 8 | 4 | 0 | 4 |
| pymupdf4llm | 564 | 9,248ms | 0 | 0 | 2 | 1 |
Why pdftotext Won
Speed: 5-20x faster
~400ms vs 2.1s (column_boxes) or 9.2s (pymupdf4llm). On Railway, this means lower cost and faster API responses.
No AGPL license
PyMuPDF (column_boxes) and pymupdf4llm are AGPL-licensed. pdftotext (GPL) + pdfplumber (MIT) avoids AGPL for an authenticated service.
Equal accuracy with normalization
pdftotext + our 27-step Academic normalization pipeline matches column_boxes on stat detection (587 each). The normalization pipeline is the differentiator, not the engine.
Robust recovery path
Auto-detects garbled Unicode (U+FFFD) from Xpdf's SMP limitation and recovers via pdfplumber. Zero garbled output across all 50 test PDFs.
Engines Dropped
Bold Markdown markers wrap statistical values, HTML < in "p < .001" gets interpreted as tag eating 40K chars, 20x slower, AGPL license.
Equal accuracy but AGPL license incompatible with authenticated SaaS. 5x slower than pdftotext.
Out-of-memory crash on CPU (std::bad_alloc). Even in FAST mode: 5GB+ RAM, 30-80s per PDF, crashed on paper #10.
Too slow as a general-purpose text extractor. Note: later adopted as the dedicated TABLE-extraction engine after a separate 5-option bake-off (2026-05-09) — see "Since this benchmark" above.
Results by Citation Style (50 PDFs)
| Style | Papers | Best Engine | Notes |
|---|---|---|---|
| APA (Psychology) | 10 | pdftotext | Stats-heavy. Default mode handles 2-column correctly. |
| Vancouver (Medical) | 8 | Tie | Simple layouts. All engines work well. |
| Nature | 6 | pdftotext | Up to 400 ligatures per paper. Normalization essential. |
| IEEE | 6 | pdftotext | pymupdf4llm has false positives from figure HTML tags. |
| Harvard | 6 | Tie | Business journals. Moderate complexity. |
| AMA | 5 | pdftotext | Medical style, similar to Vancouver. |
| ASA | 5 | pdftotext | Sociology. 2-column layouts. |
| AOM | 4 | Tie | Management journals. Simple layouts. |
Key Finding
The normalization pipeline matters more than the extraction engine. pdftotext + normalization matches or exceeds engines that are 5-20x slower. The pipeline (originally consolidated from 5 projects: ESCIcheck, MetaESCI, Scimeto, MetaMisCitations, COREcoding; now 27 academic-level steps) addresses every real-world artifact we've encountered across 8,500+ academic PDFs.
Methodology
Test set: 50 PDFs from CitationGuard validation suite, spanning 8 citation styles (APA, IEEE, Nature, Vancouver, AMA, ASA, Harvard, AOM).
Ground truth: Claude Opus 4.6 reading PDF pages as images and counting statistical expressions (p-values, F-tests, t-tests, correlations, CIs, effect sizes).
Metrics: Statistical patterns found, extraction time, garbled character count, ligature count, column interleaving detection.
Deep verification: Page-level comparison on 6 stats-heavy papers (chan_feldman, ip_feldman, chandrashekar, bmc_med_3, bmc_med_4, nat_comms_4).
Full benchmark data available on GitHub.