A universal extraction service built for academic papers. Handles multi-column layouts, ligatures, Unicode symbols, and statistical expressions that other tools break — across PDF, DOCX, and HTML inputs, with section identification, tables, figures, and Markdown rendering.
The problem
Every academic project reinvents PDF extraction. Multi-column layouts interleave text. Ligatures break pattern matching. Unicode symbols vanish. Statistical expressions split across lines. Docpluck solves this once, for all projects.
Features
Default pdftotext mode correctly undoes two-column layouts. No more interleaved text from APA, Nature, or Elsevier papers.
Automatically converts fi, fl, ff, ffi, ffl ligatures (U+FB00–FB04) back to individual characters. Nature papers have up to 400 per paper.
Mathematical Italic characters (U+1D434+) that break Xpdf are auto-detected and recovered via pdfplumber. Zero garbled output.
Rejoins split expressions like "p =\n0.001" and "OR\n1.399". Handles CIs, effect sizes, and test statistics from APA formatting.
Converts Unicode MINUS (U+2212), en-dash, em-dash to ASCII hyphen. Critical for matching r = −0.73 in downstream parsers.
Common-word ratio detects garbled PDFs (broken font encoding). Composite 0–100 score with confidence level reported per extraction.
Labels Abstract, Introduction, Methods, Results, Discussion, References, Footnotes, Appendix, and more. Universal-coverage invariant: every character is accounted for.
Camelot stream-flavor detects ruled and unruled tables, exports as HTML inside Markdown to preserve merged cells, multi-line headers, and group separators.
F0 step uses pdfplumber geometry to separate body text from running headers and footnote appendices, then re-attaches footnotes as a dedicated section.
render_pdf_to_markdown produces a single Markdown document with sections, inline tables, and figure references — ready for LLM pipelines.
extract_docx and extract_html share the same normalization + section pipeline, so DOCX manuscripts and HTML preprints come out identically structured.
How it works
Drop a PDF, DOCX, or HTML file via the web UI or send it through the API.
Format-aware extraction: pdftotext for PDFs, native parsing for DOCX/HTML.
27-step Academic pipeline fixes ligatures, dashes, line breaks, footnotes, then identifies sections and tables.
Quality check + Markdown rendering with sections, tables, and figure references.
Normalization
Every extraction reports exactly which steps were applied and what changed. Consumer apps know precisely what to expect.
none0 stepsUnmodified text from pdftotext. Your app handles all normalization.
standard16 stepsLigatures, accents, quotes, dashes, Unicode minus, whitespace, headers/footers, watermark + front-matter strips.
academic27 stepsStandard + statistical line breaks, dropped decimals, CI delimiters, math symbols, body-integer fixes, reference-list repair. Layout-aware footnote stripping when a PDF is provided.
For developers
Replace fragmented extraction code across ESCIcheck, MetaESCI, Scimeto, MetaMisCitations, and COREcoding with a single authenticated API. Python, R, and JavaScript examples included.
Read API DocsFree for researchers. 5 files/day. Academic users get higher limits.
Get Started Free