Open-source PDF extraction for academia

Extract clean text
from PDFs, Word, and HTML

A universal extraction service built for academic papers. Handles multi-column layouts, ligatures, Unicode symbols, and statistical expressions that other tools break — across PDF, DOCX, and HTML inputs, with section identification, tables, figures, and Markdown rendering.

Get Started View on GitHub

Input formats

PDF, DOCX, HTML

~400ms

Average extraction

Per academic PDF

Academic-level steps

S0–S9, A1–A8, W0, R2–R3 (+F0 with PDF layout)

The problem

Every academic project reinvents PDF extraction. Multi-column layouts interleave text. Ligatures break pattern matching. Unicode symbols vanish. Statistical expressions split across lines. Docpluck solves this once, for all projects.

Features

Built from real extraction failures

Extraction

Column-aware reading order

Default pdftotext mode correctly undoes two-column layouts. No more interleaved text from APA, Nature, or Elsevier papers.

Normalization

Ligature expansion

Automatically converts fi, fl, ff, ffi, ffl ligatures (U+FB00–FB04) back to individual characters. Nature papers have up to 400 per paper.

Recovery

Unicode symbol recovery

Mathematical Italic characters (U+1D434+) that break Xpdf are auto-detected and recovered via pdfplumber. Zero garbled output.

Academic

Statistical line break repair

Rejoins split expressions like "p =\n0.001" and "OR\n1.399". Handles CIs, effect sizes, and test statistics from APA formatting.

Normalization

Minus sign normalization

Converts Unicode MINUS (U+2212), en-dash, em-dash to ASCII hyphen. Critical for matching r = −0.73 in downstream parsers.

Quality

Quality scoring

Common-word ratio detects garbled PDFs (broken font encoding). Composite 0–100 score with confidence level reported per extraction.

Structure

Section identification

Labels Abstract, Introduction, Methods, Results, Discussion, References, Footnotes, Appendix, and more. Universal-coverage invariant: every character is accounted for.

Tables

Table extraction

Camelot stream-flavor detects ruled and unruled tables, exports as HTML inside Markdown to preserve merged cells, multi-line headers, and group separators.

Layout

Layout-aware footnote stripping

F0 step uses pdfplumber geometry to separate body text from running headers and footnote appendices, then re-attaches footnotes as a dedicated section.

Output

Render to Markdown

render_pdf_to_markdown produces a single Markdown document with sections, inline tables, and figure references — ready for LLM pipelines.

Multi-format

Word + HTML inputs

extract_docx and extract_html share the same normalization + section pipeline, so DOCX manuscripts and HTML preprints come out identically structured.

How it works

Four steps to clean text

Upload

Drop a PDF, DOCX, or HTML file via the web UI or send it through the API.

Extract

Format-aware extraction: pdftotext for PDFs, native parsing for DOCX/HTML.

Normalize + Structure

27-step Academic pipeline fixes ligatures, dashes, line breaks, footnotes, then identifies sections and tables.

Score + Render

Quality check + Markdown rendering with sections, tables, and figure references.

Normalization

Choose your level

Every extraction reports exactly which steps were applied and what changed. Consumer apps know precisely what to expect.

none0 steps

Raw

Unmodified text from pdftotext. Your app handles all normalization.

standard16 steps

Standard

Ligatures, accents, quotes, dashes, Unicode minus, whitespace, headers/footers, watermark + front-matter strips.

academic27 steps

Academic

Standard + statistical line breaks, dropped decimals, CI delimiters, math symbols, body-integer fixes, reference-list repair. Layout-aware footnote stripping when a PDF is provided.

For developers

One API for all your projects

Replace fragmented extraction code across ESCIcheck, MetaESCI, Scimeto, MetaMisCitations, and COREcoding with a single authenticated API. Python, R, and JavaScript examples included.

Read API Docs

# Python

import requests

res = requests.post(

"https://docpluck.vercel.app/api/extract"

params={"normalize": "academic"},

files={"file": open("paper.pdf", "rb")},

headers={"Authorization": "Bearer dp_..."},

)

text = res.json()["text"]

Stop rebuilding PDF extraction

Free for researchers. 5 files/day. Academic users get higher limits.

Get Started Free

Extract clean textfrom PDFs, Word, and HTML

Built from real extraction failures

Column-aware reading order

Ligature expansion

Unicode symbol recovery

Statistical line break repair

Minus sign normalization

Quality scoring

Section identification

Table extraction

Layout-aware footnote stripping

Render to Markdown

Word + HTML inputs

Four steps to clean text

Upload

Extract

Normalize + Structure

Score + Render

Choose your level

Raw

Standard

Academic

One API for all your projects

Stop rebuilding PDF extraction

Extract clean text
from PDFs, Word, and HTML