API Documentation

Integrate Docpluck into your research pipeline. Extract clean, normalized text from academic PDFs via a simple HTTP API.

Getting an API Key

API keys are issued by the administrator. Contact giladfel@gmail.com with your project name and use case. Keys follow the format dp_xxxxxxxxxxxx.

Extract Text

POST https://docpluck.vercel.app/api/extract

Query Parameters

ParameterDefaultDescription
normalizenoneNormalization level: none, standard, or academic
qualitytrueInclude quality scoring in response

Request Body

Send the PDF as multipart/form-data with the field name file. This works for files up to 4 MB.

Files larger than 4 MB

Vercel's serverless platform caps request bodies at 4.5 MB. For larger PDFs, Docpluck supports a two-step upload via Vercel Blob storage. Step one mints a short-lived upload token, step two PUTs the bytes to blob storage, step three triggers extraction by URL. The blob is deleted as soon as extraction completes (or within 1 hour if you abandon it).

  1. POST /api/extract/upload-url with your Bearer token. Returns a signed Vercel Blob client-upload token.
  2. Client uploads the file directly to Vercel Blob using the SDK (@vercel/blob/client upload() handles steps 1 and 2 in one call).
  3. POST /api/extract with JSON body {"blob_url": "https://..."}. Server fetches, extracts, and deletes the blob.

Limits and rules: 50 MB per file, PDF and DOCX content-types only (no HTML on the blob path), maximum 5 unconsumed uploads pending per user, counts against your daily quota at upload-token issuance. Uploaded blobs that are never extracted are automatically deleted within 1 hour.

Security: every blob you upload is bound to your account in our database. If a different user obtains your blob URL (e.g. from a leaked log), they will get HTTP 403 when trying to extract it. The Docpluck blob store is not a free file host — it exists solely as a transit buffer between your client and the extraction engine, and we audit this assumption with hourly orphan sweeps.

Python example (large file)

import requests, os

API = "https://docpluck.vercel.app"
headers = {"Authorization": "Bearer dp_your_api_key"}
path = "big_paper.pdf"

# Step 1+2: ask the @vercel/blob/client browser SDK pattern, but from Python:
# request a token, then PUT the bytes ourselves.
r = requests.post(
    f"{API}/api/extract/upload-url",
    json={
        "type": "blob.generate-client-token",
        "payload": {
            "pathname": os.path.basename(path),
            "callbackUrl": f"{API}/api/extract/upload-url",
            "clientPayload": None,
            "multipart": False,
        },
    },
    headers=headers,
)
token = r.json()
upload_url = token["clientToken"]  # signed PUT URL

with open(path, "rb") as f:
    requests.put(upload_url, data=f.read(),
                 headers={"Content-Type": "application/pdf"})

blob_url = token["url"]  # final blob URL returned in token response

# Step 3: trigger extraction
r = requests.post(
    f"{API}/api/extract?normalize=academic",
    json={"blob_url": blob_url},
    headers=headers,
)
print(r.json()["text"][:200])

Code Examples

Python

import requests

response = requests.post(
    "https://docpluck.vercel.app/api/extract",
    params={"normalize": "academic"},
    files={"file": open("paper.pdf", "rb")},
    headers={"Authorization": "Bearer dp_your_api_key"},
)
result = response.json()
text = result["text"]
quality = result["quality"]["score"]  # 0-100
print(f"Extracted {result['metadata']['chars']} chars, quality={quality}")

R

library(httr)

res <- POST(
  "https://docpluck.vercel.app/api/extract",
  query = list(normalize = "academic"),
  body = list(file = upload_file("paper.pdf")),
  add_headers(Authorization = "Bearer dp_your_api_key")
)
text <- content(res)$text
quality <- content(res)$quality$score

JavaScript / Node.js

const form = new FormData();
form.append("file", fs.createReadStream("paper.pdf"));

const res = await fetch(
  "https://docpluck.vercel.app/api/extract?normalize=academic",
  {
    method: "POST",
    body: form,
    headers: { Authorization: "Bearer dp_your_api_key" },
  }
);
const { text, quality } = await res.json();

Normalization Levels

none0 steps

Raw text from pdftotext. No modifications. Your app handles all post-processing.

standard10 steps

General-purpose cleanup. Safe for any downstream use.

S0: Mathematical Italic Unicode to ASCII
S1: Encoding validation (null bytes, line endings)
S2: Accent recombination (standalone marks + vowels)
S3: Ligature expansion (fi, fl, ff, ffi, ffl)
S4: Quote normalization (curly to straight)
S5: Dash/minus normalization (U+2212, en-dash, em-dash to ASCII)
S6: Whitespace normalization (NBSP, thin space, collapse)
S7: Hyphenation repair (word-\nword to word)
S8: Line break joining (lowercase,\nlowercase)
S9: Header/footer removal (repeated lines + page numbers)
academic14 steps (standard + 4)

Everything in standard, plus statistical expression repair for academic papers.

A1: Statistical line break repair (p =\n0.001 joined)
A2: Dropped decimal repair (p = 484 to p = .484)
A3: Decimal comma normalization (European 0,05 to 0.05)
A4: CI delimiter harmonization ([0.1; 0.5] to [0.1, 0.5])

Response Format

{
  "text": "Extracted and normalized text...",
  "metadata": {
    "engine": "pdftotext_default",
    "extraction_time_ms": 412.3,
    "chars": 80721,
    "pages": 24,
    "pdf_hash": "sha256..."
  },
  "normalization": {
    "level": "academic",
    "version": "1.0.0",
    "steps_applied": ["S0_smp_to_ascii", "S1_encoding_validation", ...],
    "changes_made": { "ligatures_expanded": 12, "dashes_normalized": 3 }
  },
  "quality": {
    "score": 100,
    "confidence": "high",
    "garbled": false,
    "details": { "ligatures_remaining": 0, "garbled_chars": 0 }
  }
}

Rate Limits

TierLimitHow to get
Default5 PDFs/daySign in with GitHub or Google
Academic50 PDFs/dayVerified academic email (coming soon)
API KeyCustomContact admin

Error Codes

CodeMeaning
400invalid_request or corrupt_file — bad JSON, missing field, malformed PDF
401unauthorized — missing/invalid API key or session
403forbidden — blob URL is not yours, or never tracked
410corrupt_file — blob already consumed and deleted
413file_too_large — multipart body >4.5 MB, use blob upload instead
422password_protected, scanned_no_ocr, empty_text, etc.
429rate_limit_exceeded — daily quota or pending-upload cap (5)
502service_unavailable — Python extraction service down