API Documentation
Integrate Docpluck into your research pipeline. Extract clean, normalized text from academic PDFs via a simple HTTP API.
Getting an API Key
API keys are issued by the administrator. Contact giladfel@gmail.com with your project name and use case. Keys follow the format dp_xxxxxxxxxxxx.
Extract Text
POST https://docpluck.vercel.app/api/extractQuery Parameters
| Parameter | Default | Description |
|---|---|---|
normalize | none | Normalization level: none, standard, or academic |
quality | true | Include quality scoring in response |
Request Body
Send the PDF as multipart/form-data with the field name file. This works for files up to 4 MB.
Files larger than 4 MB
Vercel's serverless platform caps request bodies at 4.5 MB. For larger PDFs, Docpluck supports a two-step upload via Vercel Blob storage. Step one mints a short-lived upload token, step two PUTs the bytes to blob storage, step three triggers extraction by URL. The blob is deleted as soon as extraction completes (or within 1 hour if you abandon it).
POST /api/extract/upload-urlwith your Bearer token. Returns a signed Vercel Blob client-upload token.- Client uploads the file directly to Vercel Blob using the SDK (
@vercel/blob/clientupload()handles steps 1 and 2 in one call). POST /api/extractwith JSON body{"blob_url": "https://..."}. Server fetches, extracts, and deletes the blob.
Limits and rules: 50 MB per file, PDF and DOCX content-types only (no HTML on the blob path), maximum 5 unconsumed uploads pending per user, counts against your daily quota at upload-token issuance. Uploaded blobs that are never extracted are automatically deleted within 1 hour.
Security: every blob you upload is bound to your account in our database. If a different user obtains your blob URL (e.g. from a leaked log), they will get HTTP 403 when trying to extract it. The Docpluck blob store is not a free file host — it exists solely as a transit buffer between your client and the extraction engine, and we audit this assumption with hourly orphan sweeps.
Python example (large file)
import requests, os
API = "https://docpluck.vercel.app"
headers = {"Authorization": "Bearer dp_your_api_key"}
path = "big_paper.pdf"
# Step 1+2: ask the @vercel/blob/client browser SDK pattern, but from Python:
# request a token, then PUT the bytes ourselves.
r = requests.post(
f"{API}/api/extract/upload-url",
json={
"type": "blob.generate-client-token",
"payload": {
"pathname": os.path.basename(path),
"callbackUrl": f"{API}/api/extract/upload-url",
"clientPayload": None,
"multipart": False,
},
},
headers=headers,
)
token = r.json()
upload_url = token["clientToken"] # signed PUT URL
with open(path, "rb") as f:
requests.put(upload_url, data=f.read(),
headers={"Content-Type": "application/pdf"})
blob_url = token["url"] # final blob URL returned in token response
# Step 3: trigger extraction
r = requests.post(
f"{API}/api/extract?normalize=academic",
json={"blob_url": blob_url},
headers=headers,
)
print(r.json()["text"][:200])Code Examples
Python
import requests
response = requests.post(
"https://docpluck.vercel.app/api/extract",
params={"normalize": "academic"},
files={"file": open("paper.pdf", "rb")},
headers={"Authorization": "Bearer dp_your_api_key"},
)
result = response.json()
text = result["text"]
quality = result["quality"]["score"] # 0-100
print(f"Extracted {result['metadata']['chars']} chars, quality={quality}")R
library(httr)
res <- POST(
"https://docpluck.vercel.app/api/extract",
query = list(normalize = "academic"),
body = list(file = upload_file("paper.pdf")),
add_headers(Authorization = "Bearer dp_your_api_key")
)
text <- content(res)$text
quality <- content(res)$quality$scoreJavaScript / Node.js
const form = new FormData();
form.append("file", fs.createReadStream("paper.pdf"));
const res = await fetch(
"https://docpluck.vercel.app/api/extract?normalize=academic",
{
method: "POST",
body: form,
headers: { Authorization: "Bearer dp_your_api_key" },
}
);
const { text, quality } = await res.json();Normalization Levels
none0 stepsRaw text from pdftotext. No modifications. Your app handles all post-processing.
standard10 stepsGeneral-purpose cleanup. Safe for any downstream use.
academic14 steps (standard + 4)Everything in standard, plus statistical expression repair for academic papers.
Response Format
{
"text": "Extracted and normalized text...",
"metadata": {
"engine": "pdftotext_default",
"extraction_time_ms": 412.3,
"chars": 80721,
"pages": 24,
"pdf_hash": "sha256..."
},
"normalization": {
"level": "academic",
"version": "1.0.0",
"steps_applied": ["S0_smp_to_ascii", "S1_encoding_validation", ...],
"changes_made": { "ligatures_expanded": 12, "dashes_normalized": 3 }
},
"quality": {
"score": 100,
"confidence": "high",
"garbled": false,
"details": { "ligatures_remaining": 0, "garbled_chars": 0 }
}
}Rate Limits
| Tier | Limit | How to get |
|---|---|---|
| Default | 5 PDFs/day | Sign in with GitHub or Google |
| Academic | 50 PDFs/day | Verified academic email (coming soon) |
| API Key | Custom | Contact admin |
Error Codes
| Code | Meaning |
|---|---|
| 400 | invalid_request or corrupt_file — bad JSON, missing field, malformed PDF |
| 401 | unauthorized — missing/invalid API key or session |
| 403 | forbidden — blob URL is not yours, or never tracked |
| 410 | corrupt_file — blob already consumed and deleted |
| 413 | file_too_large — multipart body >4.5 MB, use blob upload instead |
| 422 | password_protected, scanned_no_ocr, empty_text, etc. |
| 429 | rate_limit_exceeded — daily quota or pending-upload cap (5) |
| 502 | service_unavailable — Python extraction service down |