Extract structured data
from any document.

Send a document and a JSON schema. Get perfectly typed JSON back. One API call, any file format, pay-as-you-go.

50 pages free
No credit card
Any format
PDF, images, DOCX, EML
JSON out
Schema-enforced
terminal
# Extract with a single curl
curl -X POST "https://api.clichefactory.com/v1/extract" \
-H "X-API-KEY: $CF_KEY" \
-F "file=@invoice.pdf" \
-F 'schema={"type":"object","properties":{"total":{"type":"number"}}}'
# Or use the Python SDK
from clichefactory import factory
client = factory(api_key="your-key")
cliche = client.cliche({"type": "object", ...})
result = cliche.extract(file="invoice.pdf")
The Pipeline

How it works

From unstructured file to strict JSON in seconds.

1
Send Document

Pass any file via API or SDK—PDFs, images, emails, or spreadsheets. No preprocessing required.

2
Apply Schema

Provide a JSON Schema or your saved ClicheFactory batch model JSON. Our engine handles the OCR and maps the fields automatically.

3
Receive JSON

Get perfectly typed, schema-validated data returned instantly. Ready to drop straight into your database.

One API. Every document.

Unified extraction for your entire document pipeline.

PDF
JPG & PNG
DOC / DOCX / ODT
Excel / CSV
EML
WebP, GIF & BMP
Markdown & TXT

Define your extraction model as JSON Schema or batch JSON in the app — get structured, validated JSON back.

Single-call extraction handles documents up to ~100 pages. For longer files, the Python SDK chunks and merges them.

Strict Privacy? BYOK.

Keep sensitive documents on your own hardware. Use Local Mode to process files locally and bring your own LLM API key. ClicheFactory orchestrates the parsing, but your data never hits our servers.

Meet Developers Where They Are
MCP Server
Extract data directly inside Cursor or Claude Desktop.
CLI Native
Hook into CI/CD pipelines or shell scripts with cf extract.
Python SDK
JSON Schema–first workflows, batch extraction, and async out of the box.

Transparent pricing

Start free with 50 pages. Pay per page after that — no subscriptions.

Mode Full Service BYOK Best For
Fast 10 credits 2 credits High-volume, latency-sensitive
Standard Popular 40 credits 5 credits General use, best accuracy/cost
Robust 80 credits 10 credits High-stakes, verification pass

1000 credits = $1.00 USD. Credits never expire. Full pricing details →

Zero-Code Labeling UI

We built the labeling UI so you don't have to.

Need domain experts to review extractions or build ground-truth datasets? Don't build internal tools. Define your extraction model in the app or upload your own JSON Schema, and we generate a strict, type-validated web UI for your non-technical team.

  • Schema-Driven: Number, multiline text, and date inputs generated directly from your field types.
  • Type Validated: No more spreadsheets with strings where floats should be.
  • Client-Ready: Invite reviewers directly to your workspace to verify AI outputs safely.
# 1. Your model (JSON Schema)
{
"title": "Invoice",
"type": "object",
"properties": {
"total": { "type": "number" },
"vendor": { "type": "string" },
"invoice_date": {
"type": "string",
"format": "date"
}
}
}
2. Your reviewer sees this
Trained Pipelines
BYOK required

Higher accuracy on your documents.

Bring your own LLM key (OpenAI, Gemini, or Anthropic) and train custom extraction pipelines on your document types. Upload labeled examples, train in minutes, deploy an artifact. Use it in the SDK, CLI, or API — one artifact ID, that's it.

Learn about Training
# Train via the web app, then use the artifact
cliche = client.cliche(Invoice, artifact_id="art_abc123")
result = cliche.extract(file="invoice.pdf")
# That's it — higher accuracy, same API

Start extracting in minutes

50 free pages. No credit card. Full API access.