Extract structured data
from any document.

Send a document and a JSON schema. Get perfectly typed JSON back. One API call, any file format, pay-as-you-go.

Start Free Read the Docs

50 pages free

No credit card

Any format

PDF, images, DOCX, EML

JSON out

Schema-enforced

terminal

# Extract with a single curl

curl -X POST "https://api.clichefactory.com/v1/extract" \

-H "X-API-KEY: $CF_KEY" \

-F "file=@invoice.pdf" \

-F 'schema={"type":"object","properties":{"total":{"type":"number"}}}'

# Or use the Python SDK

from clichefactory import factory

client = factory(api_key="your-key")

cliche = client.cliche({"type": "object", ...})

result = cliche.extract(file="invoice.pdf")

The Pipeline

How it works

From unstructured file to strict JSON in seconds.

Send Document

Pass any file via API or SDK—PDFs, images, emails, or spreadsheets. No preprocessing required.

Apply Schema

Provide a JSON Schema or your saved ClicheFactory batch model JSON. Our engine handles the OCR and maps the fields automatically.

Receive JSON

Get perfectly typed, schema-validated data returned instantly. Ready to drop straight into your database.

One API. Every document.

Unified extraction for your entire document pipeline.

PDF

JPG & PNG

DOC / DOCX / ODT

Excel / CSV

EML

WebP, GIF & BMP

Markdown & TXT

Define your extraction model as JSON Schema or batch JSON in the app — get structured, validated JSON back.

Single-call extraction handles documents up to ~100 pages. For longer files, the Python SDK chunks and merges them.

Strict Privacy? BYOK.

Keep sensitive documents on your own hardware. Use Local Mode to process files locally and bring your own LLM API key. ClicheFactory orchestrates the parsing, but your data never hits our servers.

Meet Developers Where They Are

MCP Server

Extract data directly inside Cursor or Claude Desktop.

CLI Native

Hook into CI/CD pipelines or shell scripts with cf extract.

Python SDK

JSON Schema–first workflows, batch extraction, and async out of the box.

Transparent pricing

Start free with 50 pages. Pay per page after that — no subscriptions.

Mode	Full Service	BYOK	Best For
Fast	10 credits	2 credits	High-volume, latency-sensitive
Standard Popular	40 credits	5 credits	General use, best accuracy/cost
Robust	80 credits	10 credits	High-stakes, verification pass

1000 credits = $1.00 USD. Credits never expire. Full pricing details →

Zero-Code Labeling UI

We built the labeling UI so you don't have to.

Need domain experts to review extractions or build ground-truth datasets? Don't build internal tools. Define your extraction model in the app or upload your own JSON Schema, and we generate a strict, type-validated web UI for your non-technical team.

Schema-Driven: Number, multiline text, and date inputs generated directly from your field types.
Type Validated: No more spreadsheets with strings where floats should be.
Client-Ready: Invite reviewers directly to your workspace to verify AI outputs safely.

# 1. Your model (JSON Schema)

{
"title": "Invoice",
"type": "object",
"properties": {
"total": { "type": "number" },
"vendor": { "type": "string" },
"invoice_date": {
"type": "string",
"format": "date"
}
}
}

2. Your reviewer sees this

total *

vendor *

Acme Supplies Inc.

invoice_date *

Trained Pipelines

BYOK required

Higher accuracy on your documents.

Bring your own LLM key (OpenAI, Gemini, or Anthropic) and train custom extraction pipelines on your document types. Upload labeled examples, train in minutes, deploy an artifact. Use it in the SDK, CLI, or API — one artifact ID, that's it.

Learn about Training

# Train via the web app, then use the artifact

cliche = client.cliche(Invoice, artifact_id="art_abc123")