Early access · Invoice extraction

Documents to
JSON, automated.

The extraction API for PDFs, invoices, and documents — structured output, zero configuration.

Structured invoice data via API — per-field confidence scores, MCP-ready, zero configuration.

invoice_acme_2024.pdf → JSON
PDF
invoice_acme_2024.pdf
342 KB · 2 pages
INPUT
2-pass extraction
"vendor_name""Acme Corp"0.98
"invoice_date""2024-11-15"0.96
"total_amount"4250.000.91
"tax_amount"382.500.78
"line_items"0.94
≥ 0.90 high confidence
0.70–0.89 medium
Signups
Launch featureInvoices → JSONShipping first. More document types added based on user feedback.
Accuracy2-pass validationA second extraction pass corrects low-confidence fields automatically.
IntegrationMCP-ready APIWorks with Claude, Cursor, and any MCP-compatible agent — no custom glue code required.
PricingSubscription plansMonthly plans with a pay-as-you-go option for extra volume.
REST API · Authenticated

Invoice extraction API —
quick start

Same request/response shape as production — Bearer token and multipart upload.

POST /v1/invoices/extract · curl & JSON
REQUESTcurl
curl -X POST https://documenttojson.dev/v1/invoices/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@invoice.pdf"
RESPONSEJSON
"job_id""job_01hx…"
"status""completed"
"data"
"vendor_name"
"value""Acme Corp"
"confidence"0.98
"total_amount"
"value"4250
"confidence"0.91
"line_items"
"value"
"confidence"0.94
"document_confidence"0.95

How it works

Upload an invoice, get back structured JSON. No configuration, no training data, no custom setup.

  1. 01

    Upload an invoice

    Send a PDF or image directly to the API — no preprocessing required. PDF, JPEG, PNG, WEBP, and TIFF are all supported. Documents are validated upfront, so only successfully processed files count toward your usage.

  2. 02

    Extraction runs automatically

    The invoice schema extracts all standard fields automatically — vendor details, line items, amounts, payment information, and dates — with values normalized to consistent formats. On higher-tier plans, a second pass corrects any low-confidence or missing fields.

  3. 03

    Receive structured JSON with confidence scores

    Every field returns a value, a confidence score, and the verbatim source text from the document. A document-level confidence score summarises overall extraction quality. Retrieve results immediately via the API or receive them via webhook when processing completes.

What gets extracted

Invoice extraction is available at launch. Support for receipts, contracts, and additional document types is on the roadmap.

  • Vendor & buyer details

    Vendor and buyer names, each returned with a confidence score and the exact text found in the document.

  • Invoice identification

    Invoice number, invoice date, and due date — normalized to a consistent format regardless of locale or date style used in the original document.

  • Amounts & currency

    Subtotal, tax, and total as clean numbers with currency correctly identified — including ambiguous currency symbols resolved using document language context.

  • Line items

    Every line item with description, quantity, unit price, and line total — returned as a structured list with an overall confidence score.

  • Payment accounts

    Bank account numbers with automatic type identification — IBAN, SWIFT/BIC, sort codes, and regional formats — including the source text from the document.

  • Payment reference & notes

    Payment reference codes with type classification, plus any free-text payment notes found on the invoice.

Frequently asked questions

What file formats are supported?
PDF (single and multi-page), JPEG, PNG, WEBP, and TIFF. Files are uploaded directly to the API with a maximum size of 20 MB. Page limits vary by plan.
Do I need to configure the extraction schema?
No. The invoice schema is predefined — every response returns the same consistent field set with values normalized to standard formats. Support for custom schemas is planned for a future release.
What are per-field confidence scores?
Every extracted field includes a confidence score from 0 to 1. High scores indicate reliable extraction; lower scores flag fields that may need a second look — useful for routing uncertain results to human review before writing to your system of record.
How does 2-pass extraction work?
On higher-tier plans, the API automatically runs a second extraction pass when critical fields are missing or have low confidence scores. The second pass targets only the failing fields rather than re-processing the entire document, improving accuracy without adding significant latency.
What does MCP-ready mean?
Model Context Protocol (MCP) is the open standard for connecting AI agents to external tools. doc-to-json ships a native MCP server so any MCP-compatible AI agent can call invoice extraction directly — no custom integration code required.
Is extracted data stored?
By default, uploaded files and extracted results are stored for 30 days so you can retrieve them later. Retention can be disabled on a per-job basis — useful for sensitive documents where you want no data kept after processing. Billing and audit records are always retained.
Can I use webhooks for async processing?
Yes. Provide a webhook URL and the API will deliver the result when extraction completes, fails, or returns a partial result. Responses are signed so you can verify authenticity. You can also poll the job status endpoint directly if you prefer.