Documents to
JSON, automated.
The extraction API for PDFs, invoices, and documents — structured output, zero configuration.
Structured invoice data via API — per-field confidence scores, MCP-ready, zero configuration.
Invoice extraction API —
quick start
Send an invoice PDF, get back structured JSON with per-field confidence scores.
Same request/response shape as production — Bearer token and multipart upload.
curl -X POST https://documenttojson.dev/v1/invoices/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@invoice.pdf"How it works
Upload an invoice, get back structured JSON. No configuration, no training data, no custom setup.
- 01
Upload an invoice
Send a PDF or image directly to the API — no preprocessing required. PDF, JPEG, PNG, WEBP, and TIFF are all supported. Documents are validated upfront, so only successfully processed files count toward your usage.
- 02
Extraction runs automatically
The invoice schema extracts all standard fields automatically — vendor details, line items, amounts, payment information, and dates — with values normalized to consistent formats. On higher-tier plans, a second pass corrects any low-confidence or missing fields.
- 03
Receive structured JSON with confidence scores
Every field returns a value, a confidence score, and the verbatim source text from the document. A document-level confidence score summarises overall extraction quality. Retrieve results immediately via the API or receive them via webhook when processing completes.
What gets extracted
Invoice extraction is available at launch. Support for receipts, contracts, and additional document types is on the roadmap.
Vendor & buyer details
Vendor and buyer names, each returned with a confidence score and the exact text found in the document.
Invoice identification
Invoice number, invoice date, and due date — normalized to a consistent format regardless of locale or date style used in the original document.
Amounts & currency
Subtotal, tax, and total as clean numbers with currency correctly identified — including ambiguous currency symbols resolved using document language context.
Line items
Every line item with description, quantity, unit price, and line total — returned as a structured list with an overall confidence score.
Payment accounts
Bank account numbers with automatic type identification — IBAN, SWIFT/BIC, sort codes, and regional formats — including the source text from the document.
Payment reference & notes
Payment reference codes with type classification, plus any free-text payment notes found on the invoice.
Frequently asked questions
- What file formats are supported?
- PDF (single and multi-page), JPEG, PNG, WEBP, and TIFF. Files are uploaded directly to the API with a maximum size of 20 MB. Page limits vary by plan.
- Do I need to configure the extraction schema?
- No. The invoice schema is predefined — every response returns the same consistent field set with values normalized to standard formats. Support for custom schemas is planned for a future release.
- What are per-field confidence scores?
- Every extracted field includes a confidence score from 0 to 1. High scores indicate reliable extraction; lower scores flag fields that may need a second look — useful for routing uncertain results to human review before writing to your system of record.
- How does 2-pass extraction work?
- On higher-tier plans, the API automatically runs a second extraction pass when critical fields are missing or have low confidence scores. The second pass targets only the failing fields rather than re-processing the entire document, improving accuracy without adding significant latency.
- What does MCP-ready mean?
- Model Context Protocol (MCP) is the open standard for connecting AI agents to external tools. doc-to-json ships a native MCP server so any MCP-compatible AI agent can call invoice extraction directly — no custom integration code required.
- Is extracted data stored?
- By default, uploaded files and extracted results are stored for 30 days so you can retrieve them later. Retention can be disabled on a per-job basis — useful for sensitive documents where you want no data kept after processing. Billing and audit records are always retained.
- Can I use webhooks for async processing?
- Yes. Provide a webhook URL and the API will deliver the result when extraction completes, fails, or returns a partial result. Responses are signed so you can verify authenticity. You can also poll the job status endpoint directly if you prefer.