How accurate is your invoice data extraction?

Quixyl uses enterprise-grade document intelligence with confidence scoring on every extracted field. Each field is assigned a score from 0-100%, and fields below the confidence threshold are automatically flagged for human review - so errors are caught before they reach your accounting software. This is significantly more reliable than manual data entry.

Is my invoice data encrypted?

Yes. All extracted data is encrypted with AES-256-GCM (bank-grade encryption) before storage. Additionally, original files are never stored - they're processed and deleted immediately for maximum security. This dual-layer approach minimizes data breach risk.

Can I upload multiple invoices at once?

Yes! Our batch upload feature allows you to process 100+ documents simultaneously. The system includes smart recovery, so your progress is saved even if you close your browser. This is 70% faster than sequential uploads and perfect for month-end invoice processing.

What file formats do you support?

We support PDF, JPEG, PNG, TIFF, Excel (.xlsx), and Word (.docx). Our OCR works on scanned documents, photos from mobile devices, and digital files. Maximum file size: 20MB per file. Multi-page PDFs are fully supported with automatic page detection.

Is there a free trial?

Yes! We offer a 14-day free trial with no credit card required. You get full access to all features including batch upload, AI normalization, PII redaction, and API access. Cancel anytime during the trial with no charges. After the trial, choose from our flexible pricing plans starting at $29/month.

All posts

22 February 2026 Quixyl Engineering Team Deep Dive 4 min read

Multi-Page PDF Extraction: Handling Complex Documents at Scale

How Quixyl processes multi-page invoices, mixed-document batches, and high-volume PDF files - including page splitting, document boundary detection, and batch throughput tuning.

pdf multi-page batch processing scale technical

Single-page invoices are the easy case. The interesting problems arise when you need to process a 200-page batch PDF from a logistics vendor, a scanned multi-page contract with embedded invoice tables, or 10,000 documents uploaded over a weekend run. This article covers how Quixyl handles the hard cases correctly.

Why multi-page extraction is different

A single-page PDF has a clear contract: one document, one extraction result. Multi-page PDFs break this contract in several ways:

Mixed documents in one file - Vendors routinely email a single PDF containing an invoice, a delivery note, and a remittance advice. An extractor that treats the file as one document will confuse fields from different documents.

Long invoices vs batches - A 12-page invoice for a complex construction project is fundamentally different from a 12-page batch of 12 separate invoices. The extractor needs to know which it is dealing with.

Page headers and footers - In multi-page invoices, each page repeats vendor name, invoice number, and “Page N of M” text. Deduplicating these without losing true line items requires careful logic.

Rotated and mixed-orientation pages - Scanned batches often contain portrait and landscape pages in the same file, or pages scanned upside-down.

Document boundary detection

The first job when Quixyl receives a multi-page PDF is determining whether it contains one document or many. The boundary detection model uses several signals:

Page-level signals

Header recurrence: If the same company name and document number appear on pages 1, 2, and 3, they are likely the same document
“Page N of M” markers: Explicit pagination is the clearest signal
Continuation language: Text like “continued”, “subtotals carried forward”, or “page total” indicates continuation
Running totals: A subtotal column that increments across pages suggests one document

Break signals

New document number: Invoice number change is a strong break indicator
New vendor: Vendor name change almost always signals a new document
“Thank you for your business” / “Terms and Conditions”: Common footer text that closes a document
Visual layout reset: A full header block after a footer block

When boundary detection confidence is above 92%, the split is applied automatically. Below 92%, the document is routed for human confirmation of the split points.

Processing a mixed-document batch

Quixyl’s batch endpoint handles the full workflow:

curl -X POST https://api.quixyl.com/v1/extract/batch \
  -H "Authorization: Bearer $QUIXYL_API_KEY" \
  -F "file=@monthly-vendor-batch.pdf" \
  -F "mode=auto_split" \
  -F "notify_webhook=true"

The mode parameter controls splitting behavior:

Mode	Behavior
`auto_split`	Detect and split documents automatically (default)
`single_document`	Treat entire file as one document regardless of page count
`one_per_page`	Force each page to be extracted independently
`manual_split`	Return page images for human split-point selection

The webhook fires once per detected document, not once per file. A 50-page batch containing 48 invoices fires 48 extraction.complete events.

Multi-page invoice handling

When Quixyl determines that all pages belong to the same invoice, it assembles the full line item table across pages before running field extraction:

Page 1: Invoice header + line items 1-15
Page 2: Line items 16-31 + subtotal/tax/total footer

The extractor merges these correctly:

{
  "invoice_number": "BATCH-2026-0147",
  "total_pages": 2,
  "line_items": [
    { "line": 1, "description": "Item A", ... },
    ...
    { "line": 31, "description": "Item AE", ... }
  ],
  "subtotal": 84200.00,
  "tax": 8420.00,
  "total": 92620.00
}

The total_pages field in the response tells downstream systems that the result spans multiple source pages.

Performance characteristics at scale

Throughput depends on document complexity and plan tier. Benchmarks on typical commercial invoices:

Document type	Pages	Median processing time	P95 processing time
Single-page invoice (digital PDF)	1	1.8s	3.4s
Multi-page invoice (digital PDF)	8	4.2s	7.1s
Scanned single-page invoice	1	2.9s	5.2s
Scanned multi-page batch	40	18s	31s
Complex construction invoice	20	11s	19s

For batch jobs over 500 pages, Quixyl queues and processes up to 50 documents in parallel (Enterprise tier). A 5,000-page batch typically completes in 12-18 minutes.

Optimizing throughput for high-volume runs

Use async endpoints

For files over 10 pages, always use /v1/extract/async and webhook delivery. Synchronous requests for large files will hit the 30-second timeout before the job completes.

Pre-split before uploading

If you already know the document boundaries (e.g., you’re processing invoices from an ERP that generates one PDF per invoice), split the files before uploading. Submitting 48 single-page PDFs is faster and more reliable than submitting one 48-page batch with auto_split.

from pypdf import PdfReader, PdfWriter
from pathlib import Path

def split_pdf(source: str, output_dir: str):
    reader = PdfReader(source)
    for i, page in enumerate(reader.pages):
        writer = PdfWriter()
        writer.add_page(page)
        with open(f"{output_dir}/page_{i+1:03}.pdf", "wb") as f:
            writer.write(f)

Set document_type explicitly

Specifying document_type=invoice skips the classification step and saves 200-400ms per document. At 10,000 documents, that is 33-67 minutes of wall-clock time recovered.

Use the bulk endpoint

Rather than making 1,000 separate API calls:

# Less efficient - 1,000 separate requests
for file in invoices/*.pdf; do
  curl -X POST .../extract -F "file=@$file" &
done

# More efficient - one bulk submission
curl -X POST .../extract/bulk \
  -F "files[]=@invoices/001.pdf" \
  -F "files[]=@invoices/002.pdf" \
  ... \
  -F "notify_webhook=true"

The bulk endpoint accepts up to 100 files per call. Chunk your uploads accordingly.

Handling extraction failures in a batch

In a large batch run, some documents will fail - blurry scans, corrupted PDFs, or documents that simply cannot be parsed with high confidence. Build your pipeline to expect this:

// Webhook handler
app.post('/hooks/quixyl', (req, res) => {
	const event = req.body;

	switch (event.event) {
		case 'extraction.complete':
			await postToERP(event.data);
			break;
		case 'extraction.failed':
			await createReviewTask(event.extraction_id, event.error_code);
			break;
		case 'review.required':
			await routeToHumanReview(event.extraction_id, event.flagged_fields);
			break;
	}

	res.sendStatus(200);
});

A well-designed pipeline treats review.required as a normal event path, not an error state. Aim for a target review rate of under 5% - if you are seeing higher, adjust your confidence thresholds or add vendor-specific templates.

Audit trail for split documents

When Quixyl splits a batch PDF, the extraction record for each child document keeps a reference to the source file:

{
  "id": "ext_01HXYZ442B3R",
  "source_file": "s3://your-bucket/batch-2026-02.pdf",
  "source_pages": [3, 4],
  "document_type": "invoice",
  ...
}

This means you can always trace any extracted record back to the exact pages in the original file - essential for audit purposes.

Summary

Scenario	Recommended approach
Single invoices under 10 pages	Synchronous endpoint
Multi-page invoices	Async endpoint + webhook
Mixed-document batch PDFs	Bulk + `auto_split` + webhook
High-volume overnight run	Bulk + async + webhook, pre-split if boundaries are known
Unknown document types	`document_type=auto` (default)