Deep Dive

Multi-Page PDF Extraction: Handling Complex Documents at Scale

How Quixyl processes multi-page invoices, mixed-document batches, and high-volume PDF files — including page splitting, document boundary detection, and batch throughput tuning.

February 22, 2026 11 min read Quixyl Engineering Team pdf multi-page batch processing scale technical

Single-page invoices are the easy case. The interesting problems arise when you need to process a 200-page batch PDF from a logistics vendor, a scanned multi-page contract with embedded invoice tables, or 10,000 documents uploaded over a weekend run. This article covers how Quixyl handles the hard cases correctly.


Why multi-page extraction is different

A single-page PDF has a clear contract: one document, one extraction result. Multi-page PDFs break this contract in several ways:

Mixed documents in one file — Vendors routinely email a single PDF containing an invoice, a delivery note, and a remittance advice. An extractor that treats the file as one document will confuse fields from different documents.

Long invoices vs batches — A 12-page invoice for a complex construction project is fundamentally different from a 12-page batch of 12 separate invoices. The extractor needs to know which it is dealing with.

Page headers and footers — In multi-page invoices, each page repeats vendor name, invoice number, and “Page N of M” text. Deduplicating these without losing true line items requires careful logic.

Rotated and mixed-orientation pages — Scanned batches often contain portrait and landscape pages in the same file, or pages scanned upside-down.


Document boundary detection

The first job when Quixyl receives a multi-page PDF is determining whether it contains one document or many. The boundary detection model uses several signals:

Page-level signals

  • Header recurrence: If the same company name and document number appear on pages 1, 2, and 3, they are likely the same document
  • “Page N of M” markers: Explicit pagination is the clearest signal
  • Continuation language: Text like “continued”, “subtotals carried forward”, or “page total” indicates continuation
  • Running totals: A subtotal column that increments across pages suggests one document

Break signals

  • New document number: Invoice number change is a strong break indicator
  • New vendor: Vendor name change almost always signals a new document
  • “Thank you for your business” / “Terms and Conditions”: Common footer text that closes a document
  • Visual layout reset: A full header block after a footer block

When boundary detection confidence is above 92%, the split is applied automatically. Below 92%, the document is routed for human confirmation of the split points.


Processing a mixed-document batch

Quixyl’s batch endpoint handles the full workflow:

curl -X POST https://api.quixyl.com/v1/extract/batch \
  -H "Authorization: Bearer $QUIXYL_API_KEY" \
  -F "file=@monthly-vendor-batch.pdf" \
  -F "mode=auto_split" \
  -F "notify_webhook=true"

The mode parameter controls splitting behavior:

ModeBehavior
auto_splitDetect and split documents automatically (default)
single_documentTreat entire file as one document regardless of page count
one_per_pageForce each page to be extracted independently
manual_splitReturn page images for human split-point selection

The webhook fires once per detected document, not once per file. A 50-page batch containing 48 invoices fires 48 extraction.complete events.


Multi-page invoice handling

When Quixyl determines that all pages belong to the same invoice, it assembles the full line item table across pages before running field extraction:

Page 1: Invoice header + line items 1–15
Page 2: Line items 16–31 + subtotal/tax/total footer

The extractor merges these correctly:

{
  "invoice_number": "BATCH-2026-0147",
  "total_pages": 2,
  "line_items": [
    { "line": 1, "description": "Item A", ... },
    ...
    { "line": 31, "description": "Item AE", ... }
  ],
  "subtotal": 84200.00,
  "tax": 8420.00,
  "total": 92620.00
}

The total_pages field in the response tells downstream systems that the result spans multiple source pages.


Performance characteristics at scale

Throughput depends on document complexity and plan tier. Benchmarks on typical commercial invoices:

Document typePagesMedian processing timeP95 processing time
Single-page invoice (digital PDF)11.8s3.4s
Multi-page invoice (digital PDF)84.2s7.1s
Scanned single-page invoice12.9s5.2s
Scanned multi-page batch4018s31s
Complex construction invoice2011s19s

For batch jobs over 500 pages, Quixyl queues and processes up to 50 documents in parallel (Enterprise tier). A 5,000-page batch typically completes in 12–18 minutes.


Optimizing throughput for high-volume runs

Use async endpoints

For files over 10 pages, always use /v1/extract/async and webhook delivery. Synchronous requests for large files will hit the 30-second timeout before the job completes.

Pre-split before uploading

If you already know the document boundaries (e.g., you’re processing invoices from an ERP that generates one PDF per invoice), split the files before uploading. Submitting 48 single-page PDFs is faster and more reliable than submitting one 48-page batch with auto_split.

from pypdf import PdfReader, PdfWriter
from pathlib import Path

def split_pdf(source: str, output_dir: str):
    reader = PdfReader(source)
    for i, page in enumerate(reader.pages):
        writer = PdfWriter()
        writer.add_page(page)
        with open(f"{output_dir}/page_{i+1:03}.pdf", "wb") as f:
            writer.write(f)

Set document_type explicitly

Specifying document_type=invoice skips the classification step and saves 200–400ms per document. At 10,000 documents, that is 33–67 minutes of wall-clock time recovered.

Use the bulk endpoint

Rather than making 1,000 separate API calls:

# Less efficient — 1,000 separate requests
for file in invoices/*.pdf; do
  curl -X POST .../extract -F "file=@$file" &
done

# More efficient — one bulk submission
curl -X POST .../extract/bulk \
  -F "files[]=@invoices/001.pdf" \
  -F "files[]=@invoices/002.pdf" \
  ... \
  -F "notify_webhook=true"

The bulk endpoint accepts up to 100 files per call. Chunk your uploads accordingly.


Handling extraction failures in a batch

In a large batch run, some documents will fail — blurry scans, corrupted PDFs, or documents that simply cannot be parsed with high confidence. Build your pipeline to expect this:

// Webhook handler
app.post('/hooks/quixyl', (req, res) => {
  const event = req.body;

  switch (event.event) {
    case 'extraction.complete':
      await postToERP(event.data);
      break;
    case 'extraction.failed':
      await createReviewTask(event.extraction_id, event.error_code);
      break;
    case 'review.required':
      await routeToHumanReview(event.extraction_id, event.flagged_fields);
      break;
  }

  res.sendStatus(200);
});

A well-designed pipeline treats review.required as a normal event path, not an error state. Aim for a target review rate of under 5% — if you are seeing higher, adjust your confidence thresholds or add vendor-specific templates.


Audit trail for split documents

When Quixyl splits a batch PDF, the extraction record for each child document keeps a reference to the source file:

{
  "id": "ext_01HXYZ442B3R",
  "source_file": "s3://your-bucket/batch-2026-02.pdf",
  "source_pages": [3, 4],
  "document_type": "invoice",
  ...
}

This means you can always trace any extracted record back to the exact pages in the original file — essential for audit purposes.


Summary

ScenarioRecommended approach
Single invoices under 10 pagesSynchronous endpoint
Multi-page invoicesAsync endpoint + webhook
Mixed-document batch PDFsBulk + auto_split + webhook
High-volume overnight runBulk + async + webhook, pre-split if boundaries are known
Unknown document typesdocument_type=auto (default)

Further reading

Teams

10,000+

Trust Quixyl daily

Accuracy

99.9%

AI-powered OCR

Speed

5 sec

Per document

Get started free

Ready to automate your document processing?

Extract invoice data in 5 seconds with 99.9% AI accuracy. Start with 5 pages free — no credit card required.

5 pages free · no credit card · cancel anytime