Multi-Page PDF Extraction: Handling Complex Documents at Scale
How Quixyl processes multi-page invoices, mixed-document batches, and high-volume PDF files — including page splitting, document boundary detection, and batch throughput tuning.
Single-page invoices are the easy case. The interesting problems arise when you need to process a 200-page batch PDF from a logistics vendor, a scanned multi-page contract with embedded invoice tables, or 10,000 documents uploaded over a weekend run. This article covers how Quixyl handles the hard cases correctly.
Why multi-page extraction is different
A single-page PDF has a clear contract: one document, one extraction result. Multi-page PDFs break this contract in several ways:
Mixed documents in one file — Vendors routinely email a single PDF containing an invoice, a delivery note, and a remittance advice. An extractor that treats the file as one document will confuse fields from different documents.
Long invoices vs batches — A 12-page invoice for a complex construction project is fundamentally different from a 12-page batch of 12 separate invoices. The extractor needs to know which it is dealing with.
Page headers and footers — In multi-page invoices, each page repeats vendor name, invoice number, and “Page N of M” text. Deduplicating these without losing true line items requires careful logic.
Rotated and mixed-orientation pages — Scanned batches often contain portrait and landscape pages in the same file, or pages scanned upside-down.
Document boundary detection
The first job when Quixyl receives a multi-page PDF is determining whether it contains one document or many. The boundary detection model uses several signals:
Page-level signals
- Header recurrence: If the same company name and document number appear on pages 1, 2, and 3, they are likely the same document
- “Page N of M” markers: Explicit pagination is the clearest signal
- Continuation language: Text like “continued”, “subtotals carried forward”, or “page total” indicates continuation
- Running totals: A subtotal column that increments across pages suggests one document
Break signals
- New document number: Invoice number change is a strong break indicator
- New vendor: Vendor name change almost always signals a new document
- “Thank you for your business” / “Terms and Conditions”: Common footer text that closes a document
- Visual layout reset: A full header block after a footer block
When boundary detection confidence is above 92%, the split is applied automatically. Below 92%, the document is routed for human confirmation of the split points.
Processing a mixed-document batch
Quixyl’s batch endpoint handles the full workflow:
curl -X POST https://api.quixyl.com/v1/extract/batch \
-H "Authorization: Bearer $QUIXYL_API_KEY" \
-F "file=@monthly-vendor-batch.pdf" \
-F "mode=auto_split" \
-F "notify_webhook=true"
The mode parameter controls splitting behavior:
| Mode | Behavior |
|---|---|
auto_split | Detect and split documents automatically (default) |
single_document | Treat entire file as one document regardless of page count |
one_per_page | Force each page to be extracted independently |
manual_split | Return page images for human split-point selection |
The webhook fires once per detected document, not once per file. A 50-page batch containing 48 invoices fires 48 extraction.complete events.
Multi-page invoice handling
When Quixyl determines that all pages belong to the same invoice, it assembles the full line item table across pages before running field extraction:
Page 1: Invoice header + line items 1–15
Page 2: Line items 16–31 + subtotal/tax/total footer
The extractor merges these correctly:
{
"invoice_number": "BATCH-2026-0147",
"total_pages": 2,
"line_items": [
{ "line": 1, "description": "Item A", ... },
...
{ "line": 31, "description": "Item AE", ... }
],
"subtotal": 84200.00,
"tax": 8420.00,
"total": 92620.00
}
The total_pages field in the response tells downstream systems that the result spans multiple source pages.
Performance characteristics at scale
Throughput depends on document complexity and plan tier. Benchmarks on typical commercial invoices:
| Document type | Pages | Median processing time | P95 processing time |
|---|---|---|---|
| Single-page invoice (digital PDF) | 1 | 1.8s | 3.4s |
| Multi-page invoice (digital PDF) | 8 | 4.2s | 7.1s |
| Scanned single-page invoice | 1 | 2.9s | 5.2s |
| Scanned multi-page batch | 40 | 18s | 31s |
| Complex construction invoice | 20 | 11s | 19s |
For batch jobs over 500 pages, Quixyl queues and processes up to 50 documents in parallel (Enterprise tier). A 5,000-page batch typically completes in 12–18 minutes.
Optimizing throughput for high-volume runs
Use async endpoints
For files over 10 pages, always use /v1/extract/async and webhook delivery. Synchronous requests for large files will hit the 30-second timeout before the job completes.
Pre-split before uploading
If you already know the document boundaries (e.g., you’re processing invoices from an ERP that generates one PDF per invoice), split the files before uploading. Submitting 48 single-page PDFs is faster and more reliable than submitting one 48-page batch with auto_split.
from pypdf import PdfReader, PdfWriter
from pathlib import Path
def split_pdf(source: str, output_dir: str):
reader = PdfReader(source)
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"{output_dir}/page_{i+1:03}.pdf", "wb") as f:
writer.write(f)
Set document_type explicitly
Specifying document_type=invoice skips the classification step and saves 200–400ms per document. At 10,000 documents, that is 33–67 minutes of wall-clock time recovered.
Use the bulk endpoint
Rather than making 1,000 separate API calls:
# Less efficient — 1,000 separate requests
for file in invoices/*.pdf; do
curl -X POST .../extract -F "file=@$file" &
done
# More efficient — one bulk submission
curl -X POST .../extract/bulk \
-F "files[]=@invoices/001.pdf" \
-F "files[]=@invoices/002.pdf" \
... \
-F "notify_webhook=true"
The bulk endpoint accepts up to 100 files per call. Chunk your uploads accordingly.
Handling extraction failures in a batch
In a large batch run, some documents will fail — blurry scans, corrupted PDFs, or documents that simply cannot be parsed with high confidence. Build your pipeline to expect this:
// Webhook handler
app.post('/hooks/quixyl', (req, res) => {
const event = req.body;
switch (event.event) {
case 'extraction.complete':
await postToERP(event.data);
break;
case 'extraction.failed':
await createReviewTask(event.extraction_id, event.error_code);
break;
case 'review.required':
await routeToHumanReview(event.extraction_id, event.flagged_fields);
break;
}
res.sendStatus(200);
});
A well-designed pipeline treats review.required as a normal event path, not an error state. Aim for a target review rate of under 5% — if you are seeing higher, adjust your confidence thresholds or add vendor-specific templates.
Audit trail for split documents
When Quixyl splits a batch PDF, the extraction record for each child document keeps a reference to the source file:
{
"id": "ext_01HXYZ442B3R",
"source_file": "s3://your-bucket/batch-2026-02.pdf",
"source_pages": [3, 4],
"document_type": "invoice",
...
}
This means you can always trace any extracted record back to the exact pages in the original file — essential for audit purposes.
Summary
| Scenario | Recommended approach |
|---|---|
| Single invoices under 10 pages | Synchronous endpoint |
| Multi-page invoices | Async endpoint + webhook |
| Mixed-document batch PDFs | Bulk + auto_split + webhook |
| High-volume overnight run | Bulk + async + webhook, pre-split if boundaries are known |
| Unknown document types | document_type=auto (default) |
Further reading
Teams
Trust Quixyl daily
Accuracy
AI-powered OCR
Speed
Per document
Ready to automate your document processing?
Extract invoice data in 5 seconds with 99.9% AI accuracy. Start with 5 pages free — no credit card required.
5 pages free · no credit card · cancel anytime