🔬 Security Research

PDF Forensics & Document-Integrity Research

A PDF isn't one document — it's whatever the parser, renderer, or AI pipeline decides it is. These studies map that non-determinism and the attacks it enables: parser discrepancy, V/AP divergence, OCR / text-layer poisoning, and AI-ingestion poisoning — measured across nearly 8,000 real and crafted PDFs.

7,800

PDFs analyzed (latest run)

6,281

real-world benign control docs

0.34%

false-positive rate on that control

independent PDF parsers

verdict axes (threat · deception · structural)

📊 Latest & definitive study

PDF Forensics at Scale — a 7,800-PDF Study

Finding — A 0.34% false-positive rate (20 of 5,899) on a 6,281-PDF real-world control — detection by analysis, not reputation.

1,572 curated malicious/edge-case PDFs plus a 6,281-PDF real-world benign control. Live-malware detection, a multi-axis threat-vs-capability verdict, and an honest false-positive measurement. This latest run consolidates parser disagreement, reality drift, V/AP divergence and AI-ingestion into one corpus — and supersedes the corpus-level numbers in the earlier studies below.

Read the full study →

Foundational research

Earlier mechanism deep-dives. The 7,800-PDF run above consolidates and supersedes their corpus-level prevalence numbers — but the mechanisms they document (parser discrepancy, V/AP divergence, reality drift, AI-ingestion poisoning) still stand.

🗂️

The Epstein Files, Forensically — 16,971 PDFs

Finding — A complete pass over the entire DOJ Epstein release: malware-clean, but 100% metadata-stripped (toolchain still recoverable — OmniPage CSDK 21.1), and 18.6% read differently to humans than to machines.

The first complete automated forensic pass over the whole DOJ disclosure — every one of 16,971 PDFs, all 47 engines. A document-integrity story: uniformly re-processed, metadata stripped from view but recoverable, and nearly 1 in 5 documents semantically unreliable to automated analysis.

Reality driftMetadata recoveryParser discrepancyReal-world corpus

Read the study →

⚖️

Parser Disagreement: Six Parsers, Eleven Divergences

Finding — 11 crafted PDFs run through six production parsers — every file produced a different reading. Same bytes, different document.

MuPDF, Poppler, Ghostscript, qpdf, pdfminer and pdf.js, each in isolated namespaces, disagree on page count, text, and structure for the same file — the basis of parser-discrepancy attacks.

Parser discrepancyKeyword injectionStructural ambiguity

Read the study →

🌀

PDF Reality Drift

Finding — One file, many realities: 43 of 44 IRS tax forms drift between the rendered page and the extracted text layer.

A single PDF can present different content to human viewers, text extractors, OCR engines, accessibility tools and AI pipelines — documented across real government and tax-form corpora.

Reality driftOCR / text-layerAccessibility

Read the study →

🧩

The PDF Semantic Determinism Problem

Finding — Parser disagreement, V/AP divergence, AI-ingestion failure and reality drift converge on one root cause: PDF guarantees pixels, not a single meaning.

A unifying framework tying the other studies together — why a format built for visual fidelity has no canonical machine-readable interpretation, and what a fix would require.

FrameworkRoot causeDeterminism

Read the study →

🤖

PDF Structural Problems in AI Ingestion Pipelines

Finding — AI ingestion can be poisoned by the document itself — the model ingests text the human reader never sees.

When V/AP divergence and parser disagreement reach a RAG knowledge base or an LLM training corpus, the extracted text can differ from the visible page — quietly poisoning what the model learns or retrieves.

AI poisoningV/AP divergenceParser discrepancyRAG / LLM

Read the study →

📋

PDF Forms as Executable Security Boundaries

Finding — A digital signature can certify a form while /V (the value) and /AP (the appearance) disagree — what gets signed is not what gets read.

Form fields carry two independent representations. V/AP divergence, NeedAppearances, and DocMDP certification mean a "signed" document can render one value and store another.

V/AP divergenceDocMDPAcroForm

Read the study →

These findings drive the scanner. Test it on your own files with the PDF Forensics Scanner · how the engine works