PDF Forensics & Document-Integrity Research
A PDF isn't one document — it's whatever the parser, renderer, or AI pipeline decides it is. These studies map that non-determinism and the attacks it enables: parser discrepancy, V/AP divergence, OCR / text-layer poisoning, and AI-ingestion poisoning — measured across nearly 8,000 real and crafted PDFs.
PDF Forensics at Scale — a 7,800-PDF Study
1,572 curated malicious/edge-case PDFs plus a 6,281-PDF real-world benign control. Live-malware detection, a multi-axis threat-vs-capability verdict, and an honest false-positive measurement. This latest run consolidates parser disagreement, reality drift, V/AP divergence and AI-ingestion into one corpus — and supersedes the corpus-level numbers in the earlier studies below.
Read the full study →Foundational research
Earlier mechanism deep-dives. The 7,800-PDF run above consolidates and supersedes their corpus-level prevalence numbers — but the mechanisms they document (parser discrepancy, V/AP divergence, reality drift, AI-ingestion poisoning) still stand.
The Epstein Files, Forensically — 16,971 PDFs
The first complete automated forensic pass over the whole DOJ disclosure — every one of 16,971 PDFs, all 47 engines. A document-integrity story: uniformly re-processed, metadata stripped from view but recoverable, and nearly 1 in 5 documents semantically unreliable to automated analysis.
Parser Disagreement: Six Parsers, Eleven Divergences
MuPDF, Poppler, Ghostscript, qpdf, pdfminer and pdf.js, each in isolated namespaces, disagree on page count, text, and structure for the same file — the basis of parser-discrepancy attacks.
PDF Reality Drift
A single PDF can present different content to human viewers, text extractors, OCR engines, accessibility tools and AI pipelines — documented across real government and tax-form corpora.
The PDF Semantic Determinism Problem
A unifying framework tying the other studies together — why a format built for visual fidelity has no canonical machine-readable interpretation, and what a fix would require.
PDF Structural Problems in AI Ingestion Pipelines
When V/AP divergence and parser disagreement reach a RAG knowledge base or an LLM training corpus, the extracted text can differ from the visible page — quietly poisoning what the model learns or retrieves.
PDF Forms as Executable Security Boundaries
Form fields carry two independent representations. V/AP divergence, NeedAppearances, and DocMDP certification mean a "signed" document can render one value and store another.