Executive Summary
This is an independent study, conducted entirely with our own forensic engine. We ran every PDF in the DOJ "Epstein Files Transparency Act" disclosure — all 16,971 documents, all 47 engines, zero files skipped or failed — and report what the scanner found at the object level.
The corpus is clean of malware. That is the expected, correct result for a government document production, and it is not the story. The story is provenance — what these files are, how they were made, and what they carry that a human reader never sees:
- Two distinct production pipelines partition the entire release, visible in the bytes (PDF 1.3 + compressed object streams + three EOF markers vs PDF 1.5 + two).
- A dated production timeline — the documents were processed in distinct waves between December 2025 and February 2026, with a single 6,583-document batch on one day.
- The corpus underwent a digital-to-image transformation workflow before release — OCR-captured, rasterized, re-saved — with image and structural characteristics far more uniform than native scanner output.
- 96.7% carry data the renderer never shows: invisible text layers, OCR text that diverges from the image, hidden compressed objects, orphaned objects — including the original metadata the active document appears to have erased.
- 18.6% present different content to a human than to a machine, and 61.9% make PDF parsers disagree about the document's own structure.
The release is safe to open. Its risk is to interpretation. Note the deliberate limit of these findings: structure can show that a digital-to-image transformation occurred; it cannot, by itself, establish whether the underlying content was altered. We report the observable technical facts and stop there.
Corpus & Method
| Documents analyzed | 16,971 PDFs — the complete released corpus |
|---|---|
| Failed / unscanned | 0 |
| Total pages | 67,143 (mean 4.0, median 1, max 1,523) |
| Total PDF objects | 1,864,023 (mean 110/file, max 42,581) |
| Total streams | 437,548 (incl. 29,870 image XObjects) |
| Engine | PQ PDF forensic scanner — 47 independent passes per file |
| Verdict model | Multi-axis: threat (malware/exploit), deception (content-integrity), structural (neutral capability) — scored separately |
Each document received the full suite: a six-parser differential (MuPDF, Poppler, Ghostscript, qpdf, pdfminer, pdf.js), a dynamic behavioral sandbox, YARA, ClamAV, ML anomaly detection, JavaScript AST analysis, signature and revision forensics, xref-graph integrity, OCR text-layer integrity, reading-order analysis, object-stream decompression, font forensics, accessibility-tree analysis, physical-entropy topology, and content-integrity / AI-ingestion detection. Verdicts separate capability from malice, so a complex-but-legitimate scan is never mistaken for an attack.
Verdict Distribution
The malware (threat) axis is effectively empty; the high-risk population is driven entirely by content-integrity (deception).
Finding 1 — Two Distinct Production Pipelines
The release was not produced by a single process. The header PDF version partitions the corpus into two tightly-correlated populations, and the correlation is near-total:
| Trait | Pipeline A (header 1.3) | Pipeline B (header 1.5) |
|---|---|---|
| Documents | 10,718 | 5,828 (+425 at 1.4) |
Compressed object streams (/ObjStm) | 10,499 | 14 |
| Mixed xref table + stream | 10,499 | 14 |
Three %%EOF markers | 10,499 | 14 |
Pipeline A's documents claim PDF 1.3 in their header but use 1.5-era object streams internally — the inconsistency that makes parsers disagree (Finding 6). Both populations share the same OCR toolchain (Finding 4); the split reflects two assembly passes or source batches, cleanly separable from the bytes alone.
Finding 2 — Universal Post-Creation Modification
All 16,971 documents (100%) contain incremental updates: objects appended after the original save, and existing object IDs re-defined in later revisions. 16,306 (96.1%) retain orphaned objects in the cross-reference table. Incremental injections average 13 new objects and reach 3,049 in one revision; 2,323 files (13.7%) carry a large final-pass injection — the Bates-numbering stamp applied as a separate generation. A measured 22.5% show an entropy cliff at a revision boundary, the physical trace of content appended in a later pass.
No document is in its original single-save state. Every one carries two or three revision
layers; the cross-reference graph confirms it, with 21,012 superseded /XRef
and 10,513 /ObjStm objects left orphaned across the corpus.
Finding 3 — A Dated Production Timeline
The revision metadata that survived processing carries timestamps, and they cluster into distinct dated processing waves — not a single export. Of the ~10,947 documents whose dates are readable, the production ran from December 2025 into February 2026, dominated by one 6,583-document batch:
Two observations follow. First, the production is batch-processed in waves, consistent with the two-pipeline split in Finding 1. Second, 14 documents retain a pre-2025 original creation date (2013-04-16) — provenance that survived the transformation in the orphaned metadata, a reminder that the underlying documents predate this release by years.
Finding 4 — Metadata Stripped From View, but the Toolchain Is Recoverable
All 16,971 documents (100%) have empty Producer, Creator, Title and Author fields in their active metadata, and none carry XMP. On the surface the authoring fingerprints look erased.
They were not erased — only unlinked. The pipeline left the original
/Info dictionary behind as an orphaned object inside a compressed
object stream, unreferenced by the final trailer. Walking every cross-reference
object recovers it. Across a sample spanning the DataSets, a majority of documents still
carry a recoverable fingerprint — reconstructing the pipeline end to end:
| Recovered field | Value | Stage |
|---|---|---|
| Creator | OmniPage CSDK 21.1 | Commercial OCR / document-capture SDK — scanning + text layer |
| Producer | Processing-CLI | Batch assembly / processing |
| Producer (subset) | pypdf | Python PDF library — final manipulation on a minority of files |
The release was assembled by an OmniPage OCR → Processing-CLI → (pypdf) pipeline. That provenance — which the visible metadata appears to remove — is recoverable from the orphaned object-stream dictionaries the pipeline left behind.
Finding 6 — Parser Disagreement
10,499 files (61.9%) make independent PDF parsers disagree about the document's own declared version: MuPDF and Poppler read PDF 1.5; qpdf reads 1.3 — from the same bytes. The cause is structural and is Pipeline A's signature: a 1.3 header over 1.5-era object streams.
| Structural condition | Files | Share |
|---|---|---|
| Parser version mismatch (MuPDF/Poppler 1.5 vs qpdf 1.3) | 10,499 | 61.9% |
| Mixed xref table and xref stream | 10,513 | 61.9% |
/ObjStm object streams | 10,513 | 61.9% |
| Data carried in PDF comments between revisions | 10,499 | 61.9% |
Multiple %%EOF markers | 10,513 | 61.9% |
| Entropy cliff at a revision boundary | 3,814 | 22.5% |
Page counts and object counts were otherwise consistent across parsers — the disagreement is specifically about version and cross-reference structure, the layer most relevant to how a tool decides to decode the file. A document six parsers interpret six ways has no single canonical machine reading — the structural root of the reality drift in Finding 7.
Finding 7 — Reality Drift: Human-Visible vs Machine-Readable
3,159 documents (18.6%), across 4,243 individual pages, contain a hidden OCR text layer that does not match the rendered image — measured by word-overlap (Jaccard) between what the page shows and what its text layer says. This is the single most consequential finding for automated use of the release:
- A human investigator reads the image.
- A search index, e-discovery tool, or AI/RAG pipeline reads the text layer.
- When they disagree, automated analysis silently ingests text that is not what the document visibly says — degrading search and poisoning any AI summarization or retrieval built on the corpus.
Most affected documents drift on a single page, but the tail runs deep — some carry mismatches across a dozen pages:
Finding 9 — A Digital-to-Image Transformation Workflow
The structural evidence converges on a single observable conclusion: the corpus underwent a digital-to-image transformation workflow before release — content rendered to page images, OCR-captured, and re-assembled, rather than emitted as native digital documents. The signals:
Three independent measurements point the same way:
- Image rasters are extraordinarily uniform. Across 29,870 image XObjects the mean entropy is 1.01 (median 1.02), with 49% below 1.0 — heavily-compressible, low-variation raster data unlike the texture and sensor noise of native paper scanning.
- Skew is uniform and near-zero. On a rendered sample we measured page skew directly: 91% of pages sit within 0.5° of true horizontal, and the skew spread within a multi-page document averages 0.55°. Physical scanning of paper introduces per-page skew variation; this consistency is the signature of programmatic rendering, not a feed scanner.
- No born-digital structure, redundant color encoding. 99.9% carry no structure tree, no document embeds a font or attachment, and 33% store grayscale content in an RGB color space — a re-encoding artifact of a rasterization pass.
Taken together — uniform low-entropy rasters, near-zero uniform skew, redundant color encoding, and an OCR text layer in place of real fonts — the signature is a render-to-image pipeline applied at scale. We note one measurement that did not hold: page borders are not uniformly flat (border pixel variance is non-trivial), so we make no claim about the absence of edge artifacts; the conclusion rests on the skew, entropy, and structural evidence above.
A deliberate limit: this is an observation about process, not content. The structure shows a digital-to-image transformation occurred before release. It does not, on its own, establish that the underlying content was altered — that requires evidence beyond PDF structure. We report the technical finding and stop short of any claim about fabrication.
Finding 10 — The Absent Semantic Layer
Modern document workflows often leave behind a semantic layer — a tagged structure
tree, headings, /Alt text — that humans never see but parsers and AI
systems ingest. This release has almost none:
- 16,957 of 16,971 (99.9%) are untagged — no structure tree.
- 0 documents carry
/Altor/ActualTextaccessibility annotations.
The implication cuts two ways. There is no hidden semantic content injected for machines to misread — but there is also no reliable machine-readable structure at all. For a RAG or AI pipeline, the only machine-readable content is the OCR text layer — the same layer that diverges from the visible image in 18.6% of documents (Finding 7). The corpus is, by construction, a difficult and partly unreliable substrate for automated analysis.
Finding 11 — Malware: Clean
Across all 16,971 documents:
- 0 known-malware hash matches (offline threat intelligence, 6.4M+ indicators).
- 0 confirmed malicious executables, dropper chains, or weaponized JavaScript actions. No file carries a genuine exploit execution vector.
- A handful of files initially tripped a heap-spray byte heuristic. We verified these as
false positives: the patterns are flat regions and incidental byte
sequences inside compressed scanned-page images, not shellcode — no
%uheap-spray strings, no NOP sled, no executable JavaScript action, only image data. The engine was corrected so these image-entropy patterns no longer escalate, and every figure here reflects the corrected engine.
The release is safe to open. Its risks are to interpretation, not to the system.
Methodology note: a false positive we found and fixed
This finding is itself a result of the study. On the first full pass, 467
documents tripped a heap-spray byte heuristic, and four scored as
"JavaScript + Heapspray Shellcode" with a maximum threat score. Rather
than report them, we pulled the highest-scoring files and inspected the bytes directly.
The root cause was benign: the "heap-spray" pattern was a flat region inside a
compressed scanned-page image, and the "/JS" was an incidental byte substring
inside a Flate stream — no %u heap-spray string, no NOP sled, no
JavaScript action dictionary.
We corrected three detector roots so image entropy can no longer masquerade as an
exploit: the YARA NOP-sled literal was raised from 16 bytes to a 200-byte run (a real
sled cannot survive Flate compression); the JavaScript test was changed from a raw
substring to PDF action syntax (/S /JavaScript, /JS with a
string/stream/indirect reference); and the raw heap-fill byte-run heuristics were gated
on a genuine JavaScript spray context. The four flagged files dropped to a zero threat
score, and every figure in this report reflects the corrected engine. We document the
error because a forensic tool that hides its own false positives cannot be trusted to
report anyone else's.
Finding 12 — The Human-Visible Page Is Not the Canonical Document
The preceding findings are usually treated as separate defects. Read together, they make one structural argument that is the central result of this study:
For this corpus, the page a human sees is not the authoritative version of the document. The authoritative version is the machine-interpreted layer — and across thousands of documents that layer says something different from the page, or cannot be agreed upon at all.
Every content-integrity signal we measured is a different way the same thing is true:
| Signal | Prevalence | How the visible page and the machine layer diverge |
|---|---|---|
| Invisible text layer (render mode 3) | 83.4% | A whole text layer exists that is never painted to screen |
| OCR text vs image mismatch | 18.6% | The hidden layer disagrees with the visible image (median overlap 0.16) |
| Hidden layer carries email headers | 53% of drift pages | Machine reads From/To/Subject/Date the image degrades |
| Parser disagreement | 61.9% | Different readers extract a structurally different document |
| Reading-order ambiguity | 44.7% | The order a machine reads text in is not determinate |
| Digital-to-image transformation | corpus-wide | The "page" is a rendered raster, not the original document |
| Orphaned metadata / objects | 96.1% | Superseded content (incl. original /Info) the reader never sees |
| Email-transport encoding residue | ~1.2% | Raw base64 / quoted-printable artifacts only a machine ingests |
This reframes what "reading" the Epstein files even means. A person opens a page image. A search index, an e-discovery platform, or an AI/RAG system ingests the text layer, the object graph, and the orphaned metadata — a different representation, one that for nearly one document in five does not match the visible page, and that no two PDF parsers fully agree on. There is no single canonical reading of this corpus: there is the human view and there is the machine view, and the gap between them is measurable, widespread, and — for automated analysis at scale — the most important property of the release.
The limit holds here too: that the machine layer diverges from the page is an observable technical fact. Whether any specific divergence reflects intent, OCR error, or conversion artifact requires evidence beyond PDF structure. We report the divergence and its scale, not a motive.
What This Means
- Provenance is partly recoverable, partly gone. Active metadata is stripped, but the OmniPage→Processing-CLI toolchain, the dated processing waves, and a few original 2013 dates survive in orphaned objects.
- The release is a render-to-image production, processed in batches by two pipelines — an observation about process, not a claim about content.
- Nearly one document in five is semantically unreliable to machines: the OCR text diverges from the visible page, and there is no semantic layer to fall back on. Any party indexing, searching, or feeding this corpus to an AI system is, for thousands of documents, operating on text that does not match the visible record.
These are exactly the conditions a multi-axis forensic engine is built to surface — and they are invisible to malware-signature scanning, which would correctly report this entire corpus as clean.
Appendix A — Every Indicator, by Prevalence
All distinct forensic indicators raised across the corpus, as a share of 16,971 documents. Capability and structure indicators are reported on their own axes and do not drive the malware verdict.
| Indicator | Files | Share |
|---|---|---|
| Empty active metadata | 16,971 | 100% |
| Incremental update: new objects in later revisions | 16,971 | 100% |
| Incremental object override: existing OID re-defined | 16,971 | 100% |
| Orphan objects in xref | 16,306 | 96.1% |
| Invisible text (rendering mode 3/7) | 14,152 | 83.4% |
| Multiple %%EOF markers | 10,513 | 61.9% |
| /ObjStm object streams | 10,513 | 61.9% |
| Mixed xref table and xref stream | 10,513 | 61.9% |
| Data in PDF comment between revisions | 10,499 | 61.9% |
| Differential parsing: PDF version mismatch | 10,499 | 61.9% |
| Multi-column layout with ambiguous reading order | 7,587 | 44.7% |
| Grayscale image stored as RGB | 5,673 | 33.4% |
| Entropy cliff at revision boundary | 3,814 | 22.5% |
| OCR/text-layer mismatch (any page) | 3,159 | 18.6% |
| Large incremental injection in final revision | 2,323 | 13.7% |
| Overlapping text blocks | 1,293 | 7.6% |
| High-entropy streams | 984 | 5.8% |
| /Trans presentation dictionary | 439 | 2.6% |
| /ASCII85Decode | 425 | 2.5% |
A long tail of indicators appears on fewer than 0.2% of files (isolated CVE-pattern byte matches, a single /XFA, a single calculation-order array). All were individually reviewed; none corresponded to a working exploit.
Appendix B — Engine Coverage
All 47 analysis passes ran on every document. The provenance-relevant engines and what each contributed to this study:
| Engine | Contribution here |
|---|---|
| OCR Text-Layer Integrity | Jaccard image-vs-text divergence — the 18.6% reality-drift cohort |
| Six-Parser Differential | Version/structure disagreement — 61.9% |
| Revision History + Trailer Chain | Generations, Bates-stamp passes, dated production waves |
| XRef Integrity Graph | Orphaned & superseded objects (incl. recoverable /Info) |
| Metadata Analyzer + Reconciliation | Active-metadata stripping, orphaned-toolchain recovery, date correlation |
| Physical Entropy Topology | Image-stream uniformity, revision-boundary entropy cliffs |
| Reading-Order Analysis | Multi-column / overlapping-text ambiguity (44.7%) |
| Unicode & Invisible Text | Render-mode-3 hidden text layers (83.4%) |
| Accessibility-Tree Forensics | Absent semantic layer (99.9% untagged) |
| Object-Stream Analysis | Hidden compressed objects (61.9%) |
| Sandbox · YARA · ClamAV · ML · Threat-Intel | Malware axis — clean across the corpus |
Reproducibility
Every figure above derives from machine-readable per-document JSON — one result file per PDF, all 16,971, zero failures. The verdict model, indicator definitions, and per-engine output are the standard PQ PDF scanner outputs; any document's full forensic record can be regenerated by re-scanning the corresponding PDF with the PDF Forensics Scanner.