Are the DOJ Epstein PDFs dangerous to open?

No. Across all 16,971 documents the scanner found no embedded exploit JavaScript, no malicious executables, and no known-malware hash matches. The release is clean of malware. Its risks are to interpretation — not to the system that opens it.

What does “reality drift” mean for the Epstein files?

3,159 documents (18.6%) contain pages where the hidden OCR text layer disagrees with the rendered image. A human reads the scanned page; a search index, e-discovery tool, or AI pipeline reads the text layer. When they diverge, automated analysis silently ingests text that is not what the document visibly says.

Was the metadata removed from the Epstein files?

Yes — completely. All 16,971 documents have empty Producer, Creator, Title and Author fields and carry no XMP metadata. Authoring-tool fingerprints and embedded timestamps were stripped wholesale across the entire corpus.

The Epstein Files, Forensically: 16,971 PDFs

Executive Summary

This is an independent study, conducted entirely with our own forensic engine. We ran every PDF in the DOJ "Epstein Files Transparency Act" disclosure — all 16,971 documents, all 47 engines, zero files skipped or failed — and report what the scanner found at the object level.

The corpus is clean of malware. That is the expected, correct result for a government document production, and it is not the story. The story is provenance — what these files are, how they were made, and what they carry that a human reader never sees:

Two distinct production pipelines partition the entire release, visible in the bytes (PDF 1.3 + compressed object streams + three EOF markers vs PDF 1.5 + two).
A dated production timeline — the documents were processed in distinct waves between December 2025 and February 2026, with a single 6,583-document batch on one day.
The corpus underwent a digital-to-image transformation workflow before release — OCR-captured, rasterized, re-saved — with image and structural characteristics far more uniform than native scanner output.
96.7% carry data the renderer never shows: invisible text layers, OCR text that diverges from the image, hidden compressed objects, orphaned objects — including the original metadata the active document appears to have erased.
18.6% present different content to a human than to a machine, and 61.9% make PDF parsers disagree about the document's own structure.

The release is safe to open. Its risk is to interpretation. Note the deliberate limit of these findings: structure can show that a digital-to-image transformation occurred; it cannot, by itself, establish whether the underlying content was altered. We report the observable technical facts and stop there.

Corpus & Method

Documents analyzed	16,971 PDFs — the complete released corpus
Failed / unscanned	0
Total pages	67,143 (mean 4.0, median 1, max 1,523)
Total PDF objects	1,864,023 (mean 110/file, max 42,581)
Total streams	437,548 (incl. 29,870 image XObjects)
Engine	PQ PDF forensic scanner — 47 independent passes per file
Verdict model	Multi-axis: threat (malware/exploit), deception (content-integrity), structural (neutral capability) — scored separately

Each document received the full suite: a six-parser differential (MuPDF, Poppler, Ghostscript, qpdf, pdfminer, pdf.js), a dynamic behavioral sandbox, YARA, ClamAV, ML anomaly detection, JavaScript AST analysis, signature and revision forensics, xref-graph integrity, OCR text-layer integrity, reading-order analysis, object-stream decompression, font forensics, accessibility-tree analysis, physical-entropy topology, and content-integrity / AI-ingestion detection. Verdicts separate capability from malice, so a complex-but-legitimate scan is never mistaken for an attack.

Verdict Distribution

The malware (threat) axis is effectively empty; the high-risk population is driven entirely by content-integrity (deception).

Verdict across 16,971 documents. High-risk = content-integrity, not malware.

Finding 1 — Two Distinct Production Pipelines

The release was not produced by a single process. The header PDF version partitions the corpus into two tightly-correlated populations, and the correlation is near-total:

Trait	Pipeline A (header 1.3)	Pipeline B (header 1.5)
Documents	10,718	5,828 (+425 at 1.4)
Compressed object streams (`/ObjStm`)	10,499	14
Mixed xref table + stream	10,499	14
Three `%%EOF` markers	10,499	14

Pipeline A's documents claim PDF 1.3 in their header but use 1.5-era object streams internally — the inconsistency that makes parsers disagree (Finding 6). Both populations share the same OCR toolchain (Finding 4); the split reflects two assembly passes or source batches, cleanly separable from the bytes alone.

Finding 2 — Universal Post-Creation Modification

All 16,971 documents (100%) contain incremental updates: objects appended after the original save, and existing object IDs re-defined in later revisions. 16,306 (96.1%) retain orphaned objects in the cross-reference table. Incremental injections average 13 new objects and reach 3,049 in one revision; 2,323 files (13.7%) carry a large final-pass injection — the Bates-numbering stamp applied as a separate generation. A measured 22.5% show an entropy cliff at a revision boundary, the physical trace of content appended in a later pass.

No document is in its original single-save state. Every one carries two or three revision layers; the cross-reference graph confirms it, with 21,012 superseded /XRef and 10,513 /ObjStm objects left orphaned across the corpus.

Finding 3 — A Dated Production Timeline

The revision metadata that survived processing carries timestamps, and they cluster into distinct dated processing waves — not a single export. Of the ~10,947 documents whose dates are readable, the production ran from December 2025 into February 2026, dominated by one 6,583-document batch:

Final-revision processing dates (share of the 10,947 dated documents). Distinct waves indicate batch processing across multiple sessions.

Two observations follow. First, the production is batch-processed in waves, consistent with the two-pipeline split in Finding 1. Second, 14 documents retain a pre-2025 original creation date (2013-04-16) — provenance that survived the transformation in the orphaned metadata, a reminder that the underlying documents predate this release by years.

Finding 4 — Metadata Stripped From View, but the Toolchain Is Recoverable

All 16,971 documents (100%) have empty Producer, Creator, Title and Author fields in their active metadata, and none carry XMP. On the surface the authoring fingerprints look erased.

They were not erased — only unlinked. The pipeline left the original /Info dictionary behind as an orphaned object inside a compressed object stream, unreferenced by the final trailer. Walking every cross-reference object recovers it. Across a sample spanning the DataSets, a majority of documents still carry a recoverable fingerprint — reconstructing the pipeline end to end:

Recovered field	Value	Stage
Creator	`OmniPage CSDK 21.1`	Commercial OCR / document-capture SDK — scanning + text layer
Producer	`Processing-CLI`	Batch assembly / processing
Producer (subset)	`pypdf`	Python PDF library — final manipulation on a minority of files

The release was assembled by an OmniPage OCR → Processing-CLI → (pypdf) pipeline. That provenance — which the visible metadata appears to remove — is recoverable from the orphaned object-stream dictionaries the pipeline left behind.

Finding 5 — Data the Renderer Doesn't Show

Open one of these documents and you see a scanned page. The file contains far more. 16,414 documents (96.7%) carry data present in the file but never painted to screen — text, objects, and metadata a human reader never encounters but a machine does:

Non-rendered data classes, share of the 16,971-document corpus.

Invisible text (83.4%) — a text layer drawn in PDF rendering mode 3 (invisible): the OCR output laid behind the page image. Humans see the scan; machines read this layer.
OCR-divergent text (18.6%) — where that layer does not match the image (Finding 7).
Hidden objects (61.9%) — objects packed inside compressed /ObjStm streams, invisible to byte-level inspection without decompression.
Orphaned objects (96.1%) — superseded objects (including the original /Info from Finding 4) retained in the file.

Finding 6 — Parser Disagreement

10,499 files (61.9%) make independent PDF parsers disagree about the document's own declared version: MuPDF and Poppler read PDF 1.5; qpdf reads 1.3 — from the same bytes. The cause is structural and is Pipeline A's signature: a 1.3 header over 1.5-era object streams.

Structural condition	Files	Share
Parser version mismatch (MuPDF/Poppler 1.5 vs qpdf 1.3)	10,499	61.9%
Mixed xref table and xref stream	10,513	61.9%
`/ObjStm` object streams	10,513	61.9%
Data carried in PDF comments between revisions	10,499	61.9%
Multiple `%%EOF` markers	10,513	61.9%
Entropy cliff at a revision boundary	3,814	22.5%

Page counts and object counts were otherwise consistent across parsers — the disagreement is specifically about version and cross-reference structure, the layer most relevant to how a tool decides to decode the file. A document six parsers interpret six ways has no single canonical machine reading — the structural root of the reality drift in Finding 7.

Finding 7 — Reality Drift: Human-Visible vs Machine-Readable

3,159 documents (18.6%), across 4,243 individual pages, contain a hidden OCR text layer that does not match the rendered image — measured by word-overlap (Jaccard) between what the page shows and what its text layer says. This is the single most consequential finding for automated use of the release:

A human investigator reads the image.
A search index, e-discovery tool, or AI/RAG pipeline reads the text layer.
When they disagree, automated analysis silently ingests text that is not what the document visibly says — degrading search and poisoning any AI summarization or retrieval built on the corpus.

Content-integrity signals across the corpus.

Most affected documents drift on a single page, but the tail runs deep — some carry mismatches across a dozen pages:

Distribution of OCR-mismatch pages per affected document.

Finding 8 — The Hidden Text Layer Often Says More Than the Page Shows

Reality drift (Finding 7) measures that the hidden text layer and the visible image disagree. We went further and examined how they disagree, across all 4,243 divergent pages. The pattern is consistent and consequential: the machine-readable text layer is frequently cleaner and more complete than the page a human sees, and it routinely exposes born-digital structure the rasterized image degrades or hides.

How the hidden text layer differs from the visible image, across 4,243 divergent pages.

53% expose email header structure. On more than half of the divergent pages the hidden text layer contains From: / To: / Subject: / Date: / Attachments: fields — these are born-digital emails that were rendered to page images. The header text lives in the machine-readable layer even where the visible image is degraded.
25% of the time the hidden layer is materially richer than re-OCR of the visible image — the text a machine ingests contains more than the page presents to a reader.
Email-encoding residue survives. In a stratified text-layer sample, ~1.2% of documents carry quoted-printable (=20, =0A) and a similar ~1.2% carry base64 fragments in the extracted text — raw email transport encoding that should never reach a reader. It is a minor but unambiguous fingerprint of an email-to-image conversion that left machine-only artifacts behind in the text layer.
The divergence is not marginal: the median word-overlap between hidden text and visible image on these pages is 0.16 (1.0 would be identical).

The forensic significance is direct. A person reviewing these documents reads the image. A search index, an e-discovery platform, or an AI/RAG pipeline reads the hidden text layer — and for thousands of pages that layer says something different, and often more, than the visible record. (We report the existence and shape of this divergence, not its specific contents.)

Finding 9 — A Digital-to-Image Transformation Workflow

The structural evidence converges on a single observable conclusion: the corpus underwent a digital-to-image transformation workflow before release — content rendered to page images, OCR-captured, and re-assembled, rather than emitted as native digital documents. The signals:

Digital-to-image workflow signatures. Skew measured on a rendered sample; the rest are corpus-wide.

Three independent measurements point the same way:

Image rasters are extraordinarily uniform. Across 29,870 image XObjects the mean entropy is 1.01 (median 1.02), with 49% below 1.0 — heavily-compressible, low-variation raster data unlike the texture and sensor noise of native paper scanning.
Skew is uniform and near-zero. On a rendered sample we measured page skew directly: 91% of pages sit within 0.5° of true horizontal, and the skew spread within a multi-page document averages 0.55°. Physical scanning of paper introduces per-page skew variation; this consistency is the signature of programmatic rendering, not a feed scanner.
No born-digital structure, redundant color encoding. 99.9% carry no structure tree, no document embeds a font or attachment, and 33% store grayscale content in an RGB color space — a re-encoding artifact of a rasterization pass.

Taken together — uniform low-entropy rasters, near-zero uniform skew, redundant color encoding, and an OCR text layer in place of real fonts — the signature is a render-to-image pipeline applied at scale. We note one measurement that did not hold: page borders are not uniformly flat (border pixel variance is non-trivial), so we make no claim about the absence of edge artifacts; the conclusion rests on the skew, entropy, and structural evidence above.

A deliberate limit: this is an observation about process, not content. The structure shows a digital-to-image transformation occurred before release. It does not, on its own, establish that the underlying content was altered — that requires evidence beyond PDF structure. We report the technical finding and stop short of any claim about fabrication.

Finding 10 — The Absent Semantic Layer

Modern document workflows often leave behind a semantic layer — a tagged structure tree, headings, /Alt text — that humans never see but parsers and AI systems ingest. This release has almost none:

16,957 of 16,971 (99.9%) are untagged — no structure tree.
0 documents carry /Alt or /ActualText accessibility annotations.

The implication cuts two ways. There is no hidden semantic content injected for machines to misread — but there is also no reliable machine-readable structure at all. For a RAG or AI pipeline, the only machine-readable content is the OCR text layer — the same layer that diverges from the visible image in 18.6% of documents (Finding 7). The corpus is, by construction, a difficult and partly unreliable substrate for automated analysis.

Finding 11 — Malware: Clean

Across all 16,971 documents:

0 known-malware hash matches (offline threat intelligence, 6.4M+ indicators).
0 confirmed malicious executables, dropper chains, or weaponized JavaScript actions. No file carries a genuine exploit execution vector.
A handful of files initially tripped a heap-spray byte heuristic. We verified these as false positives: the patterns are flat regions and incidental byte sequences inside compressed scanned-page images, not shellcode — no %u heap-spray strings, no NOP sled, no executable JavaScript action, only image data. The engine was corrected so these image-entropy patterns no longer escalate, and every figure here reflects the corrected engine.

The release is safe to open. Its risks are to interpretation, not to the system.

Methodology note: a false positive we found and fixed

This finding is itself a result of the study. On the first full pass, 467 documents tripped a heap-spray byte heuristic, and four scored as "JavaScript + Heapspray Shellcode" with a maximum threat score. Rather than report them, we pulled the highest-scoring files and inspected the bytes directly. The root cause was benign: the "heap-spray" pattern was a flat region inside a compressed scanned-page image, and the "/JS" was an incidental byte substring inside a Flate stream — no %u heap-spray string, no NOP sled, no JavaScript action dictionary.

We corrected three detector roots so image entropy can no longer masquerade as an exploit: the YARA NOP-sled literal was raised from 16 bytes to a 200-byte run (a real sled cannot survive Flate compression); the JavaScript test was changed from a raw substring to PDF action syntax (/S /JavaScript, /JS with a string/stream/indirect reference); and the raw heap-fill byte-run heuristics were gated on a genuine JavaScript spray context. The four flagged files dropped to a zero threat score, and every figure in this report reflects the corrected engine. We document the error because a forensic tool that hides its own false positives cannot be trusted to report anyone else's.

Finding 12 — The Human-Visible Page Is Not the Canonical Document

The preceding findings are usually treated as separate defects. Read together, they make one structural argument that is the central result of this study:

For this corpus, the page a human sees is not the authoritative version of the document. The authoritative version is the machine-interpreted layer — and across thousands of documents that layer says something different from the page, or cannot be agreed upon at all.

Every content-integrity signal we measured is a different way the same thing is true:

Signal	Prevalence	How the visible page and the machine layer diverge
Invisible text layer (render mode 3)	83.4%	A whole text layer exists that is never painted to screen
OCR text vs image mismatch	18.6%	The hidden layer disagrees with the visible image (median overlap 0.16)
Hidden layer carries email headers	53% of drift pages	Machine reads From/To/Subject/Date the image degrades
Parser disagreement	61.9%	Different readers extract a structurally different document
Reading-order ambiguity	44.7%	The order a machine reads text in is not determinate
Digital-to-image transformation	corpus-wide	The "page" is a rendered raster, not the original document
Orphaned metadata / objects	96.1%	Superseded content (incl. original /Info) the reader never sees
Email-transport encoding residue	~1.2%	Raw base64 / quoted-printable artifacts only a machine ingests

This reframes what "reading" the Epstein files even means. A person opens a page image. A search index, an e-discovery platform, or an AI/RAG system ingests the text layer, the object graph, and the orphaned metadata — a different representation, one that for nearly one document in five does not match the visible page, and that no two PDF parsers fully agree on. There is no single canonical reading of this corpus: there is the human view and there is the machine view, and the gap between them is measurable, widespread, and — for automated analysis at scale — the most important property of the release.

The limit holds here too: that the machine layer diverges from the page is an observable technical fact. Whether any specific divergence reflects intent, OCR error, or conversion artifact requires evidence beyond PDF structure. We report the divergence and its scale, not a motive.

What This Means

Provenance is partly recoverable, partly gone. Active metadata is stripped, but the OmniPage→Processing-CLI toolchain, the dated processing waves, and a few original 2013 dates survive in orphaned objects.
The release is a render-to-image production, processed in batches by two pipelines — an observation about process, not a claim about content.
Nearly one document in five is semantically unreliable to machines: the OCR text diverges from the visible page, and there is no semantic layer to fall back on. Any party indexing, searching, or feeding this corpus to an AI system is, for thousands of documents, operating on text that does not match the visible record.

These are exactly the conditions a multi-axis forensic engine is built to surface — and they are invisible to malware-signature scanning, which would correctly report this entire corpus as clean.

Appendix A — Every Indicator, by Prevalence

All distinct forensic indicators raised across the corpus, as a share of 16,971 documents. Capability and structure indicators are reported on their own axes and do not drive the malware verdict.

Indicator	Files	Share
Empty active metadata	16,971	100%
Incremental update: new objects in later revisions	16,971	100%
Incremental object override: existing OID re-defined	16,971	100%
Orphan objects in xref	16,306	96.1%
Invisible text (rendering mode 3/7)	14,152	83.4%
Multiple %%EOF markers	10,513	61.9%
/ObjStm object streams	10,513	61.9%
Mixed xref table and xref stream	10,513	61.9%
Data in PDF comment between revisions	10,499	61.9%
Differential parsing: PDF version mismatch	10,499	61.9%
Multi-column layout with ambiguous reading order	7,587	44.7%
Grayscale image stored as RGB	5,673	33.4%
Entropy cliff at revision boundary	3,814	22.5%
OCR/text-layer mismatch (any page)	3,159	18.6%
Large incremental injection in final revision	2,323	13.7%
Overlapping text blocks	1,293	7.6%
High-entropy streams	984	5.8%
/Trans presentation dictionary	439	2.6%
/ASCII85Decode	425	2.5%

A long tail of indicators appears on fewer than 0.2% of files (isolated CVE-pattern byte matches, a single /XFA, a single calculation-order array). All were individually reviewed; none corresponded to a working exploit.

Appendix B — Engine Coverage

All 47 analysis passes ran on every document. The provenance-relevant engines and what each contributed to this study:

Engine	Contribution here
OCR Text-Layer Integrity	Jaccard image-vs-text divergence — the 18.6% reality-drift cohort
Six-Parser Differential	Version/structure disagreement — 61.9%
Revision History + Trailer Chain	Generations, Bates-stamp passes, dated production waves
XRef Integrity Graph	Orphaned & superseded objects (incl. recoverable /Info)
Metadata Analyzer + Reconciliation	Active-metadata stripping, orphaned-toolchain recovery, date correlation
Physical Entropy Topology	Image-stream uniformity, revision-boundary entropy cliffs
Reading-Order Analysis	Multi-column / overlapping-text ambiguity (44.7%)
Unicode & Invisible Text	Render-mode-3 hidden text layers (83.4%)
Accessibility-Tree Forensics	Absent semantic layer (99.9% untagged)
Object-Stream Analysis	Hidden compressed objects (61.9%)
Sandbox · YARA · ClamAV · ML · Threat-Intel	Malware axis — clean across the corpus

Reproducibility

Every figure above derives from machine-readable per-document JSON — one result file per PDF, all 16,971, zero failures. The verdict model, indicator definitions, and per-engine output are the standard PQ PDF scanner outputs; any document's full forensic record can be regenerated by re-scanning the corresponding PDF with the PDF Forensics Scanner.

Related Research

PDF Forensics at Scale — the methodology, validated across ~7,850 curated and real-world PDFs
PDF Reality Drift — rendered glyph vs extracted text
Parser Disagreement — six parsers, eleven divergence cases
PDF Structural Problems in AI Ingestion — what reaches a RAG or LLM corpus
All security research →