# PDF Forensics at Scale — reproducibility bundle

Companion data and code for https://pqpdf.com/pdf-forensics-at-scale.php — a forensic
measurement of a 47-module PDF scanner across two corpora totalling ~7,850 real PDFs:
a 1,572-document curated detection set (eight domains) and a 6,281-document GovDocs1
real-world benign control used to measure and harden the false-positive rate.

## Files — curated detection set (1,572)
- `per-file-results.jsonl` — one JSON record per file (all 1,572): multi-axis scores
  (threat / deception / structural), verdict band, and per-domain detection flags
  (V/AP, parser-disagreement, reading-order, accessibility, signed, JavaScript, XFA).
- `malware-detection-breakdown.csv` — per-sample result keyed by SHA-256: threat score,
  verdict band and driver for every one of the 400 malware samples (independently verifiable
  by retrieving the hash from MalwareBazaar and re-scanning).
- `malware-sha256.txt` — SHA-256 of the 400 live malware samples. The binaries are not
  redistributed; each is retrievable from MalwareBazaar (https://bazaar.abuse.ch/) by hash.
- `benign-corpus-manifest.txt` — exact file list of the benign and adversarial corpus,
  grouped by source (Mozilla pdf.js, corkami, PDF Association PDF 2.0, GovInfo, arXiv,
  real IRS/agency forms, and the hand-crafted V/AP and hidden-JS fixtures).
- `malware-per-engine-contribution.csv` — for each engine, the number of the 398 fully-
  analysed malware samples on which it raised a high/critical indicator (note: the threat-
  intelligence hash lookup fired on 100%, evidencing the known-sample nature of the corpus).
- `parser-disagreement-gov-forms.json` / `.csv` — the per-form parser divergences for all
  46 government forms (every disagreement type with the per-parser values), so the
  '46/46 government forms disagree' result can be inspected and reproduced sample by sample.
- `scan_harness.py` — resumable, resource-bounded batch scan harness.
- `scan_governor.py` — adaptive resource governor (per-scan memory = 80% of available
  with a fixed OS reserve; concurrency scaled to free memory; progress-stall watchdog).
- `analyze.py` — discrepancy / per-domain aggregation over the results JSONL.

## Files — GovDocs1 real-world benign control (6,281)
- `govdocs1-benign-control-results.jsonl` — one record per GovDocs1 file (6,281; 6,073
  complete analysis). Each record gives the public Digital Corpora path
  (`govdocs1/NNN/NNNNNN.pdf`), size, the scanner's verdict band and driver, the multi-axis
  scores, `has_exec_vector`, and the high/critical engines. Re-fetch any path from Digital
  Corpora and re-scan to verify.
- `govdocs1-verdict-summary.csv` — the scanner's verdict-band distribution over the control
  (clean/low vs. suspicious-or-higher) and the breakdown of the flagged set into real malware
  the web-crawled corpus carries, genuine content-integrity findings, and the residual
  false-positive rate on genuinely-benign documents.

The benign-control PDFs are the public GovDocs1 corpus (Digital Corpora,
https://digitalcorpora.org/corpora/files/govdocs1/) — not redistributed here, but every
file is retrievable by its `govdocs1/NNN/NNNNNN.pdf` path from the published zipfiles. The
scanner grades 96.7% of the control clean or low and flags 3.3% suspicious-or-higher; after
removing the real malware the corpus contains, the false-positive rate on genuinely-benign
documents is 0.34%.

## Files — novel-malware detection-by-analysis test
- `novel-malware-sha256.txt` — SHA-256 of 402 MalwareBazaar PDF samples outside our curated
  corpus. Binaries are not redistributed; each is retrievable from MalwareBazaar by hash.
- `novel-malware-analysis-detection.csv` — per sample: whether it carries an execution
  vector, and whether the scanner's NON-hash ANALYSIS engines (YARA, sandbox, JS-AST,
  exploit/structure, correlation) raised a high/critical (or any) indicator — i.e. detection
  WITHOUT the hash/reputation databases. Result: exec-bearing malware 93% caught by analysis;
  social-engineering PDFs (no exec vector, ~90% of the set) 16% high-confidence / 63%
  any-signal; in production these are caught by hash/URL reputation.

## Reproducing
Run each PDF through the scanner (public tool at /tools/scan.php) with the harness or
governor, producing one result JSON per file; `analyze.py` aggregates the per-domain
numbers reported in the article. The scoring weights and verdict bands are documented in
the article's methodology section.
