PDF Malware Scanner
47 Forensic Engines. Free. Online.
Malware & exploits · document-integrity tampering · content-integrity & AI-ingestion attacks — graded across four forensic axes.
PDF files are one of the most common malware delivery vectors — used in phishing campaigns, APT attacks, and exploit kits for decades. Most scanners can only find threats they have already seen. PQ PDF's 47 forensic engines detect both known and unknown threats: behavioral sandbox execution catches what a PDF does regardless of whether it has a signature, ML anomaly detection flags structurally abnormal files even with no prior example, and differential parsing exposes hidden objects whether or not they match any known exploit pattern.
But this is full document forensics, not just a malware scanner. It also detects integrity tampering
— signature forgery, shadow documents, post-signing injection — and content-integrity / semantic-determinism attacks,
where a file deliberately shows one thing to a human and a different thing to a parser or AI/RAG pipeline:
value-vs-appearance (V/AP) divergence, font glyph remapping, OCR text-layer poisoning, and /Alt & /ActualText
prompt injection. The verdict is graded across all four forensic axes — so a document-integrity or AI-ingestion attack that carries
no malware at all is still surfaced. All free, with zero data retention.
No account. No upload limit. File deleted immediately after analysis.
PDF Threats You Can't See by Opening the File
A PDF that opens and looks normal can still be malicious. The PDF specification is complex enough that attack vectors are buried in layers of structure that no viewer surfaces to the reader. Emotet used password-protected PDF lures to deliver macro-laced Word droppers. MuddyWater (Iranian APT) relied on PDF first-stage attachments throughout 2022–2024 campaigns against government targets. APT28 (Fancy Bear) distributed CVE-2015-2545 EPS-exploit PDFs in spear-phishing operations against NATO targets. More recently, QakBot and IcedID campaigns shifted entirely to PDF delivery after Microsoft disabled Office macros by default. And beyond classic malware, a newer class of attack exploits the fact that a PDF can present different content to a human than to a parser or AI pipeline — no exploit code required. We measured how widespread this is across nearly 8,000 PDFs in our PDF security research (parser disagreement, V/AP divergence, reality drift, and AI-ingestion poisoning). These are the threat categories PQ PDF detects in malicious and deceptive PDFs:
eval(unescape(...)) shellcode loaders, heap spray sequences, and multi-layer obfuscated scripts that execute silently when the file opens in a vulnerable viewer./SubmitForm actions. Hidden fields collect data without user knowledge.ByteRange gaps in the signature specification to hide malicious objects outside the signed byte range./Launch actions or /EmbeddedFile streams.U+202E), attackers reverse filenames and URLs in a way that looks legitimate to a casual reader./V-vs-/AS mismatch, blank or image-based appearance streams, and /NeedAppearances staleness all let the visible document disagree with its machine-readable value. No exploit code — pure document fraud./ToUnicode map), hidden OCR/text-layer poisoning on scanned pages, and /Alt & /ActualText prompt injection — invisible to readers but consumed verbatim by LLMs. PQ PDF measures cross-extractor semantic determinism to catch what no malware scanner looks for.Who Should Scan a PDF Before Opening It?
PDF is the most common malware delivery format in targeted attacks. According to Verizon's DBIR, email attachments account for over 90% of malware delivery — and PDF is consistently in the top two formats alongside Office macros. These are the people who scan before they open:
PQ PDF vs. VirusTotal, Hybrid Analysis, Adobe & MetaDefender
Antivirus engines answer one question: "Have we seen this before?" If a threat has been catalogued, they find it. If it hasn't — a zero-day exploit, a freshly obfuscated payload, a novel XFA FormCalc attack, a new JS shellcode loader — it passes straight through. PQ PDF answers a different question: "What does this PDF actually do?" Behavioral execution, ML anomaly detection, structural differential analysis, entropy profiling, and AI ingestion integrity checks find dangerous files whether or not any signature for them exists anywhere. Here is how the tools compare:
| Capability | PQ PDF Free · No account |
VirusTotal Free (account) · Online |
Hybrid Analysis Free (limited) · CrowdStrike |
Adobe Acrobat Pro ~$23/month |
MetaDefender OPSWAT · Paid |
|---|---|---|---|---|---|
| AV signature scanning | ✓ ClamAV 700k+ sigs | ✓ 70+ AV engines | ✓ CrowdStrike + partners | ✗ No AV scanning | ✓ 30+ AV engines |
| YARA rules (PDF-specific) | ✓ 24 custom PDF YARA rules | ⚠ Community rules, generic | ⚠ Generic YARA rules | ✗ No | ⚠ Limited, generic |
| Behavioral sandbox execution | ✓ 6 PDF renderers, isolated namespaces, strace | ⚠ General sandbox — not PDF-specific renderers | ✓ Good dynamic analysis, general sandbox | ✗ No sandbox | ⚠ Basic sandbox, limited PDF renderer coverage |
| PDF structural analysis (XRef, objects, streams) | ✓ 15 static engines built for PDF structure | ✗ AV engines scan bytes, not PDF structure | ✗ No structural PDF analysis | ✗ No structural analysis | ✗ No structural PDF analysis |
| JavaScript AST deobfuscation | ✓ Full AST deobfuscator + Acrobat API emulation | ✗ No | ⚠ Runtime observation only | ✗ No | ✗ No |
| XFA FormCalc parsing | ✓ Dedicated XFA parser engine | ✗ No | ✗ No | ✗ No | ✗ No |
| Signature forgery / Shadow Attack detection | ✓ ByteRange forensics engine | ✗ No | ✗ No | ✗ No | ✗ No |
| AcroForm exfiltration / hidden field analysis | ✓ Full field tree, SubmitForm targets, JS triggers | ✗ No | ✗ No | ✗ No | ✗ No |
| Six-parser differential comparison | ✓ MuPDF, Poppler, GS, qpdf, pdfminer, pdf.js | ✗ No | ✗ No | ✗ No | ✗ No |
| Machine learning anomaly detection | ✓ IsolationForest + RandomForest + LightGBM + SHAP | ✗ No | ✗ No | ✗ No | ✗ No |
| OCR vs. text layer divergence (hidden text poisoning) | ✓ Tesseract OCR vs. embedded text layer — Jaccard similarity per page | ✗ No | ✗ No | ✗ No | ✗ No |
| Reading order & spatial ambiguity (AI ingestion) | ✓ Multi-column layout detection, parser extraction order conflicts | ✗ No | ✗ No | ✗ No | ✗ No |
| Accessibility tree injection (/Alt, /ActualText) | ✓ /StructTreeRoot forensics — prompt injection in semantic layer | ✗ No | ✗ No | ✗ No | ✗ No |
| MITRE ATT&CK technique mapping | ✓ Every indicator mapped to technique IDs | ⚠ Some detections, not systematic | ✓ Good ATT&CK coverage | ✗ No | ⚠ Limited mapping |
| AI forensic narrative report | ✓ Self-hosted Qwen 2.5 — structured verdict + findings | ✗ No | ✗ No | ✗ No | ✗ No |
| File privacy / zero data retention | ✓ Deleted immediately, no external calls, no hashes shared | ✗ Files stored; hashes and reports are community-shared | ✗ Files stored; can be set private (paid only) | ✓ Local processing, file stays on your machine | ⚠ Enterprise tier offers private scanning |
| Offline threat intelligence | ✓ 6.4M+ indicators in local databases — zero external calls | ⚠ All queries sent to external services | ⚠ Online lookups | ✗ No threat intel | ⚠ Cloud-based lookups |
| Sanitize / clean the PDF | ✓ 9 methods: flatten-to-images, strip JS, remove XFA, PDF/A… | ✗ No | ✗ No | ✓ "Sanitize Document" removes active content | ⚠ Basic sanitization in some tiers |
| Cost | ✓ Free — no account required | ✓ Free with account (rate limited) | ✓ Free tier (limited submissions/day) | ✗ ~$23/month subscription | ✗ Paid — enterprise pricing |
The honest assessment: VirusTotal's 70+ AV engines are the best tool in existence for one specific question — "has this exact file been seen and named by the antivirus industry?" If you need community reputation across 70 vendors, use it. For everything else — detecting what a PDF does, finding zero-days, structural forensics, AI ingestion integrity, sanitization, MITRE ATT&CK mapping, and keeping your file private — PQ PDF does all of it, free, with no account required.
All 47 Forensic Engines Explained
Every uploaded PDF passes through 47 independent analysis engines in a single request. Each engine is orthogonal — designed to catch a different class of threat that the others might miss. Results are correlated by the Correlation Engine (Engine 47) that maps compound indicators to MITRE ATT&CK techniques.
/JavaScript, /JS, /Launch, /OpenAction, /AA, /EmbeddedFile, /RichMedia, /XFA, /AcroForm, heap spray constants, and shellcode sequences.javascript:, data:, file://, and vbscript: schemes. All URLs are passed to the Threat Intelligence engine./Widths arrays (historic heap-overflow vector), non-embedded fonts that trigger external font lookups, and suspicious glyph name mappings. Analyses ToUnicode CMap tables — the mapping from glyph IDs to Unicode codepoints — detecting remaps where a visually rendered ASCII character (e.g. A) resolves to a non-ASCII Unicode codepoint in the extracted text layer. These remaps make visible text differ from extracted text, corrupting entity extraction, compliance scanning, and AI embeddings without any visible change to the rendered document./JBIG2Decode (CVE-2009-0658), /JBIG2Globals exploit parameters, oversized /Widths arrays, and codec parameter combinations associated with heap-overflow and memory corruption CVEs in Adobe Reader and Foxit./Info dictionary, XMP metadata, embedded XML, and attachment timestamps. Detects desynchronization where these sources report conflicting dates — e.g. /Info creation date 2024 while XMP reports 2019 — a strong indicator of document manipulation, backdating, or incremental-update tampering. Also independently confirms XFA form presence, surfaces embedded attachment flags, and detects creator/producer strings from known exploit-generation toolkits.qpdf --check to validate cross-reference tables and trailer dictionaries from a second, independent parser. Intentionally malformed XRef tables are a hallmark of exploit kits designed to hide objects from basic parsers.%u9090, 0x0c0c), CVE-specific byte sequences (CVE-2009-0658, CVE-2024-41869, CVE-2024-45112), obfuscated JS loaders, XFA+script combos, Cobalt Strike beacon signatures, PowerShell encoded commands, and multi-layer encoder chains.Pdf.Exploit.* family covering CVE-2009-0927, CVE-2009-4324, and the Exploit.PDF-JS category. A ClamAV match means the file is a confirmed known threat.FF D8 FF, ZIP PK\x03\x04, PNG, GIF, Gzip, OLE, RIFF) appears in the bytes before the %PDF- header. ISO 32000 §7.5.2 permits arbitrary bytes before %PDF-; attackers exploit this to create JPEG+PDF or ZIP+PDF polyglots that bypass format-based content filters — email gateways see a JPEG, the PDF payload executes. (2) Stream-level polyglot — scans every PDF stream (raw and decompressed) for embedded executable magic: ZIP, Windows PE, Linux ELF, Mach-O, Java class, OLE/CFBF, RAR, 7-Zip, WebAssembly, HTML, and PostScript. Polyglot files smuggle dropper payloads past content-type security controls.strace. Detects: network beaconing, anonymous executable memory (shellcode), shell spawning, filesystem escape attempts, and process bombs. Static analysis sees structure; this engine sees what the PDF does.mutool), Poppler, Ghostscript, qpdf, pdfminer, and pdf.js — and cross-compares eight structural dimensions: page count, object count, PDF version, JavaScript presence, encryption status, AcroForm presence, embedded file count, and OpenAction. Discrepancies mean the file exploits parser differences to hide objects — the signature of broken-xref exploit staging and incremental-update attacks. See the empirical parser disagreement tests for 11 reproducible examples with live scanner output.eval/unescape layers, string-split obfuscation, hexadecimal encoding, and multi-pass encoder chains. Surfaces the final deobfuscated payload for manual review.ByteRange coverage integrity (per ISO 32000 §12.8.1, offsets are from the %PDF- header — o1 must be 0, both segments within file bounds, inner gap must contain only the /Contents blob, and o2+l2 must reach at least %%EOF); shadow document detection (unsigned bytes beyond the signed region containing execution vectors — CVE-2019-14980 class); full-save rewrite detection (when o2+l2 < %%EOF and the unsigned trailing region contains xref/trailer structure without execution vectors, a PDF viewer performed a complete file rewrite — this invalidates the cryptographic signature while the visual signature appearance remains, a pattern used by DocuSign and similar tools); /Contents blob structural validation (all-zero placeholders, sub-32-byte blobs, missing DER SEQUENCE header); SubFilter deprecation (SHA-1 collision risk, legacy no-chain variants, unknown formats); and weak digest algorithm detection (MD5/SHA-1 vulnerable to collision-assisted forgery)./Launch actions that auto-execute embedded files on viewer interaction./A and /AA field events (focus, blur, keystroke, validate), hidden NoExport fields, password-type fields (credential harvesting), /SubmitForm exfiltration targets, and calculation-order chain exploitation. Also performs Value / Appearance Stream (V/AP) divergence detection — flags /NeedAppearances true (stale AP, critical when signed), checkbox/radio /V vs /AS key mismatch (rendering-independent), text/listbox/combobox field AP stream text extraction with font encoding remap (resolves /Encoding /Differences tables so a font mapping byte 0x31 to glyph /nine is decoded correctly before comparison to /V — catches the custom-font evasion path), image-based AP stream detection (AP renders via Do image XObject with no text operators — /V is not visually verifiable without image recognition, flagged high severity), and blank AP streams that hide a signed value from the viewer.%%EOF boundary and extracts per-revision metadata: author, producer, modification date, and changed/new/deleted object counts per revision. Detects author identity changes, execution vectors injected after original creation, and automated exploit staging via large final-revision object injections.javascript: URI schemes, JavaScript triggers on click/hover, /Launch actions that spawn programs, /GoToR remote links, and /SubmitForm in annotation actions — attack vectors completely invisible to byte-level scanners./Names /JavaScript), /AA additional actions, /OpenAction type classification, and /Perms and /UR3 permission restriction exploitation. Deep DocMDP forensics — parses the /P permission level (1 = no changes, 2 = form fill-ins, 3 = annotations — the most exploitable), validates /TransformParams and /Reference structure, checks /SigFlags AppendOnly bit, detects incremental updates violating MDP constraints, and flags multiple /DocMDP entries (validator confusion attack). FieldMDP per-signature field lock (ISO 32000 §12.8.2.4, "File MDP") — distinct from DocMDP, FieldMDP locks specific named form fields per approval signature and can be selectively permissive: detects Action=Include with empty /Fields (locks nothing despite appearing to certify), Action=Exclude with named fields (those fields are explicitly unlocked), and incremental updates that modify form fields after a FieldMDP signature is in place.exec (dynamic execution), run (file execution), token (string-to-code eval), setpagedevice (PostScript-to-system bridge). Also detects malformed /ICCBased color profiles of anomalous size — the CVE-2021-21017 class of heap buffer overflows./ObjStm containers — invisible to byte scanners. This engine decompresses every object stream and re-scans the content for JavaScript, Launch actions, EmbeddedFile references, and high-entropy payloads (entropy >7.5 bits) suggesting hidden encrypted content./J#61vaScript → /JavaScript, whitespace-split token injection, and null-byte injection in name objects. These bypass simple pattern matchers while remaining valid to the PDF renderer — a classic evasion technique found in real-world exploit kits./OpenAction → /AA → field actions → annotation triggers → named actions. Visualises multi-hop execution chains where a seemingly innocent trigger leads through a chain of named actions to a final exploit — invisible when examining any single action in isolation.U+202E) that reverse displayed filenames and URLs, and zero-width joiners used to split and reassemble malicious keywords./JBIG2Decode + /JBIG2Globals combinations (CVE-2009-0658 class), abnormally large /Columns and /Rows values in CCITT streams, and unusual parameter combinations in /CCITTFaxDecode and /DCTDecode filters associated with historic heap overflow exploits.app, this, util) to reveal what JavaScript does without a real viewer — catching payload assembly that requires runtime evaluation to surface. doc.getField() returns the actual /V field values from the PDF (collected by Engine 25) so conditional exploitation chains are correctly evaluated and SUBMIT_FORM events carry real field content.seac operator abuse (out-of-bounds glyph lookup), stack exhaustion via deeply nested subroutine calls, and arithmetic overflow patterns in CharString arithmetic — a class of font-engine exploits affecting all major PDF viewers./StructTreeRoot accessibility structure and inspects all semantic elements: /Alt image descriptions, /ActualText character-level overrides, heading hierarchy, logical structure labels, and figure captions. These channels are increasingly preferred by AI document processors because they improve chunking quality — making them high-value injection targets. Detects prompt injection in /Alt and /ActualText attributes: semantic content that exists only in the accessibility tree, fully invisible in rendering but completely visible to LLM ingestion pipelines, tagged-PDF extractors, and screen-reader-style processors. Flags payloads containing instruction-override patterns ("ignore prior instructions", "system:", "INST") that would execute in downstream AI processing./OpenAction + embedded JavaScript + obfuscated URL + non-embedded font is a dangerous combination. The Correlation Engine awards bonus risk points (35–100) for such combinations and maps each compound pattern to MITRE ATT&CK technique IDs.How the Risk Score Works
Every finding is classified onto one of four forensic axes, and the headline verdict is graded by what each finding actually means — not by a single undifferentiated count. This is what lets a feature-rich but legitimate document (a government form with field JavaScript, an academic paper with hundreds of embedded-font objects) read as clean while a genuine attack still scores.
- Exploit — code execution, memory corruption, malware/dropper delivery, confirmed-malicious (AV/threat-intel) hits.
- Tampering — integrity/authenticity violations: signature forgery, shadow documents, post-signing injection.
- Deception — content/semantic-determinism manipulation: value-vs-appearance (V/AP) divergence, font glyph remapping, OCR text-layer poisoning,
/Alt&/ActualTextprompt injection, homoglyphs. - Structural / informational — neutral modern-PDF capability & structure (object streams, incremental updates, capability presence). Reported for context; never counted as a threat.
The headline Threat Score = exploit + integrity-tampering, and it drives the verdict band below. Because this is a full forensics tool and not just a malware scanner, a confirmed deception finding (e.g. the displayed value ≠ the stored/extracted value) grades the verdict on its own axis even when the malware threat score is zero — a document that shows one thing to a human and a different thing to a parser or LLM is a first-class finding. Deception and structural scores are reported separately and never inflate the malware verdict.
Severity tiers: Critical (+50 pts) · High (+25 pts) · Medium (+10 pts) · Low (+3 pts) — capped at 3 occurrences per indicator.
The Correlation Engine adds +35 to +100 bonus points for dangerous indicator combinations — a single
/OpenAction is low-risk, but /OpenAction + obfuscated JavaScript + a known-malicious URL is definitively dangerous.
Your File Never Leaves Our Server
Uploading a potentially malicious PDF to an online scanner is only sensible if the scanner's security model is trustworthy. PQ PDF is designed around the principle that the scanner must be as safe to use as the file is dangerous.
Frequently Asked Questions
prlimit resource limits, AppArmor MAC policy (pqpdf-unshare profile), Linux user + mount + network + PID namespaces, and a private tmpfs mount. The behavioral sandbox adds another nested namespace with its own isolated network stack. The file is deleted immediately after analysis — no copy, hash, or metadata is retained.
CVE-specific exploits: CVE-2009-0658 (JBIG2 heap overflow), CVE-2024-41869 (use-after-free), CVE-2024-45112 (type confusion), and 20+ others via 24 custom YARA rules.
Form-based attacks: AcroForm credential harvesting, hidden fields, SubmitForm exfiltration, XFA FormCalc exploits.
Structural attacks: XRef Shadow Attacks (signature forgery), OCG layer cloaking, Unicode invisible text, polyglot files, PDF token obfuscation.
Behavioral threats: Anything that causes network beaconing, shell spawning, or executable memory allocation when the PDF is rendered — caught by the behavioral sandbox regardless of whether a static signature exists.
File deleted immediately. Zero data retained.