🛡️
PDF Threat Scanner
Static byte analysis, object-graph traversal, dynamic syscall tracing, YARA, ClamAV, ML classification, differential parser comparison, polyglot detection, and AST deobfuscation — 20 independent analysis passes, each observing a different dimension of the same file.
Multi-Modal Detection Architecture
20 orthogonal passes · static · dynamic · forensic · probabilistic · comparative · semantic
20 engines
①
Structure Validator
▼
Inspects PDF header position, counts
%%EOF markers (exploit PDFs often carry multiple), audits cross-reference table depth, linearisation flags, and excessive filter chains used for obfuscation.②
Pattern Scanner
▼
45+ byte-level signatures:
/JavaScript /Launch /OpenAction /EmbeddedFile /JBIG2Decode /XFA /RichMedia, NOP sleds (%u9090 %u4141), heapspray fills, and dangerous JS APIs: eval() unescape() collab.getIcon() util.printf().③
Stream Inspector
▼
Decompresses every FlateDecode stream via PyMuPDF and re-scans the raw content — catching JavaScript and shellcode hidden inside compressed objects that raw-byte scanners miss entirely. Calculates Shannon entropy per stream; values above 7.2 bits flag encrypted or packed payloads.
④
Object Analyzer
▼
Walks the full cross-reference object graph, resolving indirect references and checking every object dictionary for dangerous action-type combinations (
/S /Launch, /S /JavaScript, /RichMedia, /XFA). Reports exact xref numbers of suspicious objects.⑤
URL Extractor
▼
Extracts all HTTP/HTTPS URLs from raw bytes and decompressed streams, de-duplicates, and lists them so you can assess every domain the PDF attempts to contact — phoning home, tracking pixels, and C2 beaconing.
⑥
Metadata Analyzer
▼
Inspects
Producer and Creator fields for known exploit-tool strings (Metasploit, Canvas, Core Impact), flags missing metadata — a hallmark of crafted exploits — and scans XMP streams for embedded script references.⑦
Font Analyzer
▼
Checks every font object for
/JBIG2Decode usage — the codec exploited in CVE-2009-0658 and CVE-2010-0188 — and for abnormally large /Widths arrays used in historic heap-overflow attacks against Acrobat's font engine.⑧
CVE Pattern Matcher
▼
Matches against known exploit signatures: CVE-2009-0658, CVE-2009-0927, CVE-2009-4324, CVE-2008-2992, CVE-2007-5659, CVE-2007-5020, CVE-2010-0188, binary heapspray NOP sleds (
0x0C / 0x0D fill patterns).⑨
Structural Statistics
▼
Collects page count, object count, encryption status, embedded file count, form fields, annotations, and link count via PyMuPDF's document model — providing the full structural picture for the summary dashboard.
⑩
ExifTool Metadata Forensics
▼
Runs
exiftool for deep metadata extraction complementing PyMuPDF's view. Detects exploit-kit fingerprints in Creator/Producer/Author fields (Metasploit, msfvenom, Canvas, Core Impact), independently confirms XFA forms, and surfaces embedded attachment flags visible only via EXIF/XMP metadata layers. Results feed into the Correlation Engine.⑪
qpdf Structural Integrity
▼
Runs
qpdf --check to validate cross-reference tables, trailer dictionaries, and overall document structure. Intentionally malformed or "damaged" PDFs — where xref tables are deliberately broken — are a hallmark of exploit kits designed to hide objects from basic parsers while still rendering in vulnerable viewers. Results feed into the Correlation Engine.⑫
YARA Rule Engine
▼
Applies 11 custom YARA rules targeting PDF-specific attack signatures: classic heap-spray patterns (
%u9090, 0x0c0c fills, binary NOP sleds), CVE-specific byte combinations (CVE-2009-0658, CVE-2008-2992), JavaScript shellcode loaders (eval+unescape), hex-obfuscated keywords, auto-open executable combos, XFA+script exploits, and multi-layer encoder chains. Provides byte-level corroboration independent of PyMuPDF parsing. Results feed into the Correlation Engine.⑬
PeePDF Deep Analysis
▼
Independent PDF analysis using the PeePDF framework — a separate parser that builds its own object tree entirely independently of PyMuPDF. Identifies vulnerability patterns with exact object IDs, locates suspicious elements (
/Launch, unescape, getIcon, printf, eval, /EmbeddedFile), and reports JavaScript object locations. Where our other engines parse bytes and structure, PeePDF provides a full second-opinion parse. Results feed into the Correlation Engine.⑭
Dynamic Behavioral Sandbox
▼
The only engine that actually executes the PDF. Runs the file through three independent renderers — Ghostscript (full PostScript + JS interpreter), MuPDF, and Poppler — each inside a Linux process namespace with its own isolated network stack, PID space, and mount point. All syscalls are captured by
strace. Detects: outbound network connections (beaconing in an isolated namespace is definitively malicious), anonymous executable memory mappings (the runtime signature of shellcode), unauthorised process spawning (shell execution from a CVE), filesystem escape attempts, and excessive fork/clone calls (process bombs). Static analysis sees the PDF's structure; this engine sees what it does.⑮
Correlation Engine
▼
Cross-references all 15 engine findings above. Individual indicators are scored conservatively — this engine identifies dangerous combinations that are orders of magnitude more serious than their parts. Classic patterns:
/OpenAction+JS auto-exec, JBIG2+JS CVE confirmation, eval()+unescape() shellcode chains, heapspray+JS delivery. Cross-engine patterns: YARA heap-spray + JS, PeePDF vuln + JS, qpdf structural damage + active content, ExifTool exploit-kit fingerprint + execution. Dynamic sandbox patterns: live network beacon + JS, runtime shellcode + heap spray, dynamic shell spawn + trigger mechanism, dynamic exploitation + PeePDF confirmation. 30+ compound patterns with weighted bonus scoring.⑯
ClamAV Signature Scanner
▼
Runs the local ClamAV daemon against the file — matching 700,000+ signatures including the
Pdf.Exploit.* family (CVE-2009-0927, CVE-2009-4324, Exploit.PDF-JS, and many more). Where the other 15 engines use heuristics and structural analysis to catch zero-days, ClamAV provides authoritative signature intelligence for known samples. A match here means the file is a confirmed known threat.⑰
ML Intelligence Engine
▼
Sits above all 16 engines and applies three layers of intelligence: Bayesian contextual scoring adjusts risk based on document origin — a
JasperReports or Microsoft Word creator is dampened; a Metasploit/msfvenom creator is amplified. IsolationForest provides unsupervised anomaly detection from the very first scan — flags documents whose 38-feature vector is statistically unusual compared to the scanned population. RandomForest classifier activates once 50+ labeled samples are accumulated, providing probability-based maliciousness scoring trained on real scan data. Explainable ML reports the top feature contributions for each scan. Every scan is persisted to PostgreSQL. User feedback (false positive / confirmed threat) feeds directly into retraining.⑱
Differential Parsing
▼
Runs three independent PDF parsers — MuPDF (
mutool), Poppler (pdfinfo), and Ghostscript — against the same file and compares their structural views: page count, object count, and JavaScript presence. Malicious PDFs often abuse broken xref tables, hidden incremental updates, and duplicate object numbers so that one parser recovers hidden objects that another ignores entirely. A mismatch between parsers is a direct signal that the file is exploiting parser-specific quirks — one of the primary mechanisms behind CVE-class evasion. This technique is used by malware sandboxes and browser security teams (Chromium vs Firefox DOM comparison) for exactly the same reason.⑲
Polyglot Detection
▼
Scans every PDF stream — both raw and decompressed — for file magic byte signatures:
PK\x03\x04 (ZIP), MZ (Windows PE executable), \x7fELF (Linux ELF binary), \xcf\xfa\xed\xfe (Mach-O), \xca\xfe\xba\xbe (Java class), \xd0\xcf\x11\xe0 (OLE/CFBF — Office binary), RAR, 7-Zip, and embedded PostScript. Polyglot files simultaneously satisfy the format rules of two or more file types — the file appears as a valid PDF to viewers while also containing a self-extracting archive or executable dropper that activates when saved to disk and opened by a compatible application. This technique is used to smuggle payloads past content-type-based security controls that only inspect the file header.⑳
JS AST Deobfuscation
▼
Extracts all JavaScript fragments from the PDF — both inline
/JS strings and JavaScript-bearing compressed streams — then parses each through the Acorn parser to build a full Abstract Syntax Tree. Instead of scanning text for keywords, the AST walker detects meaning: eval() and execScript() dynamic execution entry points; String.fromCharCode() arrays that assemble shellcode from integer sequences at runtime; unescape() decode chains that two-stage-deliver encoded payloads; numeric arrays of 150+ elements (the structural signature of heap spray); and new Function(string) dynamic code construction. These patterns are completely invisible to regex-based scanners but trivially visible at the AST level.🧹 Sanitize Options
Flatten to Images
Renders every page to a 144 dpi raster image and rebuilds as a new PDF. Destroys all JavaScript, launch actions, embedded files, XFA forms, and object streams with absolute certainty. Text becomes non-searchable.
Maximum Safety
Strip Active Content
Re-processes through Ghostscript with
-dSAFER to remove JavaScript, launch actions, embedded files, and rich media while preserving searchable text and document structure. Cannot guarantee removal of zero-day or heavily obfuscated structures.Preserves Text
Drop your PDF here or click to browse
20 detection engines · ML classification · Differential parsing · Polyglot detection · YARA · Behavioral sandbox · Entropy analysis
Uploading…
① Structure
② Patterns
③ Streams
④ Objects
⑤ URLs
⑥ Metadata
⑦ Fonts
⑧ CVEs
⑨ Statistics
⑩ ExifTool
⑪ qpdf
⑫ YARA
⑬ PeePDF
⑭ Sandbox
⑮ Correlation
⑯ ClamAV
⑰ ML
⑱ Diff Parse
⑲ Polyglot
⑳ JS AST
🧹 Sanitize This PDF
Choose how to clean the file. Both methods produce a new, safe PDF — no originals are modified.
Sanitizing…