- Magika is Google's open-source file-type detector.
- A 1MB deep-learning model, rewritten in Rust, classifies 200+ file types in ~5ms on a single CPU — and it already runs inside Gmail, Drive, Safe Browsing, and VirusTotal.
Attackers rename payload.exe to resume.pdf and hope your scanner trusts the extension. Google's answer: don't. Magika is an AI file-type detector that reads the actual bytes of a file, decides what it really is, and does it in about 5 milliseconds on a single CPU — with a deep-learning model that weighs only ~1MB.
Originally open-sourced in Feb 2024, Magika just hit its first stable release (1.0, Nov 6, 2025) with a full Rust rewrite, a native CLI, and support for 200+ content types.
TL;DR
- Open-source, Apache 2.0. Model ~1MB. Trained on ~100M files.
- ~99% F1 accuracy across 200+ file types; ~5ms/file on single-CPU inference.
- Core engine rewritten in Rust → ~1,000 files/sec on a MacBook Pro M4.
- Runs in production at Google across Gmail, Drive, and Safe Browsing (hundreds of billions of files/week). Also pre-filters VirusTotal's Code Insight.
- Install via
pipx install magika,brew, a Rust crate, or npm. Web demo runs locally in the browser.
What's new in 1.0
The alpha from 2024 was a Python-first tool with a TensorFlow.js model. Magika 1.0 is a different beast:
- Rust engine from the ground up. The native CLI extracts feature vectors in Rust and runs ML inference in C++ via the ONNX Runtime (
ortcrate). Tokio drives async parallelism. - 200+ content types, up from ~100. Newly added: Jupyter notebooks, PyTorch models, ONNX, Parquet, HDF5, Swift, Kotlin, TypeScript, Dart, Zig, WebAssembly, Dockerfile, TOML, HashiCorp HCL, Bazel, YARA rules, SQLite, AutoCAD, Photoshop PSD, WOFF2.
- Granular distinctions old tools couldn't make: JSON vs JSONL, C vs C++, JavaScript vs TypeScript, TSV vs CSV, Apple binary plists vs XML plists.
- Revamped Python and TypeScript modules, plus a GoLang binding in progress.
Why it matters
File-type detection has been stuck on libmagic and the file utility for 50+ years. Both rely on handcrafted magic-byte heuristics — fragile against evolving formats and trivially fooled by adversarially-crafted payloads. For security pipelines, "trust the extension" is a known attack surface: anyone can rename a file.
Magika flips the assumption. Extensions are ignored. Only a bounded subset of the actual bytes feeds a small CNN trained to recognize the structural fingerprint of each format. Because it's ML instead of hand-rules, it generalizes — and it's cheap enough to drop into hot paths like email ingestion and upload handlers.
Technical facts
| Property | Value |
|---|---|
| Model size | ~1MB (Keras → ONNX) |
| Training data | ~100M files, ~3TB uncompressed (streamed via SedPack) |
| File types | 200+ (binary + textual) |
| Accuracy | ~99% average precision & recall |
| Inference latency | ~5ms per file, single CPU |
| Throughput | Hundreds/sec single-core, ~1,000/sec on M4 multi-core |
| Hardware | No GPU required |
| License | Apache 2.0 |
The trick behind the near-constant latency: Magika only reads a limited byte window of each file, not the whole thing. A 50MB archive classifies as fast as a 5KB snippet. Google also used Gemini to synthesize training samples for rare/legacy formats where real-world data was scarce.
Comparison vs libmagic
| Metric | libmagic / file | Magika 1.0 |
|---|---|---|
| Approach | Handcrafted heuristics | Deep-learning classifier |
| Textual/code formats | Weak (often confuses JS/TS, C/C++) | Strong — granular per-language |
| Accuracy (Google 1M benchmark) | Baseline | ~20% better overall |
| Production uplift at Google | Baseline | +50% accuracy vs prior rule system |
| Adversarial robustness | Fragile | Learned from ~100M samples |
| Latency | Milliseconds | ~5ms (comparable) |
Real-world use cases
At Google, Magika routes incoming files across Gmail attachments, Drive uploads, and Safe Browsing downloads to the right security scanners. Switching from handcrafted rules to Magika boosted accuracy 50%, let Google scan 11% more files with specialized malicious-document AI scanners, and cut the "unidentified file" bucket down to 3%.
Outside Google:
- VirusTotal uses Magika as a pre-filter before running files through its generative-AI Code Insight malware analyzer.
- abuse.ch integrates it into threat-intel pipelines.
- File-upload endpoints — SaaS apps, CMSes, anywhere users can submit content. Don't trust the filename.
- SOC / DFIR — triage suspicious blobs at the start of an investigation.
- Dev tooling — static analyzers and editors that need to know whether a file is C or C++, JSON or JSONL.
Installing locally is a one-liner:
pipx install magika
magika -r ./suspicious-dir/Limitations & pricing
Magika is free. Apache 2.0. No tier, no rate limit. Caveats:
- The README disclaims that Magika is not an official Google product — no support SLA, no warranty.
- On low-confidence inputs, the model returns generic labels like
Generic text documentorUnknown binary data. You need to handle those fallbacks in your pipeline rather than assume a specific type. - The GoLang binding is still WIP; the JS/TS npm package is flagged experimental (it powers the web demo via TFJS, not the new Rust stack).
- Memory-safe outer loop in Rust, but ONNX Runtime inference is C++ — so the dependency tree still pulls in a native C++ runtime.
What's next
Google's roadmap for 1.0+ is community-driven: finishing GoLang bindings, expanding the file-type catalog via user requests, and accepting contributions on GitHub. The repo is at github.com/google/magika — already sitting at 16.2k stars and 1M+ monthly downloads. The research paper is on arXiv 2409.13768 (published at ICSE 2025).
For anyone building a security pipeline, a file-upload API, or a dev tool that needs to know what a blob actually is: 1MB of weights will probably out-classify 50 years of hand-written heuristics. It's worth the pipx install.
Nguồn: Google Open Source Blog (Magika 1.0), google/magika GitHub, InfoQ, arXiv paper.

