Google's 1MB AI Model Reads Files Like an X-Ray — And Catches Malware That Fakes Its Extension

← quay lại timelineArticle thread

Google's 1MB AI Model Reads Files Like an X-Ray — And Catches Malware That Fakes Its Extension

D. Chu

@donniechublog·20 Apr

20 Apr 2026·7 phút đọc

Highlights

Magika is Google's open-source file-type detector.
A 1MB deep-learning model, rewritten in Rust, classifies 200+ file types in ~5ms on a single CPU — and it already runs inside Gmail, Drive, Safe Browsing, and VirusTotal.

Attackers rename payload.exe to resume.pdf and hope your scanner trusts the extension. Google's answer: don't. Magika is an AI file-type detector that reads the actual bytes of a file, decides what it really is, and does it in about 5 milliseconds on a single CPU — with a deep-learning model that weighs only ~1MB.

Originally open-sourced in Feb 2024, Magika just hit its first stable release (1.0, Nov 6, 2025) with a full Rust rewrite, a native CLI, and support for 200+ content types.

TL;DR

Open-source, Apache 2.0. Model ~1MB. Trained on ~100M files.
~99% F1 accuracy across 200+ file types; ~5ms/file on single-CPU inference.
Core engine rewritten in Rust → ~1,000 files/sec on a MacBook Pro M4.
Runs in production at Google across Gmail, Drive, and Safe Browsing (hundreds of billions of files/week). Also pre-filters VirusTotal's Code Insight.
Install via pipx install magika, brew, a Rust crate, or npm. Web demo runs locally in the browser.

What's new in 1.0

The alpha from 2024 was a Python-first tool with a TensorFlow.js model. Magika 1.0 is a different beast:

Rust engine from the ground up. The native CLI extracts feature vectors in Rust and runs ML inference in C++ via the ONNX Runtime (ort crate). Tokio drives async parallelism.
200+ content types, up from ~100. Newly added: Jupyter notebooks, PyTorch models, ONNX, Parquet, HDF5, Swift, Kotlin, TypeScript, Dart, Zig, WebAssembly, Dockerfile, TOML, HashiCorp HCL, Bazel, YARA rules, SQLite, AutoCAD, Photoshop PSD, WOFF2.
Granular distinctions old tools couldn't make: JSON vs JSONL, C vs C++, JavaScript vs TypeScript, TSV vs CSV, Apple binary plists vs XML plists.
Revamped Python and TypeScript modules, plus a GoLang binding in progress.

Why it matters

File-type detection has been stuck on libmagic and the file utility for 50+ years. Both rely on handcrafted magic-byte heuristics — fragile against evolving formats and trivially fooled by adversarially-crafted payloads. For security pipelines, "trust the extension" is a known attack surface: anyone can rename a file.

Magika flips the assumption. Extensions are ignored. Only a bounded subset of the actual bytes feeds a small CNN trained to recognize the structural fingerprint of each format. Because it's ML instead of hand-rules, it generalizes — and it's cheap enough to drop into hot paths like email ingestion and upload handlers.

Technical facts

Property	Value
Model size	~1MB (Keras → ONNX)
Training data	~100M files, ~3TB uncompressed (streamed via SedPack)
File types	200+ (binary + textual)
Accuracy	~99% average precision & recall
Inference latency	~5ms per file, single CPU
Throughput	Hundreds/sec single-core, ~1,000/sec on M4 multi-core
Hardware	No GPU required
License	Apache 2.0

The trick behind the near-constant latency: Magika only reads a limited byte window of each file, not the whole thing. A 50MB archive classifies as fast as a 5KB snippet. Google also used Gemini to synthesize training samples for rare/legacy formats where real-world data was scarce.

Comparison vs libmagic

Metric	libmagic / `file`	Magika 1.0
Approach	Handcrafted heuristics	Deep-learning classifier
Textual/code formats	Weak (often confuses JS/TS, C/C++)	Strong — granular per-language
Accuracy (Google 1M benchmark)	Baseline	~20% better overall
Production uplift at Google	Baseline	+50% accuracy vs prior rule system
Adversarial robustness	Fragile	Learned from ~100M samples
Latency	Milliseconds	~5ms (comparable)

Real-world use cases

At Google, Magika routes incoming files across Gmail attachments, Drive uploads, and Safe Browsing downloads to the right security scanners. Switching from handcrafted rules to Magika boosted accuracy 50%, let Google scan 11% more files with specialized malicious-document AI scanners, and cut the "unidentified file" bucket down to 3%.

Outside Google:

VirusTotal uses Magika as a pre-filter before running files through its generative-AI Code Insight malware analyzer.
abuse.ch integrates it into threat-intel pipelines.
File-upload endpoints — SaaS apps, CMSes, anywhere users can submit content. Don't trust the filename.
SOC / DFIR — triage suspicious blobs at the start of an investigation.
Dev tooling — static analyzers and editors that need to know whether a file is C or C++, JSON or JSONL.

Installing locally is a one-liner:

pipx install magika
magika -r ./suspicious-dir/

Limitations & pricing

Magika is free. Apache 2.0. No tier, no rate limit. Caveats:

The README disclaims that Magika is not an official Google product — no support SLA, no warranty.
On low-confidence inputs, the model returns generic labels like Generic text document or Unknown binary data. You need to handle those fallbacks in your pipeline rather than assume a specific type.
The GoLang binding is still WIP; the JS/TS npm package is flagged experimental (it powers the web demo via TFJS, not the new Rust stack).
Memory-safe outer loop in Rust, but ONNX Runtime inference is C++ — so the dependency tree still pulls in a native C++ runtime.

What's next

Google's roadmap for 1.0+ is community-driven: finishing GoLang bindings, expanding the file-type catalog via user requests, and accepting contributions on GitHub. The repo is at github.com/google/magika — already sitting at 16.2k stars and 1M+ monthly downloads. The research paper is on arXiv 2409.13768 (published at ICSE 2025).

For anyone building a security pipeline, a file-upload API, or a dev tool that needs to know what a blob actually is: 1MB of weights will probably out-classify 50 years of hand-written heuristics. It's worth the pipx install.

Nguồn: Google Open Source Blog (Magika 1.0), google/magika GitHub, InfoQ, arXiv paper.

Google's 1MB AI Model Reads Files Like an X-Ray — And Catches Malware That Fakes Its Extension

TL;DR

What's new in 1.0

Why it matters

Technical facts

Comparison vs libmagic

Real-world use cases

Limitations & pricing

What's next

Tiếp tục lướt

Sherlock: công cụ OSINT mã nguồn mở quét username trên 400+ mạng xã hội trong vài giây

AI Agent pops a root shell on Ubuntu 26.04 — on day one

SideImpactor: ký và cài app iOS ngay trong trình duyệt qua WebUSB, không cần Sideloadly

OpenClaw v2026.4.24: Google Meet agents, full-agent voice, and DeepSeek V4 land in one release

qa-use: AI agents tự test E2E web app — viết test bằng tiếng Anh, chạy bằng Claude/GPT/Gemini