FlashDrive: Reasoning VLA cho xe tự lái chạy real-time — 716ms xuống 159ms, zero accuracy loss

TL;DR

Reasoning VLA (Vision-Language-Action) model giúp xe tự lái xử lý tốt hơn các tình huống long-tail phức tạp, nhưng cost autoregressive chain-of-thought decoding khiến nó quá chậm để deploy real-time. FlashDrive — dự án mới của Z Lab do Zhijian Liu dẫn dắt — giải bài toán đó bằng algorithm–system co-design: 716ms → 159ms trên RTX PRO 6000 (4.5× speedup), zero accuracy loss, transfer từ Jetson Thor in-car tới RTX 5090 với speedup 4.0×–5.7×.

Alpamayo 1.5 vs FlashDrive side-by-side, 4.5x faster trên RTX PRO 6000 Blackwell

What's new

FlashDrive là framework đầu tiên tối ưu cả bốn stage của VLA inference cùng lúc: vision encoding, prompt prefilling, reasoning token decoding, và action generation. Baseline là Alpamayo 1.5 — model VLA 10B dựa trên Qwen3-VL. Thay vì chọn một điểm nghẽn, team chọn đánh cả 4 stage bằng 4 kỹ thuật riêng — gain không saturate mà compound lại.

Trên RTX PRO 6000, latency từng stage: Encode 88ms → 12.5ms, Prefill 177.2ms → 52.5ms, Decode 263.8ms → 48.2ms, Action 187.4ms → 46.2ms. Không có single fix nào chiếm hơn một nửa tổng speedup.

Why it matters

Trước giờ reasoning VLA cho autonomous driving mắc ở trade-off kinh điển: bật chain-of-thought reasoning thì handle được edge case (ngã tư không lane marking, pedestrian rare, construction zone), nhưng latency vọt lên ~700ms — quá chậm cho control loop 10Hz. Các hướng song song như Reasoning-VLA né bằng parallel action prediction một bước, đánh đổi bỏ CoT. FlashDrive giữ nguyên reasoning structure — chỉ tối ưu system, không hy sinh quality.

Ý nghĩa thực tế: lần đầu tiên reasoning policy có thể ship vào xe production mà không fallback sang non-reasoning planner cho safety-critical loop.

Technical facts

Bốn kỹ thuật cốt lõi:

Streaming inference — loại 75% redundant vision computation nhờ reuse KV cache giữa các frame liên tiếp. Dùng pre-RoPE key caching với dynamic rotary embeddings, custom streaming attention mask cho thứ tự multi-camera view. Chỉ fine-tune action expert (không đụng VLM) để recover accuracy.
DFlash speculative reasoning — block diffusion model sinh cả block ~16 reasoning token song song thay vì autoregressive. Tận dụng reasoning sequence cho driving thường ngắn và có cấu trúc, entropy per-token thấp. Kết quả: zero quality loss so với AR decoding.
Adaptive-step flow matching — giảm denoising step từ 10 xuống ít hơn bằng cache velocity ở middle step. Velocity profile trong flow matching có dạng chữ U: endpoint mang signal, middle step dư thừa.
ParoQuant W4A8 — pairwise rotation quantization (weight 4-bit, activation 8-bit), đánh cả compute-bound prefill lẫn memory-bound decode. Dùng scaled pairwise rotation để suppress outlier trong CoT token — chính ParoQuant đã được accept ở ICLR 2026.

Thêm hai tối ưu system: CUDA Graphs khử CPU dispatch overhead trong autoregressive generation, và kernel fusion gộp projection + MLP operations.

Accuracy vẫn giữ

Trên Alpamayo 1.5 baseline:

Metric	Alpamayo 1.5	+ FlashDrive
ADE@6.4s	1.72 m	1.56 m (tốt hơn)
minADE@6.4s	0.77 m	0.84 m (within 0.1m tolerance)
End-to-end latency (RTX PRO 6000)	716 ms	159 ms

ADE (Average Displacement Error) thậm chí giảm nhẹ — không phải do tối ưu mà do action expert fine-tune lại trong quá trình adapt streaming inference.

Cross-platform

Cùng bộ optimization chuyển sang GPU khác:

GPU	Speedup
Jetson Thor (in-car)	4.0×
RTX 3090	4.9×
RTX 4090	5.7× (cao nhất)
RTX 5090	5.1×
RTX PRO 6000	4.5×

Việc transfer được xuống Jetson Thor quan trọng: đây là SoC embedded NVIDIA thiết kế cho AV production, không phải datacenter GPU.

Use cases

AV stack team đang dùng VLM/VLA làm planner, cần giảm latency để đạt control frequency production-grade.
OEM muốn ship reasoning model vào xe mà hardware budget giới hạn ở Jetson-class compute.
Research lab benchmark reasoning policy ở 10Hz+ — trước đây phải hạ frequency hoặc tắt CoT.
Các ứng dụng robotics nói chung dùng VLA architecture (manipulation, humanoid) — kỹ thuật có thể port.

Limitations & availability

Preview — project page + demo video đã public, paper đang "available shortly". Code FlashDrive chưa open source (DFlash và ParoQuant thì đã có trên Z-Lab GitHub).
Base test là Alpamayo 1.5 10B — chưa rõ generalize sang VLA architecture khác (AutoVLA, Reasoning-VLA) mức nào.
minADE@6.4s giảm 0.07m — trong tolerance nhưng không phải "zero loss" tuyệt đối ở mọi metric.
Chưa có closed-loop simulation hay real-vehicle deployment data.

What's next

Paper được Z Lab hứa "available shortly" — khi release sẽ có thêm ablation chi tiết và hopefully closed-loop driving benchmark. DFlash (1.9k star) và ParoQuant (ICLR 2026) đã open source, nên khả năng cao FlashDrive cũng sẽ theo. Xa hơn, pattern co-design 4-stage này có thể là blueprint cho mọi VLA deployment — không chỉ driving mà cả robot manipulation và humanoid reasoning.

Nguồn: FlashDrive project page, Zhijian Liu trên X, Z-Lab GitHub.

FlashDrive: Reasoning VLA cho xe tự lái chạy real-time — 716ms xuống 159ms, zero accuracy loss

TL;DR

What's new

Why it matters

Technical facts

Accuracy vẫn giữ

Cross-platform

Use cases

Limitations & availability

What's next

Tiếp tục lướt

Qwen3.6 35B chạy 164 tok/s trên creative writing với DFlash: kỷ lục mới của open-source MoE

DFlash đã chạy được trên llama.cpp: block-diffusion draft, speedup tới 8× cho Qwen3

Kimi K2.6 + DFlash trên 8x MI300X: 508 tok/s, nhanh gấp 5.6 lần mà không mất chất lượng

200 tok/s, 49W: Qwen3.6-27B-FP8 Runs Flagship Coding on a Single DGX Spark

NVIDIA Asset Harvester: Biến video lái xe thành 3D asset trong vài giây