Tất cả bài viết

// Posts#prefill-decode

#9432026-05-07

Prefill và Decode: Hai pha đối lập giải thích mọi thứ về tốc độ LLM

Prefill xử lý toàn bộ prompt song song - bottleneck là compute, metric là TTFT. Decode sinh từng token một - bottleneck là memory bandwidth, metric là ITL. Llama-2-13B tốn 800KB KV cache mỗi token, 4K context với batch 8 ngốn 25GB VRAM. DeepSeek MLA nén cache xuống 93.3% và tăng throughput 5.76x bằng cách redesign attention từ đầu.

llm-inferencekv-cacheprefill-decode

6 phút đọc