DevOps vs MLOps vs LLMOps: 3 ops, 3 bài toán khác nhau — đừng lấy playbook DevOps áp vào app LLM

← quay lại timelineArticle thread

DevOps vs MLOps vs LLMOps: 3 ops, 3 bài toán khác nhau — đừng lấy playbook DevOps áp vào app LLM

D. Chu

@donniechublog·24 Apr

24 Apr 2026·7 phút đọc

Highlights

DevOps kiểm tra code chạy hay không.
MLOps canh data drift và model decay.
LLMOps thì phải soi hallucination, token cost, bias và human feedback — và evaluation loop feedback ngược cả 3 đường prompt/RAG/fine-tune cùng lúc.
Không còn là linear pipeline nữa.

TL;DR

Nhiều team đang lấy playbook DevOps áp thẳng vào app LLM và lúng túng. Sai ngay từ điểm xuất phát: DevOps, MLOps, LLMOps giải 3 bài toán hoàn toàn khác nhau.

DevOps xoay quanh code — deterministic, feedback loop đơn giản: chạy hay không chạy.
MLOps xoay quanh model — probabilistic, canh data drift + model decay, retrain định kỳ.
LLMOps xoay quanh foundation model — non-deterministic, không train từ đầu mà tối ưu qua 3 đường: prompt engineering, RAG, fine-tuning.

Điểm tách biệt LLMOps rõ nhất không phải ở kiến trúc mà ở monitoring: không còn là accuracy/drift, mà là hallucination, bias, toxicity, token cost, human feedback. Và evaluation loop feedback ngược cả 3 đường tối ưu cùng lúc — không phải linear pipeline nữa.

DevOps — software-centric

Code → test → deploy. Feedback loop là binary: "Does the code work?" Output deterministic — cùng input, cùng output. Bug là bug, fix là fix.

Artifact chính: code và infrastructure. Rủi ro lớn nhất: deployment failure. Cost tương đối dễ dự báo.

MLOps — model-centric

Đây là nơi nhiều dev lần đầu gặp khái niệm model decay: code không đổi, nhưng model chạy sản xuất kém dần vì thế giới ngoài kia thay đổi (user behavior, thị trường, distribution của input). Đó là data drift.

Pipeline MLOps classic: thu data → train → evaluate → deploy → monitor → retrain. Artifact: datasets, features, model binaries. Monitoring: accuracy, precision, recall, F1, RMSE, data drift, latency — toàn metric định lượng vs holdout set.

Dùng MLOps khi: recommendation engine (Spotify, Netflix), fraud detection, churn prediction, sales forecast — học pattern từ structured data để ra 1 prediction cụ thể. Cost chủ yếu ở training (GPU-hours). Inference rẻ.

LLMOps — foundation-model-centric

Khác biệt lớn: bạn không train từ đầu. Bạn chọn 1 foundation model (GPT, Claude, Llama, Gemini...) rồi tối ưu qua 3 đường:

1. Prompt Engineering

Viết và version prompt như version code. Zero-shot hoặc few-shot với 1-vài ví dụ hand-picked. Là đường nhanh nhất để điều khiển tone, format, behavior. Nhưng cũng brittle — model update 1 phát là prompt có thể vỡ.

2. Context / RAG Setup

Retrieve data từ vector DB / knowledge base tại runtime, inject vào context window. Không cần retrain khi data thay đổi. Dùng khi cần: factual grounding, proprietary data, giảm hallucination. Theo nghiên cứu RAGOps (arXiv 2506.03401), 60% enterprise LLM compound systems đã tích hợp RAG dưới dạng nào đó.

3. Fine-Tuning

Train tiếp foundation model trên domain data của bạn (LoRA, QLoRA, PEFT). Dùng khi: cần embed deep domain knowledge, brand voice, regulatory alignment. Thường là last resort — chỉ đụng khi prompt + RAG không đủ, vì cost và complexity cao.

Nguyên tắc vàng: fine-tuning cho style/tone/reasoning, RAG cho real-time facts. Chúng bổ trợ chứ không thay nhau.

Monitoring khác biệt hoàn toàn

Đây chính là điểm nhiều team bỏ sót khi chuyển từ MLOps sang LLMOps.

Dimension	MLOps	LLMOps
Metric chính	Accuracy, precision, recall, F1, RMSE	Hallucination rate, toxicity, relevance, faithfulness
Drift quan tâm	Data drift, model decay	Prompt drift, behavior drift, model-vendor update
Testing	Holdout set, benchmark cố định	Golden prompts (15-20), human judge, LLM-as-judge
Cost line item	Training (GPU-hours)	Inference (token per query)
Security risk	Data leakage qua input/output	Prompt injection, jailbreak, harmful content
Feedback loop	Trigger retrain khi drift	Feedback đồng thời vào prompt / RAG / fine-tune

Lý do đơn giản: bạn không thể chỉ check output đúng hay sai. Output phải an toàn, grounded, cost-effective. Một đoạn văn "đúng fact" vẫn có thể toxic, vẫn có thể leak PII, vẫn có thể đốt 10× token bình thường.

Evaluation loop không còn là linear pipeline

Trong MLOps, fail eval → retrain. Đường thẳng.

Trong LLMOps, fail eval có thể nghĩa là:

Prompt chưa đủ rõ → sửa prompt
Context thiếu/sai → cải RAG (cải chunking, embedding model, reranker)
Model không đủ năng lực domain → fine-tune
Hoặc cả 3 cùng lúc

Đây là lý do các platform LLMOps hiện đại (W&B Weave, LangSmith, Arize Phoenix, TruLens) đều track trace — chuỗi inference từ retrieval → prompt → generation → guardrail — chứ không chỉ track 1 metric đầu ra.

Cost dimension bị đánh giá thấp nhất

Chawla cảnh báo rất đúng: "Trong DevOps compute cost dự báo được. Trong LLMOps, 1 prompt sai có thể 10× token spend qua 1 đêm." Teams đã đốt cả budget tháng trong vài ngày chỉ vì không track token per query.

LLMOps phải có:

Token counter ở từng call (input + output)
Cost budget alert ở level tenant / feature / endpoint
Prompt cache để giảm redundant token
Model routing (dùng model nhỏ cho task đơn giản, model lớn cho task phức)

Prompt versioning & RAG pipeline là first-class citizens

Giống như data versioning đã trở thành bắt buộc trong MLOps, giờ prompt versioning và RAG pipeline là first-class trong LLMOps. Mỗi prompt change = deployable artifact, có CI, có rollback, có golden test.

Guardrails cũng đã thành standard — theo framework RAGOps, có 4 stage:

Input rails — sanitize/reject user prompt độc hại
Dialog rails — điều phối dòng hội thoại
Retrieval rails — lọc/mask data retrieve về (đặc biệt PII)
Output rails — chặn/chỉnh response trước khi trả user

Chọn ops layer khớp với system bạn đang build

Đây là take-away thực dụng nhất:

Build SaaS bình thường → DevOps đủ.
Build fraud detection / recommendation / forecasting trên structured data → MLOps.
Build chatbot / copilot / RAG search / agent workflow trên LLM → LLMOps.
Build hệ thống agent autonomy cao → AgentOps (tầng trên LLMOps).

Chúng stack, không thay nhau. DevOps vẫn cấp CI/CD + infra cho tất cả. MLOps extend DevOps cho probabilistic model. LLMOps extend MLOps cho foundation model. AgentOps extend LLMOps cho autonomous agents. Mỗi tầng thêm 1 failure mode mới: deployment bug → model drift → hallucination → uncontrolled autonomy.

Over to you

LLM monitoring stack của bạn đang dùng gì? PromptLayer hay Helicone cho token tracking? RAGAS / TruLens cho eval? Hay còn đang log bằng console.log và pray?

Bài gốc từ @_avichawla. Tham khảo thêm: Daily Dose of DS, ZenML, Neptune.ai, RAGOps paper (arXiv 2506.03401).

DevOps vs MLOps vs LLMOps: 3 ops, 3 bài toán khác nhau — đừng lấy playbook DevOps áp vào app LLM

TL;DR

DevOps — software-centric

MLOps — model-centric

LLMOps — foundation-model-centric

1. Prompt Engineering

2. Context / RAG Setup

3. Fine-Tuning

Monitoring khác biệt hoàn toàn

Evaluation loop không còn là linear pipeline

Cost dimension bị đánh giá thấp nhất

Prompt versioning & RAG pipeline là first-class citizens

Chọn ops layer khớp với system bạn đang build

Over to you

Tiếp tục lướt

Atuin Desktop: Runbook chạy được cho terminal workflows, giờ đã open source

Continuous Integration (CI) trong DevOps: giải thích từ đầu cho dev hiện đại

8 kỹ thuật prompting để LLM trả lời tốt hơn (không cần đổi model)

Skyhook Radar: Kubernetes dashboard local-first, single binary 30MB, real-time qua SSE

Kubernetes v1.36: User Namespaces đã chính thức GA sau 4 năm Alpha