- Blockify giam kich thuoc corpus xuong con 2.5% kich thuoc goc trong khi giu lai 99% factual integrity.
- Token tieu thu moi query giam 3.09x - tu 1,515 xuong 490 tokens.
- Do chinh xac vector search tang 2.29x so voi chunking truyen thong.
- Trong thu nghiem lam sang voi Llama 3.2 3B chay on-device, Blockify cai thien do chinh xac trung binh 261% va len den 650% voi truong hop DKA management.
TL;DR
Blockify la mot data preprocessing engine (open-source, do Iternal Technologies phat trien) chay giua document parser va vector store. Thay vi chunk van ban theo do dai co dinh, no dung AI chuyen moi doan thanh IdeaBlock - mot don vi kien thuc 2-3 cau, kem theo Critical Question, Trusted Answer va metadata (version, clearance level, entity type). Pipeline gom 2 stage: Ingest (raw text → IdeaBlocks) va Distill (dedup + merge duplicate thanh canonical unit). Ket qua: corpus giam 40x, token per query giam 3x, vector search chinh xac hon 2.29x.

Van de that su cua RAG truyen thong
Standard RAG chunk van ban theo do dai co dinh (thuong 1,000-2,000 ky tu), cat ngang ranh gioi tu nhien cua van ban. Mot chunk thuong chua thong tin tu nhieu chu de khac nhau - chi 25-40% la lien quan den query cua nguoi dung, phan con lai la vector noise.
Nghiem trong hon: embedding model ma hoa chunk cu va chunk moi theo cung mot cach. Khong co tin hieu nao cho biet phien ban nao la chinh thong, cai nao la draft loi thoi. Khi retrieval lay ca hai, LLM pha tron chung va ao giac. Van de khong nam o retrieval - no nam o representation. Don vi du lieu sai, va fix phai xay ra truoc retrieval, o tang du lieu.
Blockify giai quyet dung cho diem nay.
IdeaBlock hoat dong nhu the nao
Engine sit between document parser va vector store. Pipeline gom 2 stage chinh:
- Ingest: Context-aware splitter tim natural breaks (paragraph boundary, section break, topic shift). Mot LLM chuyen dung xu ly tung segment va trich xuat IdeaBlock - khoang 2-3 cau, chua: ten concept, Critical Question, Trusted Answer, va metadata (entity name, entity type, version, clearance level). Moi block con kem cap Q&A giup embedding cua query va block nam gan nhau trong vector space - tuong tu HyDE nhung thuc hien o tang du lieu voi data da validate thay vi patch o retrieval.
- Distill: Model thu hai cluster cac block tuong tu ngu nghia tren toan bo corpus, merge duplicate thanh 1 canonical unit truoc khi indexing. Trung binh enterprise co ti le trung lap du lieu 15:1 - Distill loai bo hoan toan van de nay.
IdeaBlock duoc index tra loi mot cau hoi cu the thay vi tra ve mot doan van co the chua cau tra loi o dau do ben trong.
Con so quan trong
| Chi so | Standard RAG | Blockify | Cai thien |
|---|---|---|---|
| Kich thuoc corpus | 100% | 2.5% | 40x nho hon |
| Tokens per query | 1,515 | 490 | 3.09x giam |
| Vector search precision | Baseline | +51-52% | 2.29x chinh xac |
| Aggregate LLM accuracy | Baseline | len toi 78x | - |
Trong benchmark thuc te voi tieu thuyet Dune (425 trang, 200K+ tu) chay tren Intel Gaudi 2 qua Denvr Cloud: xu ly 202 giay, 5,404 bytes/giay, tang 40x do chinh xac RAG va 51% precision vector search.
Thu nghiem trong y te
Medical whitepaper cua Iternal test 9 cau hoi lam sang (DKA, viem phoi, heart failure, dau dau red flags...) tren Llama 3.2 3B MLC-LLM Quantized - mo hinh chay hoan toan on-device, air-gapped:
- Trung binh: +261.11% do chinh xac va source fidelity so voi standard chunking
- DKA management: +650% - RAG truyen thong goi y "D5W" (dextrose) lam IV dich dau tien - sai nghiem trong ve mat lam sang; Blockify goi y "IV rehydration" dung phac do
- Pneumonia lab tests: +500%
- Headache red flags: +250%
Ket luan cua whitepaper: "Blockify ingestion is not optional but mandatory for RAG-powered LLMs in medicine."
Tich hop va trien khai
Blockify compose voi cac tool RAG pho bien:
- LangChain: swap TextSplitter/NodeParser bang Blockify, IdeaBlock nodes hoat dong binh thuong trong pipeline
- LlamaIndex: tuong tu, IdeaBlock nodes tich hop vao query engine
- Vector DBs: Milvus, Elastic, Pinecone, Azure AI Search, Zilliz, AWS
- Intel Xeon: optimized build qua OpenVINO cho production workloads
- Claude Code skill co trong repo, chay full Ingest + Distill pipeline voi tham chieu project documentation
Pricing: $0.25/1K tokens (Developer, kem $400 promo credit); $270/thang/user (Enterprise). Ho tro Cloud SaaS, Private Cloud (AWS/Azure/GCP), On-premises va Air-gapped.
Nhan xet va tien trinh
Van de Blockify giai quyet - representation quality truoc retrieval - la gap thuc su cua tat ca RAG pipeline hien tai. Con so 40x corpus reduction va 3x token savings khong chi la accuracy metric, no con la cost metric: voi 1 ty query/nam, Blockify tiet kiem uoc tinh ~$738,000 API cost (tinh tren gia LLAMA 3.3 70B). Roadmap ke tiep la Self-Healing Datasets - LLM agents tu dong draft IdeaBlock updates va route toi SME de approve.
Nguon: @_avichawla tren X, Blockify Benchmarks - Iternal, Medical Accuracy Case Study.


