NVIDIA NCore: The Open-Source Data Format That Could End Sensor-Data Chaos in Physical AI

TL;DR

NVIDIA released NCore — an open-source Python library and canonical data format for multi-sensor recordings (cameras, LiDAR, radar, poses, calibrations, labels) used in autonomous vehicles, robotics, and physical AI. The headline feature: a new single-file .itar storage format that hits ~9.8 GB/s on local SSD and delivers a 240ms time-to-first-read from Amazon S3, compared with 8.7 seconds for Parquet. Install via pip install nvidia-ncore. Apache 2.0. Already powers NVIDIA NuRec, 3DGRUT, and gsplat.

NVIDIA/ncore GitHub repository preview

What's new

Sensor data in robotics and AV has been fragmented forever. Every team invents its own coordinate conventions, calibration representations, and storage layouts. Every dataset ships with a custom parser. Every pipeline rewrites motion compensation from scratch. NCore is NVIDIA's attempt to kill that chaos with one canonical format.

Released roughly two months ago as v18.5.0, the repo just hit v18.9.0 last week, with the latest update adding S3 streaming benchmarks and a tail-read cache for the indexed-tar store. It is developed inside NVIDIA's SIL Lab and published under Apache 2.0.

What makes it notable is not just the format — it is the companion .itar (indexed tar) container, which packages Zarr chunks as sequential tar members and appends a compressed index at the end. The result: the streaming efficiency of a tar file plus O(1) random access.

Why it matters

Physical AI — autonomous vehicles, humanoid robots, drones, reasoning VLA models like NVIDIA Alpamayo — is built on ingesting oceans of multi-sensor recordings. Getting that data from the fleet into a trainable form is the slowest, most expensive stage of the whole pipeline.

If you have ever tried to train a NeRF or 3D Gaussian Splatting model on Waymo Open plus a custom robot dataset, you know the pain: two datasets, two parsers, two coordinate conventions, two calibration schemas, zero shared tooling. NCore fixes that by making the format the contract, not the code.

Because the storage layer is optimized for cloud object stores, distributed training on GPU clusters no longer needs a local extraction step. You just point the loader at S3, GCS, or Azure Blob and stream.

Technical facts

Numbers first, hype second. On a synthetic benchmark of 1,000 JPEG images at 2K and 4K (~4.5 GB total):

Storage format	S3 time-to-first-read
`.itar` (NCore)	240 ms
Parquet	8.7 s
tarfile (plain)	119 s

Local SSD throughput for .itar: ~9.8 GB/s sequential, ~9.5 GB/s random access. Streaming from S3: 53 MB/s sequential, 16 MB/s random — without any local extraction.

Other notable specs:

Cloud backends: local disk, Amazon S3, Google Cloud Storage, Azure Blob Storage — same API, no code changes (via UPath + fsspec).
Camera models: ftheta (NVIDIA polynomial, ultra-wide FOV), opencv-pinhole, opencv-fisheye (Kannala–Brandt 4-coefficient), bivariate-windshield for refraction through car glass.
LiDAR model: row-offset-spinning — structured spinning sensor (e.g. Hesai Pandar P128).
Radar component: raw detections with direction, distance, radial velocity, RCS, SNR.
Pose graph: unified SE(3) tree with microsecond timestamps and optional global ECEF anchor.
Rolling-shutter-aware projections: demonstrated on 10 non-synchronized rolling-shutter cameras on a single AV.
Non-redundant storage: raw ray bundles only; motion compensation computed on-demand so egomotion estimates can be swapped without rewriting sensor data.

Comparison

How NCore's .itar stacks up against what teams use today:

Aspect	ROS bags / Parquet / HDF5	NCore
Interchange	Fragmented, per-team	Canonical open format
Storage	Often pre-computes motion-compensated point clouds	Raw rays, compute on demand
S3 first read	8.7 s (Parquet), 119 s (tar)	240 ms
Modularity	Monolithic	Component-based, independent updates
Sensor models	CPU, per-project	GPU (CUDA/PyTorch), rolling-shutter-aware
Pose handling	Rigid-rig OR free-pose	Unified pose graph — both

Standard container formats like ROS bags (.bag), MCAP, and ASAM MDF4 still make sense for raw fleet ingest. NCore is the downstream reconstruction-ready layer you convert into.

Use cases

NCore is not a theoretical project — it is already the data backbone for three of NVIDIA's flagship reconstruction engines:

NVIDIA NuRec — NCore is the required input format. NuRec turns NCore data into Gaussian-splat photorealistic 3D scenes with independently controllable actors.
3DGRUT — NVIDIA's hybrid rasterization + ray-tracing engine for Gaussian particles. Native NCore v4 training support was added in March 2026.
gsplat — the popular high-performance Gaussian splatting library; built-in NCore dataset loader.

On the production side, NCore sits at the center of the AWS + NVIDIA AV 3.0 reference architecture: raw fleet ROS/MCAP/MDF4 recordings are converted into NCore, reconstructed with NuRec into OpenUSD scenes on Amazon EC2 G7e instances (RTX PRO 6000 Blackwell, 96GB GPU memory), then used to train Alpamayo VLA models and validate them in AlpaSim closed-loop simulation.

Built-in converters handle Waymo Open, COLMAP/ScanNet++, the NVIDIA Physical AI Dataset, and PPISP. A converter base class makes it straightforward to add proprietary fleet formats.

Limitations & pricing

NCore itself is free and Apache 2.0 licensed. The catch is downstream: to use the GPU-accelerated sensor models you need CUDA-capable NVIDIA GPUs, and the high-end AV reconstruction pipelines (NuRec, Asset Harvester, Fixer, Alpamayo) run best on cloud GPU instances like EC2 G7e with 96GB of VRAM — which is not cheap.

Documentation does not yet publish specific dataset-size limits or known scaling ceilings. With 90 stars and 8 forks at the time of writing, it is also early-stage community-wise — though usage inside the NVIDIA stack is already substantial.

Install is one line: pip install nvidia-ncore. Source on GitHub.

What's next

NVIDIA says more dataset converters are actively being developed beyond Waymo, COLMAP/ScanNet++, and Physical-AI-AV. Future NVIDIA Hyperion AV hardware variants (beyond 8/8.1) will be supported natively, and ITAR optimizations continue shipping (last week's release added a tail-read cache to avoid duplicate I/O in S3 streams).

If NCore gets broad adoption outside NVIDIA's stack — especially by non-NVIDIA AV companies and open robotics datasets — it could become the de facto interchange standard for physical AI, the way Parquet became for analytics. The pieces are in place: Apache 2.0, PyPI install, multi-cloud support, and integration with the two most popular Gaussian-splat engines (3DGRUT, gsplat).

Sources: NVIDIA SIL, NVIDIA/ncore on GitHub, NCore docs, AWS Industries Blog.

NVIDIA NCore: The Open-Source Data Format That Could End Sensor-Data Chaos in Physical AI

TL;DR

What's new

Why it matters

Technical facts

Comparison

Use cases

Limitations & pricing

What's next

Tiếp tục lướt

Sherlock: công cụ OSINT mã nguồn mở quét username trên 400+ mạng xã hội trong vài giây

SideImpactor: ký và cài app iOS ngay trong trình duyệt qua WebUSB, không cần Sideloadly

OpenClaw v2026.4.24: Google Meet agents, full-agent voice, and DeepSeek V4 land in one release

qa-use: AI agents tự test E2E web app — viết test bằng tiếng Anh, chạy bằng Claude/GPT/Gemini

Faraday: nền tảng quản lý lỗ hổng mã nguồn mở dành cho red team