The Cross-Simulator Gap: GR00T N1.6 scores 97.65% on LIBERO (MuJoCo) but 0% on RoboGate (Isaac Sim). Same model, same robot, same task — 97.65 percentage point gap.
LIBERO (MuJoCo)
97.65%
NVIDIA Official
→ 0%
97.65pp gap
RoboGate (Isaac Sim)
0%
Confidence 49/100
GR00T N1.6, NVIDIA's own robot foundation model, achieves 97.65% on LIBERO (MuJoCo). The same model, fine-tuned on the same LIBERO-Spatial dataset, scores 0% on RoboGate's 68 industrial scenarios (Isaac Sim). This 97.65 percentage point gap proves that deployment-environment validation is not optional — it's essential.
All models evaluated on identical 68 scenarios · Franka Panda · Isaac Sim 5.1
| Model | Params | SR | Result | Conf. |
|---|---|---|---|---|
| Scripted Controller | — | 100% | 68/68 | 76 |
| GR00T N1.6 (LIBERO-finetuned)NEW | 3B | 0.0% | 0/68 | 49 |
| GR00T N1.7 LIBERO-10 (NVIDIA)NEW | 3B | 0.0% | 0/68 | 27 |
| GR00T N1.7 LIBERO-Goal (NVIDIA)NEW | 3B | 0.0% | 0/68 | 27 |
| GR00T N1.7 LIBERO-Object (NVIDIA)NEW | 3B | 0.0% | 0/68 | 27 |
| GR00T N1.7 LIBERO-Spatial (NVIDIA)NEW | 3B | 0.0% | 0/68 | 27 |
| PI0 Base (Physical Intelligence) | 3.5B | 0.0% | 0/68 | 27 |
| OpenVLA (Stanford + TRI) | 7B | 0.0% | 0/68 | 27 |
| GR00T N1.6 (base) | 3B | 0.0% | 0/68 | 1 |
| SmolVLA Base (HuggingFace) | 450M | 0.0% | 0/68 | 1 |
| Octo-Base (UC Berkeley) | 93M | 0.0% | 0/68 | 1 |
| Octo-Small (UC Berkeley) | 27M | 0.0% | 0/68 | 1 |
| CogACT (Embodied VLA) — mock | 7B | pending | — | — |
| X-VLA — mock | 4.5B | pending | — | — |
| OpenVLA-OFT — mock | 7B | pending | — | — |
Scaling from 27M to 7B (260×) parameters yields zero improvement — and even NVIDIA's official GR00T N1.6 (3B) scores 0%. The failure is not capacity — it's the distribution gap between robot pre-training data and RoboGate's adversarial industrial scenarios.
Scale tested
27M → 7B (260×)
PI Official
PI0 Base (3.5B)
NVIDIA Official
GR00T N1.6 (3B)
Improvement
0%
vs Scripted
100-point gap
Six VLA models — including Physical Intelligence's PI0, NVIDIA's GR00T N1.6, and HuggingFace's SmolVLA — evaluated on the same 68 adversarial scenarios via two-process ZMQ pipeline (Isaac Sim ↔ VLA inference).
NVIDIA (LIBERO-Spatial 20K)
GR00T N1.6 fine-tuned on LIBERO-Spatial (20K steps, H100 80GB). Achieves 97.65% SR on LIBERO (MuJoCo) — NVIDIA's official benchmark result. But scores 0% on RoboGate's 68 Isaac Sim scenarios with Confidence 49/100 (zero collisions). The highest Confidence among VLAs, yet still 0% SR — proving the cross-simulator gap.
Physical Intelligence
Physical Intelligence's official 3.5B VLA, evaluated via OpenPI (official inference server). PaliGemma 3B vision + 315M Flow-Matching action expert. Zero collisions like OpenVLA, but 0% SR — the Flow-Matching architecture also cannot bridge the training-deployment distribution gap without fine-tuning.
NVIDIA
NVIDIA's official 3B foundation model for humanoid and manipulation. Built on Eagle-2 vision encoder + Llama backbone with large-scale robot pre-training. Despite being the industry's flagship VLA from the GPU leader, 0% SR with both grasp_miss and collision failures — proving even tier-1 vendors cannot bridge the sim-to-real gap on adversarial scenarios.
Stanford + Toyota Research Institute
Open-source 7B VLA from Stanford + Toyota Research Institute. Built on Llama-2 backbone, fine-tuned on Open X-Embodiment. The largest model tested — yet 0% SR with a different failure profile: primarily grasp_miss with zero collisions, suggesting better spatial awareness but still unable to complete tasks.
UC Berkeley
93M parameter version of Octo from UC Berkeley. Trained on 800K episodes from Open X-Embodiment. 3.4× larger than Octo-Small but identical 0% SR and nearly identical failure distribution.
HuggingFace
HuggingFace's 450M parameter VLA built on SmolLM2 language model + SigLIP vision encoder. Designed for efficient on-device deployment. The fastest model tested (18ms/inference) — yet 0% SR, demonstrating that even purpose-built efficient VLAs cannot bridge the training-deployment gap.
UC Berkeley
27M parameter lightweight VLA from UC Berkeley. The smallest and fastest model. Same 0% result with 79.4% grasp_miss and 20.6% collision failures.
Scripted Baseline
68/68 PASS · Confidence 76/100
Octo-Small VLA (6 models)
0/68 FAIL · Best Confidence: 27/100 (OpenVLA, PI0)
100-point gap
Same 0% SR but different failure patterns yield different Confidence Scores
Physical Intelligence's official 3.5B model (OpenPI). Zero collisions — same pattern as OpenVLA. Flow-Matching architecture also cannot bridge the distribution gap
NVIDIA's official 3B model. Collisions present + grasp_miss dominant. Despite large-scale robot pre-training, complete failure on industrial adversarial scenarios
Zero collisions — spatial awareness exists but unable to grasp. Higher confidence means 'safe but incapable'
20%+ collision rate — crashes into table/obstacles. Low confidence means 'incapable and dangerous'
Models currently being integrated for evaluation
NVIDIA · 2B · Predict 2.5 · LIBERO 98.33% SOTA
NVIDIA · 3B · LIBERO_PANDA · Isaac-GR00T API
NVIDIA · 3B · libero_{10,goal,object,spatial} · Apache 2.0
NVIDIA · DreamZero · Late 2026
On April 14, 2026, NVIDIA released Ising — an open AI model family for quantum computing — and framed it explicitly as "the control plane for quantum machines." The same pattern applies to Physical AI.
| NVIDIA Ising (Quantum) | RoboGate (Physical AI) | |
|---|---|---|
| Domain | Quantum Computing | Physical AI |
| Hardware noise | ~10⁻³ qubit errors | Physics sim-to-real gap |
| Validation method | 35B VLM + 3D CNN | 68-scenario benchmark |
| Benchmark | QCalEval (6 tests) | RoboGate Bench (68) |
| Key finding | 2.5x faster decoding | 97.65% → 0% gap |
| Target integration | CUDA-Q + NVQLink | Isaac Sim + Arena |
| Release | HF + GitHub (open) | HF + GitHub (open) |
*RoboGate is not affiliated with NVIDIA. This comparison illustrates a structural parallel: both serve as AI-based validation layers for fundamentally noisy systems. NVIDIA Ising is a trademark of NVIDIA Corporation.*
If you use this VLA benchmark in your research:
@misc{kim2026robogate,
title = {ROBOGATE: Adaptive Failure Discovery for Safe Robot
Policy Deployment via Two-Stage Boundary-Focused Sampling},
author = {{AgentAI Co., Ltd.}},
year = {2026},
doi = {10.5281/zenodo.19166967},
url = {https://robogate.io/paper},
note = {VLA Benchmark: 8 VLA models 0/68. Cross-simulator gap: GR00T N1.6 LIBERO 97.65\% → RoboGate 0\%. PI0, SmolVLA, OpenVLA, Octo-Base, Octo-Small also 0/68}
}RoboGate's 68-scenario suite is open-source. Run your VLA model against the same adversarial conditions.