| arXiv | ✅ Endorsed — cs.RO (Submitting) |
| Zenodo | ✅ DOI: 10.5281/zenodo.19166967 |
| SSRN | ✅ Abstract ID: 6455499 |
Deploying learned robot manipulation policies in industrial settings requires rigorous pre-deployment validation, yet exhaustive testing across high-dimensional parameter spaces is intractable. We present RoboGate, a deployment risk management framework that combines physics-based simulation with a two-stage adaptive sampling strategy to efficiently discover failure boundaries in the operational parameter space. Stage 1 employs Latin Hypercube Sampling (LHS) across an 8-dimensional parameter space to establish a coarse failure landscape from 20,000 uniformly distributed experiments. Stage 2 applies boundary-focused sampling that concentrates 10,000 additional experiments in the 30–70% success rate transition zone, enabling precise failure boundary mapping. Using NVIDIA Isaac Sim with Newton physics, we evaluate a scripted pick-and-place controller on two robot embodiments—Franka Panda (7-DOF) and UR5e (6-DOF)—across 30,000 total experiments. Our logistic regression risk model achieves an AUC of 0.780 on the combined dataset (vs. 0.754 for Stage 1 alone), identifies a closed-form failure boundary equation μ*(m) = (1.469 + 0.419m)/(3.691 - 1.400m), and reveals four universal danger zones affecting both robot platforms. We further demonstrate the framework on VLA (Vision-Language-Action) model evaluation, where Octo-Small achieves 0.0% success rate on 68 scenarios versus 100% for the scripted baseline—a 100-point gap that underscores the challenge of deploying foundation models in industrial settings. RoboGate is open-source and runs on a single GPU workstation.
Main results from 50,000+ Isaac Sim experiments across 4 robots
Combined two-stage model outperforms Stage 1 alone (0.754) by +3.4%
Closed-form: μ*(m) = (1.469 + 0.419m) / (3.691 - 1.400m) separates PASS/FAIL regions
Strongest interaction effect — failure cascades from timeout → collision → grasp miss
mass > 0.93kg, friction < 0.492, friction×mass interaction, mass > 1.8kg → both robots fail
UR5e suction gripper eliminates drop failures entirely vs. Franka parallel-jaw
All 4 VLA models (GR00T N1.6, OpenVLA, Octo-Base, Octo-Small) achieve 0.0% SR on all 68 scenarios vs. 100% scripted baseline — 100-point gap
Evaluated with real Octo-Small inference via ZMQ pipeline
Model
Octo-Small (27M params)
Pipeline
Isaac Sim ↔ ZMQ ↔ Octo (JAX)
Result
0/68 passed (0.0% SR)
Failures
grasp_miss 79.4%, collision 20.6%
Confidence
1/100 (CRITICAL)
vs Scripted
68/68 (100%) — 100-point gap
If you use RoboGate in your research, please cite:
@misc{kim2026robogate,
title = {ROBOGATE: Adaptive Failure Discovery for Safe Robot
Policy Deployment via Two-Stage Boundary-Focused
Sampling},
author = {{AgentAI Co., Ltd.}},
year = {2026},
eprint = {2603.XXXXX},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
doi = {10.5281/zenodo.19166967},
url = {https://robogate.io/paper}
}