RESEARCH PAPER

RoboGate: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling

AgentAI Co., Ltd.|March 2026|DOI: 10.5281/zenodo.19166967

Download PDF DOI10.5281/zenodo.19166967 GitHub HFHuggingFace

Publication Status

arXiv	✅ Published — cs.RO (2603.22126) v4 — 2026-04-20
Zenodo	✅ DOI: 10.5281/zenodo.19166967
SSRN	✅ Submitted — Under Review (Abstract ID: 6455499)

Abstract

Deploying learned robot manipulation policies in industrial settings requires rigorous pre-deployment validation, yet exhaustive testing across high-dimensional parameter spaces is intractable. We present RoboGate, a deployment risk management framework that combines physics-based simulation with a two-stage adaptive sampling strategy to efficiently discover failure boundaries in the operational parameter space. Stage 1 employs Latin Hypercube Sampling (LHS) across an 8-dimensional parameter space to establish a coarse failure landscape from 20,000 uniformly distributed experiments. Stage 2 applies boundary-focused sampling that concentrates 10,000 additional experiments in the 30–70% success rate transition zone, enabling precise failure boundary mapping. Using NVIDIA Isaac Sim with Newton physics, we evaluate a scripted pick-and-place controller on two robot embodiments—Franka Panda (7-DOF) and UR5e (6-DOF)—across 30,000 total experiments. Our logistic regression risk model achieves an AUC of 0.780 on the combined dataset (vs. 0.754 for Stage 1 alone), identifies a closed-form failure boundary equation μ*(m) = (1.469 + 0.419m)/(3.691 - 1.400m), and reveals four universal danger zones affecting both robot platforms. We further demonstrate the framework on VLA (Vision-Language-Action) model evaluation, where Octo-Small achieves 0.0% success rate on 68 scenarios versus 83.8% (57/68) for the scripted baseline (100% on nominal scenarios)—an 84-point gap that underscores the challenge of deploying foundation models in industrial settings. RoboGate is open-source and runs on a single GPU workstation.

robot safetydeployment validationfailure analysisadaptive samplingsim-to-realVLA evaluation

Key Findings

Main results from 50,000+ Isaac Sim experiments across 4 robots

0.780Risk Model AUC

Combined two-stage model outperforms Stage 1 alone (0.754) by +3.4%

μ*(m)Boundary Equation

Closed-form: μ*(m) = (1.469 + 0.419m) / (3.691 - 1.400m) separates PASS/FAIL regions

z = -10.00friction × mass

Strongest interaction effect — failure cascades from timeout → collision → grasp miss

4Universal Danger Zones

mass > 0.93kg, friction < 0.492, friction×mass interaction, mass > 1.8kg → both robots fail

0% dropSuction Gripper

UR5e suction gripper eliminates drop failures entirely vs. Franka parallel-jaw

0% vs 83.8%VLA vs Scripted

Every VLA evaluated with real inference achieves 0.0% SR on all 68 scenarios — including 0% on nominal — vs. the scripted IK baseline's 83.8% (57/68, 100% on nominal). The VLAs fail even where scripted succeeds perfectly.

VLA Evaluation (Real Results)

Evaluated with real Octo-Small inference via ZMQ pipeline

Model

Octo-Small (27M params)

Pipeline

Isaac Sim ↔ ZMQ ↔ Octo (JAX)

Result

0/68 passed (0.0% SR)

Failures

grasp_miss 79.4%, collision 20.6%

Confidence

1/100 (CRITICAL)

vs Scripted

57/68 (83.8%); VLAs 0/68 — 84-point gap

Citation

If you use RoboGate in your research, please cite:

@misc{kim2026robogate,
  title         = {ROBOGATE: Adaptive Failure Discovery for Safe Robot
                   Policy Deployment via Two-Stage Boundary-Focused
                   Sampling},
  author        = {{AgentAI Co., Ltd.}},
  year          = {2026},
  eprint        = {2603.22126},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  doi           = {10.5281/zenodo.19166967},
  url           = {https://robogate.io/paper}
}