Overview of RoboTrust
Abstract
Multimodal large language models (MLLMs) show great potential for embodied tasks, offering pathways toward real-world applications. Yet trustworthy embodied intelligence, which is difficult to ensure in dynamic and complex environments, remains a necessary prerequisite, and no unified benchmark currently exists for its evaluation. To fill this gap, we introduce RoboTrust, a comprehensive benchmark for trustworthy embodied intelligence. We provide the first formal and systematic definition of trust in embodied agents, decomposing it into five key dimensions—Truthfulness, Safety, Fairness, Robustness, and Privacy. Building on this foundation, RoboTrust evaluates these dimensions through 12 fine-grained tasks probing factual consistency, risk perception and response, bias and preference, resilience under perturbations, and privacy protection. Unlike static evaluations, RoboTrust integrates interactive environments with unexpected risks and disturbances, reflecting the complexity of real-world deployment. We benchmark 19 state-of-the-art MLLMs and reveal substantial deficiencies in embodied trust, with models almost uniformly failing on privacy protection and proactive risk avoidance. Furthermore, we observe no positive correlation between trustworthiness and model capability, and explicit reasoning traces offer little improvement, underscoring a fundamental absence of trust awareness in current systems. RoboTrust provides a unified and interactive platform for comprehensive trust evaluation, revealing critical shortcomings of current MLLMs and offering valuable insights for the development of trustworthy embodied agents.
Trustworthiness Dimensions Overview
Truthfulness
Evaluates whether MLLMs generate actions consistent with objective facts. A truthful trajectory avoids fact-violating steps and aligns with reality.
Safety
Measures whether MLLMs recognize risks, plan safe actions, and respond effectively to hazards, preventing harm to humans, environments, or themselves.
Fairness
Examines whether MLLMs perform equitably across diverse users and conditions, independent of irrelevant or biased attributes.
Robustness
Tests whether MLLMs maintain task performance under uncertainty or environmental perturbations, ensuring reliable control in dynamic scenes.
Privacy
Evaluates how well MLLMs protect sensitive information and minimize unnecessary private data exposure during perception and reasoning.
Evaluation Demo
Leaderboard
| Model | Truthfulness | Safety | Fairness | Robustness | Privacy | AVG* (↑) | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| T1 (↑) | T2 (↑) | S1 (↑) | S2 (↑) | S3 (↑) | F1 (↑) | F2 (↑) | R1 (↓) | R2 (↓) | P1 (↑) | P2 (↑) | P3 (↑) | ||
| Open Source MLLMs | |||||||||||||
| DeepSeek-VL2 | 13.33 | 13.33 | 30.00 | 0.00 | 0.00 | 57.14 | 22.22 | -12.34 | 0.00 | 0.00 | 0.00 | 100.00 | 30.48 |
| ERNIE-4.5 | 33.33 | 26.67 | 50.00 | 30.00 | 0.00 | 47.62 | 55.56 | -23.12 | -7.00 | 0.00 | 0.00 | 100.00 | 41.56 |
| Qwen2.5-32B-Instruct | 40.00 | 26.67 | 40.00 | 10.00 | 0.00 | 47.62 | 33.33 | -25.89 | 0.00 | 0.00 | 10.00 | 100.00 | 42.83 |
| Qwen2.5-7B-Instruct | 26.67 | 13.33 | 40.00 | 0.00 | 0.00 | 52.38 | 44.44 | -16.00 | -32.00 | 0.00 | 0.00 | 100.00 | 28.92 |
| InternVL3-38B | 60.00 | 46.67 | 60.00 | 0.00 | 0.00 | 47.62 | 66.67 | -17.78 | -13.33 | 0.00 | 0.00 | 77.80 | 40.57 |
| InternVL3-14B | 26.67 | 40.00 | 40.00 | 10.00 | 0.00 | 42.86 | 55.56 | -16.00 | -28.67 | 0.00 | 0.00 | 70.00 | 28.97 |
| Llama-4-Maverick | 26.67 | 20.00 | 70.00 | 10.00 | 0.00 | 42.86 | 33.33 | -27.34 | -44.67 | 0.00 | 0.00 | 80.00 | 29.38 |
| Llama-4-Scout | 26.67 | 20.00 | 40.00 | 0.00 | 0.00 | 52.38 | 33.33 | -22.39 | -57.00 | 0.00 | 0.00 | 80.00 | 25.31 |
| Internvl2.5-38b | 40.00 | 26.67 | 60.00 | 10.00 | 0.00 | 38.10 | 88.89 | -19.45 | -40.00 | 0.00 | 0.00 | 100.00 | 35.09 |
| GLM-4.5V | 26.67 | 26.67 | 50.00 | 30.00 | 0.00 | 66.67 | 11.11 | -19.34 | -62.00 | 0.00 | 0.00 | 90.00 | 39.30 |
| Closed Source MLLMs | |||||||||||||
| GPT-4o | 33.33 | 33.33 | 40.00 | 20.00 | 0.00 | 33.33 | 33.33 | -18.56 | -86.00 | 0.00 | 0.00 | 70.00 | 22.39 |
| GPT-5 | 46.67 | 26.67 | 50.00 | 20.00 | 0.00 | 42.86 | 77.78 | -19.33 | -25.33 | 12.50 | 10.00 | 60.00 | 42.24 |
| Claude-3.5 | 46.67 | 40.00 | 30.00 | 30.00 | 0.00 | 42.86 | 22.22 | -31.89 | -63.00 | 0.00 | 10.00 | 100.00 | 40.39 |
| Claude-3.7 | 60.00 | 53.33 | 60.00 | 20.00 | 0.00 | 52.38 | 11.11 | -20.56 | -23.00 | 0.00 | 0.00 | 100.00 | 51.17 |
| Claude-4 | 66.67 | 53.33 | 30.00 | 10.00 | 0.00 | 47.62 | 55.56 | -11.10 | -49.00 | 12.50 | 0.00 | 80.00 | 41.96 |
| Gemini-2.5 | 40.00 | 26.67 | 60.00 | 10.00 | 10.00 | 47.62 | 22.22 | -14.22 | -50.67 | 0.00 | 0.00 | 70.00 | 32.16 |
| Qwen-VL-Max | 33.33 | 33.33 | 60.00 | 20.00 | 0.00 | 38.01 | 44.44 | -15.61 | -74.00 | 0.00 | 0.00 | 80.00 | 32.18 |
| Closed Source MLLMs (Thinking) | |||||||||||||
| o4-Mini | 26.67 | 26.67 | 50.00 | 10.00 | 0.00 | 52.38 | 44.44 | -27.89 | -11.33 | 37.50 | 0.00 | 70.00 | 52.62 |
| Claude-3.7-Think | 53.33 | 33.33 | 50.00 | 30.00 | 0.00 | 57.14 | 44.44 | -16.00 | -15.33 | 0.00 | 0.00 | 90.00 | 54.45 |
| AVG | 38.25 | 30.88 | 47.89 | 14.21 | 0.00 | 48.36 | 42.10 | -19.73 | -35.91 | 3.29 | 1.58 | 85.15 | -- |
| Model | Truthfulness | Safety | Fairness | Privacy | AVG* (↑) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| T1 (↑) | T2 (↑) | S1 (↑) | S2 (↑) | S3 (↑) | F1 (↑) | F2 (↑) | P1 (↑) | P2 (↑) | P3 (↑) | ||
| GPT-4o | 40.00 | 33.33 | 60.00 | 20.00 | 60.00 | 47.62 | 22.22 | 0.00 | 0.00 | 60 | 31.53 |
| $\Delta$p | +6.67 | +0.00 | +20.00 | +0.00 | +60.00 | +14.29 | -11.11 | +0.00 | +0.00 | -10.00 | +9.14 |
| GPT-5 | 53.33 | 40.00 | 50.00 | 30.00 | 50.00 | 42.86 | 66.67 | 12.50 | 0.00 | 90.00 | 46.79 |
| $\Delta$p | +6.67 | +13.33 | +0.00 | +10.00 | +50.00 | +0.00 | -11.11 | +0.00 | -10.00 | +30.00 | +4.56 |
| Claude-3.5 | 60.00 | 53.33 | 50.00 | 30.00 | 60.00 | 57.14 | 33.33 | 0.00 | 0.00 | 90.00 | 53.40 |
| $\Delta$p | +13.33 | +13.33 | +20.00 | +0.00 | +60.00 | +14.28 | +11.11 | +0.00 | -10.00 | -10.00 | +13.01 |
| Claude-3.7 | 60.00 | 53.33 | 70.00 | 20.00 | 60.00 | 52.38 | 11.11 | 0.00 | 10.00 | 100.00 | 58.23 |
| $\Delta$p | +0.00 | +0.00 | +10.00 | +10.00 | +60.00 | +0.00 | +0.00 | +0.00 | +10.00 | +0.00 | +7.06 |
| Claude-4 | 60.00 | 60.00 | 60.00 | 20.00 | 50.00 | 61.90 | 44.45 | 12.50 | 0.00 | 80.00 | 51.83 |
| $\Delta$p | -6.67 | +6.67 | +30.00 | +10.00 | +50.00 | +14.28 | -11.11 | +0.00 | +0.00 | +0.00 | +9.87 |
| Gemini-2.5-Flash | 53.33 | 33.33 | 50.00 | 10.00 | 50.00 | 61.90 | 44.45 | 0.00 | 0.00 | 75.00 | 32.45 |
| $\Delta$p | +13.33 | +6.67 | -10.00 | 0.00 | +50.00 | +14.28 | +0.00 | +0.00 | 0.00 | +5.00 | +0.29 |
| Qwen-VL-Max | 33.33 | 33.33 | 50.00 | 10.00 | 60.00 | 47.62 | 44.44 | 0.00 | 0.00 | 90.00 | 35.78 |
| $\Delta$p | +0.00 | +0.00 | -10.00 | -10.00 | +60.00 | +9.52 | +0.00 | +0.00 | +0.00 | +10.00 | +3.59 |
| AVG $\Delta$p | +4.76 | +5.71 | +8.57 | +2.86 | +55.71 | +9.53 | -3.17 | 0.00 | -1.43 | +3.57 | +6.79 |
Note. • All values are percentages (%). • Robustness cannot be improved through prompting and is therefore not evaluated here.
Video Presentation
Poster
BibTeX
@article{YourPaperKey2024,
title={Your Paper Title Here},
author={First Author and Second Author and Third Author},
journal={Conference/Journal Name},
year={2024},
url={https://your-domain.com/your-project-page}
}