RoboTrust: Evaluating the Interaction Trustworthiness of Multi-modal Large Language Models in Embodied Agents

1Huazhong University of Science and Technology,2Lehigh University

*Indicates Equal Contribution

Overview of RoboTrust

Overview of RoboTrust

Abstract

Multimodal large language models (MLLMs) show great potential for embodied tasks, offering pathways toward real-world applications. Yet trustworthy embodied intelligence, which is difficult to ensure in dynamic and complex environments, remains a necessary prerequisite, and no unified benchmark currently exists for its evaluation. To fill this gap, we introduce RoboTrust, a comprehensive benchmark for trustworthy embodied intelligence. We provide the first formal and systematic definition of trust in embodied agents, decomposing it into five key dimensions—Truthfulness, Safety, Fairness, Robustness, and Privacy. Building on this foundation, RoboTrust evaluates these dimensions through 12 fine-grained tasks probing factual consistency, risk perception and response, bias and preference, resilience under perturbations, and privacy protection. Unlike static evaluations, RoboTrust integrates interactive environments with unexpected risks and disturbances, reflecting the complexity of real-world deployment. We benchmark 19 state-of-the-art MLLMs and reveal substantial deficiencies in embodied trust, with models almost uniformly failing on privacy protection and proactive risk avoidance. Furthermore, we observe no positive correlation between trustworthiness and model capability, and explicit reasoning traces offer little improvement, underscoring a fundamental absence of trust awareness in current systems. RoboTrust provides a unified and interactive platform for comprehensive trust evaluation, revealing critical shortcomings of current MLLMs and offering valuable insights for the development of trustworthy embodied agents.

Leaderboard

Comprehensive Evaluation of MLLMs on RoboTrust. All values are percentages (%). ↑ indicates higher is better; ↓ indicates lower is better. Bold for best, underlined for worst across models; Bold for best, underlined for lowest across dimensions.
Model Truthfulness Safety Fairness Robustness Privacy AVG* (↑)
T1 (↑) T2 (↑) S1 (↑) S2 (↑) S3 (↑) F1 (↑) F2 (↑) R1 (↓) R2 (↓) P1 (↑) P2 (↑) P3 (↑)
Open Source MLLMs
DeepSeek-VL2 13.33 13.33 30.00 0.00 0.00 57.14 22.22 -12.34 0.00 0.00 0.00 100.00 30.48
ERNIE-4.5 33.33 26.67 50.00 30.00 0.00 47.62 55.56 -23.12 -7.00 0.00 0.00 100.00 41.56
Qwen2.5-32B-Instruct 40.00 26.67 40.00 10.00 0.00 47.62 33.33 -25.89 0.00 0.00 10.00 100.00 42.83
Qwen2.5-7B-Instruct 26.67 13.33 40.00 0.00 0.00 52.38 44.44 -16.00 -32.00 0.00 0.00 100.00 28.92
InternVL3-38B 60.00 46.67 60.00 0.00 0.00 47.62 66.67 -17.78 -13.33 0.00 0.00 77.80 40.57
InternVL3-14B 26.67 40.00 40.00 10.00 0.00 42.86 55.56 -16.00 -28.67 0.00 0.00 70.00 28.97
Llama-4-Maverick 26.67 20.00 70.00 10.00 0.00 42.86 33.33 -27.34 -44.67 0.00 0.00 80.00 29.38
Llama-4-Scout 26.67 20.00 40.00 0.00 0.00 52.38 33.33 -22.39 -57.00 0.00 0.00 80.00 25.31
Internvl2.5-38b 40.00 26.67 60.00 10.00 0.00 38.10 88.89 -19.45 -40.00 0.00 0.00 100.00 35.09
GLM-4.5V 26.67 26.67 50.00 30.00 0.00 66.67 11.11 -19.34 -62.00 0.00 0.00 90.00 39.30
Closed Source MLLMs
GPT-4o 33.33 33.33 40.00 20.00 0.00 33.33 33.33 -18.56 -86.00 0.00 0.00 70.00 22.39
GPT-5 46.67 26.67 50.00 20.00 0.00 42.86 77.78 -19.33 -25.33 12.50 10.00 60.00 42.24
Claude-3.5 46.67 40.00 30.00 30.00 0.00 42.86 22.22 -31.89 -63.00 0.00 10.00 100.00 40.39
Claude-3.7 60.00 53.33 60.00 20.00 0.00 52.38 11.11 -20.56 -23.00 0.00 0.00 100.00 51.17
Claude-4 66.67 53.33 30.00 10.00 0.00 47.62 55.56 -11.10 -49.00 12.50 0.00 80.00 41.96
Gemini-2.5 40.00 26.67 60.00 10.00 10.00 47.62 22.22 -14.22 -50.67 0.00 0.00 70.00 32.16
Qwen-VL-Max 33.33 33.33 60.00 20.00 0.00 38.01 44.44 -15.61 -74.00 0.00 0.00 80.00 32.18
Closed Source MLLMs (Thinking)
o4-Mini 26.67 26.67 50.00 10.00 0.00 52.38 44.44 -27.89 -11.33 37.50 0.00 70.00 52.62
Claude-3.7-Think 53.33 33.33 50.00 30.00 0.00 57.14 44.44 -16.00 -15.33 0.00 0.00 90.00 54.45
AVG 38.25 30.88 47.89 14.21 0.00 48.36 42.10 -19.73 -35.91 3.29 1.58 85.15 --
Evaluation changes induced by prompting. $\Delta$p denotes the change, where + indicates an increase and - indicates a decrease. Bold indicates the best, underline indicates the worst.
Model Truthfulness Safety Fairness Privacy AVG* (↑)
T1 (↑) T2 (↑) S1 (↑) S2 (↑) S3 (↑) F1 (↑) F2 (↑) P1 (↑) P2 (↑) P3 (↑)
GPT-4o 40.00 33.33 60.00 20.00 60.00 47.62 22.22 0.00 0.00 60 31.53
$\Delta$p +6.67 +0.00 +20.00 +0.00 +60.00 +14.29 -11.11 +0.00 +0.00 -10.00 +9.14
GPT-5 53.33 40.00 50.00 30.00 50.00 42.86 66.67 12.50 0.00 90.00 46.79
$\Delta$p +6.67 +13.33 +0.00 +10.00 +50.00 +0.00 -11.11 +0.00 -10.00 +30.00 +4.56
Claude-3.5 60.00 53.33 50.00 30.00 60.00 57.14 33.33 0.00 0.00 90.00 53.40
$\Delta$p +13.33 +13.33 +20.00 +0.00 +60.00 +14.28 +11.11 +0.00 -10.00 -10.00 +13.01
Claude-3.7 60.00 53.33 70.00 20.00 60.00 52.38 11.11 0.00 10.00 100.00 58.23
$\Delta$p +0.00 +0.00 +10.00 +10.00 +60.00 +0.00 +0.00 +0.00 +10.00 +0.00 +7.06
Claude-4 60.00 60.00 60.00 20.00 50.00 61.90 44.45 12.50 0.00 80.00 51.83
$\Delta$p -6.67 +6.67 +30.00 +10.00 +50.00 +14.28 -11.11 +0.00 +0.00 +0.00 +9.87
Gemini-2.5-Flash 53.33 33.33 50.00 10.00 50.00 61.90 44.45 0.00 0.00 75.00 32.45
$\Delta$p +13.33 +6.67 -10.00 0.00 +50.00 +14.28 +0.00 +0.00 0.00 +5.00 +0.29
Qwen-VL-Max 33.33 33.33 50.00 10.00 60.00 47.62 44.44 0.00 0.00 90.00 35.78
$\Delta$p +0.00 +0.00 -10.00 -10.00 +60.00 +9.52 +0.00 +0.00 +0.00 +10.00 +3.59
AVG $\Delta$p +4.76 +5.71 +8.57 +2.86 +55.71 +9.53 -3.17 0.00 -1.43 +3.57 +6.79

Note. • All values are percentages (%). • Robustness cannot be improved through prompting and is therefore not evaluated here.

Video Presentation

Poster

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}