UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Building on this idea, we are the first to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks. To this end, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. We also introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks. Specifically, on the ID benchmark AndroidControl, the action type accuracy improves by 15%, while grounding accuracy increases by 10.3%, compared with the base model (i.e. Qwen2.5-VL-3B). On the OOD GUI grounding benchmark ScreenSpot-Pro, our model surpasses the base model by 6.0% and achieves competitive performance with larger models (e.g., OS-Atlas-7B), which are trained via supervised fine-tuning (SFT) on 76K data. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain.

Related collections

Author and article information

Journal

Publication date Created: 27 March 2025

Article

ArXiV ID: 2503.21620

SO-VID: 3fda76da-e6de-417e-901b-ee7bafb69a3e

License:

http://creativecommons.org/licenses/by/4.0/

History

Custom metadata

Categories cs.AI

ScienceOpen disciplines: Artificial intelligence

Data availability:

ScienceOpen disciplines: Artificial intelligence

UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Read this article at

Abstract

Related collections

Annual Reviews AI, Machine Learning, and Society

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 324