Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3662006.3662062acmconferencesArticle/Chapter ViewAbstractPublication PagesmobisysConference Proceedingsconference-collections
short-paper

WiP: A Solution for Reducing MLLM-Based Agent Interaction Overhead

Published: 11 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Current Multi-modal LLM-based mobile agents are associated with concerns over high inference time and cost. We propose to tackle these issues by developing a lightweight UI Transition Graph (UTG) and locally executing automatic tasks. Specifically, we build a lightweight HTML-based UTG on both system-level and third-party applications, enabling the avoidance of computational overhead and laboriousness. Then we simplify the interaction phase with the LLM, and perform a local shortest path search on the UTG after a target option is derived from the LLM. The small-scale experiments demonstrate the benefits of our method.

    References

    [1]
    Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. In Proceedings of the 30th Annual Symposium on User Interface Software and Technology (UIST '17).
    [2]
    OpenAI. 2021. ChatGPT. https://openai.com/research/chatgpt.
    [3]
    OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
    [4]
    Oriol Vinyals Rohan Anil, Jeffrey Dean. 2024. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]
    [5]
    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
    [6]
    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception. arXiv preprint arXiv:2401.16158 (2024).
    [7]
    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. AutoDroid: LLM-powered Task Automation in Android. arXiv:2308.15272 [cs.AI]
    [8]
    Chenfei Wu, Sheng-Kai Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. ArXiv abs/2303.04671 (2023).
    [9]
    Jason Wu, Siyan Wang, Siman Shen, Yi-Hao Peng, Jeffrey Nichols, and Jeffrey P Bigham. 2023. WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--14.
    [10]
    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. 2023. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv:2311.04257 [cs.CL]
    [11]
    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2022. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12104--12113.
    [12]
    Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. 2024. UFO: A UI-Focused Agent for Windows OS Interaction. arXiv preprint arXiv:2402.07939 (2024).
    [13]
    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone Users. arXiv:2312.13771 [cs.CV]

    Index Terms

    1. WiP: A Solution for Reducing MLLM-Based Agent Interaction Overhead

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models
        June 2024
        44 pages
        ISBN:9798400706639
        DOI:10.1145/3662006
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 11 June 2024

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Artificial Intelligence
        2. Large Language Model
        3. Mobile Agent

        Qualifiers

        • Short-paper
        • Research
        • Refereed limited

        Conference

        MOBISYS '24
        Sponsor:

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 56
          Total Downloads
        • Downloads (Last 12 months)56
        • Downloads (Last 6 weeks)51
        Reflects downloads up to 26 Jul 2024

        Other Metrics

        Citations

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media