Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3641399.3641419acmotherconferencesArticle/Chapter ViewAbstractPublication PagesisecConference Proceedingsconference-collections
short-paper

How much SPACE do metrics have in GenAI assisted software development?

Published: 22 February 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Large Language Models (LLMs) are revolutionizing the way a developer creates software by replacing code with natural language prompts as primary drivers. While many initial assessments of such LLMs suggest that it helps with developer productivity, other research studies have also pointed out areas in the Software Development Life Cycle(SDLC) and developer experience where such tools fail miserably. Currently, there exist many studies dedicated to evaluation of LLM-based AI-assisted software tools but there lacks a standardization of studies and metrics which may prove to be a hindrance to adoption of metrics and reproducible studies. The primary objective of this survey is to assess the recent user studies and surveys, aimed at evaluating different aspects of developer’s experience of using code-based LLMs, and highlight any existing gaps among them. We have leveraged the SPACE framework to enumerate and categorise metrics from studies conducting some form of controlled user experiments. In Generative AI assisted SDLC, the developer’s experience should encompass the ability to perform the in-hand task efficiently and effectively, with minimal friction using these LLM tools. Our exploration has led to some critical insights regarding complete absence of user studies in Collaborative aspects of teams, bias towards certain LLM models & metrics and lack of diversity in metrics within productivity dimensions. We also propose some recommendations to the research community which will help bring some conformity in the evaluation of such LLMs.

    References

    [1]
    Naser Al Madi. 2022. How readable is model-generated code? examining readability and visual inspection of github copilot. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5.
    [2]
    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
    [3]
    Shraddha Barke, Michael B James, and Nadia Polikarpova. 2022. Grounded copilot: How programmers interact with code-generating models.(2022). CoRR arXiv 2206 (2022).
    [4]
    Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2023. Taking Flight with Copilot. Commun. ACM 66, 6 (2023), 56–62.
    [5]
    Victor Dibia, Adam Fourney, Gagan Bansal, Forough Poursabzi-Sangdeh, Han Liu, and Saleema Amershi. 2022. Aligning Offline Metrics and Human Judgments of Value of AI-Pair Programmers. arXiv preprint arXiv:2210.16494 (2022).
    [6]
    Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. 2023. Out of the bleu: how should we assess quality of the code generation models?Journal of Systems and Software 203 (2023), 111741.
    [7]
    Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler. 2021. The SPACE of Developer Productivity: There’s more to it than you think.Queue 19, 1 (2021), 20–48.
    [8]
    Saki Imai. 2022. Is github copilot a substitute for human pair-programming? an empirical study. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 319–321.
    [9]
    Ellen Jiang, Edwin Toh, Alejandra Molina, Kristen Olson, Claire Kayacik, Aaron Donsbach, Carrie J Cai, and Michael Terry. 2022. Discovering the syntax and strategies of natural language programming with generative language models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19.
    [10]
    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, 2023. StarCoder: may the source be with you!arXiv preprint arXiv:2305.06161 (2023).
    [11]
    OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL]
    [12]
    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
    [13]
    Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. 2022. Security implications of large language model code assistants: A user study. arXiv preprint arXiv:2208.09727 (2022).
    [14]
    Jiao Sun, Q Vera Liao, Michael Muller, Mayank Agarwal, Stephanie Houde, Kartik Talamadupula, and Justin D Weisz. 2022. Investigating explainability of generative AI for code through scenario-based design. In 27th International Conference on Intelligent User Interfaces. 212–228.
    [15]
    Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts. 1–7.
    [16]
    Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q Vera Liao, and Jennifer Wortman Vaughan. 2022. Generation probabilities are not enough: improving error highlighting for AI code suggestions. In Virtual Workshop on Human-Centered AI Workshop at NeurIPS (HCAI@ NeurIPS’22). Virtual Event, USA. 1–4.
    [17]
    Frank F Xu, Bogdan Vasilescu, and Graham Neubig. 2022. In-ide code generation from natural language: Promise and challenges. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2 (2022), 1–47.
    [18]
    Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 7443–7464. https://doi.org/10.18653/v1/2023.acl-long.411
    [19]
    Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 21–29.

    Cited By

    View all
    • (2024)The Role of Generative AI in Software Development Productivity: A Pilot Case StudyProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664773(131-138)Online publication date: 10-Jul-2024

    Index Terms

    1. How much SPACE do metrics have in GenAI assisted software development?

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ISEC '24: Proceedings of the 17th Innovations in Software Engineering Conference
      February 2024
      144 pages
      ISBN:9798400717673
      DOI:10.1145/3641399
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 February 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Developer Productivity
      2. Generative AI
      3. Metrics
      4. SDLC
      5. Software

      Qualifiers

      • Short-paper
      • Research
      • Refereed limited

      Conference

      ISEC 2024

      Acceptance Rates

      Overall Acceptance Rate 76 of 315 submissions, 24%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)132
      • Downloads (Last 6 weeks)27

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)The Role of Generative AI in Software Development Productivity: A Pilot Case StudyProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664773(131-138)Online publication date: 10-Jul-2024

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media