Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-981-96-0602-3_18guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

MILE: A Mutation Testing Framework of In-Context Learning Systems

Published: 26 November 2024 Publication History

Abstract

In-context Learning (ICL) has achieved notable success in large language models (LLMs) applications. By adding only a few input-output pairs demonstrating a new task, LLMs can efficiently learn the task during inference without modifying their parameters. Such mysterious ability of LLMs has attracted great research interests in understanding, formatting, and improving the in-context demonstrations, while still suffering from drawbacks like black-box mechanisms and sensitivity against the selection of examples. In this work, inspired by the foundations of adopting testing techniques in machine learning (ML) systems, we propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems. First, we propose several mutation operators specialized for ICL demonstrations, as well as the corresponding mutation scores for ICL test sets. With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites. Our code is available at https://github.com/weizeming/MILE.

References

[1]
Agarwal, R., et al.: Many-shot in-context learning. arXiv preprint (2024)
[2]
Almazrouei, E., et al.: The falcon series of open language models. arXiv:2311.16867 (2023)
[3]
Bojar, O., et al.: Findings of the 2014 workshop on statistical machine translation. In: Proceedings of the Ninth Workshop on Statistical Machine Translation (2014)
[4]
Brown, T.B., et al.: Language models are few-shot learners. arXiv:2005.14165 (2020)
[5]
Chen, W., et al.: Icleval: evaluating in-context learning ability of large language models. arXiv:2406.14955 (2024)
[6]
Cheng, C., et al.: Exploring the robustness of in-context learning with noisy labels. In: ICLR 2024 R2-FM Workshop (2024)
[7]
Dagan, I., et al.: The pascal recognising textual entailment challenge. In: Machine Learning Challenges Workshop (2005)
[8]
Dolan, B., et al.: Automatically constructing a corpus of sentential paraphrases. In: IWP (2005)
[9]
Dong, Q., et al.: A survey on in-context learning (2023)
[10]
Fang, L., et al.: Rethinking invariance in in-context learning. In: ICML 2024 Workshop on Theoretical Foundations of Foundation Models (2024)
[11]
Gao, H., et al.: On the noise robustness of in-context learning for text generation. arXiv:2405.17264 (2024)
[12]
Ghanbari, A., et al.: Practical program repair via bytecode mutation. In: ISSTA (2019)
[13]
Hu, Q., et al.: Deepmutation++: a mutation testing framework for deep learning systems. In: ASE (2019)
[14]
Huang XO A survey of safety and trustworthiness of deep neural networks: verification, testing, adversarial attack and defence, and interpretability Comput. Sci. Rev. 2020 37 100270
[15]
Huang, X., et al.: Understanding the planning of LLM agents: a survey. arXiv:2402.02716 (2024)
[16]
Humbatova, N., et al.: Deepcrime: mutation testing of deep learning systems based on real faults. In: ISSTA (2021)
[17]
Jia Y et al. An analysis and survey of the development of mutation testing IEEE Trans. Softw. Eng. 2010 37 5 649-678
[18]
Just, R., et al.: Defects4j: a database of existing faults to enable controlled testing studies for java programs. In: ISSTA (2014)
[19]
Li, X., et al.: Alpacaeval: an automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval
[20]
Liu, W.O.: Energy-based out-of-distribution detection. In: NeurIPS (2020)
[21]
Lu, S., et al.: Are emergent abilities in large language models just in-context learning? arXiv:2309.01809 (2023)
[22]
Lu, Y., et al.: Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity. arXiv:2104.08786 (2021)
[23]
Lu Y et al. Towards mutation testing of reinforcement learning systems J. Syst. Architect. 2022 131 102701
[24]
Lu Y et al. Mutation testing of unsupervised learning systems J. Syst. Archit. 2024 146 103050
[25]
Ma, L., et al.: Deepmutation: mutation testing of deep learning systems. In: ISSRE (2018)
[26]
Min, S., et al.: Rethinking the role of demonstrations: what makes in-context learning work? In: EMNLP (2022)
[27]
Papadakis, M., et al.: Metallaxis-FL: mutation-based fault localization. Softw. Test. Verif. Reliab. 25, 605–628 (2015)
[28]
Shen, W., et al.: Munn: mutation analysis of neural networks. In: QRS-C (2018)
[29]
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP (2013)
[30]
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)
[31]
Uesato, J., et al.: Rigorous agent evaluation: an adversarial approach to uncover catastrophic failures. arXiv:1812.01647 (2018)
[32]
Wang, A., et al.: Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461 (2018)
[33]
Wang, J.O.: Adversarial sample detection for deep neural network through model mutation testing. In: ICSE (2019)
[34]
Wang, X., et al.: Large language models are implicitly topic models: explaining and finding good demonstrations for in-context learning. arXiv:2301.11916 (2023)
[35]
Wang, Y., et al.: A theoretical understanding of self-correction through in-context alignment. In: NeurIPS (2024)
[36]
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022)
[37]
Wei, Z., et al.: Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023)
[38]
Wu, Q., et al.: Autogen: enabling next-gen llm applications via multi-agent conversation framework. arXiv:2308.08155 (2023)
[39]
Xie, S.M., et al.: An explanation of in-context learning as implicit bayesian inference. arXiv:2111.02080 (2021)
[40]
Yang, J., et al.: Generalized out-of-distribution detection: a survey. Int. J. Comput. Vis. (2024)
[41]
Yu, J., et al.: Gptfuzzer: red teaming large language models with auto-generated jailbreak prompts. arXiv:2309.10253 (2023)
[42]
Zhang JM et al. Machine learning testing: survey, landscapes and horizons IEEE Trans. Softw. Eng. 2020 48 1 1-36
[43]
Zhang, K., et al.: Batch-icl: Effective, efficient, and order-agnostic in-context learning. In: ACL (2024)
[44]
Zhang, X., et al.: Character-level convolutional networks for text classification. In: NeurIPS (2015)
[45]
Zhang, X., et al.: A mutation-based method for multi-modal jailbreaking attack detection. arXiv:2312.10766 (2023)
[46]
Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: NeurIPS (2024)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Dependable Software Engineering. Theories, Tools, and Applications: 10th International Symposium, SETTA 2024, Hong Kong, China, November 26–28, 2024, Proceedings
Nov 2024
430 pages
ISBN:978-981-96-0601-6
DOI:10.1007/978-981-96-0602-3
  • Editors:
  • Timothy Bourke,
  • Liqian Chen,
  • Amir Goharshady

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 26 November 2024

Author Tags

  1. In-context learning
  2. Mutation testing
  3. Large Language Models

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media