Article

MILE: A Mutation Testing Framework of In-Context Learning Systems

Authors:

Zeming Wei, Yihao Zhang,

Meng SunAuthors Info & Claims

Dependable Software Engineering. Theories, Tools, and Applications: 10th International Symposium, SETTA 2024, Hong Kong, China, November 26–28, 2024, Proceedings

Pages 327 - 343

https://doi.org/10.1007/978-981-96-0602-3_18

Published: 26 November 2024 Publication History

Abstract

In-context Learning (ICL) has achieved notable success in large language models (LLMs) applications. By adding only a few input-output pairs demonstrating a new task, LLMs can efficiently learn the task during inference without modifying their parameters. Such mysterious ability of LLMs has attracted great research interests in understanding, formatting, and improving the in-context demonstrations, while still suffering from drawbacks like black-box mechanisms and sensitivity against the selection of examples. In this work, inspired by the foundations of adopting testing techniques in machine learning (ML) systems, we propose a mutation testing framework designed to characterize the quality and effectiveness of test data for ICL systems. First, we propose several mutation operators specialized for ICL demonstrations, as well as the corresponding mutation scores for ICL test sets. With comprehensive experiments, we showcase the effectiveness of our framework in evaluating the reliability and quality of ICL test suites. Our code is available at https://github.com/weizeming/MILE.

References

[1]

Agarwal, R., et al.: Many-shot in-context learning. arXiv preprint (2024)

[2]

Almazrouei, E., et al.: The falcon series of open language models. arXiv:2311.16867 (2023)

[3]

Bojar, O., et al.: Findings of the 2014 workshop on statistical machine translation. In: Proceedings of the Ninth Workshop on Statistical Machine Translation (2014)

[4]

Brown, T.B., et al.: Language models are few-shot learners. arXiv:2005.14165 (2020)

[5]

Chen, W., et al.: Icleval: evaluating in-context learning ability of large language models. arXiv:2406.14955 (2024)

[6]

Cheng, C., et al.: Exploring the robustness of in-context learning with noisy labels. In: ICLR 2024 R2-FM Workshop (2024)

[7]

Dagan, I., et al.: The pascal recognising textual entailment challenge. In: Machine Learning Challenges Workshop (2005)

[8]

Dolan, B., et al.: Automatically constructing a corpus of sentential paraphrases. In: IWP (2005)

[9]

Dong, Q., et al.: A survey on in-context learning (2023)

[10]

Fang, L., et al.: Rethinking invariance in in-context learning. In: ICML 2024 Workshop on Theoretical Foundations of Foundation Models (2024)

[11]

Gao, H., et al.: On the noise robustness of in-context learning for text generation. arXiv:2405.17264 (2024)

[12]

Ghanbari, A., et al.: Practical program repair via bytecode mutation. In: ISSTA (2019)

[13]

Hu, Q., et al.: Deepmutation++: a mutation testing framework for deep learning systems. In: ASE (2019)

[14]

Huang XO A survey of safety and trustworthiness of deep neural networks: verification, testing, adversarial attack and defence, and interpretability Comput. Sci. Rev. 2020 37 100270

[15]

Huang, X., et al.: Understanding the planning of LLM agents: a survey. arXiv:2402.02716 (2024)

[16]

Humbatova, N., et al.: Deepcrime: mutation testing of deep learning systems based on real faults. In: ISSTA (2021)

[17]

Jia Y et al. An analysis and survey of the development of mutation testing IEEE Trans. Softw. Eng. 2010 37 5 649-678

Digital Library

[18]

Just, R., et al.: Defects4j: a database of existing faults to enable controlled testing studies for java programs. In: ISSTA (2014)

[19]

Li, X., et al.: Alpacaeval: an automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

[20]

Liu, W.O.: Energy-based out-of-distribution detection. In: NeurIPS (2020)

[21]

Lu, S., et al.: Are emergent abilities in large language models just in-context learning? arXiv:2309.01809 (2023)

[22]

Lu, Y., et al.: Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity. arXiv:2104.08786 (2021)

[23]

Lu Y et al. Towards mutation testing of reinforcement learning systems J. Syst. Architect. 2022 131 102701

Digital Library

[24]

Lu Y et al. Mutation testing of unsupervised learning systems J. Syst. Archit. 2024 146 103050

Digital Library

[25]

Ma, L., et al.: Deepmutation: mutation testing of deep learning systems. In: ISSRE (2018)

[26]

Min, S., et al.: Rethinking the role of demonstrations: what makes in-context learning work? In: EMNLP (2022)

[27]

Papadakis, M., et al.: Metallaxis-FL: mutation-based fault localization. Softw. Test. Verif. Reliab. 25, 605–628 (2015)

[28]

Shen, W., et al.: Munn: mutation analysis of neural networks. In: QRS-C (2018)

[29]

Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP (2013)

[30]

Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)

[31]

Uesato, J., et al.: Rigorous agent evaluation: an adversarial approach to uncover catastrophic failures. arXiv:1812.01647 (2018)

[32]

Wang, A., et al.: Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461 (2018)

[33]

Wang, J.O.: Adversarial sample detection for deep neural network through model mutation testing. In: ICSE (2019)

[34]

Wang, X., et al.: Large language models are implicitly topic models: explaining and finding good demonstrations for in-context learning. arXiv:2301.11916 (2023)

[35]

Wang, Y., et al.: A theoretical understanding of self-correction through in-context alignment. In: NeurIPS (2024)

[36]

Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022)

[37]

Wei, Z., et al.: Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023)

[38]

Wu, Q., et al.: Autogen: enabling next-gen llm applications via multi-agent conversation framework. arXiv:2308.08155 (2023)

[39]

Xie, S.M., et al.: An explanation of in-context learning as implicit bayesian inference. arXiv:2111.02080 (2021)

[40]

Yang, J., et al.: Generalized out-of-distribution detection: a survey. Int. J. Comput. Vis. (2024)

[41]

Yu, J., et al.: Gptfuzzer: red teaming large language models with auto-generated jailbreak prompts. arXiv:2309.10253 (2023)

[42]

Zhang JM et al. Machine learning testing: survey, landscapes and horizons IEEE Trans. Softw. Eng. 2020 48 1 1-36

Digital Library

[43]

Zhang, K., et al.: Batch-icl: Effective, efficient, and order-agnostic in-context learning. In: ACL (2024)

[44]

Zhang, X., et al.: Character-level convolutional networks for text classification. In: NeurIPS (2015)

[45]

Zhang, X., et al.: A mutation-based method for multi-modal jailbreaking attack detection. arXiv:2312.10766 (2023)

[46]

Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: NeurIPS (2024)

Index Terms

MILE: A Mutation Testing Framework of In-Context Learning Systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Predictive mutation testing
ISSTA 2016: Proceedings of the 25th International Symposium on Software Testing and Analysis

Mutation testing is a powerful methodology for evaluating test suite quality. In mutation testing, a large number of mutants are generated and executed against the test suite to check the ratio of killed mutants. Therefore, mutation testing is widely ...
Faster mutation testing inspired by test prioritization and reduction
ISSTA 2013: Proceedings of the 2013 International Symposium on Software Testing and Analysis

Mutation testing is a well-known but costly approach for determining test adequacy. The central idea behind the approach is to generate mutants, which are small syntactic transformations of the program under test, and then to measure for a given test ...
Mitigating the effects of flaky tests on mutation testing
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Dependable Software Engineering. Theories, Tools, and Applications: 10th International Symposium, SETTA 2024, Hong Kong, China, November 26–28, 2024, Proceedings

Nov 2024

430 pages

ISBN:978-981-96-0601-6

DOI:10.1007/978-981-96-0602-3

Editors:
Timothy Bourke
Inria, Paris, France
,
Liqian Chen
National University of Defense Technology, Changsha, China
,
Amir Goharshady
https://ror.org/00q4vv597Hong Kong University of Science and Technology, Hong Kong, China

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 26 November 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents