Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3650105.3652297acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

MeTMaP: Metamorphic Testing for Detecting False Vector Matching Problems in LLM Augmented Generation

Published: 12 June 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Augmented generation techniques such as Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation (CAG) have revolutionized the field by enhancing large language model (LLM) outputs with external knowledge and cached information. However, the integration of vector databases, which serve as a backbone for these augmentations, introduces critical challenges, particularly in ensuring accurate vector matching. False vector matching in these databases can significantly compromise the integrity and reliability of LLM outputs, leading to misinformation or erroneous responses. Despite the crucial impact of these issues, there is a notable research gap in methods to effectively detect and address false vector matches in LLM-augmented generation.
    This paper presents MeTMaP, a metamorphic testing framework developed to identify false vector matching in LLM-augmented generation systems. We derive eight metamorphic relations (MRs) from six NLP datasets, which form our method's core, based on the idea that semantically similar texts should match and dissimilar ones should not. MeTMaP uses these MRs to create sentence triplets for testing, simulating real-world matching scenarios. Our evaluation of MeTMaP over 203 vector matching configurations, involving 29 embedding models and 7 distance metrics, uncovers significant inaccuracies. The results, showing a maximum accuracy of only 41.51% on our tests compared to the original datasets, emphasize the widespread issue of false matches in vector matching methods and the critical need for effective detection and mitigation in LLM-augmented applications.

    References

    [1]
    2023. MeTMaP. https://anonymous.4open.science/r/MeTMaP-879B. (2023).
    [2]
    0xk1h0. 2023. ChatGPT_DAN. https://github.com/0xk1h0/ChatGPT_DAN. (2023).
    [3]
    Basemah Alshemali and Jugal Kalita. 2020. Improving the reliability of deep neural networks in NLP: A review. Knowledge-Based Systems 191 (2020).
    [4]
    Mahalanobis Prasanta Chandra et al. 1936. On the generalised distance in statistics. In Proceedings of the National Institute of Sciences of India, Vol. 2. 49--55.
    [5]
    Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, and Yang Liu. 2024. Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues. arXiv preprint arXiv:2402.09091 (2024).
    [6]
    Harrison Chase. 2022. LangChain. https://python.langchain.com/docs/get_started/introduction. (2022).
    [7]
    Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. In ASE. 104--116.
    [8]
    Tsong Yueh Chen, S. C. Cheung, and Siu-Ming Yiu. 2020. Metamorphic Testing: A New Approach for Generating Next Test Cases. ArXiv abs/2002.12543 (2020). https://api.semanticscholar.org/CorpusID:15467386
    [9]
    Tsong Yueh Chen, Fei-Ching Kuo, Huai Liu, Pak-Lok Poon, Dave Towey, T. H. Tse, and Zhi Quan Zhou. 2018. Metamorphic Testing: A Review of Challenges and Opportunities. ACM Comput. Surv. 51, 1 (jan 2018), 27.
    [10]
    Cheng-Han Chiang, Yung-Sung Chuang, James Glass, and Hung-yi Lee. 2023. Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS. arXiv preprint arXiv:2306.05083 (2023).
    [11]
    chroma core. 2023. Chroma. https://github.com/chroma-core/chroma. (2023).
    [12]
    Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
    [13]
    cohere. 2023. Cohere. https://dashboard.cohere.com/. (2023).
    [14]
    Marie-Catherine De Marneffe, Anna N Rafferty, and Christopher D Manning. 2008. Finding contradictions in text. In Proceedings of acl-08: Hlt. 1039--1047.
    [15]
    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. 2023. Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. (2023). arXiv:cs.CR/2307.08715
    [16]
    Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, and Yang Liu. 2024. Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning. arXiv preprint arXiv:2402.08416 (2024).
    [17]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
    [18]
    docarray. 2023. DocArray. https://github.com/docarray/docarray. (2023).
    [19]
    Salvatore Claudio Fanni, Maria Febi, Gayane Aghakhanyan, and Emanuele Neri. 2023. Natural language processing. In Introduction to Artificial Intelligence. 87--99.
    [20]
    FlowiseAI. 2023. Flowise. https://github.com/FlowiseAI/Flowise. (2023).
    [21]
    Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Shaomeng Cao, Kechi Zhang, and Zhi Jin. 2023. Interpretation-based Code Summarization. In ICPC.
    [22]
    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. (2023). arXiv:cs.CR/2302.12173
    [23]
    Sylvain Gugger. 2023. RWKV. https://huggingface.co/sgugger/rwkv-430M-pile. (2023).
    [24]
    Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In International Conference on Machine Learning. https://arxiv.org/abs/1908.10396
    [25]
    Walid Hariri. 2023. Unlocking the Potential of ChatGPT: A Comprehensive Exploration of its Applications, Advantages, Limitations, and Future Directions in Natural Language Processing. arXiv preprint (2023).
    [26]
    Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. (2021). arXiv:cs.CL/2111.09543
    [27]
    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large Language Models for Software Engineering: A Systematic Literature Review. (2023). arXiv:cs.SE/2308.10620
    [28]
    Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao. 2023. ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory. (2023). arXiv:cs.AI/2306.03901
    [29]
    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535--547.
    [30]
    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the association for computational linguistics 8 (2020), 64--77.
    [31]
    Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
    [32]
    Sana Zehra Kamoonpuri and Anita Sengar. 2023. Hi, May AI help you? An analysis of the barriers impeding the implementation and use of artificial intelligence-enabled virtual assistants in retail. Journal of Retailing and Consumer Services 72 (2023).
    [33]
    Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. 2022. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics 10 (2022), 163--177.
    [34]
    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR abs/1909.11942 (2019). arXiv:1909.11942 http://arxiv.org/abs/1909.11942
    [35]
    David D Lewis and Karen Spärck Jones. 1996. Natural language processing for information retrieval. Communications of the ACM 39, 1 (1996), 92--101.
    [36]
    Haodong Li, Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu, Guoai Xu, Guosheng Xu, and Haoyu Wang. 2024. Digger: Detecting Copyright Content Mis-usage in Large Language Model Training. (2024). arXiv:cs.CR/2401.00676
    [37]
    Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2019. TextBugger: Generating Adversarial Text Against Real-world Applications. In NDSS.
    [38]
    Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, and Yinxing Xue. 2024. A Cross-Language Investigation into Jailbreak Attacks in Large Language Models. arXiv preprint arXiv:2401.16765 (2024).
    [39]
    Jerry Liu. 2022. LlamaIndex. (11 2022).
    [40]
    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. Prompt Injection attack against LLM-integrated Applications, June 2023. arXiv preprint arXiv:2306.05499 (????).
    [41]
    Y Liu, G Deng, Z Xu, Y Li, Y Zheng, Y Zhang, L Zhao, T Zhang, and Y Liu. Jailbreaking chatgpt via prompt engineering: An empirical study (2023). Preprint at https://arxiv.org/abs/2305.13860 (????).
    [42]
    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
    [43]
    Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002).
    [44]
    Zheheng Luo, Qianqian Xie, and Sophia Ananiadou. 2023. Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint (2023).
    [45]
    Pingchuan Ma, Shuai Wang, and Jin Liu. 2020. Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models. In International Joint Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:220483049
    [46]
    Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824--836.
    [47]
    Michihiro Yasunaga and Jure Leskovec and Percy Liang. 2022. LinkBERT: Pre-training Language Models with Document Links. In ACL.
    [48]
    microsoft. 2023. MPNet. https://huggingface.co/microsoft/mpnet-base. (2023).
    [49]
    Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. 2022. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 (2022).
    [50]
    Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. 2021. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877 (2021).
    [51]
    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y Zhao, Yi Luan, Keith B Hall, Ming-Wei Chang, et al. 2021. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899 (2021).
    [52]
    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023). arXiv:2306.01116 https://arxiv.org/abs/2306.01116
    [53]
    pgvector. 2023. PGVector. https://github.com/pgvector/pgvector. (2023).
    [54]
    pinecone. 2023. Pinecone. https://www.pinecone.io/. (2023).
    [55]
    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In ACL. 4902--4912.
    [56]
    Abdul Rahaman Wahab Sait and Mohamad Khairi Ishak. 2023. Deep learning with natural language processing enabled sentimental analysis on sarcasm classification. Comput. Syst. Sci. Eng 44, 3 (2023), 2553--2567.
    [57]
    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019).
    [58]
    Sebastião Santos, Beatriz Nogueira Carvalho da Silveira, Stevão Alves de Andrade, Márcio Eduardo Delamaro, and Simone do Rocio Senger de Souza. 2020. An Experimental Study on Applying Metamorphic Testing in Machine Learning Applications. Proceedings of the 5th Brazilian Symposium on Systematic and Automated Software Testing (2020). https://api.semanticscholar.org/CorpusID:225040791
    [59]
    Tal Schuster, Adam Fisch, and Regina Barzilay. 2021. Get your vitamin C! robust fact verification with contrastive evidence. arXiv preprint arXiv:2103.08541 (2021).
    [60]
    Thomas Scialom, Paul-Alexis Dray, Patrick Gallinari, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, and Alex Wang. 2021. Questeval: Summarization asks for fact-based evaluation. arXiv preprint arXiv:2103.12693 (2021).
    [61]
    scipy. 2023. Fundamental algorithms for scientific computing in Python. https://scipy.org/. (2023).
    [62]
    Spotify. 2023. Annoy. https://github.com/spotify/annoy?tab=readme-ov-file. (2023).
    [63]
    Zeyu Sun, J Zhang, Yingfei Xiong, Mark Harman, Mike Papadakis, and Lu Zhang. 2022. Improving Machine Translation Systems via Isotopic Replacement. In ICSE. 1181--1192.
    [64]
    Samson Tan, Shafiq Joty, Kathy Baxter, Araz Taeihagh, Gregory A. Bennett, and Min-Yen Kan. 2021. Reliability Testing for Natural Language Processing Systems. (2021). arXiv:cs.LG/2105.02590
    [65]
    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
    [66]
    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
    [67]
    unum cloud. 2023. Uform. https://huggingface.co/unum-cloud/uform-vl-english. (2023).
    [68]
    Ellen M Voorhees. 1999. Natural language processing and information retrieval. In International summer school on information extraction. 32--48.
    [69]
    Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, et al. 2021. Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the 2021 International Conference on Management of Data. 2614--2627.
    [70]
    Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan Shang, Yanbin Zhao, Chao Pang, et al. 2021. Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2112.12731 (2021).
    [71]
    Wenxuan Wang, Jen-tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, and Michael R. Lyu. 2023. MTTM: Metamorphic Testing for Textual Content Moderation Software. In ICSE. 2387--2399.
    [72]
    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776--5788.
    [73]
    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824--24837.
    [74]
    Orion Weller, Dawn Lawrie, and Benjamin Van Durme. 2023. NevIR: Negation in Neural Information Retrieval. arXiv preprint arXiv:2305.07614 (2023).
    [75]
    Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. 2017. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 996--1005.
    [76]
    P William, Anurag Shrivastava, Premanand S Chauhan, Mudasir Raja, Sudhir Baijnath Ojha, and Keshav Kumar. 2023. Natural Language processing implementation for sentiment analysis on tweets. In MRCN. 317--327.
    [77]
    Dongwei Xiao, Zhibo Liu, Yuanyuan Yuan, Qi Pang, and Shuai Wang. 2022. Metamorphic Testing of Deep Learning Compilers. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6 (2022), 1--28. https://api.semanticscholar.org/CorpusID:247159402
    [78]
    Mingxuan Xiao, Yan Xiao, Hai Dong, Shunhui Ji, and Pengcheng Zhang. 2023. LEAP: Efficient and Automated Test Method for NLP Software. arXiv preprint arXiv:2308.11284 (2023).
    [79]
    Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, and Stjepan Picek. 2024. LLM Jailbreak Attack versus Defense Techniques-A Comprehensive Study. arXiv preprint arXiv:2402.13457 (2024).
    [80]
    Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, and Wei Cheng. 2023. Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint (2023).
    [81]
    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. CoRR abs/1906.08237 (2019). arXiv:1906.08237 http://arxiv.org/abs/1906.08237
    [82]
    Arister NJ Yew, Marijn Schraagen, Willem M Otte, and Eric van Diessen. 2023. Transforming epilepsy research: A systematic review on natural language processing applications. Epilepsia 64, 2 (2023), 292--305.
    [83]
    Yuanyuan Yuan, Qi Pang, and Shuai Wang. 2022. Unveiling Hidden DNN Defects with Decision-Based Metamorphic Testing. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022). https://api.semanticscholar.org/CorpusID:252816122
    [84]
    Yuan Zhang, Jason Baldridge, and Luheng He. 2019. PAWS: Paraphrase adversaries from word scrambling. arXiv preprint arXiv:1904.01130 (2019).
    [85]
    Shuo Zhou, Joshva Silvasstar, Christopher Clark, Adam J Salyers, Catia Chavez, and Sheana S Bull. 2023. An artificially intelligent, natural language processing chatbot designed to promote COVID-19 vaccination: A proof-of-concept pilot study. Digital Health 9 (2023).
    [86]
    zilliztech. 2023. GPTCache. https://github.com/zilliztech/GPTCache. (2023).
    [87]
    zilliztech. 2023. Paraphrase-albert-onnx. https://huggingface.co/GPTCache/paraphrase-albert-onnx. (2023).

    Index Terms

    1. MeTMaP: Metamorphic Testing for Detecting False Vector Matching Problems in LLM Augmented Generation

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      FORGE '24: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering
      April 2024
      140 pages
      ISBN:9798400706097
      DOI:10.1145/3650105
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 June 2024

      Check for updates

      Author Tags

      1. metamorphic testing
      2. vector matching
      3. augmented generation

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      FORGE '24
      Sponsor:

      Upcoming Conference

      ICSE 2025

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 61
        Total Downloads
      • Downloads (Last 12 months)61
      • Downloads (Last 6 weeks)58
      Reflects downloads up to

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media