Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3627673.3679888acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper
Open access

Compressed Models are NOT Miniature Versions of Large Models

Published: 21 October 2024 Publication History

Abstract

Large neural models are often compressed before deployment. Model compression is necessary for many practical reasons, such as inference latency, memory footprint, and energy consumption. Compressed models are assumed to be miniature versions of corresponding large neural models. However, we question this belief in our work. We compare compressed models with corresponding large neural models using four model characteristics: prediction errors, data representation, data distribution, and vulnerability to adversarial attack. We perform experiments using the BERT-large model and its five compressed versions. For all four model characteristics, compressed models significantly differ from the BERT-large model. Even among compressed models, they differ from each other on all four model characteristics. Apart from the expected loss in model performance, there are major side effects of using compressed models to replace large neural models.

References

[1]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186. https://doi.org/10.18653/v1/N19--1423
[2]
Manish Gupta and Puneet Agrawal. 2022. Compression of Deep Learning Models for Text: A Survey. ACM Trans. Knowl. Discov. Data (2022). https://doi.org/10.1145/3487045
[3]
Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. 1135--1143. https://dl.acm.org/doi/10.5555/2969239.2969366
[4]
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. Advances in neural information processing systems (2015). https://dl.acm.org/doi/10.5555/2969239.2969428
[5]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[6]
Akshita Jha and Chandan K Reddy. 2023. Codeattack: Code-based adversarial attacks for pre-trained programming language models. In Proceedings of the AAAI Conference on Artificial Intelligence. 14892--14900. https://doi.org/10.1609/aaai.v37i12.26739
[7]
Yannik Keller, Jan Mackensen, and Steffen Eger. 2021. BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 1616--1629. https://doi.org/10.18653/v1/2021.findings-acl.141
[8]
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. I-BERT: Integer-only BERT Quantization. In Proceedings of the 38th International Conference on Machine Learning. 5506--5518. https://dblp.org/rec/journals/corr/abs-2101-01321.html
[9]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (2019), 1234--1240. https://doi.org/10.1093/bioinformatics/btz682
[10]
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6193--6202. https://doi.org/10.18653/v1/2020.emnlp-main.500
[11]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142--150. https://aclanthology.org/P11--1015/
[12]
John X Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. 2020. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. EMNLP 2020 (2020), 119. https://doi.org/10.18653/v1/2020.emnlp-demos.16
[13]
Saeed Mian Qaisar. 2020. Sentiment analysis of IMDb movie reviews using long short-term memory. In Proceedings of the 2020 2nd International Conference on Computer and Information Sciences (ICCIS). 1--4. https://doi.org/10.1109/ICCIS49240.2020.9257657
[14]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research (2020), 1--67. https://dl.acm.org/doi/abs/10.5555/3455716.3455856
[15]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000 Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2383--2392. https://doi.org/10.18653/v1/D16--1264
[16]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
[17]
Nurullah Sevim, Furkan Sahinuç, and Aykut Koç. 2023. Gender bias in legal corpora and debiasing it. Natural Language Engineering (2023), 449--482. https://doi.org/10.1017/S1351324922000122
[18]
Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. 2022. Out-of-distribution detection with deep nearest neighbors. In Proceedings of International Conference on Machine Learning. 20827--20840. https://proceedings.mlr.press/v162/sun22d
[19]
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. 191--200. https://doi.org/10.18653/v1/W17--2623
[20]
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation. CoRR (2019). https://doi.org/10.48550/arXiv.1908.08962

Index Terms

  1. Compressed Models are NOT Miniature Versions of Large Models

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
    October 2024
    5705 pages
    ISBN:9798400704369
    DOI:10.1145/3627673
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 October 2024

    Check for updates

    Author Tags

    1. BERT
    2. model characteristics
    3. model compression

    Qualifiers

    • Short-paper

    Conference

    CIKM '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 60
      Total Downloads
    • Downloads (Last 12 months)60
    • Downloads (Last 6 weeks)34
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media