Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3336191.3371792acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Published: 22 January 2020 Publication History

Abstract

Deep pre-training and fine-tuning models (such as BERT and OpenAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow. How to apply these complex models to real business scenarios becomes a challenging but practical problem. Previous model compression methods usually suffer from information loss during the model compression procedure, leading to inferior models compared with the original one. To tackle this challenge, we propose a Two-stage Multi-teacher Knowledge Distillation (TMKD for short) method for web Question Answering system. We first develop a general Q&A distillation task for student model pre-training, and further fine-tune this pre-trained student model with multi-teacher knowledge distillation on downstream tasks (like Web Q&A task, MNLI, SNLI, RTE tasks from GLUE), which effectively reduces the overfitting bias in individual teacher models, and transfers more general knowledge to the student model. The experiment results show that our method can significantly outperform the baseline methods and even achieve comparable results with the original teacher models, along with substantial speedup of model inference.

References

[1]
Philipp Cimiano, Christina Unger, and John McCrae. 2014. Ontology-based interpretation of natural language. Synthesis Lectures on Human Language Technologies, Vol. 7, 2 (2014), 1--178.
[2]
Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, and Quoc V. Le. 2019. BAM! Born-Again Multi-Task Networks for Natural Language Understanding. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers . 5931--5937.
[3]
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning. ACM, 160--167. https://doi.org/10.1145/1390156.1390177
[4]
Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8--13 2014, Montreal, Quebec, Canada. 1269--1277.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019), 4171--4186.
[6]
Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, and Nitesh V Chawla. 2014. Inferring user demographics and social strategies in mobile social networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 15--24. https://doi.org/10.1145/2623330.2623703
[7]
Murhaf Fares, Stephan Oepen, and Erik Velldal. 2018. Transfer and Multi-Task Learning for Noun-Noun Compound Interpretation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 . 1488--1498.
[8]
Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).
[9]
Babak Hassibi and David G. Stork. 1993. Second order derivatives for network pruning: Optimal Brain Surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.). Morgan-Kaufmann, 164--171.
[10]
Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel Pruning for Accelerating Very Deep Neural Networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017 . 1398--1406. https://doi.org/10.1109/ICCV.2017.155
[11]
Geoffrey E Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv: Machine Learning (2015).
[12]
Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers . 328--339. https://doi.org/10.18653/v1/P18--1031
[13]
Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014).
[14]
Yoon Kim and Alexander M. Rush. 2016. Sequence-Level Knowledge Distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1--4, 2016. 1317--1327.
[15]
Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, and Noah A. Smith. 2016. Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1--4, 2016. 1744--1753.
[16]
Yann LeCun, John S. Denker, and Sara A. Solla. 1989. Optimal Brain Damage. In Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27--30, 1989] . 598--605.
[17]
Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David J. Crandall, and Dhruv Batra. 2015. Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks. CoRR, Vol. abs/1511.06314 (2015). arxiv: 1511.06314 http://arxiv.org/abs/1511.06314
[18]
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019 a. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding. CoRR, Vol. abs/1904.09482 (2019).
[19]
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019 b. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers . 4487--4496.
[20]
Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. 2016. Distilling Word Embeddings: An Encoding Approach. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24--28, 2016. 1977--1980. https://doi.org/10.1145/2983323.2983888
[21]
Nicolas Papernot, Mart'i n Abadi, Ú lfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. 2017. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings .
[22]
Anastasia Pentina and Christoph H Lampert. 2017. Multi-Task Learning with Labeled and Unlabeled Tasks. stat, Vol. 1050 (2017), 1.
[23]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1--6, 2018, Volume 1 (Long Papers). 2227--2237.
[24]
Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Model compression via distillation and quantization. CoRR, Vol. abs/1802.05668 (2018).
[25]
Alec Radford. 2018. Improving Language Understanding by Generative Pre-Training.
[26]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019 .
[27]
Meng-Chieh Wu, Ching-Te Chiu, and Kun-Hsuan Wu. 2019. Multi-teacher Knowledge Distillation for Compressed Video Action Recognition on Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12--17, 2019. 2202--2206.
[28]
Junho Yim, Heechul Jung, ByungIn Yoo, Changkyu Choi, Dusik Park, and Junmo Kim. 2015. Rotating your face using multi-task deep neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 676--684. https://doi.org/10.1109/CVPR.2015.7298667
[29]
Shan You, Chang Xu, Chao Xu, and Dacheng Tao. 2017. Learning from Multiple Teacher Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. 1285--1294.
[30]
Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. 2019. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes. CoRR, Vol. abs/1904.00962 (2019).
[31]
Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. 2015. Efficient and accurate approximations of nonlinear convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. 1984--1992. https://doi.org/10.1109/CVPR.2015.7298809

Cited By

View all
  • (2025)Adaptive Prompt Learning With Distilled Connective Knowledge for Implicit Discourse Relation RecognitionIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLPRO.2025.352713533(586-597)Online publication date: 2025
  • (2024)MAGDIProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692350(7220-7235)Online publication date: 21-Jul-2024
  • (2024)Discrepancy and uncertainty aware denoising knowledge distillation for zero-shot cross-lingual named entity recognitionProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i16.29762(18056-18064)Online publication date: 20-Feb-2024
  • Show More Cited By

Index Terms

  1. Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining
    January 2020
    950 pages
    ISBN:9781450368223
    DOI:10.1145/3336191
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 January 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. distillation pre-training
    2. knowledge distillation
    3. model compression
    4. multi-teacher
    5. two-stage

    Qualifiers

    • Research-article

    Conference

    WSDM '20

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)109
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 28 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Adaptive Prompt Learning With Distilled Connective Knowledge for Implicit Discourse Relation RecognitionIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLPRO.2025.352713533(586-597)Online publication date: 2025
    • (2024)MAGDIProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692350(7220-7235)Online publication date: 21-Jul-2024
    • (2024)Discrepancy and uncertainty aware denoising knowledge distillation for zero-shot cross-lingual named entity recognitionProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i16.29762(18056-18064)Online publication date: 20-Feb-2024
    • (2024)Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future EnvisionACM Transactions on Embedded Computing Systems10.1145/370172824:1(1-100)Online publication date: 24-Oct-2024
    • (2024)NICEST: Noisy Label Correction and Training for Robust Scene Graph GenerationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338734946:10(6873-6888)Online publication date: Oct-2024
    • (2024)The Staged Knowledge Distillation in Video Classification: Harmonizing Student Progress by a Complementary Weakly Supervised FrameworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329497734:8(6646-6660)Online publication date: Aug-2024
    • (2024)Zero-Shot Cross-Lingual Named Entity Recognition via Progressive Multi-Teacher DistillationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.344902932(4617-4630)Online publication date: 1-Jan-2024
    • (2024)ELMAGIC: Energy-Efficient Lean Model for Reliable Medical Image Generation and Classification Using Forward Forward Algorithm2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI)10.1109/ICMI60790.2024.10585776(1-5)Online publication date: 13-Apr-2024
    • (2024)DuoDistill: A Dual Knowledge Distillation Framework for Optimizing Top-K Recommender Systems2024 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC)10.1109/CyberC62439.2024.00024(82-89)Online publication date: 24-Oct-2024
    • (2024)AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01187(12490-12500)Online publication date: 16-Jun-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media