research-article

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Authors:

Daxin JiangAuthors Info & Claims

WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

Pages 690 - 698

https://doi.org/10.1145/3336191.3371792

Published: 22 January 2020 Publication History

Abstract

Deep pre-training and fine-tuning models (such as BERT and OpenAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow. How to apply these complex models to real business scenarios becomes a challenging but practical problem. Previous model compression methods usually suffer from information loss during the model compression procedure, leading to inferior models compared with the original one. To tackle this challenge, we propose a Two-stage Multi-teacher Knowledge Distillation (TMKD for short) method for web Question Answering system. We first develop a general Q&A distillation task for student model pre-training, and further fine-tune this pre-trained student model with multi-teacher knowledge distillation on downstream tasks (like Web Q&A task, MNLI, SNLI, RTE tasks from GLUE), which effectively reduces the overfitting bias in individual teacher models, and transfers more general knowledge to the student model. The experiment results show that our method can significantly outperform the baseline methods and even achieve comparable results with the original teacher models, along with substantial speedup of model inference.

References

[1]

Philipp Cimiano, Christina Unger, and John McCrae. 2014. Ontology-based interpretation of natural language. Synthesis Lectures on Human Language Technologies, Vol. 7, 2 (2014), 1--178.

[2]

Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, and Quoc V. Le. 2019. BAM! Born-Again Multi-Task Networks for Natural Language Understanding. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers . 5931--5937.

[3]

Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning. ACM, 160--167. https://doi.org/10.1145/1390156.1390177

Digital Library

[4]

Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8--13 2014, Montreal, Quebec, Canada. 1269--1277.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2019), 4171--4186.

[6]

Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, and Nitesh V Chawla. 2014. Inferring user demographics and social strategies in mobile social networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 15--24. https://doi.org/10.1145/2623330.2623703

Digital Library

[7]

Murhaf Fares, Stephan Oepen, and Erik Velldal. 2018. Transfer and Multi-Task Learning for Noun-Noun Compound Interpretation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 . 1488--1498.

[8]

Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).

[9]

Babak Hassibi and David G. Stork. 1993. Second order derivatives for network pruning: Optimal Brain Surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.). Morgan-Kaufmann, 164--171.

[10]

Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel Pruning for Accelerating Very Deep Neural Networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017 . 1398--1406. https://doi.org/10.1109/ICCV.2017.155

[11]

Geoffrey E Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv: Machine Learning (2015).

[12]

Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers . 328--339. https://doi.org/10.18653/v1/P18--1031

[13]

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. 2014. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014).

[14]

Yoon Kim and Alexander M. Rush. 2016. Sequence-Level Knowledge Distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1--4, 2016. 1317--1327.

[15]

Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, and Noah A. Smith. 2016. Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1--4, 2016. 1744--1753.

[16]

Yann LeCun, John S. Denker, and Sara A. Solla. 1989. Optimal Brain Damage. In Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27--30, 1989] . 598--605.

[17]

Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David J. Crandall, and Dhruv Batra. 2015. Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks. CoRR, Vol. abs/1511.06314 (2015). arxiv: 1511.06314 http://arxiv.org/abs/1511.06314

[18]

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019 a. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding. CoRR, Vol. abs/1904.09482 (2019).

[19]

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019 b. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers . 4487--4496.

[20]

Lili Mou, Ran Jia, Yan Xu, Ge Li, Lu Zhang, and Zhi Jin. 2016. Distilling Word Embeddings: An Encoding Approach. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24--28, 2016. 1977--1980. https://doi.org/10.1145/2983323.2983888

Digital Library

[21]

Nicolas Papernot, Mart'i n Abadi, Ú lfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. 2017. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings .

[22]

Anastasia Pentina and Christoph H Lampert. 2017. Multi-Task Learning with Labeled and Unlabeled Tasks. stat, Vol. 1050 (2017), 1.

[23]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1--6, 2018, Volume 1 (Long Papers). 2227--2237.

[24]

Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Model compression via distillation and quantization. CoRR, Vol. abs/1802.05668 (2018).

[25]

Alec Radford. 2018. Improving Language Understanding by Generative Pre-Training.

[26]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019 .

[27]

Meng-Chieh Wu, Ching-Te Chiu, and Kun-Hsuan Wu. 2019. Multi-teacher Knowledge Distillation for Compressed Video Action Recognition on Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12--17, 2019. 2202--2206.

[28]

Junho Yim, Heechul Jung, ByungIn Yoo, Changkyu Choi, Dusik Park, and Junmo Kim. 2015. Rotating your face using multi-task deep neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 676--684. https://doi.org/10.1109/CVPR.2015.7298667

[29]

Shan You, Chang Xu, Chao Xu, and Dacheng Tao. 2017. Learning from Multiple Teacher Networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. 1285--1294.

Digital Library

[30]

Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. 2019. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes. CoRR, Vol. abs/1904.00962 (2019).

[31]

Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. 2015. Efficient and accurate approximations of nonlinear convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. 1984--1992. https://doi.org/10.1109/CVPR.2015.7298809

Cited By

Wang BWang ZXiang WMo Y(2025)Adaptive Prompt Learning With Distilled Connective Knowledge for Implicit Discourse Relation RecognitionIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLPRO.2025.352713533(586-597)Online publication date: 2025
https://doi.org/10.1109/TASLPRO.2025.3527135
Chen JSaha SStengel-Eskin EBansal MSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)MAGDIProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692350(7220-7235)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692350
Ge LHu CMa GLiu JZhang HWooldridge MDy JNatarajan S(2024)Discrepancy and uncertainty aware denoising knowledge distillation for zero-shot cross-lingual named entity recognitionProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i16.29762(18056-18064)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i16.29762
Show More Cited By

Index Terms

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Question answering

Recommendations

Correlation Guided Multi-teacher Knowledge Distillation
Neural Information Processing
Abstract
Knowledge distillation is a model compression technique that transfers knowledge from a redundant and strong network (teacher) to a lightweight network (student). Due to the limitations of a single teacher’s perspective, researchers advocate for ...
Multi-target Knowledge Distillation via Student Self-reflection
Abstract
Knowledge distillation is a simple yet effective technique for deep model compression, which aims to transfer the knowledge learned by a large teacher model to a small student model. To mimic how the teacher teaches the student, existing knowledge ...
Multi-stage knowledge distillation for sequential recommendation with interest knowledge
Abstract
Sequential models based on deep learning are widely used in sequential recommendation task, but the increase of model parameters results in a higher latency in the inference stage, which limits the real-time performance of the model. In order to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

January 2020

950 pages

ISBN:9781450368223

DOI:10.1145/3336191

General Chairs:
James Caverlee
Texas A&M University
,
Xia "Ben" Hu
Texas A&M University
,
Program Chairs:
Mounia Lalmas
Spotify
,
Wei Wang
University of California, Los Angeles

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 January 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM '20

Sponsor:

WSDM '20: The Thirteenth ACM International Conference on Web Search and Data Mining

February 3 - 7, 2020

TX, Houston, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
841
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)5

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang BWang ZXiang WMo Y(2025)Adaptive Prompt Learning With Distilled Connective Knowledge for Implicit Discourse Relation RecognitionIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLPRO.2025.352713533(586-597)Online publication date: 2025
https://doi.org/10.1109/TASLPRO.2025.3527135
Chen JSaha SStengel-Eskin EBansal MSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)MAGDIProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692350(7220-7235)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692350
Ge LHu CMa GLiu JZhang HWooldridge MDy JNatarajan S(2024)Discrepancy and uncertainty aware denoising knowledge distillation for zero-shot cross-lingual named entity recognitionProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i16.29762(18056-18064)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i16.29762
Luo XLiu DKong HHuai SChen HXiong GLiu W(2024)Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future EnvisionACM Transactions on Embedded Computing Systems10.1145/370172824:1(1-100)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3701728
Li LXiao JShi HZhang HYang YLiu WChen L(2024)NICEST: Noisy Label Correction and Training for Robust Scene Graph GenerationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.338734946:10(6873-6888)Online publication date: Oct-2024
https://doi.org/10.1109/TPAMI.2024.3387349
Wang CTang Z(2024)The Staged Knowledge Distillation in Video Classification: Harmonizing Student Progress by a Complementary Weakly Supervised FrameworkIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329497734:8(6646-6660)Online publication date: Aug-2024
https://doi.org/10.1109/TCSVT.2023.3294977
Li ZHu CZhang RChen JGuo X(2024)Zero-Shot Cross-Lingual Named Entity Recognition via Progressive Multi-Teacher DistillationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.344902932(4617-4630)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2024.3449029
Barua SRahman MSaad MIslam RSadek M(2024)ELMAGIC: Energy-Efficient Lean Model for Reliable Medical Image Generation and Classification Using Forward Forward Algorithm2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI)10.1109/ICMI60790.2024.10585776(1-5)Online publication date: 13-Apr-2024
https://doi.org/10.1109/ICMI60790.2024.10585776
Chiang CChiou YLiu R(2024)DuoDistill: A Dual Knowledge Distillation Framework for Optimizing Top-K Recommender Systems2024 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC)10.1109/CyberC62439.2024.00024(82-89)Online publication date: 24-Oct-2024
https://doi.org/10.1109/CyberC62439.2024.00024
Ranzinger MHeinrich GKautz JMolchanov P(2024)AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01187(12490-12500)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01187
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten