Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3219819.3220007acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

Published: 19 July 2018 Publication History

Abstract

Neural-based multi-task learning has been successfully used in many real-world large-scale applications such as recommendation systems. For example, in movie recommendations, beyond providing users movies which they tend to purchase and watch, the system might also optimize for users liking the movies afterwards. With multi-task learning, we aim to build a single model that learns these multiple goals and tasks simultaneously. However, the prediction quality of commonly used multi-task models is often sensitive to the relationships between tasks. It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships. In this work, we propose a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data. We adapt the Mixture-of-Experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also having a gating network trained to optimize each task. To validate our approach on data with different levels of task relatedness, we first apply it to a synthetic dataset where we control the task relatedness. We show that the proposed approach performs better than baseline methods when the tasks are less related. We also show that the MMoE structure results in an additional trainability benefit, depending on different levels of randomness in the training data and model initialization. Furthermore, we demonstrate the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.

Supplementary Material

MP4 File (ma_modeling_relationships.mp4)

References

[1]
Mart'ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et almbox. . 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).
[2]
Arthur Asuncion and David Newman . 2007. UCI machine learning repository. (2007).
[3]
Trapit Bansal, David Belanger, and Andrew McCallum . 2016. Ask the gru: Multi-task learning for deep text recommendations Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 107--114.
[4]
Jonathan Baxter et almbox. . 2000. A model of inductive bias learning. J. Artif. Intell. Res.(JAIR) Vol. 12, 149--198 (2000), 3.
[5]
Shai Ben-David, Johannes Gehrke, and Reba Schuller . 2002. A theoretical framework for learning from a pool of disparate data sources Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 443--449.
[6]
Shai Ben-David, Reba Schuller, et almbox. . 2003. Exploiting task relatedness for multiple task learning. Lecture notes in computer science (2003), 567--580.
[7]
Yoshua Bengio, Nicholas Léonard, and Aaron Courville . 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).
[8]
Rich Caruana . 1998. Multitask learning. Learning to learn. Springer, 95--133.
[9]
R Caruna . 1993. Multitask learning: A knowledge-based source of inductive bias Machine Learning: Proceedings of the Tenth International Conference. 41--48.
[10]
Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo . 2016. Capacity and Trainability in Recurrent Neural Networks. arXiv preprint arXiv:1611.09913 (2016).
[11]
Paul Covington, Jay Adams, and Emre Sargin . 2016. Deep neural networks for youtube recommendations. Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198.
[12]
Andrew Davis and Itamar Arel . 2013. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461 (2013).
[13]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. . 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.
[14]
Thomas Desautels, Andreas Krause, and Joel W Burdick . 2014. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research Vol. 15, 1 (2014), 3873--3923.
[15]
Long Duong, Trevor Cohn, Steven Bird, and Paul Cook . 2015. Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. ACL (2). 845--850.
[16]
David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever . 2013. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 (2013).
[17]
Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra . 2017. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734 (2017).
[18]
Ross Girshick . 2015. Fast r-cnn Proceedings of the IEEE international conference on computer vision. 1440--1448.
[19]
Xavier Glorot and Yoshua Bengio . 2010. Understanding the difficulty of training deep feedforward neural networks Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249--256.
[20]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[21]
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton . 1991. Adaptive mixtures of local experts. Neural computation, Vol. 3, 1 (1991), 79--87.
[22]
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et almbox. . 2016. Google's multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558 (2016).
[23]
Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit . 2017. One Model To Learn Them All. arXiv preprint arXiv:1706.05137 (2017).
[24]
Zhuoliang Kang, Kristen Grauman, and Fei Sha . 2011. Learning with whom to share in multi-task feature learning Proceedings of the 28th International Conference on Machine Learning (ICML-11). 521--528.
[25]
Diederik Kingma and Jimmy Ba . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[26]
Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser . 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114 (2015).
[27]
Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert . 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3994--4003.
[28]
Xia Ning and George Karypis . 2010. Multi-task learning for recommender system. In Proceedings of 2nd Asian Conference on Machine Learning. 269--284.
[29]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99.
[30]
Sebastian Ruder . 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).
[31]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean . 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
[32]
Jasper Snoek, Hugo Larochelle, and Ryan P Adams . 2012. Practical bayesian optimization of machine learning algorithms Advances in neural information processing systems. 2951--2959.
[33]
Shengyang Sun, Changyou Chen, and Lawrence Carin . 2017. Learning Structured Weight Uncertainty in Bayesian Neural Networks Artificial Intelligence and Statistics. 1283--1292.
[34]
Yongxin Yang and Timothy Hospedales . 2016. Deep multi-task representation learning: A tensor factorisation approach. arXiv preprint arXiv:1605.06391 (2016).
[35]
Zhe Zhao, Zhiyuan Cheng, Lichan Hong, and Ed H Chi . 2015. Improving user topic interest profiles by behavior factorization Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1406--1416.

Cited By

View all
  • (2025)A Collaborative Network for Multiple Hyperspectral Images Joint ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.351161863(1-14)Online publication date: 2025
  • (2025)GatedNN: An accurate deep learning-based parameter extraction for BSIM-CMGSolid-State Electronics10.1016/j.sse.2024.109044224(109044)Online publication date: Feb-2025
  • (2025)Hybrid contrastive multi-scenario learning for multi-task sequential-dependence recommendationNeural Networks10.1016/j.neunet.2024.106953183(106953)Online publication date: Mar-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. mixture of experts
  2. multi-task learning
  3. neural network
  4. recommendation system

Qualifiers

  • Research-article

Conference

KDD '18
Sponsor:

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21,901
  • Downloads (Last 6 weeks)2,457
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)A Collaborative Network for Multiple Hyperspectral Images Joint ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.351161863(1-14)Online publication date: 2025
  • (2025)GatedNN: An accurate deep learning-based parameter extraction for BSIM-CMGSolid-State Electronics10.1016/j.sse.2024.109044224(109044)Online publication date: Feb-2025
  • (2025)Hybrid contrastive multi-scenario learning for multi-task sequential-dependence recommendationNeural Networks10.1016/j.neunet.2024.106953183(106953)Online publication date: Mar-2025
  • (2025)A user behavior-aware multi-task learning model for enhanced short video recommendationNeurocomputing10.1016/j.neucom.2024.129076617(129076)Online publication date: Feb-2025
  • (2025)Mitigating gradient conflicts via expert squads in multi-task learningNeurocomputing10.1016/j.neucom.2024.128832614(128832)Online publication date: Jan-2025
  • (2025)Causal-relationship representation enhanced joint extraction model for elements and relationshipsNeurocomputing10.1016/j.neucom.2024.128736613(128736)Online publication date: Jan-2025
  • (2025)Cross-attention multi-perspective fusion network based fake news censorshipNeurocomputing10.1016/j.neucom.2024.128695611(128695)Online publication date: Jan-2025
  • (2025)Fusing temporal and semantic dependencies for session-based recommendationInformation Processing & Management10.1016/j.ipm.2024.10389662:1(103896)Online publication date: Jan-2025
  • (2025)Enhancing road surface recognition via optimal transport and metric learning in task-agnostic intelligent driving environmentsExpert Systems with Applications10.1016/j.eswa.2024.125978266(125978)Online publication date: Mar-2025
  • (2024)Restructuring the Landscape of Generative AI ResearchImpacts of Generative AI on the Future of Research and Education10.4018/979-8-3693-0884-4.ch012(287-334)Online publication date: 20-Sep-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media