research-article

Open access

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

Authors:

Ed H. ChiAuthors Info & Claims

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1930 - 1939

https://doi.org/10.1145/3219819.3220007

Published: 19 July 2018 Publication History

Abstract

Neural-based multi-task learning has been successfully used in many real-world large-scale applications such as recommendation systems. For example, in movie recommendations, beyond providing users movies which they tend to purchase and watch, the system might also optimize for users liking the movies afterwards. With multi-task learning, we aim to build a single model that learns these multiple goals and tasks simultaneously. However, the prediction quality of commonly used multi-task models is often sensitive to the relationships between tasks. It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships. In this work, we propose a novel multi-task learning approach, Multi-gate Mixture-of-Experts (MMoE), which explicitly learns to model task relationships from data. We adapt the Mixture-of-Experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also having a gating network trained to optimize each task. To validate our approach on data with different levels of task relatedness, we first apply it to a synthetic dataset where we control the task relatedness. We show that the proposed approach performs better than baseline methods when the tasks are less related. We also show that the MMoE structure results in an additional trainability benefit, depending on different levels of randomness in the training data and model initialization. Furthermore, we demonstrate the performance improvements by MMoE on real tasks including a binary classification benchmark, and a large-scale content recommendation system at Google.

Supplementary Material

MP4 File (ma_modeling_relationships.mp4)

Download
436.75 MB

References

[1]

Mart'ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et almbox. . 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).

Digital Library

[2]

Arthur Asuncion and David Newman . 2007. UCI machine learning repository. (2007).

[3]

Trapit Bansal, David Belanger, and Andrew McCallum . 2016. Ask the gru: Multi-task learning for deep text recommendations Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 107--114.

Digital Library

[4]

Jonathan Baxter et almbox. . 2000. A model of inductive bias learning. J. Artif. Intell. Res.(JAIR) Vol. 12, 149--198 (2000), 3.

Digital Library

[5]

Shai Ben-David, Johannes Gehrke, and Reba Schuller . 2002. A theoretical framework for learning from a pool of disparate data sources Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 443--449.

Digital Library

[6]

Shai Ben-David, Reba Schuller, et almbox. . 2003. Exploiting task relatedness for multiple task learning. Lecture notes in computer science (2003), 567--580.

[7]

Yoshua Bengio, Nicholas Léonard, and Aaron Courville . 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).

[8]

Rich Caruana . 1998. Multitask learning. Learning to learn. Springer, 95--133.

Digital Library

[9]

R Caruna . 1993. Multitask learning: A knowledge-based source of inductive bias Machine Learning: Proceedings of the Tenth International Conference. 41--48.

Digital Library

[10]

Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo . 2016. Capacity and Trainability in Recurrent Neural Networks. arXiv preprint arXiv:1611.09913 (2016).

[11]

Paul Covington, Jay Adams, and Emre Sargin . 2016. Deep neural networks for youtube recommendations. Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 191--198.

Digital Library

[12]

Andrew Davis and Itamar Arel . 2013. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461 (2013).

[13]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. . 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.

Digital Library

[14]

Thomas Desautels, Andreas Krause, and Joel W Burdick . 2014. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research Vol. 15, 1 (2014), 3873--3923.

Digital Library

[15]

Long Duong, Trevor Cohn, Steven Bird, and Paul Cook . 2015. Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser. ACL (2). 845--850.

[16]

David Eigen, Marc'Aurelio Ranzato, and Ilya Sutskever . 2013. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 (2013).

[17]

Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra . 2017. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734 (2017).

[18]

Ross Girshick . 2015. Fast r-cnn Proceedings of the IEEE international conference on computer vision. 1440--1448.

Digital Library

[19]

Xavier Glorot and Yoshua Bengio . 2010. Understanding the difficulty of training deep feedforward neural networks Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249--256.

[20]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean . 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).

[21]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton . 1991. Adaptive mixtures of local experts. Neural computation, Vol. 3, 1 (1991), 79--87.

[22]

Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et almbox. . 2016. Google's multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558 (2016).

[23]

Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit . 2017. One Model To Learn Them All. arXiv preprint arXiv:1706.05137 (2017).

[24]

Zhuoliang Kang, Kristen Grauman, and Fei Sha . 2011. Learning with whom to share in multi-task feature learning Proceedings of the 28th International Conference on Machine Learning (ICML-11). 521--528.

Digital Library

[25]

Diederik Kingma and Jimmy Ba . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[26]

Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser . 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114 (2015).

[27]

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert . 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3994--4003.

[28]

Xia Ning and George Karypis . 2010. Multi-task learning for recommender system. In Proceedings of 2nd Asian Conference on Machine Learning. 269--284.

[29]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks Advances in neural information processing systems. 91--99.

Digital Library

[30]

Sebastian Ruder . 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).

[31]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean . 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).

[32]

Jasper Snoek, Hugo Larochelle, and Ryan P Adams . 2012. Practical bayesian optimization of machine learning algorithms Advances in neural information processing systems. 2951--2959.

Digital Library

[33]

Shengyang Sun, Changyou Chen, and Lawrence Carin . 2017. Learning Structured Weight Uncertainty in Bayesian Neural Networks Artificial Intelligence and Statistics. 1283--1292.

[34]

Yongxin Yang and Timothy Hospedales . 2016. Deep multi-task representation learning: A tensor factorisation approach. arXiv preprint arXiv:1605.06391 (2016).

[35]

Zhe Zhao, Zhiyuan Cheng, Lichan Hong, and Ed H Chi . 2015. Improving user topic interest profiles by behavior factorization Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1406--1416.

Digital Library

Cited By

Zhang WZhang MBao ZWang Z(2025)Cross-attention multi-perspective fusion network based fake news censorshipNeurocomputing10.1016/j.neucom.2024.128695611(128695)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128695
Fu HQin ZXue WDing G(2025)Fusing temporal and semantic dependencies for session-based recommendationInformation Processing & Management10.1016/j.ipm.2024.10389662:1(103896)Online publication date: Jan-2025
https://doi.org/10.1016/j.ipm.2024.103896
Yuan MLiu JChen ZGuo QYuan MLi JYu G(2024)Predicting Energy Consumption for Hybrid Energy Systems toward Sustainable Manufacturing: A Physics-Informed Approach Using Pi-MMoESustainability10.3390/su1617725916:17(7259)Online publication date: 23-Aug-2024
https://doi.org/10.3390/su16177259
Show More Cited By

Index Terms

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Multi-task learning
    2. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Metric-Guided Multi-task Learning
Foundations of Intelligent Systems
Abstract
Multi-task learning (MTL) aims to solve multiple related learning tasks simultaneously so that the useful information in one specific task can be utilized by other tasks in order to improve the learning performance of all tasks. Many ...
Hierarchical Task-aware Multi-Head Attention Network
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Neural Multi-task Learning is gaining popularity as a way to learn multiple tasks jointly within a single model. While related research continues to break new ground, two major limitations still remain, including (i) poor generalization to scenarios ...
Saliency-Regularized Deep Multi-Task Learning
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Multi-task learning (MTL) is a framework that enforces multiple learning tasks to share their knowledge to improve their generalization abilities. While shallow multi-task learning can learn task relations, it can only handle pre-defined features. Modern ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2018

2925 pages

ISBN:9781450355520

DOI:10.1145/3219819

General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '18

Sponsor:

KDD '18: The 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 19 - 23, 2018

London, United Kingdom

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

541
Total Citations
View Citations
88,802
Total Downloads

Downloads (Last 12 months)20,122
Downloads (Last 6 weeks)2,436

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang WZhang MBao ZWang Z(2025)Cross-attention multi-perspective fusion network based fake news censorshipNeurocomputing10.1016/j.neucom.2024.128695611(128695)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128695
Fu HQin ZXue WDing G(2025)Fusing temporal and semantic dependencies for session-based recommendationInformation Processing & Management10.1016/j.ipm.2024.10389662:1(103896)Online publication date: Jan-2025
https://doi.org/10.1016/j.ipm.2024.103896
Yuan MLiu JChen ZGuo QYuan MLi JYu G(2024)Predicting Energy Consumption for Hybrid Energy Systems toward Sustainable Manufacturing: A Physics-Informed Approach Using Pi-MMoESustainability10.3390/su1617725916:17(7259)Online publication date: 23-Aug-2024
https://doi.org/10.3390/su16177259
Hou CZheng L(2024)A Multi-Task Joint Learning Model Based on Transformer and Customized Gate Control for Predicting Remaining Useful Life and Health Status of ToolsSensors10.3390/s2413411724:13(4117)Online publication date: 25-Jun-2024
https://doi.org/10.3390/s24134117
Yoo EKim GKang S(2024)Summary-Sentence Level Hierarchical Supervision for Re-Ranking Model of Two-Stage Abstractive Summarization FrameworkMathematics10.3390/math1204052112:4(521)Online publication date: 7-Feb-2024
https://doi.org/10.3390/math12040521
Zhu CQi JLu ZChen SLi XLi Z(2024)Performance Prediction of the Elastic Support Structure of a Wind Turbine Based on Multi-Task LearningMachines10.3390/machines1206035612:6(356)Online publication date: 21-May-2024
https://doi.org/10.3390/machines12060356
Fan RChen YYocom K(2024)A New Approach to Landscape Visual Quality Assessment from a Fine-Tuning PerspectiveLand10.3390/land1305067313:5(673)Online publication date: 13-May-2024
https://doi.org/10.3390/land13050673
Wang YDang KYang RLi LLi HGong M(2024)Multi-Objective Automatic Clustering Algorithm Based on Evolutionary Multi-Tasking OptimizationElectronics10.3390/electronics1310198713:10(1987)Online publication date: 19-May-2024
https://doi.org/10.3390/electronics13101987
Shou ZChen YWen HLiu JMo JZhang H(2024)Research on Joint Recommendation Algorithm for Knowledge Concepts and Learning Partners Based on Improved Multi-Gate Mixture-of-ExpertsElectronics10.3390/electronics1307127213:7(1272)Online publication date: 29-Mar-2024
https://doi.org/10.3390/electronics13071272
Su ZLin SZhang LFeng YJiang W(2024)Multitask Learning-Based Affective Prediction for Videos of Films and TV ScenesApplied Sciences10.3390/app1411439114:11(4391)Online publication date: 22-May-2024
https://doi.org/10.3390/app14114391
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents