Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512021acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article
Open access

Can Small Heads Help? Understanding and Improving Multi-Task Generalization

Published: 25 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Multi-task learning aims to solve multiple machine learning tasks at the same time, with good solutions being both generalizable and Pareto optimal. A multi-task deep learning model consists of a shared representation learned to capture task commonalities, and task-specific sub-networks capturing the specificities of each task. In this work, we offer insights on the under-explored trade-off between minimizing task training conflicts in multi-task learning and improving multi-task generalization, i.e. the generalization capability of the shared presentation across all tasks. The trade-off can be viewed as the tension between multi-objective optimization and shared representation learning: As a multi-objective optimization problem, sufficient parameterization is needed for mitigating task conflicts in a constrained solution space; However, from a representation learning perspective, over-parameterizing the task-specific sub-networks may give the model too many ”degrees of freedom” and impedes the generalizability of the shared representation.
    Specifically, we first present insights on understanding the parameterization effect of multi-task deep learning models and empirically show that larger models are not necessarily better in terms of multi-task generalization. A delicate balance between mitigating task training conflicts vs. improving generalizability of the shared presentation learning is needed to achieve optimal performance across multiple tasks. Motivated by our findings, we then propose the use of a under-parameterized self-auxiliary head alongside each task-specific sub-network during training, which automatically balances the aforementioned trade-off. As the auxiliary heads are small in size and are discarded during inference time, the proposed method incurs minimal training cost and no additional serving cost. We conduct experiments with the proposed self-auxiliaries on two public datasets and live experiments on one of the largest industrial recommendation platforms serving billions of users. The results demonstrate the effectiveness of the proposed method in improving the predictive performance across multiple tasks in multi-task models.

    References

    [1]
    Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235(2018).
    [2]
    Sercan Ö Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, 2017. Deep voice: Real-time neural text-to-speech. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 195–204.
    [3]
    Timothy Ward Athan and Panos Y Papalambros. 1996. A note on weighted criteria methods for compromise solutions in multi-objective optimization. Engineering optimization 27, 2 (1996), 155–176.
    [4]
    Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask the gru: Multi-task learning for deep text recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. 107–114.
    [5]
    Jonathan Baxter. 1998. Theoretical models of learning to learn. In Learning to learn. Springer, 71–94.
    [6]
    Jonathan Baxter. 2000. A model of inductive bias learning. Journal of artificial intelligence research 12 (2000), 149–198.
    [7]
    Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41–75.
    [8]
    Rich Caruana and Virginia R De Sa. 1997. Promoting poor features to supervisors: Some inputs work better as outputs. In Advances in Neural Information Processing Systems. 389–395.
    [9]
    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2017. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257(2017).
    [10]
    Zhe Chen, Yuyan Wang, Dong Lin, Derek Zhiyuan Cheng, Lichan Hong, Ed H Chi, and Claire Cui. 2020. Beyond Point Estimate: Inferring Ensemble Prediction Variation from Neuron Activation Strength in Recommender Systems. arXiv preprint arXiv:2008.07032(2020).
    [11]
    Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning. 160–167.
    [12]
    Vincent Conitzer. 2009. Eliciting single-peaked preferences using comparison queries. Journal of Artificial Intelligence Research 35 (2009), 161–191.
    [13]
    Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.
    [14]
    Jean-Antoine Désidéri. 2012. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique 350, 5-6 (2012), 313–318.
    [15]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
    [16]
    Alexey Dosovitskiy and Josip Djolonga. 2020. You Only Train Once: Loss-Conditional Training of Deep Networks. In International Conference on Learning Representations. https://openreview.net/forum?id=HyxY6JHKwr
    [17]
    Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1126–1135.
    [18]
    Yaroslav Ganin and Victor Lempitsky. 2014. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495(2014).
    [19]
    A Ghane-Kanafi and E Khorram. 2015. A new scalarization method for finding the efficient frontier in non-convex multi-objective problems. Applied Mathematical Modelling 39, 23-24 (2015), 7483–7498.
    [20]
    Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440–1448.
    [21]
    F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4(2015), 1–19.
    [22]
    Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2016. A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587(2016).
    [23]
    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531(2015).
    [24]
    Dylan Jones, Mehrdad Tamiz, 2010. Practical goal programming. Vol. 141. Springer.
    [25]
    Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482–7491.
    [26]
    Mykel J Kochenderfer and Tim A Wheeler. 2019. Algorithms for optimization. Mit Press.
    [27]
    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
    [28]
    Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, and Sam Kwong. 2019. Pareto Multi-Task Learning. In Advances in Neural Information Processing Systems. 12037–12047.
    [29]
    Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. (2015).
    [30]
    Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482(2019).
    [31]
    Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504(2019).
    [32]
    Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, and Rogerio Feris. 2017. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5334–5343.
    [33]
    Jiaqi Ma, Zhe Zhao, Jilin Chen, Ang Li, Lichan Hong, and Ed H Chi. 2019. SNR: Sub-Network Routing for Flexible Parameter Sharing in Multi-task Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 216–223.
    [34]
    Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1930–1939.
    [35]
    Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730(2018).
    [36]
    Elliot Meyerson and Risto Miikkulainen. 2018. Pseudo-task Augmentation: From Deep Multitask Learning to Intratask Sharing—and Back. arXiv preprint arXiv:1803.04062(2018).
    [37]
    Kaisa Miettinen. 2012. Nonlinear multiobjective optimization. Vol. 12. Springer Science & Business Media.
    [38]
    Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3994–4003.
    [39]
    Panos M Pardalos, Antanas Žilinskas, and Julius Žilinskas. 2017. Non-convex multi-objective optimization. Springer.
    [40]
    Marek Rei. 2017. Semi-supervised multitask learning for sequence labeling. arXiv preprint arXiv:1704.07156(2017).
    [41]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
    [42]
    Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098(2017).
    [43]
    Yoshikazu Sawaragi, Hirotaka Nakayama, and Tetesuzo Tanino. 1985. Theory of multiobjective optimization. Elsevier.
    [44]
    J David Schaffer. 1985. Multiple objective optimization with vector evaluated genetic algorithms. In Proceedings of the first international conference on genetic algorithms and their applications, 1985. Lawrence Erlbaum Associates. Inc., Publishers.
    [45]
    Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems. 527–538.
    [46]
    Trevor Standley, Amir R Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. 2019. Which Tasks Should Be Learned Together in Multi-task Learning?arXiv preprint arXiv:1905.07553(2019).
    [47]
    Simon Vandenhende, Stamatios Georgoulis, Bert De Brabandere, and Luc Van Gool. 2019. Branched multi-task networks: deciding what layers to share. arXiv preprint arXiv:1904.02920(2019).
    [48]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
    [49]
    Sen Wu, Hongyang R Zhang, and Christopher Ré. 2020. Understanding and Improving Information Transfer in Multi-Task Learning. arXiv preprint arXiv:2005.00944(2020).
    [50]
    Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747(2017).
    [51]
    P Le Yu. 1974. Cone convexity, cone extreme points, and nondominated solutions in decision problems with multiobjectives. Journal of Optimization Theory and Applications 14, 3(1974), 319–377.
    [52]
    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. arXiv preprint arXiv:2001.06782(2020).
    [53]
    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530(2016).
    [54]
    Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2014. Facial landmark detection by deep multi-task learning. In European conference on computer vision. Springer, 94–108.
    [55]
    Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi. 2019. Recommending what video to watch next: a multitask ranking system. In Proceedings of the 13th ACM Conference on Recommender Systems. 43–51.

    Cited By

    View all
    • (2024)M3oE: Multi-Domain Multi-Task Mixture-of Experts Recommendation FrameworkProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657686(893-902)Online publication date: 10-Jul-2024
    • (2023)Multitask Ranking System for Immersive Feed and No More Clicks: A Case Study of Short-Form Video RecommendationProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615489(4709-4716)Online publication date: 21-Oct-2023
    • (2023)Deep Task-specific Bottom Representation Network for Multi-Task RecommendationProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614837(1637-1646)Online publication date: 21-Oct-2023
    • Show More Cited By

    Index Terms

    1. Can Small Heads Help? Understanding and Improving Multi-Task Generalization
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          WWW '22: Proceedings of the ACM Web Conference 2022
          April 2022
          3764 pages
          ISBN:9781450390965
          DOI:10.1145/3485447
          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 25 April 2022

          Check for updates

          Author Tags

          1. Pareto frontier
          2. auxiliary tasks
          3. multi-task learning
          4. neural networks

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          WWW '22
          Sponsor:
          WWW '22: The ACM Web Conference 2022
          April 25 - 29, 2022
          Virtual Event, Lyon, France

          Acceptance Rates

          Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)835
          • Downloads (Last 6 weeks)85
          Reflects downloads up to 27 Jul 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)M3oE: Multi-Domain Multi-Task Mixture-of Experts Recommendation FrameworkProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657686(893-902)Online publication date: 10-Jul-2024
          • (2023)Multitask Ranking System for Immersive Feed and No More Clicks: A Case Study of Short-Form Video RecommendationProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615489(4709-4716)Online publication date: 21-Oct-2023
          • (2023)Deep Task-specific Bottom Representation Network for Multi-Task RecommendationProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614837(1637-1646)Online publication date: 21-Oct-2023
          • (2023)Optimizing Airbnb Search Journey with Multi-task LearningProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599881(4872-4881)Online publication date: 6-Aug-2023
          • (2023)AdaTT: Adaptive Task-to-Task Fusion Network for Multitask Learning in RecommendationsProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599769(4370-4379)Online publication date: 6-Aug-2023

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media