research-article

Open access

Can Small Heads Help? Understanding and Improving Multi-Task Generalization

Authors:

Bo Dai,

Li Wei,

Ed H. ChiAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 3009 - 3019

https://doi.org/10.1145/3485447.3512021

Published: 25 April 2022 Publication History

All formats PDF

Abstract

Multi-task learning aims to solve multiple machine learning tasks at the same time, with good solutions being both generalizable and Pareto optimal. A multi-task deep learning model consists of a shared representation learned to capture task commonalities, and task-specific sub-networks capturing the specificities of each task. In this work, we offer insights on the under-explored trade-off between minimizing task training conflicts in multi-task learning and improving multi-task generalization, i.e. the generalization capability of the shared presentation across all tasks. The trade-off can be viewed as the tension between multi-objective optimization and shared representation learning: As a multi-objective optimization problem, sufficient parameterization is needed for mitigating task conflicts in a constrained solution space; However, from a representation learning perspective, over-parameterizing the task-specific sub-networks may give the model too many ”degrees of freedom” and impedes the generalizability of the shared representation.

Specifically, we first present insights on understanding the parameterization effect of multi-task deep learning models and empirically show that larger models are not necessarily better in terms of multi-task generalization. A delicate balance between mitigating task training conflicts vs. improving generalizability of the shared presentation learning is needed to achieve optimal performance across multiple tasks. Motivated by our findings, we then propose the use of a under-parameterized self-auxiliary head alongside each task-specific sub-network during training, which automatically balances the aforementioned trade-off. As the auxiliary heads are small in size and are discarded during inference time, the proposed method incurs minimal training cost and no additional serving cost. We conduct experiments with the proposed self-auxiliaries on two public datasets and live experiments on one of the largest industrial recommendation platforms serving billions of users. The results demonstrate the effectiveness of the proposed method in improving the predictive performance across multiple tasks in multi-task models.

References

[1]

Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235(2018).

Abstract

References

Cited By

Index Terms

Recommendations

Understanding and Improving Fairness-Accuracy Trade-offs in Multi-Task Learning

Saliency-Regularized Deep Multi-Task Learning

Metric-Guided Multi-task Learning

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations