research-article

Free access

Optimizing data usage via differentiable rewards

AUTHORs:

Antonios Anastasopoulos,

Jaime Carbonell,

Graham NeubigAuthors Info & Claims

ICML'20: Proceedings of the 37th International Conference on Machine Learning

Article No.: 926, Pages 9983 - 9995

Published: 13 July 2020 Publication History

PDF eReader Publisher Site

Abstract

To acquire a new skill, humans learn better and faster if a tutor informs them of how much attention they should pay to particular content or practice problems based on their current knowledge level. Similarly, a machine learning model could potentially be trained better if data is presented in a way that adapts to its current learning state. In this paper, we examine the problem of training an adaptive scorer that weights data instances to maximally benefit learning. Training such as scorer efficiently is a challenging problem; in order to precisely quantify the effect of a data instance on the final model, a naive approach would require completing the entire training process and observing final performance. We propose an efficient alternative - Differentiable Data Selection (DDS) - that formulates a scorer as a learnable function of the training data that can be efficiently updated along with the main model being trained. Specifically, DDS updates the scorer with an intuitive reward signal: it should up-weigh the data that has a similar gradient with a development set upon which we would finally like to perform well. Without significant computing overhead, DDS delivers consistent improvements over several strong baselines on two very different tasks of machine translation and image classification.

References

[1]

Anandalingam, G. and Friesz, T. L. Hierarchical optimization: An introduction. Annals OR, 1992.

[2]

Axelrod, A., He, X., and Gao, J. Domain adaptation via pseudo in-domain data selection. In EMNLP, 2011.

[3]

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.

[4]

Baydin, A. G., Cornish, R., Martínez-Rubio, D., Schmidt, M., and Wood, F. Online learning rate adaptation with hypergradient descent. In ICLR, 2018.

[5]

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In ICML, 2009.

Digital Library

[6]

Clark, J. H., Dyer, C., Lavie, A., and Smith, N. A. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In ACL, 2011.

[7]

Colson, B., Marcotte, P., and Savard, G. An overview of bilevel optimization. Annals OR, 153(1), 2007.

[8]

Du, Y., Czarnecki, W. M., Jayakumar, S. M., Pascanu, R., and Lakshminarayanan, B. Adapting auxiliary losses using gradient similarity. CoRR, abs/1812.02224, 2018. URL http://arxiv.org/abs/1812.02224.

[9]

Fan, Y., Tian, F., Qin, T., Bian, J., and Liu, T.-Y. Learning what data to learn. 2018a. URL https://arxiv.org/abs/1702.08635.

[10]

Fan, Y., Tian, F., Qin, T., Li, X., and Liu, T. Learning to teach. In ICLR, 2018b.

[11]

Fang, M., Li, Y., and Cohn, T. Learning how to active learn: A deep reinforcement learning approach. In EMNLP, pp. 595-605, 2017.

[12]

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.

Digital Library

[13]

Foster, G., Goutte, C., and Kuhn, R. Discriminative instance weighting for domain adaptation in statistical machine translation. In EMNLP, 2010.

[14]

Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. Automated curriculum learning for neural networks. In ICML, 2017.

[15]

Grézl, F., Karafiát, M., Kontár, S., and Cernocky, J. Probabilistic and bottle-neck features for lvcsr of meetings. In ICASSP, volume 4, pp. IV-757. IEEE, 2007.

[16]

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CPVR, 2016.

[17]

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

Digital Library

[18]

Jiang, J. and Zhai, C. Instance weighting for domain adaptation in nlp. In ACL, 2007.

[19]

Jiang, L., Meng, D., Zhao, Q., Shan, S., and Hauptmann, A. G. Self-paced curriculum learning. In AAAI, 2015.

Digital Library

[20]

Jiang, L., Zhou, Z., Leung, T., Li, L., and Fei-Fei, L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.

[21]

Kingma, D. P. and Ba, J. L. Adam: A method for stochastic optimization. In ICLR, 2015.

[22]

Kirchhoff, K. and Bilmes, J. A. Submodularity for data selection in machine translation. In EMNLP, 2014.

[23]

Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet models transfer better? In CVPR, 2019.

[24]

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.

[25]

Kumar, G., Foster, G., Cherry, C., and Krikun, M. Reinforcement learning based curriculum optimization for neural machine translation. In NAACL, pp. 2054-2061, 2019.

[26]

Kumar, M. P., Packer, B., and Koller, D. Self-paced learning for latent variable models. In NIPS, 2010.

Digital Library

[27]

Lee, Y. J. and Grauman, K. Learning the easy things first: Self-paced visual category discovery. In CVPR, 2011.

Digital Library

[28]

Lipton, Z. C., Wang, Y.-X., and Smola, A. Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916, 2018.

[29]

Liu, H., Simonyan, K., and Yang, Y. DARTS: differentiable architecture search. 2019a.

[30]

Liu, S., Davison, A. J., and Johns, E. Self-supervised generalisation with meta auxiliary learning. CoRR, abs/1901.08933, 2019b. URL http://arxiv.org/abs/1901.08933.

[31]

Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.

[32]

Moore, R. C. and Lewis, W. Intelligent selection of language model training data. In ACL, 2010.

Digital Library

[33]

Nesterov, Y. E. A method for solving the convex programming problem with convergence rate o(1/k²). Soviet Mathematics Doklady, 1983.

[34]

Neubig, G. and Hu, J. Rapid adaptation of neural machine translation to new languages. EMNLP, 2018.

[35]

Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q. V., and Pang, R. Domain adaptive transfer learning with specialist models. CVPR, 2018.

[36]

Papineni, K., Roukos, S., Ward, T., and Zhu, W. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.

Digital Library

[37]

Pham, M. Q., Crego, J., Senellart, J., and Yvon, F. Fixing translation divergences in parallel corpora for neural MT. In EMNLP, 2018.

[38]

Platanios, E. A., Stretcu, O., Neubig, G., Poczos, B., and Mitchell, T. Competence-based curriculum learning for neural machine translation. In NAACL, 2019.

[39]

Qi, Y., Sachan, D. S., Felix, M., Padmanabhan, S., and Neubig, G. When and why are pre-trained word embeddings useful for neural machine translation? NAACL, 2018.

[40]

Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In ICML, pp. 4331-4340, 2018.

[41]

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

Digital Library

[42]

Sennrich, R. and Zhang, B. Revisiting low-resource neural machine translation: A case study. In ACL, 2019.

[43]

Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227-244, 2000.

[44]

Sivasankaran, S., Vincent, E., and Illina., I. Discriminative importance weighting of augmented training data for acoustic model training. In ICASSP, 2017.

Digital Library

[45]

Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. From baby steps to leapfrog: How "less is more" in unsupervised dependency parsing. In NAACL, 2010.

[46]

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. In JMLR, 2014.

Digital Library

[47]

Tschiatschek, S., Iyer, R. K., Wei, H., and Bilmes, J. A. Learning mixtures of submodular functions for image collection summarization. In NIPS, 2014.

Digital Library

[48]

Tsvetkov, Y., Faruqui, M., Ling, W., MacWhinney, B., and Dyer, C. Learning the curriculum with bayesian optimization for task-specific word representation learning. In ACL, 2016.

[49]

van der Wees, M., Bisazza, A., and Monz, C. Dynamic data selection for neural machine translation. In EMNLP, 2017.

[50]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NIPS, pp. 5998-6008, 2017.

Digital Library

[51]

Vyas, Y., Niu, X., and Carpuat, M. Identifying semantic divergences in parallel text without annotations. In NAACL, 2018.

[52]

Wang, R., Utiyama, M., Liu, L., Chen, K., and Sumita, E. Instance weighting for neural machine translation domain adaptation. In EMNLP.

[53]

Wang, W., Caswell, I., and Chelba, C. Dynamically composing domain-data selection with clean-data selection by "co-curricular learning" for neural machine translation. In ACL, 2019a.

[54]

Wang, X. and Neubig, G. Target conditioned sampling: Optimizing data selection for multilingual neural machine translation. In ACL, 2019.

[55]

Wang, X., Pham, H., Arthur, P., and Neubig, G. Multilingual neural machine translation with soft decoupled encoding. In ICLR, 2019b.

[56]

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.

Digital Library

[57]

Wu, J., Li, L., and Wang, W. Y. Reinforced co-training. In NAACL, 2018.

[58]

Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In CVPR, 2017.

[59]

Zagoruyko, S. and Komodakis, N. Wide residual networks. In BMVC, 2016.

[60]

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

[61]

Optimizing Data Usage via Differentiable Rewards Zhang, D., Kim, J., Crego, J., and Senellart, J. Boosting neural machine translation. Arxiv 1612.06138, 2016.

[62]

Zhang, X., Kumar, G., Khayrallah, H., Murray, K., Gwinnup, J., Martindale, M. J., McNamee, P., Duh, K., and Carpuat, M. An empirical exploration of curriculum learning for neural machine translation. Arxiv, 1811.00739, 2018.

[63]

Zoph, B., Yuret, D., May, J., and Knight, K. Transfer learning for low-resource neural machine translation. In EMNLP, 2016.

Index Terms

Optimizing data usage via differentiable rewards
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Index terms have been assigned to the content through auto-classification.

Recommendations

Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Accelerating Lifelong Reinforcement Learning via Reshaping Rewards*
2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
The reinforcement learning (RL) problem is typically formalized as the Markov Decision Process (MDP), where an agent interacts with the environment to maximize the long-term expected reward. As an important branch of RL, Lifelong RL requires the agent to ...
Reinforcement learning without rewards

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'20: Proceedings of the 37th International Conference on Machine Learning

July 2020

11702 pages

Editors:
Hal Daumé,
Aarti Singh

Copyright © 2020.

Publisher

JMLR.org

Publication History

Published: 13 July 2020

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
52
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)6

Reflects downloads up to 28 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents