Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3524938.3525864guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article
Free access

Optimizing data usage via differentiable rewards

Published: 13 July 2020 Publication History

Abstract

To acquire a new skill, humans learn better and faster if a tutor informs them of how much attention they should pay to particular content or practice problems based on their current knowledge level. Similarly, a machine learning model could potentially be trained better if data is presented in a way that adapts to its current learning state. In this paper, we examine the problem of training an adaptive scorer that weights data instances to maximally benefit learning. Training such as scorer efficiently is a challenging problem; in order to precisely quantify the effect of a data instance on the final model, a naive approach would require completing the entire training process and observing final performance. We propose an efficient alternative - Differentiable Data Selection (DDS) - that formulates a scorer as a learnable function of the training data that can be efficiently updated along with the main model being trained. Specifically, DDS updates the scorer with an intuitive reward signal: it should up-weigh the data that has a similar gradient with a development set upon which we would finally like to perform well. Without significant computing overhead, DDS delivers consistent improvements over several strong baselines on two very different tasks of machine translation and image classification.

References

[1]
Anandalingam, G. and Friesz, T. L. Hierarchical optimization: An introduction. Annals OR, 1992.
[2]
Axelrod, A., He, X., and Gao, J. Domain adaptation via pseudo in-domain data selection. In EMNLP, 2011.
[3]
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
[4]
Baydin, A. G., Cornish, R., Martínez-Rubio, D., Schmidt, M., and Wood, F. Online learning rate adaptation with hypergradient descent. In ICLR, 2018.
[5]
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In ICML, 2009.
[6]
Clark, J. H., Dyer, C., Lavie, A., and Smith, N. A. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In ACL, 2011.
[7]
Colson, B., Marcotte, P., and Savard, G. An overview of bilevel optimization. Annals OR, 153(1), 2007.
[8]
Du, Y., Czarnecki, W. M., Jayakumar, S. M., Pascanu, R., and Lakshminarayanan, B. Adapting auxiliary losses using gradient similarity. CoRR, abs/1812.02224, 2018. URL http://arxiv.org/abs/1812.02224.
[9]
Fan, Y., Tian, F., Qin, T., Bian, J., and Liu, T.-Y. Learning what data to learn. 2018a. URL https://arxiv.org/abs/1702.08635.
[10]
Fan, Y., Tian, F., Qin, T., Li, X., and Liu, T. Learning to teach. In ICLR, 2018b.
[11]
Fang, M., Li, Y., and Cohn, T. Learning how to active learn: A deep reinforcement learning approach. In EMNLP, pp. 595-605, 2017.
[12]
Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
[13]
Foster, G., Goutte, C., and Kuhn, R. Discriminative instance weighting for domain adaptation in statistical machine translation. In EMNLP, 2010.
[14]
Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. Automated curriculum learning for neural networks. In ICML, 2017.
[15]
Grézl, F., Karafiát, M., Kontár, S., and Cernocky, J. Probabilistic and bottle-neck features for lvcsr of meetings. In ICASSP, volume 4, pp. IV-757. IEEE, 2007.
[16]
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CPVR, 2016.
[17]
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
[18]
Jiang, J. and Zhai, C. Instance weighting for domain adaptation in nlp. In ACL, 2007.
[19]
Jiang, L., Meng, D., Zhao, Q., Shan, S., and Hauptmann, A. G. Self-paced curriculum learning. In AAAI, 2015.
[20]
Jiang, L., Zhou, Z., Leung, T., Li, L., and Fei-Fei, L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.
[21]
Kingma, D. P. and Ba, J. L. Adam: A method for stochastic optimization. In ICLR, 2015.
[22]
Kirchhoff, K. and Bilmes, J. A. Submodularity for data selection in machine translation. In EMNLP, 2014.
[23]
Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet models transfer better? In CVPR, 2019.
[24]
Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
[25]
Kumar, G., Foster, G., Cherry, C., and Krikun, M. Reinforcement learning based curriculum optimization for neural machine translation. In NAACL, pp. 2054-2061, 2019.
[26]
Kumar, M. P., Packer, B., and Koller, D. Self-paced learning for latent variable models. In NIPS, 2010.
[27]
Lee, Y. J. and Grauman, K. Learning the easy things first: Self-paced visual category discovery. In CVPR, 2011.
[28]
Lipton, Z. C., Wang, Y.-X., and Smola, A. Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916, 2018.
[29]
Liu, H., Simonyan, K., and Yang, Y. DARTS: differentiable architecture search. 2019a.
[30]
Liu, S., Davison, A. J., and Johns, E. Self-supervised generalisation with meta auxiliary learning. CoRR, abs/1901.08933, 2019b. URL http://arxiv.org/abs/1901.08933.
[31]
Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
[32]
Moore, R. C. and Lewis, W. Intelligent selection of language model training data. In ACL, 2010.
[33]
Nesterov, Y. E. A method for solving the convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 1983.
[34]
Neubig, G. and Hu, J. Rapid adaptation of neural machine translation to new languages. EMNLP, 2018.
[35]
Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q. V., and Pang, R. Domain adaptive transfer learning with specialist models. CVPR, 2018.
[36]
Papineni, K., Roukos, S., Ward, T., and Zhu, W. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
[37]
Pham, M. Q., Crego, J., Senellart, J., and Yvon, F. Fixing translation divergences in parallel corpora for neural MT. In EMNLP, 2018.
[38]
Platanios, E. A., Stretcu, O., Neubig, G., Poczos, B., and Mitchell, T. Competence-based curriculum learning for neural machine translation. In NAACL, 2019.
[39]
Qi, Y., Sachan, D. S., Felix, M., Padmanabhan, S., and Neubig, G. When and why are pre-trained word embeddings useful for neural machine translation? NAACL, 2018.
[40]
Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In ICML, pp. 4331-4340, 2018.
[41]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
[42]
Sennrich, R. and Zhang, B. Revisiting low-resource neural machine translation: A case study. In ACL, 2019.
[43]
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227-244, 2000.
[44]
Sivasankaran, S., Vincent, E., and Illina., I. Discriminative importance weighting of augmented training data for acoustic model training. In ICASSP, 2017.
[45]
Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. From baby steps to leapfrog: How "less is more" in unsupervised dependency parsing. In NAACL, 2010.
[46]
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. In JMLR, 2014.
[47]
Tschiatschek, S., Iyer, R. K., Wei, H., and Bilmes, J. A. Learning mixtures of submodular functions for image collection summarization. In NIPS, 2014.
[48]
Tsvetkov, Y., Faruqui, M., Ling, W., MacWhinney, B., and Dyer, C. Learning the curriculum with bayesian optimization for task-specific word representation learning. In ACL, 2016.
[49]
van der Wees, M., Bisazza, A., and Monz, C. Dynamic data selection for neural machine translation. In EMNLP, 2017.
[50]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NIPS, pp. 5998-6008, 2017.
[51]
Vyas, Y., Niu, X., and Carpuat, M. Identifying semantic divergences in parallel text without annotations. In NAACL, 2018.
[52]
Wang, R., Utiyama, M., Liu, L., Chen, K., and Sumita, E. Instance weighting for neural machine translation domain adaptation. In EMNLP.
[53]
Wang, W., Caswell, I., and Chelba, C. Dynamically composing domain-data selection with clean-data selection by "co-curricular learning" for neural machine translation. In ACL, 2019a.
[54]
Wang, X. and Neubig, G. Target conditioned sampling: Optimizing data selection for multilingual neural machine translation. In ACL, 2019.
[55]
Wang, X., Pham, H., Arthur, P., and Neubig, G. Multilingual neural machine translation with soft decoupled encoding. In ICLR, 2019b.
[56]
Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
[57]
Wu, J., Li, L., and Wang, W. Y. Reinforced co-training. In NAACL, 2018.
[58]
Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
[59]
Zagoruyko, S. and Komodakis, N. Wide residual networks. In BMVC, 2016.
[60]
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
[61]
Optimizing Data Usage via Differentiable Rewards Zhang, D., Kim, J., Crego, J., and Senellart, J. Boosting neural machine translation. Arxiv 1612.06138, 2016.
[62]
Zhang, X., Kumar, G., Khayrallah, H., Murray, K., Gwinnup, J., Martindale, M. J., McNamee, P., Duh, K., and Carpuat, M. An empirical exploration of curriculum learning for neural machine translation. Arxiv, 1811.00739, 2018.
[63]
Zoph, B., Yuret, D., May, J., and Knight, K. Transfer learning for low-resource neural machine translation. In EMNLP, 2016.

Index Terms

  1. Optimizing data usage via differentiable rewards
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Guide Proceedings
        ICML'20: Proceedings of the 37th International Conference on Machine Learning
        July 2020
        11702 pages

        Publisher

        JMLR.org

        Publication History

        Published: 13 July 2020

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 52
          Total Downloads
        • Downloads (Last 12 months)37
        • Downloads (Last 6 weeks)6
        Reflects downloads up to 28 Dec 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media