research-article

Balance-Subsampled Stable Prediction Across Unknown Test Data

Authors:

Yueting Zhuang,

Aijun ZhangAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 3

Article No.: 45, Pages 1 - 21

https://doi.org/10.1145/3477052

Published: 22 October 2021 Publication History

Abstract

In data mining and machine learning, it is commonly assumed that training and test data share the same population distribution. However, this assumption is often violated in practice because of the sample selection bias, which might induce the distribution shift from training data to test data. Such a model-agnostic distribution shift usually leads to prediction instability across unknown test data. This article proposes a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design. It isolates the clear effect of each predictor from the confounding variables. A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift, improving both the accuracy of parameter estimation and the stability of prediction across unknown test data. Numerical experiments on synthetic and real-world datasets demonstrate that our BSSP algorithm can significantly outperform the baseline methods for stable prediction across unknown test data.

References

[1]

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant risk minimization. arXiv:1907.02893. Retrieved from https://arxiv.org/abs/1907.02893.

[2]

Susan Athey, Guido W. Imbens, and Stefan Wager. 2018. Approximate residual balancing: Debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 4 (2018), 597–623.

[3]

Peter C. Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research 46, 3 (2011), 399–424.

[4]

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine Learning 79, 1 (2010), 151–175.

Digital Library

[5]

Steffen Bickel, Michael Brückner, and Tobias Scheffer. 2009. Discriminative learning under covariate shift. Journal of Machine Learning Research 10, 9 (2009), 2137–2155.

Digital Library

[6]

Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. 2021. Domain generalization by marginal transfer learning. Journal of Machine Learning Research 22 (2021), 2–1.

[7]

John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. 120–128.

Digital Library

[8]

George E. P. Box, J. Stuart Hunter, and William G. Hunter. 2005. Statistics for Experimenters. Wiley Hoboken, NJ.

[9]

Rita Chattopadhyay, Qian Sun, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. 2012. Multisource domain adaptation and its application to early detection of fatigue. ACM Transactions on Knowledge Discovery from Data 6, 4 (2012), 1–26.

Digital Library

[10]

Iti Chaturvedi, Erik Cambria, Sandro Cavallari, and Roy E. Welsch. 2020. Genetic programming for domain adaptation in product reviews. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation. 1–8.

[11]

Aloke Dey and Rahul Mukerjee. 2009. Fractional Factorial Plans. Vol. 496. John Wiley & Sons.

[12]

Petros Drineas, Michael W. Mahoney, Shan Muthukrishnan, and Tamás Sarlós. 2011. Faster least squares approximation. Numerische Mathematik 117, 2 (2011), 219–249.

Digital Library

[13]

John Duchi and Hongseok Namkoong. 2021. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics 49, 3 (2011), 1378–1406.

[14]

Christian Fong, Chad Hazlett, and Kosuke Imai. 2018. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics 12, 1 (2018), 156–177.

[15]

Arthur Fries and William G Hunter. 1980. Minimum aberration \(2^{k-p}\) designs. Technometrics 22, 4 (1980), 601–608.

[16]

Wenlong Fu, Bing Xue, Xiaoying Gao, and Mengjie Zhang. 2019. Genetic programming based transfer learning for document classification with self-taught and ensemble learning. In Proceedings of the 2019 IEEE Congress on Evolutionary Computation. 2260–2267.

[17]

Wenlong Fu, Bing Xue, Mengjie Zhang, and Xiaoying Gao. 2017. Transductive transfer learning in genetic programming for document classification. In Proceedings of the Asia-Pacific Conference on Simulated Evolution and Learning. 556–568.

[18]

Ulrike Grönmping. 2014. R package FrF2 for creating and analyzing fractional factorial 2-level designs. Journal of Statistical Software 56, 1 (2014), 1–56.

[19]

Jens Hainmueller. 2012. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20, 1 (2012), 25–46.

[20]

A. Samad Hedayat, Neil James Alexander Sloane, and John Stufken. 2012. Orthogonal Arrays: Theory and Applications. Springer Science & Business Media.

[21]

Guang-Bin Huang, Erik Cambria, Kar-Ann Toh, Bernard Widrow, and Zongben Xu. 2015. New trends of learning in computational intelligence. IEEE Computational Intelligence Magazine 10, 2 (2015), 16–17.

Digital Library

[22]

Jing Jiang and Chengxiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. 264–271.

[23]

Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Xiong, and Bo Li. 2018. Stable prediction across unknown environments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1617–1626.

Digital Library

[24]

Kun Kuang, Peng Cui, Bo Li, Meng Jiang, Yashen Wang, Fei Wu, and Shiqiang Yang. 2019. Treatment effect estimation via differentiated confounder balancing and regression. ACM Transactions on Knowledge Discovery from Data 14, 1 (2019), 1–25.

Digital Library

[25]

Kun Kuang, Peng Cui, Bo Li, Meng Jiang, and Shiqiang Yang. 2017. Estimating treatment effect in the wild via differentiated confounder balancing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 265–274.

Digital Library

[26]

Kun Kuang, Peng Cui, Bo Li, Meng Jiang, Shiqiang Yang, and Fei Wang. 2017. Treatment effect estimation with data-driven variable decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence. 140–146.

Digital Library

[27]

Kun Kuang, Lian Li, Zhi Geng, Lei Xu, Kun Zhang, Beishui Liao, Huaxin Huang, Peng Ding, Wang Miao, and Zhichao Jiang. 2020. Causal inference. Engineering 6, 3 (2020), 253–263.

[28]

Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. 2021. Heterogeneous risk minimization. In Proceedings of the 38th International Conference on Machine Learning. 6804–6814.

[29]

Jiashuo Liu, Zheyan Shen, Peng Cui, Linjun Zhou, Kun Kuang, and Bo Li. 2021. Distributionally robust learning with stable adversarial training. Proceedings of the AAAI Conference on Artificial Intelligence 35, 10 (2021), 8662–8670.

[30]

Mingsheng Long, Jianmin Wang, Guiguang Ding, Sinno Jialin Pan, and S. Yu Philip. 2013. Adaptation regularization: A general framework for transfer learning. IEEE Transactions on Knowledge and Data Engineering 26, 5 (2013), 1076–1089.

Digital Library

[31]

Miguel López, Ana Valdivia, Eugenio Martínez-Cámara, M. Victoria Luzón, and Francisco Herrera. 2019. E2SAM: Evolutionary ensemble of sentiment analysis methods for domain adaptation. Information Sciences 480, 1 (2019), 273–286.

[32]

Cewu Lu and Shiquan Wang. 2020. The general-purpose intelligent agent. Engineering 6, 3 (2020), 221–226.

[33]

Chang-Xing Ma and Kai-Tai Fang. 2001. A note on generalized aberration in factorial designs. Metrika 53, 1 (2001), 85–93.

[34]

Ping Ma, Michael W. Mahoney, and Bin Yu. 2015. A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research 16, 1 (2015), 861–911.

Digital Library

[35]

Yukun Ma, Haiyun Peng, Tahir Khan, Erik Cambria, and Amir Hussain. 2018. Sentic LSTM: A hybrid network for targeted aspect-based sentiment analysis. Cognitive Computation 10, 4 (2018), 639–650.

[36]

Florence Jessie MacWilliams and Neil James Alexander Sloane. 1977. The Theory of Error-Correcting Codes. Vol. 16. Elsevier.

[37]

Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. 2013. Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning. 10–18.

Digital Library

[38]

Sinno Jialin Pan and and Qiang Yang2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.

Digital Library

[39]

Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. 2016. Causal inference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78, 5 (2016), 947–1012.

[40]

Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu. 2020. Adversarial attacks and defenses in deep learning. Engineering 6, 3 (2020), 346–360.

[41]

Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. 2018. Invariant models for causal transfer learning. The Journal of Machine Learning Research 19, 1 (2018), 1309–1342.

Digital Library

[42]

Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.

[43]

Zheyan Shen, Peng Cui, Kun Kuang, Bo Li, and Peixuan Chen. 2018. Causally regularized learning with agnostic data selection bias. In Proceedings of the 26th ACM International Conference on Multimedia. 411–419.

Digital Library

[44]

Steven K. Thompson. 2012. Sampling (3rd ed.). Wiley, New York, NY.

[45]

Qi Tian, Kun Kuang, Kelu Jiang, Fei Wu, and Yisen Wang. 2021. Analysis and applications of class-wise robustness in adversarial training. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1561–1570.

Digital Library

[46]

Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267–288.

[47]

Athanasios Tsanas, Max A. Little, Patrick E. McSharry, and Lorraine O. Ramig. 2009. Accurate telemonitoring of Parkinson’s disease progression by noninvasive speech tests. IEEE Transactions on Biomedical Engineering 57, 4 (2009), 884–893.

[48]

HaiYing Wang. 2019. More efficient estimation for logistic regression with optimal subsamples. The Journal of Machine Learning Research 20, 132 (2019), 1–59.

[49]

HaiYing Wang, Min Yang, and John Stufken. 2019. Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114, 525 (2019), 393–405.

[50]

Xiao Wang, Shaohua Fan, Kun Kuang, Chuan Shi, Jiawei Liu, and Bai Wang. 2020. Decorrelated clustering with data selection bias. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 2177–2183.

Digital Library

[51]

C. F. Jeff Wu and Michael S. Hamada. 2011. Experiments: Planning, Analysis, and Optimization. Vol. 552. John Wiley & Sons.

[52]

Hongquan Xu and C. F. Jeff Wu. 2001. Generalized minimum aberration for asymmetrical fractional factorial designs. The Annals of Statistics 29, 4 (2001), 1066–1077.

[53]

Aijun Zhang, Kai-Tai Fang, Runze Li, and Agus Sudjianto. 2005. Majorization framework for balanced lattice designs. The Annals of Statistics 33, 6 (2005), 2837–2853.

[54]

Aijun Zhang, Hengtao Zhang, and Guosheng Yin. 2020. Adaptive iterative Hessian sketch via A-optimal subsampling. Statistics and Computing 30, 4 (2020), 1075–1090.

Digital Library

[55]

Yftah Ziser and Roi Reichart. 2017. Neural structural correspondence learning for domain adaptation. In Proceedings of the 21st Conference on Computational Natural Language Learning. 400–410.

[56]

Yftah Ziser and Roi Reichart. 2018. Deep pivot-based modeling for cross-language cross-domain transfer with minimal guidance. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 238–249.

[57]

Yftah Ziser and Roi Reichart. 2019. Task refinement learning for improved accuracy and stability of unsupervised domain adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5895–5906.

[58]

José R. Zubizarreta. 2015. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association 110, 511 (2015), 910–922.

Cited By

Kimura M(2023)On the Decomposition of Covariate Shift Assumption for the Set-to-Set MatchingIEEE Access10.1109/ACCESS.2023.332404411(120728-120740)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3324044
Yuan JWu AKuang KLi BWu RWu FLin L(2022)Auto IV: Counterfactual Prediction via Automatic Instrumental Variable DecompositionACM Transactions on Knowledge Discovery from Data10.1145/349456816:4(1-20)Online publication date: 8-Jan-2022
https://dl.acm.org/doi/10.1145/3494568
Chen ZXiao TKuang K(2022)BA-GNN: On Learning Bias-Aware Graph Neural Network2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00271(3012-3024)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00271

Index Terms

Balance-Subsampled Stable Prediction Across Unknown Test Data
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Causal reasoning and diagnostics
  2. Machine learning
    1. Machine learning approaches
      1. Logical and relational learning
        Statistical relational learning

Recommendations

Stable Prediction across Unknown Environments
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

In many important machine learning applications, the training distribution used to learn a probabilistic classifier differs from the distribution on which the classifier will be used to make predictions. Traditional methods correct the distribution ...
Tackling Instance-Dependent Label Noise with Dynamic Distribution Calibration
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Instance-dependent label noise is realistic but rather challenging, where the label-corruption process depends on instances directly. It causes a severe distribution shift between the distributions of training and test data, which impairs the ...
A method for identifying a minimal set of test conditions in 2k experimental design
Special issue: Selected papers from the 31st international conference on computers & industrial engineering

A primary task of the analysis of a 2^k factorial design is to estimate the 2^k unknown effects/interactions. When some of these interactions are known to be zero or negligible, a full 2^k factorial design may no longer be necessary. In general, when ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 3

June 2022

494 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3485152

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2021

Accepted: 01 July 2021

Revised: 01 May 2021

Received: 01 December 2020

Published in TKDD Volume 16, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Hong Kong General Research Fund
National Key Research and Development Program of China
National Natural Science Foundation of China
Zhejiang Province Natural Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
458
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)6

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kimura M(2023)On the Decomposition of Covariate Shift Assumption for the Set-to-Set MatchingIEEE Access10.1109/ACCESS.2023.332404411(120728-120740)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3324044
Yuan JWu AKuang KLi BWu RWu FLin L(2022)Auto IV: Counterfactual Prediction via Automatic Instrumental Variable DecompositionACM Transactions on Knowledge Discovery from Data10.1145/349456816:4(1-20)Online publication date: 8-Jan-2022
https://dl.acm.org/doi/10.1145/3494568
Chen ZXiao TKuang K(2022)BA-GNN: On Learning Bias-Aware Graph Neural Network2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00271(3012-3024)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00271

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents