Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Balance-Subsampled Stable Prediction Across Unknown Test Data

Published: 22 October 2021 Publication History

Abstract

In data mining and machine learning, it is commonly assumed that training and test data share the same population distribution. However, this assumption is often violated in practice because of the sample selection bias, which might induce the distribution shift from training data to test data. Such a model-agnostic distribution shift usually leads to prediction instability across unknown test data. This article proposes a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design. It isolates the clear effect of each predictor from the confounding variables. A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift, improving both the accuracy of parameter estimation and the stability of prediction across unknown test data. Numerical experiments on synthetic and real-world datasets demonstrate that our BSSP algorithm can significantly outperform the baseline methods for stable prediction across unknown test data.

References

[1]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant risk minimization. arXiv:1907.02893. Retrieved from https://arxiv.org/abs/1907.02893.
[2]
Susan Athey, Guido W. Imbens, and Stefan Wager. 2018. Approximate residual balancing: Debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 4 (2018), 597–623.
[3]
Peter C. Austin. 2011. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research 46, 3 (2011), 399–424.
[4]
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Machine Learning 79, 1 (2010), 151–175.
[5]
Steffen Bickel, Michael Brückner, and Tobias Scheffer. 2009. Discriminative learning under covariate shift. Journal of Machine Learning Research 10, 9 (2009), 2137–2155.
[6]
Gilles Blanchard, Aniket Anand Deshmukh, Urun Dogan, Gyemin Lee, and Clayton Scott. 2021. Domain generalization by marginal transfer learning. Journal of Machine Learning Research 22 (2021), 2–1.
[7]
John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. 120–128.
[8]
George E. P. Box, J. Stuart Hunter, and William G. Hunter. 2005. Statistics for Experimenters. Wiley Hoboken, NJ.
[9]
Rita Chattopadhyay, Qian Sun, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. 2012. Multisource domain adaptation and its application to early detection of fatigue. ACM Transactions on Knowledge Discovery from Data 6, 4 (2012), 1–26.
[10]
Iti Chaturvedi, Erik Cambria, Sandro Cavallari, and Roy E. Welsch. 2020. Genetic programming for domain adaptation in product reviews. In Proceedings of the 2020 IEEE Congress on Evolutionary Computation. 1–8.
[11]
Aloke Dey and Rahul Mukerjee. 2009. Fractional Factorial Plans. Vol. 496. John Wiley & Sons.
[12]
Petros Drineas, Michael W. Mahoney, Shan Muthukrishnan, and Tamás Sarlós. 2011. Faster least squares approximation. Numerische Mathematik 117, 2 (2011), 219–249.
[13]
John Duchi and Hongseok Namkoong. 2021. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics 49, 3 (2011), 1378–1406.
[14]
Christian Fong, Chad Hazlett, and Kosuke Imai. 2018. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics 12, 1 (2018), 156–177.
[15]
Arthur Fries and William G Hunter. 1980. Minimum aberration \(2^{k-p}\) designs. Technometrics 22, 4 (1980), 601–608.
[16]
Wenlong Fu, Bing Xue, Xiaoying Gao, and Mengjie Zhang. 2019. Genetic programming based transfer learning for document classification with self-taught and ensemble learning. In Proceedings of the 2019 IEEE Congress on Evolutionary Computation. 2260–2267.
[17]
Wenlong Fu, Bing Xue, Mengjie Zhang, and Xiaoying Gao. 2017. Transductive transfer learning in genetic programming for document classification. In Proceedings of the Asia-Pacific Conference on Simulated Evolution and Learning. 556–568.
[18]
Ulrike Grönmping. 2014. R package FrF2 for creating and analyzing fractional factorial 2-level designs. Journal of Statistical Software 56, 1 (2014), 1–56.
[19]
Jens Hainmueller. 2012. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20, 1 (2012), 25–46.
[20]
A. Samad Hedayat, Neil James Alexander Sloane, and John Stufken. 2012. Orthogonal Arrays: Theory and Applications. Springer Science & Business Media.
[21]
Guang-Bin Huang, Erik Cambria, Kar-Ann Toh, Bernard Widrow, and Zongben Xu. 2015. New trends of learning in computational intelligence. IEEE Computational Intelligence Magazine 10, 2 (2015), 16–17.
[22]
Jing Jiang and Chengxiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. 264–271.
[23]
Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Xiong, and Bo Li. 2018. Stable prediction across unknown environments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1617–1626.
[24]
Kun Kuang, Peng Cui, Bo Li, Meng Jiang, Yashen Wang, Fei Wu, and Shiqiang Yang. 2019. Treatment effect estimation via differentiated confounder balancing and regression. ACM Transactions on Knowledge Discovery from Data 14, 1 (2019), 1–25.
[25]
Kun Kuang, Peng Cui, Bo Li, Meng Jiang, and Shiqiang Yang. 2017. Estimating treatment effect in the wild via differentiated confounder balancing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 265–274.
[26]
Kun Kuang, Peng Cui, Bo Li, Meng Jiang, Shiqiang Yang, and Fei Wang. 2017. Treatment effect estimation with data-driven variable decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence. 140–146.
[27]
Kun Kuang, Lian Li, Zhi Geng, Lei Xu, Kun Zhang, Beishui Liao, Huaxin Huang, Peng Ding, Wang Miao, and Zhichao Jiang. 2020. Causal inference. Engineering 6, 3 (2020), 253–263.
[28]
Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. 2021. Heterogeneous risk minimization. In Proceedings of the 38th International Conference on Machine Learning. 6804–6814.
[29]
Jiashuo Liu, Zheyan Shen, Peng Cui, Linjun Zhou, Kun Kuang, and Bo Li. 2021. Distributionally robust learning with stable adversarial training. Proceedings of the AAAI Conference on Artificial Intelligence 35, 10 (2021), 8662–8670.
[30]
Mingsheng Long, Jianmin Wang, Guiguang Ding, Sinno Jialin Pan, and S. Yu Philip. 2013. Adaptation regularization: A general framework for transfer learning. IEEE Transactions on Knowledge and Data Engineering 26, 5 (2013), 1076–1089.
[31]
Miguel López, Ana Valdivia, Eugenio Martínez-Cámara, M. Victoria Luzón, and Francisco Herrera. 2019. E2SAM: Evolutionary ensemble of sentiment analysis methods for domain adaptation. Information Sciences 480, 1 (2019), 273–286.
[32]
Cewu Lu and Shiquan Wang. 2020. The general-purpose intelligent agent. Engineering 6, 3 (2020), 221–226.
[33]
Chang-Xing Ma and Kai-Tai Fang. 2001. A note on generalized aberration in factorial designs. Metrika 53, 1 (2001), 85–93.
[34]
Ping Ma, Michael W. Mahoney, and Bin Yu. 2015. A statistical perspective on algorithmic leveraging. The Journal of Machine Learning Research 16, 1 (2015), 861–911.
[35]
Yukun Ma, Haiyun Peng, Tahir Khan, Erik Cambria, and Amir Hussain. 2018. Sentic LSTM: A hybrid network for targeted aspect-based sentiment analysis. Cognitive Computation 10, 4 (2018), 639–650.
[36]
Florence Jessie MacWilliams and Neil James Alexander Sloane. 1977. The Theory of Error-Correcting Codes. Vol. 16. Elsevier.
[37]
Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. 2013. Domain generalization via invariant feature representation. In Proceedings of the 30th International Conference on Machine Learning. 10–18.
[38]
Sinno Jialin Pan and and Qiang Yang2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.
[39]
Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. 2016. Causal inference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78, 5 (2016), 947–1012.
[40]
Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu. 2020. Adversarial attacks and defenses in deep learning. Engineering 6, 3 (2020), 346–360.
[41]
Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. 2018. Invariant models for causal transfer learning. The Journal of Machine Learning Research 19, 1 (2018), 1309–1342.
[42]
Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.
[43]
Zheyan Shen, Peng Cui, Kun Kuang, Bo Li, and Peixuan Chen. 2018. Causally regularized learning with agnostic data selection bias. In Proceedings of the 26th ACM International Conference on Multimedia. 411–419.
[44]
Steven K. Thompson. 2012. Sampling (3rd ed.). Wiley, New York, NY.
[45]
Qi Tian, Kun Kuang, Kelu Jiang, Fei Wu, and Yisen Wang. 2021. Analysis and applications of class-wise robustness in adversarial training. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1561–1570.
[46]
Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267–288.
[47]
Athanasios Tsanas, Max A. Little, Patrick E. McSharry, and Lorraine O. Ramig. 2009. Accurate telemonitoring of Parkinson’s disease progression by noninvasive speech tests. IEEE Transactions on Biomedical Engineering 57, 4 (2009), 884–893.
[48]
HaiYing Wang. 2019. More efficient estimation for logistic regression with optimal subsamples. The Journal of Machine Learning Research 20, 132 (2019), 1–59.
[49]
HaiYing Wang, Min Yang, and John Stufken. 2019. Information-based optimal subdata selection for big data linear regression. Journal of the American Statistical Association 114, 525 (2019), 393–405.
[50]
Xiao Wang, Shaohua Fan, Kun Kuang, Chuan Shi, Jiawei Liu, and Bai Wang. 2020. Decorrelated clustering with data selection bias. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. 2177–2183.
[51]
C. F. Jeff Wu and Michael S. Hamada. 2011. Experiments: Planning, Analysis, and Optimization. Vol. 552. John Wiley & Sons.
[52]
Hongquan Xu and C. F. Jeff Wu. 2001. Generalized minimum aberration for asymmetrical fractional factorial designs. The Annals of Statistics 29, 4 (2001), 1066–1077.
[53]
Aijun Zhang, Kai-Tai Fang, Runze Li, and Agus Sudjianto. 2005. Majorization framework for balanced lattice designs. The Annals of Statistics 33, 6 (2005), 2837–2853.
[54]
Aijun Zhang, Hengtao Zhang, and Guosheng Yin. 2020. Adaptive iterative Hessian sketch via A-optimal subsampling. Statistics and Computing 30, 4 (2020), 1075–1090.
[55]
Yftah Ziser and Roi Reichart. 2017. Neural structural correspondence learning for domain adaptation. In Proceedings of the 21st Conference on Computational Natural Language Learning. 400–410.
[56]
Yftah Ziser and Roi Reichart. 2018. Deep pivot-based modeling for cross-language cross-domain transfer with minimal guidance. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 238–249.
[57]
Yftah Ziser and Roi Reichart. 2019. Task refinement learning for improved accuracy and stability of unsupervised domain adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5895–5906.
[58]
José R. Zubizarreta. 2015. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association 110, 511 (2015), 910–922.

Cited By

View all
  • (2023)On the Decomposition of Covariate Shift Assumption for the Set-to-Set MatchingIEEE Access10.1109/ACCESS.2023.332404411(120728-120740)Online publication date: 2023
  • (2022)Auto IV: Counterfactual Prediction via Automatic Instrumental Variable DecompositionACM Transactions on Knowledge Discovery from Data10.1145/349456816:4(1-20)Online publication date: 8-Jan-2022
  • (2022)BA-GNN: On Learning Bias-Aware Graph Neural Network2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00271(3012-3024)Online publication date: May-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 3
June 2022
494 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3485152
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2021
Accepted: 01 July 2021
Revised: 01 May 2021
Received: 01 December 2020
Published in TKDD Volume 16, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Stable prediction
  2. distribution shift
  3. subsampling
  4. variable deconfounding
  5. fractional factorial design

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Hong Kong General Research Fund
  • National Key Research and Development Program of China
  • National Natural Science Foundation of China
  • Zhejiang Province Natural Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)6
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)On the Decomposition of Covariate Shift Assumption for the Set-to-Set MatchingIEEE Access10.1109/ACCESS.2023.332404411(120728-120740)Online publication date: 2023
  • (2022)Auto IV: Counterfactual Prediction via Automatic Instrumental Variable DecompositionACM Transactions on Knowledge Discovery from Data10.1145/349456816:4(1-20)Online publication date: 8-Jan-2022
  • (2022)BA-GNN: On Learning Bias-Aware Graph Neural Network2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00271(3012-3024)Online publication date: May-2022

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media