Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

Published: 01 June 2021 Publication History

Abstract

There is a surge of interest in social science studies in applying data mining methods to construct variables for regression analysis. For example, text classification was applied to classify whether the review is subjective or objective. The derived review subjectivity was used as an independent variable in the regression to examine its impact on review helpfulness. In the classification phase of these studies, researchers need to subjectively choose a classification performance metric for optimization. No matter which performance metric is chosen, the constructed variable still includes classification error because the variable cannot be classified perfectly. The misclassification of constructed variables will lead to inconsistent estimators of regression coefficients in the following phase. To correct the estimation inconsistency, we summarize and modify existing proofs in econometrics to derive theoretical formulas of consistent estimators in generalized linear models. The main implication of our theoretical result is that the inconsistency can be corrected by theoretical formulas, even when the classification accuracy is poor. Therefore, we propose that a classification algorithm should be tuned to minimize the standard error of the focal coefficient derived based on the corrected formula. As a result, researchers derive a consistent and most precise estimator in generalized linear models.

Abstract

As a result of advances in data mining, more and more empirical studies in the social sciences apply classification algorithms to construct independent or dependent variables for further analysis via standard regression methods. In the classification phase of these studies, researchers need to subjectively choose a classification performance metric for optimization in the standard procedure. No matter which performance metric is chosen, the constructed variable still includes classification error because those variables cannot be classified perfectly. The misclassification of constructed variables will lead to inconsistent regression coefficient estimates in the following phase, which has been documented as a problem of measurement error in the econometrics literature. The pioneering discussions on the issue of estimation inconsistency because of misclassification in these studies have been provided. Our study attempts to investigate systematically the theoretical foundation of this problem when a newly constructed variable is used as the independent or dependent variable in linear and nonlinear regressions. Our theoretical analysis shows that consistent regression estimators can be recovered in all models studied in this paper. The main implication of our theoretical result is that researchers do not need to tune the classification algorithm to minimize the inconsistency of estimated regression coefficients because the inconsistency can be corrected by theoretical formulas, even when the classification accuracy is poor. Instead, we propose that a classification algorithm should be tuned to minimize the standard error of the focal regression coefficient derived based on the corrected formula. As a result, researchers can derive a consistent and most precise estimator in all models studied in this paper.

References

[1]
Aggarwal R, Gopal R, Gupta A, Singh H (2012) Putting money where the mouths are: The relation between venture financing and electronic word-of-mouth. Inform. Systems Res. 23(3-part-2):976–992.
[2]
Aigner DJ (1973) Regression with a binary independent variable subject to errors of observation. J. Econometrics 1(1):49–59.
[3]
Balakrishnan R, Qiu XY, Srinivasan P (2010) On the predictive ability of narrative disclosures in annual reports. Eur. J. Oper. Res. 202(3):789–801.
[4]
Bound J, Brown C, Duncan GJ, Rodgers WL (1994) Evidence on the validity of cross-sectional and longitudinal labor market data. J. Labor Econom. 12(3):345–368.
[5]
Buonaccorsi JP (2010) Measurement Error: Models, Methods, and Applications (CRC Press, Boca Raton, FL).
[6]
Carroll RJ, Ruppert D, Crainiceanu CM, Stefanski LA (2006) Measurement Error in Nonlinear Models: A Modern Perspective (Chapman and Hall/CRC, Boca Raton, FL).
[7]
Caruana R, Niculescu-Mizil A (2004) Data mining in metric space: An empirical analysis of supervised learning performance criteria. Proc. 10th ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 69–78.
[8]
Chan J, Wang J (2014) Hiring biases in online labor markets: The case of gender stereotyping. Proc. 35th Internat. Conf. Inform. Systems (ICIS), Auckland, New Zealand.
[9]
Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: from big data to big impact. Management Inform. Systems Quart. 36(4):1165.
[10]
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. Proc. 22nd ACM SIGKDD Internat. Conf. Knowledge Discovery Data Mining (Association for Computing Machinery, New York), 785–794.
[11]
Cook JR, Stefanski LA (1994) Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 89(428):1314–1328.
[12]
Geurts P (2009) Bias vs Variance Decomposition for Regression and Classification. Data Mining and Knowledge Discovery Handbook (Springer, New York).
[13]
Ghose A, Ipeirotis PG (2011) Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Trans. Knowledge Data Engrg. 23(10):1498–1512.
[14]
Ghose A, Ipeirotis PG, Li B (2012) Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content. Marketing Sci. 31(3):493–520.
[15]
Goes PB, Lin M, Yeung CMA (2014) “Popularity effect” in user-generated content: Evidence from online product reviews. Inform. Systems Res. 25(2):222–238.
[16]
Greene WH (2012) Econometric Analysis (Pearson, Boston).
[17]
Gu B, Konana P, Raghunathan R, Chen HWM (2014) Research note: The allure of homophily in social media: Evidence from investor responses on virtual communities. Inform. Systems Res. 25(3):604–617.
[18]
Hausman JA (2001) Mismeasured variables in econometric analysis: Problems from the right and problems from the left. J. Econom. Perspective 15(4):57–67.
[19]
Hausman JA, Abrevaya J, Scott-Morton FM (1998) Misclassification of the dependent variable in a discrete-response setting. J. Econometrics 87(2):239–269.
[20]
Huang AH, Zang AY, Zheng R (2014) Evidence on the information content of text in analyst reports. Accounting Rev. 89(6):2151–2180.
[21]
Kim J, Park J (2017) Does facial expression matter even online? An empirical analysis of facial expression of emotion and crowdfunding success. Proc. 38th Internat. Conf. Inform. Systems (ICIS), Seoul, South Korea.
[22]
Küchenhoff H, Mwalili SM, Lesaffre E (2006) A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics 62(1):85–96.
[23]
Kumar BS, Ravi V (2016) A survey of the applications of text mining in financial domain. Knowledge Base. Systems 114:128–147.
[24]
Li F (2010) Textual analysis of corporate disclosures: A survey of the literature. J. Accounting Literature 29:143.
[25]
McAuley JJ, Leskovec J (2013) From amateurs to connoisseurs: Modeling the evolution of user expertise through online reviews. Proc. 22nd Internat. Conf. World Wide Web (Association for Computing Machinery, New York), 897–908.
[26]
Moreno A, Terwiesch C (2014) Doing business with strangers: Reputation in online service marketplaces. Inform. Systems Res. 25(4):865–886.
[27]
Mousavi R, Raghu T, Frey K (2015) Assessing order effects in online community-based health forums. Proc. 36th Internat. Conf. Inform. Systems (ICIS), Fort Worth, TX.
[28]
Provost FJ, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. Proc. 15th Internat. Conf. Machine Learn. (Morgan Kaufmann, San Francisco), 445–453.
[29]
Singh PV, Sahoo N, Mukhopadhyay T (2014) How to attract and retain readers in enterprise blogging? Inform. Systems Res. 25(1):35–52.
[30]
Spiegelman D, Rosner B, Logan R (2000) Estimation and inference for logistic regression with covariate misclassification and measurement error in main study/validation study designs. J. Amer. Statist. Assoc. 95(449):51–61.
[31]
Wang T, Kannan KN, Ulmer JR (2013) The association between the disclosure and the realization of information security risk factors. Inform. Systems Res. 24(2):201–218.
[32]
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, Cambridge, MA).
[33]
Wulczyn E, Thain N, Dixon L (2016) Wikipedia detox. figshare. Accessed February 23, 2017, https://doi.org/10.6084/m9.figshare.4054689.
[34]
Yang M, Adomavicius G, Burtch G, Ren Y (2018) Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining. Inform. Systems Res. 29(1):4–24.
[35]
Zhang S, Lee D, Singh PV, Srinivasan K (2016) How much is an image worth? An empirical analysis of property’s image aesthetic quality on demand at AirBNB. Proc. 37th Internat. Conf. on Inform. Systems (ICIS, Dublin, Ireland).

Cited By

View all

Index Terms

  1. Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Information Systems Research
        Information Systems Research  Volume 32, Issue 2
        June 2021
        380 pages
        ISSN:1526-5536
        DOI:10.1287/isre.2021.32.issue-2
        Issue’s Table of Contents

        Publisher

        INFORMS

        Linthicum, MD, United States

        Publication History

        Published: 01 June 2021
        Accepted: 10 August 2020
        Received: 27 December 2018

        Author Tags

        1. data mining
        2. econometrics
        3. measurement error
        4. misclassification
        5. statistical inference
        6. performance metric

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 30 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media