article

Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining

Authors:

Gediminas Adomavicius,

Yuqing RenAuthors Info & Claims

Information Systems Research, Volume 29, Issue 1

Pages 4 - 24

https://doi.org/10.1287/isre.2017.0727

Published: 01 March 2018 Publication History

Abstract

The application of predictive data mining techniques in information systems research has grown in recent years, likely because of their effectiveness and scalability in extracting information from large amounts of data. A number of scholars have sought to combine data mining with traditional econometric analyses. Typically, data mining methods are first used to generate new variables e.g., text sentiment, which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first-stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second-stage econometric estimations and threaten the validity of statistical inference. In this commentary, we examine the nature of this bias, both analytically and empirically, and show that it can be severe even when data mining models exhibit relatively high performance. We then show that this bias becomes increasingly difficult to anticipate as the functional form of the measurement error or the specification of the econometric model grows more complex. We review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, we demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real-world data sets related to travel, social networking, and crowdfunding campaign websites.

The online appendix is available at <ext-link ext-link-type="uri" href="https://doi.org/10.1287/isre.2017.0727">https://doi.org/10.1287/isre.2017.0727</ext-link>.

References

[1]

Agarwal R, Dhar V (2014) Editorial--Big data, data science, and analytics: The opportunity and challenge for IS research. Inform. Systems Res. 25(3):443-448.

Digital Library

[2]

Aggarwal CC (2015) Data Mining: The Textbook (Springer, Cham, Switzerland).

[3]

Aggarwal R, Gopal R, Gupta A, Singh H (2012) Putting money where the mouths are: The relation between venture financing and electronic word-of-mouth. Inform. Systems Res. 23(3-part-2): 976-992.

[4]

Agrawal A, Catalini C, Goldfarb A (2014) Some simple economics of crowdfunding. Lerner J, Stern S, eds. Innovation Policy and the Economy, 1st ed., Vol. 14 (University of Chicago Press, Chicago), 63-97.

[5]

Archak N, Ghose A, Ipeirotis PG (2011) Deriving the pricing power of product features by mining consumer reviews. Management Sci. 57(8):1485-1509.

Digital Library

[6]

Bao Y, Datta A (2014) Simultaneously discovering and quantifying risk types from textual risk disclosures. Management Sci. 60(6):1371-1391.

Digital Library

[7]

Buonaccorsi JP, Laake P, Veierød MB (2005) On the effect of misclassification on bias of perfectly measured covariates in regression. Biometrics 61(3):831-836.

[8]

Burtch G, Ghose A, Wattal S (2013) An empirical examination of the antecedents and consequences of contribution patterns in crowd-funded markets. Inform. Systems Res. 24(3):499-519.

[9]

Burtch G, Ghose A, Wattal S (2015) The hidden cost of accommodating crowdfunder privacy preferences: A randomized field experiment. Management Sci. 61(5):949-962.

Digital Library

[10]

Carroll RJ, Küchenhoff H, Lombard F, Stefanski LA (1996) Asymptotics for the SIMEX estimator in nonlinear measurement error models. J. Amer. Statist. Assoc. 91(433):242-250.

[11]

Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement Error in Nonlinear Models: A Modern Perspective (CRC Press, Boca Raton, FL).

[12]

Chan J, Wang J (2014) Hiring biases in online labor markets: The case of gender stereotyping. Proc. 35th Internat. Conf. Inform. Systems (ICIS), Auckland, NZ, 1161-1178.

[13]

Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: From big data to big impact. MIS Quart. 36(4):1165-1188.

[14]

Cook JR, Stefanski LA (1994) Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 89(428):1314-1328.

[15]

Das SR, Chen MY (2007) Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Sci. 53(9):1375-1388.

Digital Library

[16]

Dellarocas C (2003) The digitization of word of mouth: Promise and challenges of online feedback. Management Sci. 49(10):1407-1424.

Digital Library

[17]

Fisher IE, Garnsey MR, Hughes ME (2016) Natural language processing in accounting, auditing and finance: A synthesis of the literature with a roadmap for future research. Intelligent Systems Accounting, Finance Management 23(3):157-214.

Digital Library

[18]

Forman C, Ghose A, Wiesenfeld B (2008) Examining the relationship between reviews and sales: The role of reviewer identity disclosure in electronic markets. Inform. Systems Res. 19(3):291-313.

Digital Library

[19]

Ghose A, Ipeirotis PG (2011) Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. Knowledge Data Engrg., IEEE Trans. 23(10): 1498-1512.

Digital Library

[20]

Ghose A, Ipeirotis PG, Li B (2012) Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content. Marketing Sci. 31(3):493-520.

Digital Library

[21]

Gleser LJ (1990) Improvements of the naive approach to estimation in nonlinear errors-in-variables regression models. Contemporary Math. 112:99-114.

[22]

Godes D, Mayzlin D (2004) Using online conversations to study word-of-mouth communication. Marketing Sci. 23(4):545-560.

Digital Library

[23]

Goh KY, Heng CS, Lin Z (2013) Social media brand community and consumer behavior: Quantifying the relative impact of user - and marketer-generated content. Inform. Systems Res. 24(1):88-107.

[24]

Greene WH (2003) Econometric Analysis (Pearson Education, Delhi, India).

[25]

Gu B, Konana P, Rajagopalan B, Chen HM (2007) Competition among virtual communities and user valuation: The case of investing-related communities. Inform. Systems Res. 18(1):68-85.

Digital Library

[26]

Gu B, Konana P, Raghunathan R, Chen HM (2014) The allure of homophily in social media: Evidence from investor responses on virtual communities. Inform. Systems Res. 25(3):604-617.

Digital Library

[27]

Gustafson P (2003) Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments (CRC Press, Boca Raton, FL).

[28]

Hardin JW, Schmiediche H, Carroll RJ (2003) The simulation extrapolation method for fitting generalized linear models with additive measurement error. Stata J. 3(4):373-385.

[29]

Hopkins DJ, King G (2010) A method of automated nonparametric content analysis for social science. Amer. J. Political Sci. 54(1): 229-247.

[30]

Huang N, Hong Y, Burtch G (2017) Social network integration and user content generation: Evidence from natural experiments. MIS Quart. 41(4):1035-1058.

Digital Library

[31]

Huang N, Burtch G, Hong Y, Polman E (2016) Effects of multiple psychological distances on construal level: A field study of online reviews. J. Consumer Psych. 26(4):474-482.

[32]

Jelveh Z, Kogut B, Naidu S (2014) Political language in economics. Working paper, New York University, New York.

[33]

Johnson SL, Safadi H, Faraj S (2015) The emergence of online community leadership. Inform. Systems Res. 26(1):165-187.

Digital Library

[34]

Jurafsky D, Martin JH (2008) Speech and Language Processing (Prentice Hall, Upper Saddle River, NJ).

[35]

Küchenhoff H, Lederer W, Lesaffre E (2007) Asymptotic variance estimation for the misclassification SIMEX. Comput. Statist. Data Anal. 51(12):6197-6211.

Digital Library

[36]

Küchenhoff H, Mwalili SM, Lesaffre E (2006) A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics 62(1):85-96.

[37]

Lin M, Lucas HC Jr, Shmueli G (2013) Research commentary--Too big to fail: Large samples and the p-value problem. Inform. Systems Res. 24(4):906-917.

Digital Library

[38]

Liu Y, Chen R, Chen Y, Mei Q, Salib S (2012) I loan because: : :: Understanding motivations for pro-social lending. Proc. 5th ACM Internat. Conf. Web Search Data Mining (ACM, New York), 503-512.

[39]

Lu Y, Jerath K, Singh PV (2013) The emergence of opinion leaders in a networked online community: A dyadic model with time dynamics and a heuristic for fast estimation. Management Sci. 59(8):1783-1799.

[40]

Mayzlin D, Dover Y, Chevalier J (2014) Promotional reviews: An empirical investigation of online review manipulation. Amer. Econom. Rev. 104(8):2421-2455.

[41]

Moreno A, Terwiesch C (2014) Doing business with strangers: Reputation in online service marketplaces. Inform. Systems Res. 25(4):865-886.

Digital Library

[42]

Mudambi SM, Schuff D (2010) What makes a helpful review? A study of customer reviews on Amazon.com. MIS Quart. 34(1): 185-200.

Digital Library

[43]

Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. Proc. ACL-02 Conf. Empirical Methods Natural Language Processing, Vol. 10 (Association for Computational Linguistics, Strousburg, PA), 79-86.

[44]

Provost F, Fawcett T (2013) Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking (O'Reilly Media, Sebastopol, CA).

[45]

Rhue L (2015) Who gets started on Kickstarter? Demographic variations in fundraising success. Proc. 36th Internat. Conf. Inform. Systems ICIS, Fort Worth, TX, 1303-1314.

[46]

Singh PV, Sahoo N, Mukhopadhyay T (2014) How to attract and retain readers in enterprise blogging? Inform. Systems Res. 25(1): 35-52.

Digital Library

[47]

Stefanski LA, Cook JR (1995) Simulation-extrapolation: The measurement error jackknife. J. Amer. Statist. Assoc. 90(432):1247-1256.

[48]

Tetlock PC, Saar-Tsechansky M, Macskassy S (2008) More than words: Quantifying language to measure firms' fundamentals. J. Finance 63(3):1437-1467.

[49]

Tirunillai S, Tellis GJ (2012) Does chatter really matter? Dynamics of user-generated content and stock performance. Marketing Sci. 31(2):198-215.

Digital Library

[50]

Vapnik V (1995) The Nature of Statistical Learning Theory (Springer, New York).

[51]

Varian H (2014) Big data: New tricks for econometrics. J. Econom. Perspect. 28(2):3-28.

[52]

Wang T, Kannan KN, Ulmer JR (2013) The association between the disclosure and the realization of information security risk factors. Inform. Systems Res. 24(2):201-218.

[53]

Wansbeek T, Meijer E (2000) Measurement Error and Latent Variables in Econometrics, Vol. 37 (North-Holland, Amsterdam).

[54]

Wu L (2013) Social network effects on productivity and job security: Evidence from the adoption of a social networking tool. Inform. Systems Res. 24(1):30-51.

[55]

Yin D, Bond S, Zhang H (2014) Anxious or angry? Effects of discrete emotions on the perceived helpfulness of online reviews. MIS Quart. 38(2):539-560.

Digital Library

[56]

Zhang S, Lee D, Singh P, Srinivasan K (2016) How much is an image worth? An empirical analysis of property's image aesthetic quality on demand at AirBNB. Proc. 37th Internat. Conf. Inform. Systems ICIS, Dublin, Ireland, 168-188.

[57]

Zhu H, Kraut R, Kittur A (2012) Effectiveness of shared leadership in online communities. Proc. ACM 2012 Conf. Comput. Supported Cooperative Work (ACM, New York), 407-416.

[58]

Zhu H, Kraut RE,Wang YC, Kittur A (2011) Identifying shared leadership in Wikipedia. Proc. SIGCHI Conf. Human Factors Comput. Systems (ACM, New York), 3431-3434.

Cited By

Li TDahleh M(2024)Automation of Strategic Data Prioritization in System Model CalibrationINFORMS Journal on Computing10.1287/ijoc.2022.012836:1(163-184)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1287/ijoc.2022.0128
Sun XGui LCai B(2024)To be honest or positive? The effect of Airbnb host description on consumer behaviorDecision Support Systems10.1016/j.dss.2024.114200181:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.dss.2024.114200
Yang YZhang KFan Y(2023)sDTMInformation Systems Research10.1287/isre.2022.112434:1(137-156)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1287/isre.2022.1124
Show More Cited By

Recommendations

Mind the gap: large-scale frequent sequence mining
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale ...
Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

There is a surge of interest in social science studies in applying data mining methods to construct variables for regression analysis. For example, text classification was applied to classify whether the review is subjective or objective. The derived ...

As a result of advances in data mining, more and more empirical studies in the social sciences apply classification algorithms to construct independent or dependent variables for further analysis via standard regression methods. In the classification ...
Closing the Gap: Sequence Mining at Scale

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale ...

Comments

Information & Contributors

Information

Published In

cover image Information Systems Research

Information Systems Research Volume 29, Issue 1

March 2018

254 pages

ISSN:1526-5536

Issue’s Table of Contents

Publisher

INFORMS

Linthicum, MD, United States

Publication History

Published: 01 March 2018

Accepted: 24 April 2017

Received: 06 October 2015

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li TDahleh M(2024)Automation of Strategic Data Prioritization in System Model CalibrationINFORMS Journal on Computing10.1287/ijoc.2022.012836:1(163-184)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1287/ijoc.2022.0128
Sun XGui LCai B(2024)To be honest or positive? The effect of Airbnb host description on consumer behaviorDecision Support Systems10.1016/j.dss.2024.114200181:COnline publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1016/j.dss.2024.114200
Yang YZhang KFan Y(2023)sDTMInformation Systems Research10.1287/isre.2022.112434:1(137-156)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1287/isre.2022.1124
Oh HGoh KPhan T(2023)Are You What You Tweet? The Impact of Sentiment on Digital News Consumption and Social Media SharingInformation Systems Research10.1287/isre.2022.111234:1(111-136)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1287/isre.2022.1112
Li BZhao J(2023)Visual-audio correspondence and its effect on video tippingInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10334760:3Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.ipm.2023.103347
Wang QSu TLau RXie H(2023)DeepEmotionNetInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10315160:3Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1016/j.ipm.2022.103151
Meier FLaumer SGallivan MZaza SArmstrong DGuzman IRodriguez-Abitia G(2022)Uncovering the Necessary Hard- and Soft-Skills to Get IT Personnel JobsProceedings of the 2022 Computers and People Research Conference10.1145/3510606.3550213(1-7)Online publication date: 2-Jun-2022
https://dl.acm.org/doi/10.1145/3510606.3550213
Gruetzemacher RParadice D(2022)Deep Transfer Learning & Beyond: Transformer Language Models in Information Systems ResearchACM Computing Surveys10.1145/350524554:10s(1-35)Online publication date: 13-Sep-2022
https://dl.acm.org/doi/10.1145/3505245
Qiao MHuang K(2021)Correcting Misclassification Bias in Regression Models with Variables Generated via Data MiningInformation Systems Research10.1287/isre.2020.097732:2(462-480)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1287/isre.2020.0977
Bai XMarsden JRoss WWang G(2020)A Note on the Impact of Daily Deals on Local Retailers’ Online ReputationInformation Systems Research10.1287/isre.2020.093531:4(1132-1143)Online publication date: 1-Dec-2020
https://dl.acm.org/doi/10.1287/isre.2020.0935

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents