Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

Mind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining

Published: 01 March 2018 Publication History


The application of predictive data mining techniques in information systems research has grown in recent years, likely because of their effectiveness and scalability in extracting information from large amounts of data. A number of scholars have sought to combine data mining with traditional econometric analyses. Typically, data mining methods are first used to generate new variables e.g., text sentiment, which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first-stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second-stage econometric estimations and threaten the validity of statistical inference. In this commentary, we examine the nature of this bias, both analytically and empirically, and show that it can be severe even when data mining models exhibit relatively high performance. We then show that this bias becomes increasingly difficult to anticipate as the functional form of the measurement error or the specification of the econometric model grows more complex. We review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, we demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real-world data sets related to travel, social networking, and crowdfunding campaign websites.
The online appendix is available at <ext-link ext-link-type="uri" href="https://doi.org/10.1287/isre.2017.0727">https://doi.org/10.1287/isre.2017.0727</ext-link>.


Agarwal R, Dhar V (2014) Editorial--Big data, data science, and analytics: The opportunity and challenge for IS research. Inform. Systems Res. 25(3):443-448.
Aggarwal CC (2015) Data Mining: The Textbook (Springer, Cham, Switzerland).
Aggarwal R, Gopal R, Gupta A, Singh H (2012) Putting money where the mouths are: The relation between venture financing and electronic word-of-mouth. Inform. Systems Res. 23(3-part-2): 976-992.
Agrawal A, Catalini C, Goldfarb A (2014) Some simple economics of crowdfunding. Lerner J, Stern S, eds. Innovation Policy and the Economy, 1st ed., Vol. 14 (University of Chicago Press, Chicago), 63-97.
Archak N, Ghose A, Ipeirotis PG (2011) Deriving the pricing power of product features by mining consumer reviews. Management Sci. 57(8):1485-1509.
Bao Y, Datta A (2014) Simultaneously discovering and quantifying risk types from textual risk disclosures. Management Sci. 60(6):1371-1391.
Buonaccorsi JP, Laake P, Veierød MB (2005) On the effect of misclassification on bias of perfectly measured covariates in regression. Biometrics 61(3):831-836.
Burtch G, Ghose A, Wattal S (2013) An empirical examination of the antecedents and consequences of contribution patterns in crowd-funded markets. Inform. Systems Res. 24(3):499-519.
Burtch G, Ghose A, Wattal S (2015) The hidden cost of accommodating crowdfunder privacy preferences: A randomized field experiment. Management Sci. 61(5):949-962.
Carroll RJ, Küchenhoff H, Lombard F, Stefanski LA (1996) Asymptotics for the SIMEX estimator in nonlinear measurement error models. J. Amer. Statist. Assoc. 91(433):242-250.
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement Error in Nonlinear Models: A Modern Perspective (CRC Press, Boca Raton, FL).
Chan J, Wang J (2014) Hiring biases in online labor markets: The case of gender stereotyping. Proc. 35th Internat. Conf. Inform. Systems (ICIS), Auckland, NZ, 1161-1178.
Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: From big data to big impact. MIS Quart. 36(4):1165-1188.
Cook JR, Stefanski LA (1994) Simulation-extrapolation estimation in parametric measurement error models. J. Amer. Statist. Assoc. 89(428):1314-1328.
Das SR, Chen MY (2007) Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Sci. 53(9):1375-1388.
Dellarocas C (2003) The digitization of word of mouth: Promise and challenges of online feedback. Management Sci. 49(10):1407-1424.
Fisher IE, Garnsey MR, Hughes ME (2016) Natural language processing in accounting, auditing and finance: A synthesis of the literature with a roadmap for future research. Intelligent Systems Accounting, Finance Management 23(3):157-214.
Forman C, Ghose A, Wiesenfeld B (2008) Examining the relationship between reviews and sales: The role of reviewer identity disclosure in electronic markets. Inform. Systems Res. 19(3):291-313.
Ghose A, Ipeirotis PG (2011) Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. Knowledge Data Engrg., IEEE Trans. 23(10): 1498-1512.
Ghose A, Ipeirotis PG, Li B (2012) Designing ranking systems for hotels on travel search engines by mining user-generated and crowdsourced content. Marketing Sci. 31(3):493-520.
Gleser LJ (1990) Improvements of the naive approach to estimation in nonlinear errors-in-variables regression models. Contemporary Math. 112:99-114.
Godes D, Mayzlin D (2004) Using online conversations to study word-of-mouth communication. Marketing Sci. 23(4):545-560.
Goh KY, Heng CS, Lin Z (2013) Social media brand community and consumer behavior: Quantifying the relative impact of user - and marketer-generated content. Inform. Systems Res. 24(1):88-107.
Greene WH (2003) Econometric Analysis (Pearson Education, Delhi, India).
Gu B, Konana P, Rajagopalan B, Chen HM (2007) Competition among virtual communities and user valuation: The case of investing-related communities. Inform. Systems Res. 18(1):68-85.
Gu B, Konana P, Raghunathan R, Chen HM (2014) The allure of homophily in social media: Evidence from investor responses on virtual communities. Inform. Systems Res. 25(3):604-617.
Gustafson P (2003) Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments (CRC Press, Boca Raton, FL).
Hardin JW, Schmiediche H, Carroll RJ (2003) The simulation extrapolation method for fitting generalized linear models with additive measurement error. Stata J. 3(4):373-385.
Hopkins DJ, King G (2010) A method of automated nonparametric content analysis for social science. Amer. J. Political Sci. 54(1): 229-247.
Huang N, Hong Y, Burtch G (2017) Social network integration and user content generation: Evidence from natural experiments. MIS Quart. 41(4):1035-1058.
Huang N, Burtch G, Hong Y, Polman E (2016) Effects of multiple psychological distances on construal level: A field study of online reviews. J. Consumer Psych. 26(4):474-482.
Jelveh Z, Kogut B, Naidu S (2014) Political language in economics. Working paper, New York University, New York.
Johnson SL, Safadi H, Faraj S (2015) The emergence of online community leadership. Inform. Systems Res. 26(1):165-187.
Jurafsky D, Martin JH (2008) Speech and Language Processing (Prentice Hall, Upper Saddle River, NJ).
Küchenhoff H, Lederer W, Lesaffre E (2007) Asymptotic variance estimation for the misclassification SIMEX. Comput. Statist. Data Anal. 51(12):6197-6211.
Küchenhoff H, Mwalili SM, Lesaffre E (2006) A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics 62(1):85-96.
Lin M, Lucas HC Jr, Shmueli G (2013) Research commentary--Too big to fail: Large samples and the p-value problem. Inform. Systems Res. 24(4):906-917.
Liu Y, Chen R, Chen Y, Mei Q, Salib S (2012) I loan because: : :: Understanding motivations for pro-social lending. Proc. 5th ACM Internat. Conf. Web Search Data Mining (ACM, New York), 503-512.
Lu Y, Jerath K, Singh PV (2013) The emergence of opinion leaders in a networked online community: A dyadic model with time dynamics and a heuristic for fast estimation. Management Sci. 59(8):1783-1799.
Mayzlin D, Dover Y, Chevalier J (2014) Promotional reviews: An empirical investigation of online review manipulation. Amer. Econom. Rev. 104(8):2421-2455.
Moreno A, Terwiesch C (2014) Doing business with strangers: Reputation in online service marketplaces. Inform. Systems Res. 25(4):865-886.
Mudambi SM, Schuff D (2010) What makes a helpful review? A study of customer reviews on Amazon.com. MIS Quart. 34(1): 185-200.
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. Proc. ACL-02 Conf. Empirical Methods Natural Language Processing, Vol. 10 (Association for Computational Linguistics, Strousburg, PA), 79-86.
Provost F, Fawcett T (2013) Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking (O'Reilly Media, Sebastopol, CA).
Rhue L (2015) Who gets started on Kickstarter? Demographic variations in fundraising success. Proc. 36th Internat. Conf. Inform. Systems ICIS, Fort Worth, TX, 1303-1314.
Singh PV, Sahoo N, Mukhopadhyay T (2014) How to attract and retain readers in enterprise blogging? Inform. Systems Res. 25(1): 35-52.
Stefanski LA, Cook JR (1995) Simulation-extrapolation: The measurement error jackknife. J. Amer. Statist. Assoc. 90(432):1247-1256.
Tetlock PC, Saar-Tsechansky M, Macskassy S (2008) More than words: Quantifying language to measure firms' fundamentals. J. Finance 63(3):1437-1467.
Tirunillai S, Tellis GJ (2012) Does chatter really matter? Dynamics of user-generated content and stock performance. Marketing Sci. 31(2):198-215.
Vapnik V (1995) The Nature of Statistical Learning Theory (Springer, New York).
Varian H (2014) Big data: New tricks for econometrics. J. Econom. Perspect. 28(2):3-28.
Wang T, Kannan KN, Ulmer JR (2013) The association between the disclosure and the realization of information security risk factors. Inform. Systems Res. 24(2):201-218.
Wansbeek T, Meijer E (2000) Measurement Error and Latent Variables in Econometrics, Vol. 37 (North-Holland, Amsterdam).
Wu L (2013) Social network effects on productivity and job security: Evidence from the adoption of a social networking tool. Inform. Systems Res. 24(1):30-51.
Yin D, Bond S, Zhang H (2014) Anxious or angry? Effects of discrete emotions on the perceived helpfulness of online reviews. MIS Quart. 38(2):539-560.
Zhang S, Lee D, Singh P, Srinivasan K (2016) How much is an image worth? An empirical analysis of property's image aesthetic quality on demand at AirBNB. Proc. 37th Internat. Conf. Inform. Systems ICIS, Dublin, Ireland, 168-188.
Zhu H, Kraut R, Kittur A (2012) Effectiveness of shared leadership in online communities. Proc. ACM 2012 Conf. Comput. Supported Cooperative Work (ACM, New York), 407-416.
Zhu H, Kraut RE,Wang YC, Kittur A (2011) Identifying shared leadership in Wikipedia. Proc. SIGCHI Conf. Human Factors Comput. Systems (ACM, New York), 3431-3434.

Cited By

View all



Information & Contributors


Published In

cover image Information Systems Research
Information Systems Research  Volume 29, Issue 1
March 2018
254 pages



Linthicum, MD, United States

Publication History

Published: 01 March 2018
Accepted: 24 April 2017
Received: 06 October 2015

Author Tags

  1. data mining
  2. econometrics
  3. measurement error
  4. misclassification
  5. statistical inference


  • Article


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics


Cited By

View all
  • (2024)Automation of Strategic Data Prioritization in System Model CalibrationINFORMS Journal on Computing10.1287/ijoc.2022.012836:1(163-184)Online publication date: 1-Jan-2024
  • (2024)To be honest or positive? The effect of Airbnb host description on consumer behaviorDecision Support Systems10.1016/j.dss.2024.114200181:COnline publication date: 1-Jun-2024
  • (2023)sDTMInformation Systems Research10.1287/isre.2022.112434:1(137-156)Online publication date: 1-Mar-2023
  • (2023)Are You What You Tweet? The Impact of Sentiment on Digital News Consumption and Social Media SharingInformation Systems Research10.1287/isre.2022.111234:1(111-136)Online publication date: 1-Mar-2023
  • (2023)Visual-audio correspondence and its effect on video tippingInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10334760:3Online publication date: 1-May-2023
  • (2023)DeepEmotionNetInformation Processing and Management: an International Journal10.1016/j.ipm.2022.10315160:3Online publication date: 1-May-2023
  • (2022)Uncovering the Necessary Hard- and Soft-Skills to Get IT Personnel JobsProceedings of the 2022 Computers and People Research Conference10.1145/3510606.3550213(1-7)Online publication date: 2-Jun-2022
  • (2022)Deep Transfer Learning & Beyond: Transformer Language Models in Information Systems ResearchACM Computing Surveys10.1145/350524554:10s(1-35)Online publication date: 13-Sep-2022
  • (2021)Correcting Misclassification Bias in Regression Models with Variables Generated via Data MiningInformation Systems Research10.1287/isre.2020.097732:2(462-480)Online publication date: 1-Jun-2021
  • (2020)A Note on the Impact of Daily Deals on Local Retailers’ Online ReputationInformation Systems Research10.1287/isre.2020.093531:4(1132-1143)Online publication date: 1-Dec-2020

View Options

View options

Get Access

Login options







Share this Publication link

Share on social media