Frontmatter
Frontmatter
Frontmatter
This textbook provides future data analysts with the tools, methods, and skills needed to answer data-
focused, real-life questions; to carry out data analysis; and to visualize and interpret results to support
better decisions in business, economics, and public policy.
Data wrangling and exploration, regression analysis, machine learning, and causal analysis are
comprehensively covered, as well as when, why, and how the methods work, and how they relate to
each other.
As the most effective way to communicate data analysis, running case studies play a central role in
this textbook. Each case starts with an industry-relevant question and answers it by using real-world
data and applying the tools and methods covered in the textbook. Learning is then consolidated by
360 practice questions and 120 data exercises.
Extensive online resources, including raw and cleaned data and codes for all analysis in Stata, R,
and Python, can be found at http://www.gabors-data-analysis.com.
Gábor Békés is an assistant professor at the Department of Economics and Business of the Central
European University, and Director of the Business Analytics Program. He is a senior fellow at KRTK
and a research affiliate at the Center for Economic Policy Research (CEPR). He has published in top
economics journals on multinational firm activities and productivity, business clusters, and innovation
spillovers. He has managed international data collection projects on firm performance and supply
chains. He has done policy-advising (the European Commission, ECB) as well as private-sector con-
sultancy (in finance, business intelligence, and real estate). He has taught graduate-level data analysis
and economic geography courses since 2012.
Gábor Kézdi is a research associate professor at the University of Michigan’s Institute for Social
Research. He has published in top journals in economics, statistics, and political science on topics
including household finances, health, education, demography, and ethnic disadvantages and preju-
dice. He has managed several data collection projects in Europe; currently, he is co-investigator of
the Health and Retirement Study in the USA. He has consulted for various governmental and non-
governmental institutions on the disadvantage of the Roma minority and the evaluation of social
interventions. He has taught data analysis, econometrics, and labor economics from undergraduate
to PhD levels since 2002, and supervised a number of MA and PhD students.
“This exciting new text covers everything today’s aspiring data scientist needs to know, managing to be comprehensive as well as
accessible. Like a good confidence interval, the Gabors have got you almost completely covered!”
Professor Joshua Angrist, Massachusetts Institute of Technology
“This is an excellent book for students learning the art of modern data analytics. It combines the latest techniques with practical
applications, replicating the implementation side of classroom teaching that is typically missing in textbooks. For example, they
used the World Management Survey data to generate exercises on firm performance for students to gain experience in handling
real data, with all its quirks, problems, and issues. For students looking to learn data analysis from one textbook, this is a great way
to proceed.”
Professor Nicholas Bloom, Department of Economics and Stanford Business School, Stanford University
“I know of few books about data analysis and visualization that are as comprehensive, deep, practical, and current as this one; and
I know of almost none that are as fun to read. Gábor Békés and Gábor Kézdi have created a most unusual and most compelling
beast: a textbook that teaches you the subject matter well and that, at the same time, you can enjoy reading cover to cover.”
Professor Alberto Cairo, University of Miami
“A beautiful integration of econometrics and data science that provides a direct path from data collection and exploratory analysis to
conventional regression modeling, then on to prediction and causal modeling. Exactly what is needed to equip the next generation
of students with the tools and insights from the two fields.”
Professor David Card, University of California–Berkeley
“This textbook is excellent at dissecting and explaining the underlying process of data analysis. Békés and Kézdi have masterfully
woven into their instruction a comprehensive range of case studies. The result is a rigorous textbook grounded in real-world
learning, at once accessible and engaging to novice scholars and advanced practitioners alike. I have every confidence it will be
valued by future generations.”
Professor Kerwin K. Charles, Yale School of Management
“This book takes you by the hand in a journey that will bring you to understand the core value of data in the fields of machine
learning and economics. The large amount of accessible examples combined with the intuitive explanation of foundational concepts
is an ideal mix for anyone who wants to do data analysis. It is highly recommended to anyone interested in the new way in which
data will be analyzed in the social sciences in the next years.”
Professor Christian Fons-Rosen, Barcelona Graduate School of Economics
“This sophisticatedly simple book is ideal for undergraduate- or Master’s-level Data Analytics courses with a broad audience. The
authors discuss the key aspects of examining data, regression analysis, prediction, Lasso, random forests, and more, using elegant
prose instead of algebra. Using well-chosen case studies, they illustrate the techniques and discuss all of them patiently and
thoroughly.”
Professor Carter Hill, Louisiana State University
“This is not an econometrics textbook. It is a data analysis textbook. And a highly unusual one - written in plain English, based on
simplified notation, and full of case studies. An excellent starting point for future data analysts or anyone interested in finding out
what data can tell us.”
Professor Beata Javorcik, University of Oxford
“A multifaceted book that considers many sides of data analysis, all of them important for the contemporary student and practi-
tioner. It brings together classical statistics, regression, and causal inference, sending the message that awareness of all three aspects
is important for success in this field. Many ’best practices’ are discussed in accessible language, and illustrated using interesting
datasets.”
Professor llya Ryzhov, University of Maryland
“This is a fantastic book to have. Strong data skills are critical for modern business and economic research, and this text provides
a thorough and practical guide to acquiring them. Highly recommended.”
Professor John van Reenen, MIT Sloan
“Energy and climate change is a major public policy challenge, where high-quality data analysis is the foundation of solid policy.
This textbook will make an important contribution to this with its innovative approach. In addition to the comprehensive treatment
of modern econometric techniques, the book also covers the less glamorous but crucial aspects of procuring and cleaning data,
and drawing useful inferences from less-than-perfect datasets. An important and practical combination for both academic and
policy professionals.”
Laszlo Varro, Chief Economist, International Energy Agency
DATA ANALYSIS
FOR BUSINESS,
ECONOMICS,
AND POLICY
Gábor Békés
Central European University, Vienna and Budapest
Gábor Kézdi
University of Michigan, Ann Arbor
BRIEF CONTENTS
I DATA EXPLORATION 1
1 Origins of Data 3
vi Brief Contents
22 Difference-in-Differences 620
References 704
Index 709
CONTENTS
I DATA EXPLORATION 1
1 Origins of Data 3
1.1 What Is Data? 4
1.2 Data Structures 5
1.A1 CASE STUDY – Finding a Good Deal among Hotels: Data Collection 6
1.3 Data Quality 7
1.B1 CASE STUDY – Comparing Online and Offline Prices: Data Collection 9
1.C1 CASE STUDY – Management Quality and Firm Performance: Data Collection 10
1.4 How Data Is Born: The Big Picture 11
1.5 Collecting Data from Existing Sources 12
1.A2 CASE STUDY – Finding a Good Deal among Hotels: Data Collection 14
1.B2 CASE STUDY – Comparing Online and Offline Prices: Data Collection 15
1.6 Surveys 16
1.C2 CASE STUDY – Management Quality and Firm Size: Data Collection 18
1.7 Sampling 18
1.8 Random Sampling 19
1.B3 CASE STUDY – Comparing Online and Offline Prices: Data Collection 21
1.C3 CASE STUDY – Management Quality and Firm Size: Data Collection 21
1.9 Big Data 22
1.10 Good Practices in Data Collection 24
1.11 Ethical and Legal Issues of Data Collection 26
1.12 Main Takeaways 27
Practice Questions 27
Data Exercises 28
References and Further Reading 28
viii Contents
Contents ix
x Contents
Contents xi
7.U1 Under the Hood: Derivation of the OLS Formulae for the Intercept and
Slope Coefficients 194
7.U2 Under the Hood: More on Residuals and Predicted Values with OLS 197
xii Contents
Contents xiii
xiv Contents
13.5 Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) 375
13.6 Bias and Variance of Predictions 376
13.7 The Task of Finding the Best Model 377
13.8 Finding the Best Model by Best Fit and Penalty: The BIC 379
13.9 Finding the Best Model by Training and Test Samples 380
13.10 Finding the Best Model by Cross-Validation 382
13.A3 CASE STUDY – Predicting Used Car Value with Linear Regressions 383
13.11 External Validity and Stable Patterns 384
13.A4 CASE STUDY – Predicting Used Car Value with Linear Regressions 386
13.12 Machine Learning and the Role of Algorithms 387
13.13 Main Takeaways 389
Practice Questions 389
Data Exercises 390
References and Further Reading 390
Contents xv
xvi Contents
Contents xvii
xviii Contents
22 Difference-in-Differences 620
22.1 Conditioning on Pre-intervention Outcomes 621
22.2 Basic Difference-in-Differences Analysis: Comparing Average Changes 622
22.A1 CASE STUDY – How Does a Merger between Airlines Affect Prices? 625
22.3 The Parallel Trends Assumption 629
22.A2 CASE STUDY – How Does a Merger between Airlines Affect Prices? 631
22.4 Conditioning on Additional Confounders in Diff-in-Diffs Regressions 633
22.A3 CASE STUDY – How Does a Merger between Airlines Affect Prices? 635
22.5 Quantitative Causal Variable 637
22.A4 CASE STUDY – How Does a Merger between Airlines Affect Prices? 638
22.6 Difference-in-Differences with Pooled Cross-Sections 640
22.A5 CASE STUDY – How Does a Merger between Airlines Affect Prices? 643
22.7 Main Takeaways 645
Practice Questions 646
Data Exercises 647
References and Further Reading 648
Contents xix
23.B1 CASE STUDY – Immunization against Measles and Saving Children 662
23.8 Clustered Standard Errors 665
23.9 Panel Regression in First Differences 666
23.10 Lags and Leads in FD Panel Regressions 667
23.B2 CASE STUDY – Immunization against Measles and Saving Children 669
23.11 Aggregate Trend and Individual Trends in FD Models 671
23.B3 CASE STUDY – Immunization against Measles and Saving Children 672
23.12 Panel Regressions and Causality 674
23.13 First Differences or Fixed Effects? 675
23.14 Dealing with Unbalanced Panels 677
23.15 Main Takeaways 678
Practice Questions 678
Data Exercises 680
References and Further Reading 680
References 704
Index 709
When deciding on which methods to discuss and in what depth, we drew on our own experience
as well as the advice of many people. We have taught Data Analysis and Econometrics to students
in Master’s programs for years in Europe and the USA, and trained experts in business analytics,
economics, and economic policy. We used earlier versions of this textbook in many courses with
students who differed in background, interest, and career plans. In addition, we talked to many experts
both in academia and in industry: teachers, researchers, analysts, and users of data analysis results.
As a result, this textbook offers a curated content that reflects the views of data analysts with
a wide range of experiences.
SIMPLIFIED NOTATION
ACKNOWLEDGMENTS
Let us first thank our students at the Central European University, at the University of Michigan, and
at the University of Reading. The idea of writing a textbook was born out of teaching and mentoring
them. We have learned a lot from teaching them, and many of them helped us writing code, collecting
data, reading papers, and hunting for ideas.
Many colleagues helped us with their extremely valuable comments and suggestions. We thank
Eduardo Arino de la Rubia, Emily Blanchard, Imre Boda, Alberto Cairo, Gergely Daróczi, János Divényi,
Christian Fons-Rosen, Bonnie Kavoussi, Olivér Kiss, Miklós Koren, Mike Luca, Róbert Lieli, László
Mátyás, Tímea Laura Molnár, Arieda Muço, Jenő Pál, and Ádám Szeidl and anonymous reviewers
of the first draft of the textbook.
We have received help with our case studies from Alberto Cavallo, Daniella Scur, Nick Bloom, John
van Reenen, Anikó Kristof, József Keleti, Emily Oster, and MyChelle Andrews. We have learned a lot
from them.
Several people helped us a great deal with our manuscript. At Cambridge University Press, our
commissioning editor, Phil Good, encouraged us from the day we met. Our editors, Heather Brolly,
Jane Adams, and Nicola Chapman, guided us with kindness and steadfastness from first draft to
proofs. We are not native English speakers, and support from Chris Cartwrigh and Jon Billam was
very useful. We are grateful for Sarolta Rózsás, who read and edited endless versions of chapters,
checking consistency and clarity, and pushed us to make the text more coherent and accessible.
Creating the code base in Stata, R and Python was a massive endeavour. Both of us are primarily
Stata users, and we needed R code that would be fairly consistent with Stata code. Plus, all graphs were
produced in R. So we needed help to have all our Stata codes replicated in R, and a great deal of code
writing from scratch. Zsuzsa Holler and Kinga Ritter have provided enormous development support,
spearheading this effort for years. Additional code and refactoring in R was created by Máté Tóth,
János Bíró, and Eszter Pázmándi. János and Máté also created the first version of Python notebooks.
Additional coding, data collection, visualization, and editing were done by Viktória Kónya, Zsófia
Kőműves, Dániel Bánki, Abuzar Ali, Endre Borza, Imola Csóka, and Ahmed Al Shaibani.
The wonderful cover design is based on the work by Ágoston Nagy, his first but surely not his last.
Collaborating with many talented people, including our former students, and bringing them
together was one of the joys of writing this book.
Let us also shout out to the fantastic R user community – both online and offline – from whom
we learned tremendously. Special thanks to the Rstats and Econ Twitter community – we received
wonderful suggestions from tons of people we have never met.
We thank the Central European University for professional and financial support. Julius Horvath
and Miklós Koren as department heads provided massive support from the day we shared our plans.
Finally, let us thank those who were with us throughout the long, and often stressful, process of
writing a textbook. Békés thanks Saci; Kézdi thanks Zsuzsanna. We would not have been able to do
it without their love and support.