Guidebook on Exploratory Data Analysis
Guidebook on Exploratory Data Analysis
CONTENT
A. EDA ESSENTIAL CONCEPTS ................................................................................................................................. 3
STEP 1: UNDERSTANDING THE STRUCTURE AND SUMMARY OF THE DATASET ......................................................................... 3
STEP 2: CHECKING FOR MISSING VALUES ...................................................................................................................... 3
STEP 3: IDENTIFYING OUTLIERS .................................................................................................................................. 4
STEP 4: UNIVARIATE ANALYSIS (INDIVIDUAL VARIABLE EXPLORATION) ................................................................................. 4
STEP 5: BIVARIATE ANALYSIS (EXPLORING RELATIONSHIPS BETWEEN VARIABLES) ................................................................... 5
STEP 6: DIMENSIONALITY REDUCTION (OPTIONAL, FOR LARGE DATASETS) ........................................................................... 6
STEP 7: DATA CLEANING AS PART OF EDA .................................................................................................................... 6
STEP 8: REPORTING FINDINGS.................................................................................................................................... 7
B. DEEPER DATA ANALYSIS ...................................................................................................................................... 8
STEP 1: DEFINE YOUR RESEARCH QUESTIONS AND HYPOTHESES......................................................................................... 8
STEP 2: DESCRIPTIVE STATISTICS (SUMMARIZE KEY VARIABLES).......................................................................................... 8
STEP 3: INFERENTIAL STATISTICS (HYPOTHESIS TESTING) ................................................................................................... 9
STEP 4: BUILDING PREDICTIVE MODELS (IF APPLICABLE) ................................................................................................ 10
STEP 5: MODEL VALIDATION ................................................................................................................................... 10
STEP 6: DATA INTERPRETATION AND CONTEXTUALIZATION .............................................................................................. 11
STEP 7: REPORTING AND VISUALIZATION .................................................................................................................... 11
STEP 8: ENSURING REPRODUCIBILITY ......................................................................................................................... 12
STEP 9: REVIEW AND PEER FEEDBACK ........................................................................................................................ 12
C. FURTHER ADJUSTMENTS GIVEN THE TWO ROUNDS OF THE PHL DATA COLLECTION ......................................... 14
1. ORGANIZE THE FILE STRUCTURE ............................................................................................................................ 14
2. PREPARE DOCUMENTATION ................................................................................................................................. 15
3. DATA CLEANING FOR EACH VISIT SEPARATELY........................................................................................................... 15
4. MERGING THE DATA FROM BOTH VISITS ................................................................................................................. 15
5. EXPLORATORY DATA ANALYSIS (EDA) FOR PANEL DATA .............................................................................................. 16
6. INFERENTIAL STATISTICS AND HYPOTHESIS TESTING FOR PANEL DATA ............................................................................. 17
7. DATA INTERPRETATION AND CONTEXTUALIZATION FOR PANEL DATA............................................................................... 17
8. REPORTING AND VISUALIZATION ........................................................................................................................... 18
D. TYPICAL REPRODUCIBLE REPORT FORMAT ....................................................................................................... 20
Page |3
Data Analysis (EDA) is a crucial step in understanding the structure and characteristics of the PHL
dataset before performing any in depth analysis. It helps uncover patterns, detect outliers, and identify
data issues. Here's an exhaustive EDA strategy, with a focus on both why and how each step is
performed, and how data cleaning integrates into this process.
Why:
You need to first understand the overall structure of the dataset—what variables are present, their
data types, and a general sense of the data. This helps you determine how to handle each variable
during analysis.
How:
View Data: In SPSS, go to Data View and Variable View. This allows you to inspect your dataset
structure and variable details.
Check Variable Types : Are the variables numeric, string, or categorical? Understanding variable
types informs how to treat them during analysis.
Generate Descriptive Statistics: Use `Analyze > Descriptive Statistics > Descriptives` to get an
overview of key metrics (e.g., mean, median, standard deviation) for numeric variables, and `Analyze
> Descriptive Statistics > Frequencies` for categorical data.
Objective: Understand basic distributions, spot any unusual values, and see if variables have a lot of
missing data.
Why:
Missing values can bias your results or reduce the accuracy of your models. It’s essential to identify
and decide how to handle missing data early in the process.
How:
Frequencies for Categorical Data: Run `Analyze > Descriptive Statistics > Frequencies` and check
the "Missing" section to quantify missing values.
Descriptives for Numeric Data: Use `Analyze > Descriptive Statistics > Descriptives` to inspect
missing values for numeric variables.
Objective: Assess the extent of missing data and understand if it occurs randomly or systematically.
Page |4
Actions:
Remove Missing Values: If the proportion of missing data is small and does not appear systematic,
you can remove those rows.
Imputation: For larger amounts of missing data, consider imputing values using the mean, median,
or more advanced techniques like regression imputation (available through `Transform > Replace
Missing Values`).
Why: Removing too much data can reduce the power of your analysis, while imputation preserves the
dataset size.
Why:
Outliers can skew your analysis and lead to misleading conclusions, especially in parametric analyses
where assumptions like normality are important.
How:
Boxplots: Use `Graphs > Chart Builder` to create boxplots for numerical variables. Boxplots
highlight outliers as points outside the whiskers.
Descriptive Statistics: Use `Analyze > Descriptive Statistics > Explore` to get a detailed view of
variable ranges and identify extreme values.
Objective: Identify outliers and unusual values that may require further investigation or treatment.
Actions:
Examine the Cause: Determine if outliers are legitimate data points or data entry errors.
Correct: If the outliers are errors, correct them by looking at raw data or using domain knowledge.
Cap or Transform: For extreme but valid outliers, you can use techniques like capping (setting a limit
on the maximum/minimum values) or log transformations to reduce their impact.
Why:
Understanding each variable individually gives insight into its distribution, central tendency, and
variability. This helps you decide which transformations or adjustments may be needed for later
analysis.
How:
Numerical Variables:
Use Histograms (`Graphs > Chart Builder`) to visualize the distribution of continuous variables.
Page |5
Run Descriptive Statistics (`Analyze > Descriptive Statistics > Descriptives`) to calculate central
tendency measures (mean, median) and dispersion (standard deviation).
Objective: Identify the skewness, kurtosis, and whether the data is normally distributed.
Categorical Variables:
Run Frequencies to check the distribution of categorical variables. Are some categories
underrepresented?
Bar Charts can help visualize the proportions of categories.
Objective: Understand the balance of your categorical data and see if some categories need
regrouping.
Actions:
Transformations: For skewed data, consider transformations like log or square root to normalize
distributions.
Recoding Variables: Recode categorical variables with too many levels into fewer, meaningful
categories using `Transform > Recode`.
Why:
You need to explore relationships between variables to understand correlations, patterns, or
dependencies. This step helps in identifying potential predictors for more complex modeling later.
How:
Numerical Variables:
Use Scatterplots (`Graphs > Chart Builder`) to visualize relationships between two continuous
variables.
Run Correlation Matrix (`Analyze > Correlate > Bivariate`) to calculate Pearson or Spearman
correlations between pairs of continuous variables.
Objective: Identify strong correlations, trends, or associations between numerical variables.
Categorical Variables:
Use Crosstabulations (`Analyze > Descriptive Statistics > Crosstabs`) to explore the relationships
between two categorical variables.
Perform Chi Square tests to see if associations between categorical variables are statistically
significant.
Page |6
Why:
In cases where you have many variables, dimensionality reduction techniques help simplify the data
while preserving its structure. This step isn’t always necessary but can improve interpretability and
model performance in complex datasets.
How:
Principal Component Analysis (PCA): Run PCA (`Analyze > Dimension Reduction > Factor`) to
reduce the number of variables while preserving most of the data’s variability.
Objective: Simplify your dataset by finding key underlying components.
Actions:
Use PCA results to focus on key variables or components for further analysis.
Throughout EDA, you should continuously clean the data based on insights gained from each step.
Key cleaning steps include:
Outlier Handling: Correct or adjust outliers based on boxplots and scatterplot inspections.
Missing Data Handling: Impute or remove missing data as uncovered in the descriptive statistics.
Data Transformation: Normalize skewed distributions or reclassify categories based on your
univariate and bivariate analyses.
By documenting these cleaning steps in SPSS syntax files, you ensure that your work is reproducible.
Page |7
Why:
Sharing your EDA findings ensures transparency and lays the foundation for subsequent analysis. It
also serves as a checkpoint for understanding whether any additional cleaning or data transformations
are necessary.
How:
Graphs and Tables: Use SPSS to export tables and graphs for inclusion in a report. Highlight key
statistics (e.g., distributions, outliers, missing data).
Interpret Results: Summarize what the data tells you about each variable and how relationships
between variables might affect future analysis.
Final Checklist for EDA:
1. Inspect the data structure and types.
2. Handle missing values.
3. Detect and treat outliers.
4. Explore individual variables (univariate analysis).
5. Investigate relationships between variables (bivariate analysis).
6. Optionally, reduce dimensionality if you have a large dataset.
7. Clean data as part of the process and document every change for reproducibility.
8. Summarize your findings in a clear report, focusing on key insights for future analysis.
This thorough EDA approach ensures that the dataset is clean, understandable, and ready for further
analysis.
Page |8
Once the dataset is clean and ready for analysis, the next steps depend on your objectives—whether
you're looking to summarize/tabulate data, test hypotheses, or build predictive models. Here's a
logical, strategic approach to continue from where your Exploratory Data Analysis (EDA) ended, step
by step:
Why:
Clearly defining your objectives helps focus the analysis on answering relevant questions. Knowing
whether you're conducting descriptive, inferential, or predictive analysis will dictate the methods used.
How:
Summarize Key Questions: Based on the survey objectives, list key questions you're looking to answer
(e.g., "What factors predict customer satisfaction?" or "Is there a relationship between income and
education level?").
Formulate Hypotheses: Convert your research questions into testable hypotheses. For example:
Null Hypothesis (H0): There is no relationship between income and education level.
Alternative Hypothesis (H1): There is a positive relationship between income and education level.
Actions:
Prioritize hypotheses based on their importance to the overall goal of your analysis. Focus on the
most actionable insights first.
Why:
Descriptive statistics provide a summary of the main features of your dataset, offering insights into
the distribution, central tendency, and variability of key variables. This gives context for interpreting
results and helps to communicate findings clearly.
How:
Numeric Variables: Use `Analyze > Descriptive Statistics > Descriptives` to generate means,
medians, standard deviations, and ranges.
Categorical Variables: Use `Analyze > Descriptive Statistics > Frequencies` to summarize the
proportions or counts of categories.
Visuals: Create histograms, bar charts, or pie charts to visually represent your findings.
Page |9
Actions:
Present key descriptive stats in tables and graphs.
Highlight any interesting patterns or notable differences in the data, which will help when moving to
more detailed analyses.
Why:
Inferential statistics allow you to draw conclusions from your sample data and make inferences about
the larger population. This step is essential if you're testing relationships between variables or
differences between groups.
How:
For Categorical Variables:
Chi Square Test: Use this test (`Analyze > Descriptive Statistics > Crosstabs > Statistics`) to check
if there are statistically significant relationships between two categorical variables.
For Continuous Variables:
T tests: Use `Analyze > Compare Means > Independent Samples T Test` to compare the means
of two groups (e.g., income between males and females).
ANOVA: Use `Analyze > Compare Means > One Way ANOVA` if you're comparing means across
more than two groups (e.g., income across different education levels).
For Continuous and Categorical Relationships:
Correlation Analysis: Use `Analyze > Correlate > Bivariate` to find correlations between numeric
variables (e.g., income and age).
Linear Regression: Run regression (`Analyze > Regression > Linear`) to predict a continuous
dependent variable (e.g., predicting income based on education and experience).
Actions:
Evaluate p values to determine the statistical significance of your results. If p < 0.05, reject the null
hypothesis.
Document key findings (e.g., significant relationships or differences) that can lead to actionable
insights.
P a g e | 10
Why:
Predictive modeling allows you to build a model that can predict outcomes based on independent
variables. It’s useful for applications like forecasting or classification.
How:
Linear Regression (for predicting continuous outcomes):
Use `Analyze > Regression > Linear` if you want to predict a continuous variable (e.g., predicting
income based on education and experience).
Evaluate model fit using R², adjusted R², and residual plots to check for assumptions like linearity,
homoscedasticity, and normality of residuals.
Logistic Regression (for binary outcomes):
Use `Analyze > Regression > Binary Logistic` if your dependent variable is categorical (e.g.,
predicting whether someone will purchase a product based on age and income).
Classification Trees (for categorical outcomes):
Use decision tree algorithms (`Analyze > Classify > Decision Tree`) for more complex
classifications or nonlinear relationships.
Actions:
Assess model performance using metrics such as R² for regression, or accuracy, precision, and recall
for classification.
Refine the model as needed by adjusting variables or testing different models.
Why:
Validating your model ensures its reliability and generalizability to new data. It’s a critical step for
confirming that your results aren’t just due to chance.
How:
Split Data: Use a training/testing split (`Data > Select Cases`) to build your model on one part of
the data and test it on the other.
Cross Validation: If your sample size is large enough, use cross validation techniques to ensure the
model performs well across different subsets of data.
P a g e | 11
Actions:
Document the performance of your models (e.g., accuracy, R²).
Check for potential overfitting (a model too tightly fit to the training data) by comparing performance
between training and test sets.
Why:
It’s crucial to interpret the statistical findings within the context of your research questions. Statistical
significance alone does not equate to practical significance.
How:
Compare Results to Hypotheses: Review the results of your hypothesis tests. What do they tell
you about the relationships between variables?
Provide Context: Use external knowledge or domain expertise to frame your results within a broader
context. For instance, a statistically significant difference in income based on education level may have
implications for policy recommendations.
Actions:
Summarize the key takeaways from your analysis.
Make recommendations based on the findings, such as further investigation into specific patterns or
policy changes.
Why:
Reporting your findings in a clear, professional manner ensures that stakeholders or other researchers
can understand and replicate your analysis.
How:
Generate Graphs and Charts: Use SPSS to create visually appealing charts (e.g., bar graphs, scatter
plots) that summarize key findings.
Export Reports: Export output tables and graphs from SPSS as Word documents, PDFs, or Excel
files.
Write a Detailed Report: Your report should include:
P a g e | 12
Why:
Ensuring reproducibility is key for the credibility of your analysis and so that future researchers can
replicate the study.
How:
Save Syntax Files: Ensure all analysis steps (data cleaning, transformations, and statistical tests) are
saved in SPSS syntax files.
Create a Final Data Version: Save a final, cleaned dataset that is fully documented.
Actions:
Write a reproducibility log or document that outlines each step, how it was performed, and why
certain choices were made (e.g., handling missing data).
Why:
Before finalizing your analysis, reviewing your work, or obtaining feedback from peers can help
identify potential errors or areas for improvement.
How:
Internal Review: Go over the report, syntax, and outputs to ensure everything is accurate and
consistent.
P a g e | 13
Peer Feedback: If possible, share your findings with a colleague or peer for review. They may spot
issues or offer insights that you missed.
Actions:
Revise any steps or findings based on feedback.
Finalize the report, ensuring that it is clear and replicable.
By following these strategic steps, you’ll move from a clean dataset to well-founded conclusions. This
approach ensures that your analysis is thorough, reproducible, and actionable, whether for internal
decision making or academic purposes.
P a g e | 14
C. Further Adjustments given the two rounds of the PHL Data Collection
Given that your study involves two rounds of data collection (panel study), with almost the same
variables collected in both rounds, you will need to adjust the strategy to account for the repeated
measurements. Below is a slightly modified approach that incorporates the panel design (the two
visits), from organizing the files to producing a reproducible report.
Since we are dealing with two rounds of data, it is important to separate the data for each round while
maintaining a cohesive structure. Here's how the file organization could be done:
2. Prepare Documentation
Why:
Each round of data needs to be cleaned independently before merging to avoid introducing bias or
errors from either visit.
How:
Handle Visit1 and Visit2 Separately:
Import Visit1 data and clean it as per the standard EDA process:
Check for missing values, outliers, and variable distributions.
Correct data entry errors, recode variables as needed, and apply any imputations for missing data.
Do the same for Visit2 data.
Ensure Consistency:
Pay special attention to variable names and coding across both visits. If a variable is coded differently
between visits (e.g., `gender` coded as 1/2 in one visit but as M/F in the other), standardize the coding.
Actions:
Create clean datasets for both visits and store them separately in the `Processed Data` folder.
Why:
Merging the two rounds allows for a longitudinal or panel analysis, where changes between visits can
be measured.
P a g e | 16
How:
Unique Identifier: Ensure that each household has a unique identifier that links Visit1 and Visit2 data.
Merge in SPSS:
Use the Merge Files option (`Data > Merge Files > Add Cases`) in SPSS to combine the two visits
into a single file, ensuring that the household ID is the key variable for the merge.
(Add Variables) can also be used depending on the situation at hand.
Actions:
After merging, check for any discrepancies (e.g., households missing from one visit) and document
how such cases are handled (e.g., excluding them or filling missing data).
EDA for panel data requires additional considerations, as you’re analyzing both cross sectional and
longitudinal changes.
How:
Cross Sectional EDA (For Each Visit):
For Visit1 and Visit2 individually, perform EDA (distribution of variables, missing data, outliers,
etc.) as previously described.
Longitudinal EDA:
Change Over Time: Create new variables representing changes between the visits (e.g., `income
change = income_visit2 income_visit1`).
Paired Comparisons: Use paired sample t tests to compare differences in continuous variables
across the two visits if any.
In SPSS: `Analyze > Compare Means > Paired Samples T Test`.
Correlation Across Time: Analyze how variables from Visit1 correlate with Visit2 variables if
relevant.
Actions:
P a g e | 17
Summarize changes across time and investigate patterns if relevant (e.g., are household labor generally
increasing or decreasing from Visit1 to Visit2?).
Why:
Panel data allows you to not only look at cross sectional differences but also changes over time, adding
a longitudinal dimension to your analysis.
How:
Paired Comparisons:
Use paired t tests or nonparametric equivalents to assess differences between Visit1 and Visit2 for
continuous variables.
Longitudinal Models:
Consider using repeated measures ANOVA (`Analyze > General Linear Model > Repeated
Measures`) if you’re interested in understanding how outcomes change across the visits.
For more complex modeling, mixed effect models (also known as hierarchical linear models) can be
used to account for both within household and between household variation over time.
Regression Models:
For longitudinal regression, run models that include both Visit1 and Visit2 variables as predictors.
For instance, use `area_visit1` to predict `area_visit2` while controlling for other variables like
`education` or `crop yield` (if relevant).
Actions:
Document all hypotheses and model assumptions clearly.
Validate your models using appropriate tests and measures of fit (e.g., R² for regression).
Why:
Understanding how key outcomes change between the two visits provides insights into trends and the
impact of any interventions or seasonal changes.
How:
P a g e | 18
Examine Changes: Focus on understanding how key indicators (e.g., household size, crop yield,
losses, labor use) evolve between the two visits.
Contextualize Findings: Link findings to the time periods and potential external factors (e.g., seasonal
weather changes, market conditions).
Actions:
Compare your results with the survey objectives and discuss the implications of changes observed
between visits.
Highlight key insights that may inform policy recommendations or further research.
Why:
Clear reporting is essential for communicating findings from both cross sectional and longitudinal
perspectives.
How:
Cross Sectional vs. Longitudinal:
Present results separately for each visit, followed by a section comparing changes over time.
Use side by side charts or line graphs to visualize trends over the two visits.
Graphical Representation:
Use line graphs to show changes in continuous variables (e.g., crop yield) between the two visits.
Bar charts can help compare categorical variables across the visits.
Actions:
Write a report that includes:
1. Executive Summary: Key findings from both visits and notable trends over time.
2. Introduction: Background and objectives of the panel study.
3. Methods: Explanation of data collection, cleaning, and merging for panel data.
4. Results: Separate sections for each visit, followed by an analysis of longitudinal changes.
5. Conclusion and Recommendations: Based on the observed trends and findings.
9. Ensuring Reproducibility
P a g e | 19
Why:
Reproducibility is critical, especially in panel studies, where the structure of the data is more complex.
How:
Save Syntax Files: Save all steps in SPSS syntax files, from cleaning to analysis.
Document Merging Process: Clearly document how the data from both visits were merged and
any issues encountered.
Actions:
Write a detailed reproducibility log that outlines the process of cleaning and analyzing data for both
rounds.
By adapting the strategy to the panel data structure (if necessary), we should be able to track changes
over time, make meaningful comparisons between the visits, and ensure that the analysis remains
reproducible. This approach allows for a thorough understanding of the data both within each visit
and across the two rounds.
P a g e | 20
4.1.2 Visit 2
Repeat the above for Visit 2 data.
4.2 Comparative Analysis Between Visits
Variable Changes Over Time
Use tables and graphs to illustrate changes in key variables.
Paired Sample Tests
Present results from paired t tests or nonparametric tests comparing visits.
4.3 Correlation Analysis
Within Visit Correlations
Correlation matrices for variables within each visit.
Cross Visit Correlations
Correlations between variables from Visit 1 and Visit 2.
4.4 Visualization of Data
Graphs and Charts
Include histograms, box plots, line graphs to visualize data distributions and trends.
5. Inferential Statistics and Modeling
5.1 Hypotheses Formulation
List of Hypotheses
Clearly state each hypothesis tested, e.g., "H1: There is a significant increase in crop losses from
Visit 1 to Visit 2."
P a g e | 24
Visit 1 Results
Present findings from statistical tests on Visit 1 data.
Visit 2 Results
Present findings from statistical tests on Visit 2 data.
5.3.2 Longitudinal Analysis
Repeated Measures Analysis
Results from repeated measures ANOVA or equivalent tests.
Regression Models
Present models predicting outcomes in Visit 2 based on Visit 1 variables.
5.4 Interpretation of Results
Significance Levels
Discuss which results are statistically significant.
Practical Implications
Explain the real world meaning of the findings.
6. Discussion
6.1 Overview of Key Findings
Summary
Recap the most important results from the EDA and inferential statistics.
P a g e | 25
Ethical Considerations
Note any restrictions due to confidentiality or consent agreements.
References
Citations
List all sources referenced, formatted consistently according to a chosen citation style.
Appendices
Appendix A: Detailed File Structure
Data Organization
Provide a diagram or detailed description of the file structure used.
Appendix B: Survey Instruments
Questionnaires
Include copies of the questionnaires used for both visits.
Appendix C: Codebooks
Variable Definitions
Detailed descriptions of all variables, including coding schemes.
Appendix D: SPSS Syntax Files
Data Cleaning Syntax
Include or reference the syntax files used for data cleaning.
Analysis Syntax
Include or reference the syntax files used for statistical analyses.
Appendix E: Additional Tables and Figures
Supplementary Material
Any additional tables, graphs, or analyses not included in the main text.
Formatting and Presentation Tips
Consistency
Ensure all tables, figures, and references are formatted consistently.
Clarity
Use clear, concise language throughout the report.
Numbering
P a g e | 27
By adhering to this structure, the report will not only document the findings but also serve as a valuable
resource for others interested in PHL agricultural studies and panel data analysis.