EDA Unit IV
EDA Unit IV
UNIT IV
EXPLORATORY DATA ANALYSIS
BIVARIATE ANALYSIS - Relationships between Two Variables - Percentage Tables
- Analysing Contingency Tables - Handling Several Batches - Scatterplots and Resistant
Lines.
Relationship Between Two Variables in Bivariate Analysis
Bivariate analysis examines the relationship between two variables to understand how one
variable influences or is associated with another. This analysis can be performed using various
statistical and graphical methods. Below are some key concepts and methods used in bivariate
analysis:
Key Concepts
1. Correlation: Measures the strength and direction of the linear relationship between two
numerical variables.
2. Causation: Indicates that changes in one variable directly cause changes in another
variable.
3. Association: Describes a relationship between two variables, which may or may not be
causal.
4. Independence: Indicates that two variables do not influence each other.
Methods for Analyzing Relationships Between Two Variables
1. Scatterplots:
o A graphical representation where each point represents an observation with its x-
coordinate corresponding to one variable and its y-coordinate corresponding to
the other variable.
o Example: Plotting height vs. weight to see if there's a visual pattern or trend.
2. Correlation Coefficient:
o Quantifies the degree to which two variables are linearly related.
o Pearson's correlation coefficient (r) ranges from -1 to 1.
r = 1: Perfect positive linear relationship.
r = -1: Perfect negative linear relationship.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
r = 0: No linear relationship.
o Example: Calculating the correlation coefficient between study hours and exam
scores.
3. Regression Analysis:
o Examines the relationship between a dependent variable and one or more
independent variables.
o Simple Linear Regression: Models the relationship between two variables by
fitting a linear equation to the observed data.
Formula: y=mx+ c, where y is the dependent variable, x is the
independent variable, mmm is the slope, and ccc is the y-intercept.
o Example: Predicting house prices based on square footage.
4. Crosstabulation and Chi-Square Test:
o Crosstabulation (Contingency Table): Displays the frequency distribution of
variables.
o Chi-Square Test: Tests the independence of two categorical variables.
o Example: Analyzing the relationship between gender and preference for a
particular product.
5. T-Test for Comparing Means:
o Compares the means of two groups to determine if they are statistically different
from each other.
o Independent Samples T-Test: Used when comparing means from two different
groups.
o Paired Samples T-Test: Used when comparing means from the same group at
different times.
o Example: Comparing average test scores between two different classes.
6. ANOVA (Analysis of Variance):
o Compares means among three or more groups to see if at least one group mean is
different from the others.
o Example: Comparing average income levels across different education levels.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
Key Concepts
Frequency Table: A table that displays the count of occurrences for each combination of
categories in two categorical variables.
Percentage Table: A table that converts the counts from a frequency table into percentages,
making it easier to interpret the proportions.
Creating Percentage Tables
Percentage tables can be constructed in different ways depending on the context and the type of
analysis:
Row Percentage Table: Shows the percentage of each row category within the corresponding
column categories.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
Column Percentage Table: Shows the percentage of each column category within the
corresponding row categories.
Overall Percentage Table: Shows the percentage of each cell relative to the total number of
observations.
Example
Let's consider a dataset that includes two categorical variables: Gender and Product Preference.
Dataset:
Female Product B
Male Product A
Female Product C
Male Product B
Female Product A
Male Product C
Female Product B
Step-by-Step Construction
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
Implementation
import pandas as pd
# Sample data
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
'Product Preference': ['Product A', 'Product B', 'Product A', 'Product C', 'Product B', 'Product
A', 'Product C', 'Product B']}
df = pd.DataFrame(data)
# Frequency table
frequency_table = pd.crosstab(df['Gender'], df['Product Preference'])
print("Frequency Table:\n", frequency_table)
For Female respondents, 50% prefer Product B, while 25% prefer both Product A and
Product C.
Column Percentage Table Analysis
For Product A, 66.67% of the respondents are Male, and 33.33% are Female.
For Product B, 33.33% of the respondents are Male, and 66.67% are Female.
For Product C, 50% of the respondents are Male, and 50% are Female.
Overall Percentage Table Analysis
Out of the total respondents, 25% are Male preferring Product A, and 12.5% are Female
preferring Product A.
12.5% are Male preferring Product B, and 25% are Female preferring Product B.
Both Male and Female respondents have an equal preference for Product C at 12.5%
each.
Insights
1. Gender Preferences:
o Male respondents have a higher preference for Product A.
o Female respondents have a higher preference for Product B.
2. Product Popularity:
o Product A is more popular among Male respondents, while Product B is more
popular among Female respondents.
o Product C has an equal preference among both genders.
3. Marketing Strategy:
o Companies can use these insights to tailor their marketing strategies. For instance,
Product A could be targeted more towards males, and Product B towards
females.
Contingency Tables
A contingency table (also known as a cross-tabulation or crosstab) is a type of table in a matrix
format that displays the frequency distribution of variables. They are particularly useful for
examining the relationship between two categorical variables. Each cell in a contingency table
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
shows the count or frequency of occurrences for specific combinations of categories of the two
variables.
Key Concepts
Rows: Represent the categories of one variable.
Columns: Represent the categories of another variable.
Cells: Represent the frequency or count of observations falling into the corresponding
row and column categories.
Creating a contingency table in SPSS (Statistical Package for the Social Sciences) is a
straightforward process. Here’s a step-by-step guide:
Step 1: Open SPSS and Load Your Data
1. Open SPSS: Start SPSS on your computer.
2. Load Data: Open your dataset by going to File > Open > Data, then browse and select
your data file.
Step 2: Access the Crosstabs Function
1. Navigate to Crosstabs: Go to Analyze > Descriptive Statistics > Crosstabs.
Step 3: Define Rows and Columns
1. Select Variables:
o In the Crosstabs dialog box, move the variable you want in the rows to the Rows
box.
o Move the variable you want in the columns to the Columns box.
Step 4: Choose Statistics and Display Options
1. Statistics:
o Click on the Statistics button if you want additional statistics like Chi-square, Phi,
and Cramer's V.
o Check the boxes for the statistics you want and click Continue.
2. Cells:
o Click on the Cells button to select which counts you want displayed (e.g.,
observed, expected, row percentages, column percentages).
o Check the appropriate boxes and click Continue.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
o Compare these with the observed frequencies to see if there is a large discrepancy.
2. Chi-square Test:
o The Chi-square value is 1.667 with 1 degree of freedom.
o The p-value is 0.197.
3. Interpret the p-value:
o Typically, a p-value less than 0.05 indicates a significant association between the
variables.
o In this case, the p-value (0.197) is greater than 0.05, suggesting there is no
significant association between Gender and Preference.
Step 4: Additional Statistics (if applicable)
1. Phi and Cramer's V:
o These are measures of association for nominal data.
o A value close to 0 indicates little to no association, while a value closer to 1
indicates a strong association.
2. Interpret Residuals:
o Look at the residuals to understand the differences between observed and
expected frequencies.
o Standardized residuals greater than ±1.96 are considered significant at the 0.05
level.
Handling Several Batches
ChatGPT
Handling several batches of data in SPSS can involve tasks such as running the same analysis on
different subsets of data or looping through multiple datasets. Here's a guide on how to handle
multiple batches of data in SPSS:
Approach 1: Using SPLIT FILE
The SPLIT FILE command in SPSS allows you to perform analyses on subsets of data defined
by one or more grouping variables.
1. Load Your Data: Open your dataset in SPSS.
2. Split File:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
python
END PROGRAM.
This script loads each dataset and runs a crosstabs analysis.
Approach 4: Using Macros
SPSS macros can automate repetitive tasks.
1. Define Macro:
o Open the Syntax Editor.
o Define a macro to process multiple datasets:
spss
END PROGRAM.
2. Execute Script: Run the script in the SPSS Syntax Editor.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis
By using these approaches, you can efficiently handle and analyze multiple batches of data in
SPSS.
Scatterplots and Resistant Lines in Exploratory Data Analysis (EDA)
Scatterplots
Definition: A scatterplot is a graphical representation used to display the relationship between
two numerical variables. Each point on the scatterplot represents an observation from the dataset,
with its position determined by the values of the two variables.
Key Features:
Axes: The x-axis represents the independent variable, and the y-axis represents the
dependent variable.
Points: Each point represents an observation, with coordinates given by the values of the
two variables.
Trend Identification: Scatterplots help identify trends, patterns, and possible
correlations between variables.
Example: Consider a dataset containing students' study hours and their corresponding test
scores. A scatterplot can show the relationship between study hours (x-axis) and test scores (y-
axis).
python
# Example data
study_hours = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test_scores = [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
plt.ylabel('Test Scores')
plt.show()
This scatterplot shows that as study hours increase, test scores tend to increase, indicating a
positive correlation.
Resistant Lines
Definition: Resistant lines, also known as robust regression lines, are used to fit a line to data in
a way that is less sensitive to outliers than traditional least squares regression. These lines
provide a more accurate representation of the central trend in the presence of outliers.
Key Features:
Robustness: Resistant lines are not greatly influenced by outliers, making them suitable
for datasets with extreme values.
Central Trend: They provide a reliable measure of the central trend, especially when
data includes anomalies or non-normal distributions.
Example: Using Python's statsmodels library, you can create a resistant line (e.g., a robust linear
model).
python
import numpy as np
import statsmodels.api as sm
# Example data
x = np.array(study_hours)
y = np.array(test_scores)
results = model.fit()
# Predicted values
predicted = results.predict(x)