Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
108 views

EDA Unit IV

Uploaded by

Srinithi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views

EDA Unit IV

Uploaded by

Srinithi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Rajalakshmi Institute of Technology

(An Autonomous Institution), Affiliated to Anna University, Chennai


Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

UNIT IV
EXPLORATORY DATA ANALYSIS
BIVARIATE ANALYSIS - Relationships between Two Variables - Percentage Tables
- Analysing Contingency Tables - Handling Several Batches - Scatterplots and Resistant
Lines.
Relationship Between Two Variables in Bivariate Analysis
Bivariate analysis examines the relationship between two variables to understand how one
variable influences or is associated with another. This analysis can be performed using various
statistical and graphical methods. Below are some key concepts and methods used in bivariate
analysis:
Key Concepts
1. Correlation: Measures the strength and direction of the linear relationship between two
numerical variables.
2. Causation: Indicates that changes in one variable directly cause changes in another
variable.
3. Association: Describes a relationship between two variables, which may or may not be
causal.
4. Independence: Indicates that two variables do not influence each other.
Methods for Analyzing Relationships Between Two Variables
1. Scatterplots:
o A graphical representation where each point represents an observation with its x-
coordinate corresponding to one variable and its y-coordinate corresponding to
the other variable.
o Example: Plotting height vs. weight to see if there's a visual pattern or trend.
2. Correlation Coefficient:
o Quantifies the degree to which two variables are linearly related.
o Pearson's correlation coefficient (r) ranges from -1 to 1.
 r = 1: Perfect positive linear relationship.
 r = -1: Perfect negative linear relationship.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

 r = 0: No linear relationship.
o Example: Calculating the correlation coefficient between study hours and exam
scores.
3. Regression Analysis:
o Examines the relationship between a dependent variable and one or more
independent variables.
o Simple Linear Regression: Models the relationship between two variables by
fitting a linear equation to the observed data.
 Formula: y=mx+ c, where y is the dependent variable, x is the
independent variable, mmm is the slope, and ccc is the y-intercept.
o Example: Predicting house prices based on square footage.
4. Crosstabulation and Chi-Square Test:
o Crosstabulation (Contingency Table): Displays the frequency distribution of
variables.
o Chi-Square Test: Tests the independence of two categorical variables.
o Example: Analyzing the relationship between gender and preference for a
particular product.
5. T-Test for Comparing Means:
o Compares the means of two groups to determine if they are statistically different
from each other.
o Independent Samples T-Test: Used when comparing means from two different
groups.
o Paired Samples T-Test: Used when comparing means from the same group at
different times.
o Example: Comparing average test scores between two different classes.
6. ANOVA (Analysis of Variance):
o Compares means among three or more groups to see if at least one group mean is
different from the others.
o Example: Comparing average income levels across different education levels.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

7. Contingency Tables and Percentage Tables:


o Display the distribution of two categorical variables and show the percentages to
understand the proportion of occurrences.
o Example: Examining the relationship between education level and employment
status.
8. Resistant Lines:
o A line that is not unduly influenced by outliers.
o Used in scatterplots to identify the trend while minimizing the effect of extreme
values.
o Example: Drawing a resistant line through a scatterplot of years of experience vs.
salary.
Percentage Tables in Bivariate Analysis
Percentage tables (also known as proportion tables or relative frequency tables) are used
to understand the distribution and relationship between two categorical variables. They provide a
clear and concise way to visualize the proportion of observations in each category combination
of the two variables. These tables help in identifying patterns, associations, and potential
dependencies between the variables.

Key Concepts
Frequency Table: A table that displays the count of occurrences for each combination of
categories in two categorical variables.
Percentage Table: A table that converts the counts from a frequency table into percentages,
making it easier to interpret the proportions.
Creating Percentage Tables
Percentage tables can be constructed in different ways depending on the context and the type of
analysis:

Row Percentage Table: Shows the percentage of each row category within the corresponding
column categories.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

Column Percentage Table: Shows the percentage of each column category within the
corresponding row categories.
Overall Percentage Table: Shows the percentage of each cell relative to the total number of
observations.
Example
Let's consider a dataset that includes two categorical variables: Gender and Product Preference.

Dataset:

Gender Product Preference


Male Product A

Female Product B
Male Product A

Female Product C
Male Product B
Female Product A
Male Product C

Female Product B

Step-by-Step Construction
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

Implementation
import pandas as pd

# Sample data
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

'Product Preference': ['Product A', 'Product B', 'Product A', 'Product C', 'Product B', 'Product
A', 'Product C', 'Product B']}

df = pd.DataFrame(data)

# Frequency table
frequency_table = pd.crosstab(df['Gender'], df['Product Preference'])
print("Frequency Table:\n", frequency_table)

# Row percentage table


row_percentage_table = frequency_table.div(frequency_table.sum(axis=1), axis=0) * 100
print("\nRow Percentage Table:\n", row_percentage_table)

# Column percentage table


column_percentage_table = frequency_table.div(frequency_table.sum(axis=0), axis=1) * 100
print("\nColumn Percentage Table:\n", column_percentage_table)

# Overall percentage table


overall_percentage_table = frequency_table / frequency_table.values.sum() * 100
print("\nOverall Percentage Table:\n", overall_percentage_table)
Analysis and Insights
Frequency Table Analysis
 Male respondents prefer Product A the most (2 out of 4), followed by an equal
preference for Product B and Product C (1 each).
 Female respondents show an equal preference for Product B and Product A (2 and 1
respectively), and the least preference for Product C (1).
Row Percentage Table Analysis
 For Male respondents, 50% prefer Product A, while 25% prefer both Product B and
Product C.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

 For Female respondents, 50% prefer Product B, while 25% prefer both Product A and
Product C.
Column Percentage Table Analysis
 For Product A, 66.67% of the respondents are Male, and 33.33% are Female.
 For Product B, 33.33% of the respondents are Male, and 66.67% are Female.
 For Product C, 50% of the respondents are Male, and 50% are Female.
Overall Percentage Table Analysis
 Out of the total respondents, 25% are Male preferring Product A, and 12.5% are Female
preferring Product A.
 12.5% are Male preferring Product B, and 25% are Female preferring Product B.
 Both Male and Female respondents have an equal preference for Product C at 12.5%
each.
Insights
1. Gender Preferences:
o Male respondents have a higher preference for Product A.
o Female respondents have a higher preference for Product B.
2. Product Popularity:
o Product A is more popular among Male respondents, while Product B is more
popular among Female respondents.
o Product C has an equal preference among both genders.
3. Marketing Strategy:
o Companies can use these insights to tailor their marketing strategies. For instance,
Product A could be targeted more towards males, and Product B towards
females.
Contingency Tables
A contingency table (also known as a cross-tabulation or crosstab) is a type of table in a matrix
format that displays the frequency distribution of variables. They are particularly useful for
examining the relationship between two categorical variables. Each cell in a contingency table
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

shows the count or frequency of occurrences for specific combinations of categories of the two
variables.
Key Concepts
 Rows: Represent the categories of one variable.
 Columns: Represent the categories of another variable.
 Cells: Represent the frequency or count of observations falling into the corresponding
row and column categories.
Creating a contingency table in SPSS (Statistical Package for the Social Sciences) is a
straightforward process. Here’s a step-by-step guide:
Step 1: Open SPSS and Load Your Data
1. Open SPSS: Start SPSS on your computer.
2. Load Data: Open your dataset by going to File > Open > Data, then browse and select
your data file.
Step 2: Access the Crosstabs Function
1. Navigate to Crosstabs: Go to Analyze > Descriptive Statistics > Crosstabs.
Step 3: Define Rows and Columns
1. Select Variables:
o In the Crosstabs dialog box, move the variable you want in the rows to the Rows
box.
o Move the variable you want in the columns to the Columns box.
Step 4: Choose Statistics and Display Options
1. Statistics:
o Click on the Statistics button if you want additional statistics like Chi-square, Phi,
and Cramer's V.
o Check the boxes for the statistics you want and click Continue.
2. Cells:
o Click on the Cells button to select which counts you want displayed (e.g.,
observed, expected, row percentages, column percentages).
o Check the appropriate boxes and click Continue.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

Step 5: Create the Table


1. Create Table: Click OK to generate the contingency table.
Step 6: View and Interpret the Output
1. Output Window: The output window will display your contingency table along with any
selected statistics.
2. Interpretation: Analyze the table to understand the relationship between the variables.
Example
Suppose you have a dataset with the variables Gender (Male, Female) and Preference (Yes, No).
To create a contingency table showing the relationship between Gender and Preference:
1. Select Gender for Rows: Move Gender to the Rows box.
2. Select Preference for Columns: Move Preference to the Columns box.
3. Select Statistics: Optionally, click Statistics and choose Chi-square if you want to test for
independence.
4. Select Cell Display: Click Cells and check Observed and Row percentages to display
both counts and row percentages.
5. Generate Table: Click OK.
Your output might look something like this:

This table shows the distribution of preferences by gender.

Analyzing Contingency Table: Analyzing a contingency table involves examining the


relationships between the variables and assessing their statistical significance. Here’s a step-by-
step guide on how to analyze a contingency table in SPSS:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

Step 1: Generate the Contingency Table


As previously described, use the Crosstabs function in SPSS to create your contingency table.
Ensure you select the appropriate statistics, such as Chi-square, to test for independence.
Step 2: Interpret the Table
1. Observed Frequencies: Look at the counts in each cell. These are the observed
frequencies.
2. Row and Column Percentages: Examine the row and column percentages to understand
the distribution of data across categories.
Step 3: Statistical Tests
1. Chi-square Test of Independence:
o The Chi-square test determines if there is a significant association between the
two categorical variables.
o In SPSS, if you checked the Chi-square option, you will see a Chi-square table in
the output.
o Look at the Pearson Chi-square value and its significance level (p-value).
Example Interpretation
Let's say you have a contingency table showing the relationship between Gender (Male, Female)
and Preference (Yes, No):

Preference: Yes Preference: No Row Total

Male 30 (60%) 20 (40%) 50

Female 25 (50%) 25 (50%) 50

Column Total 55 45 100


Chi-square Test Output

Test Value df Asymptotic Significance (2-sided)

Pearson Chi-Square 1.667 1 .197


Steps to Analyze:
1. Compare Observed vs. Expected Frequencies:
o SPSS provides expected frequencies in the output.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

o Compare these with the observed frequencies to see if there is a large discrepancy.
2. Chi-square Test:
o The Chi-square value is 1.667 with 1 degree of freedom.
o The p-value is 0.197.
3. Interpret the p-value:
o Typically, a p-value less than 0.05 indicates a significant association between the
variables.
o In this case, the p-value (0.197) is greater than 0.05, suggesting there is no
significant association between Gender and Preference.
Step 4: Additional Statistics (if applicable)
1. Phi and Cramer's V:
o These are measures of association for nominal data.
o A value close to 0 indicates little to no association, while a value closer to 1
indicates a strong association.
2. Interpret Residuals:
o Look at the residuals to understand the differences between observed and
expected frequencies.
o Standardized residuals greater than ±1.96 are considered significant at the 0.05
level.
Handling Several Batches
ChatGPT
Handling several batches of data in SPSS can involve tasks such as running the same analysis on
different subsets of data or looping through multiple datasets. Here's a guide on how to handle
multiple batches of data in SPSS:
Approach 1: Using SPLIT FILE
The SPLIT FILE command in SPSS allows you to perform analyses on subsets of data defined
by one or more grouping variables.
1. Load Your Data: Open your dataset in SPSS.
2. Split File:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

o Go to Data > Split File.


o Select "Compare groups" or "Organize output by groups" depending on your
analysis needs.
o Move the grouping variable(s) (e.g., batch number) into the Groups Based on box.
o Click OK.
3. Run Analyses: Perform the desired analyses (e.g., contingency tables). SPSS will
produce separate results for each group defined by the grouping variable.
Approach 2: Using DO REPEAT
The DO REPEAT command can be used to execute the same set of commands on different sets
of variables.
1. Syntax Editor:
o Open the Syntax Editor (File > New > Syntax).
2. Write DO REPEAT Syntax:
spss

DO REPEAT var = var1 var2 var3.


* Your SPSS commands here (e.g., FREQUENCIES, CROSSTABS).
FREQUENCIES VARIABLES=var.
END REPEAT.
This example runs the FREQUENCIES command on var1, var2, and var3.
Approach 3: Looping with Python
SPSS includes a Python integration, allowing more advanced automation for handling multiple
datasets.
1. Enable Python:
o Go to Edit > Options.
o In the File Locations tab, ensure that the Python plugin is enabled.
2. Write Python Script:
o Open the Syntax Editor.
o Write a script to load and process multiple datasets:
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

python

BEGIN PROGRAM Python.


import spss, spssaux

datasets = ['data1.sav', 'data2.sav', 'data3.sav']

for dataset in datasets:


spss.Submit(f'GET FILE="{dataset}".')
spss.Submit('CROSSTABS /TABLES=var1 BY var2 /STATISTICS=CHISQ.')

END PROGRAM.
This script loads each dataset and runs a crosstabs analysis.
Approach 4: Using Macros
SPSS macros can automate repetitive tasks.
1. Define Macro:
o Open the Syntax Editor.
o Define a macro to process multiple datasets:
spss

DEFINE !runAnalysis (dataset = !TOKENS(1) /var1 = !TOKENS(1) /var2 =


!TOKENS(1))
GET FILE=!dataset.
CROSSTABS /TABLES=!var1 BY !var2 /STATISTICS=CHISQ.
!ENDDEFINE.
!runAnalysis dataset="data1.sav" var1=gender var2=preference.
!runAnalysis dataset="data2.sav" var1=gender var2=preference.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

Example: Handling Multiple Batches


Suppose you have three datasets (batch1.sav, batch2.sav, batch3.sav) and you want to create
contingency tables for Gender and Preference in each batch.
Using SPLIT FILE
1. Load Combined Data: Combine all batches into a single file with a Batch variable
indicating the batch.
2. Split File: Split by Batch.
o Data > Split File.
o Select "Organize output by groups".
o Move Batch to Groups Based on.
o Click OK.
3. Run Analysis: Perform the crosstabs analysis.
Using Python Script
1. Write Python Script:
python

BEGIN PROGRAM Python.


import spss, spssaux

datasets = ['batch1.sav', 'batch2.sav', 'batch3.sav']

for dataset in datasets:


spss.Submit(f'GET FILE="{dataset}".')
spss.Submit('CROSSTABS /TABLES=Gender BY Preference
/STATISTICS=CHISQ.')

END PROGRAM.
2. Execute Script: Run the script in the SPSS Syntax Editor.
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

By using these approaches, you can efficiently handle and analyze multiple batches of data in
SPSS.
Scatterplots and Resistant Lines in Exploratory Data Analysis (EDA)
Scatterplots
Definition: A scatterplot is a graphical representation used to display the relationship between
two numerical variables. Each point on the scatterplot represents an observation from the dataset,
with its position determined by the values of the two variables.
Key Features:
 Axes: The x-axis represents the independent variable, and the y-axis represents the
dependent variable.
 Points: Each point represents an observation, with coordinates given by the values of the
two variables.
 Trend Identification: Scatterplots help identify trends, patterns, and possible
correlations between variables.
Example: Consider a dataset containing students' study hours and their corresponding test
scores. A scatterplot can show the relationship between study hours (x-axis) and test scores (y-
axis).
python

import matplotlib.pyplot as plt

# Example data
study_hours = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
test_scores = [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

# Creating the scatterplot


plt.scatter(study_hours, test_scores)
plt.title('Scatterplot of Study Hours vs Test Scores')
plt.xlabel('Study Hours')
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

plt.ylabel('Test Scores')
plt.show()
This scatterplot shows that as study hours increase, test scores tend to increase, indicating a
positive correlation.
Resistant Lines
Definition: Resistant lines, also known as robust regression lines, are used to fit a line to data in
a way that is less sensitive to outliers than traditional least squares regression. These lines
provide a more accurate representation of the central trend in the presence of outliers.
Key Features:
 Robustness: Resistant lines are not greatly influenced by outliers, making them suitable
for datasets with extreme values.
 Central Trend: They provide a reliable measure of the central trend, especially when
data includes anomalies or non-normal distributions.
Example: Using Python's statsmodels library, you can create a resistant line (e.g., a robust linear
model).
python

import numpy as np
import statsmodels.api as sm

# Example data
x = np.array(study_hours)
y = np.array(test_scores)

# Adding a constant term for the intercept


x = sm.add_constant(x)

# Fit a robust linear model


model = sm.RLM(y, x, M=sm.robust_norms.HuberT())
Rajalakshmi Institute of Technology
(An Autonomous Institution), Affiliated to Anna University, Chennai
Department of Computer Science and Business Systems
CCS346-Exploratory Data Analysis

results = model.fit()

# Predicted values
predicted = results.predict(x)

# Plotting the scatterplot and resistant line


plt.scatter(study_hours, test_scores, label='Data')
plt.plot(study_hours, predicted, color='red', label='Resistant Line')
plt.title('Scatterplot with Resistant Line')
plt.xlabel('Study Hours')
plt.ylabel('Test Scores')
plt.legend()
plt.show()
In this example, the red line represents the resistant line, showing the central trend of the data
while minimizing the influence of outliers.
Importance in EDA
Scatterplots:
 Visual Exploration: Help in visually exploring the relationship between two variables.
 Pattern Recognition: Aid in recognizing patterns, trends, and potential correlations.
 Outlier Detection: Facilitate the detection of outliers or anomalies in the data.
Resistant Lines:
 Robust Analysis: Provide a robust method for trend analysis, reducing the impact of
outliers.
 Accurate Representation: Offer a more accurate representation of the underlying trend,
especially in datasets with non-normal distributions or anomalies.
Unit IV completed

You might also like