Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
32 views

Lab 2 - Basic Statistical Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Lab 2 - Basic Statistical Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lab 2 - Basic Statistical Analysis

December 12, 2024

0.1 Imports
0.1.1 Step 1: Import Required Libraries
Import essential libraries for data manipulation, visualization, and statistics.
[1]: import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

0.1.2 Step 2: Load the Dataset


Load the student performance dataset from the specified CSV file.
[2]: file_path = "dataset/Student_performance_10k.csv"
data = pd.read_csv(file_path)

[3]: data.head()

[3]: roll_no gender race_ethnicity parental_level_of_education lunch \


0 std-01 male group D some college 1.0
1 std-02 male group B high school 1.0
2 std-03 male group C master's degree 1.0
3 std-04 male group D some college 1.0
4 std-05 male group C some college 0.0

test_preparation_course math_score reading_score writing_score \


0 1.0 89.0 38.0 85.0
1 0.0 65.0 100.0 67.0
2 0.0 10.0 99.0 97.0
3 1.0 22.0 51.0 41.0
4 1.0 26.0 58.0 64.0

science_score total_score grade


0 26.0 238.0 C
1 96.0 328.0 A
2 58.0 264.0 B
3 84.0 198.0 D
4 65.0 213.0 C

1
0.2 Exploratory Data Analysis (EDA)
0.2.1 Step 3: Basic Dataset Information
Display basic information about the dataset, such as column names and data types.
[4]: print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 roll_no 9999 non-null object
1 gender 9982 non-null object
2 race_ethnicity 9977 non-null object
3 parental_level_of_education 9978 non-null object
4 lunch 9976 non-null float64
5 test_preparation_course 9977 non-null float64
6 math_score 9976 non-null float64
7 reading_score 9975 non-null float64
8 writing_score 9976 non-null float64
9 science_score 9977 non-null float64
10 total_score 9981 non-null float64
11 grade 9997 non-null object
dtypes: float64(7), object(5)
memory usage: 937.6+ KB
None

0.2.2 Step 4: Statistical Summary


Use describe() to compute summary statistics for numerical columns.
[5]: print("\nDescriptive Statistics:")
print(data.describe())

Descriptive Statistics:
lunch test_preparation_course math_score reading_score \
count 9976.000000 9977.000000 9976.000000 9975.000000
mean 0.644246 0.388694 57.177125 70.125915
std 0.478765 0.487478 21.746777 19.026245
min 0.000000 0.000000 0.000000 17.000000
25% 0.000000 0.000000 41.000000 57.000000
50% 1.000000 0.000000 58.000000 71.000000
75% 1.000000 1.000000 73.000000 85.000000
max 1.000000 1.000000 100.000000 100.000000

writing_score science_score total_score


count 9976.000000 9977.000000 9981.000000
mean 71.415798 66.063045 264.740908

2
std 18.245360 19.324331 42.304858
min 10.000000 9.000000 89.000000
25% 59.000000 53.000000 237.000000
50% 72.500000 67.000000 268.000000
75% 85.000000 81.000000 294.000000
max 100.000000 100.000000 383.000000

0.2.3 Step 5: Check for Missing Values


Identify the total number of missing values in each column.
[6]: missing_values = data.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values)

Missing Values in Each Column:


roll_no 1
gender 18
race_ethnicity 23
parental_level_of_education 22
lunch 24
test_preparation_course 23
math_score 24
reading_score 25
writing_score 24
science_score 23
total_score 19
grade 3
dtype: int64

0.2.4 Step 6: Handle Missing Values


Simple approach: Drop rows with missing values (not preferred)
[7]: # Uncomment the following block to drop rows with missing values.
# data = data.dropna()

Better Approach: Fill missing numerical values with the mean and categorical values
with the mode.
[8]: numerical_cols = data.select_dtypes(include=[np.number]).columns
categorical_cols = data.select_dtypes(include=["object"]).columns

data[numerical_cols] = data[numerical_cols].fillna(data[numerical_cols].mean())
data[categorical_cols] = data[categorical_cols].fillna(data[categorical_cols].
↪mode().iloc[0])

Verify if missing values are handled


[9]: print("\nMissing Values After Handling:")
print(data.isnull().sum())

3
Missing Values After Handling:
roll_no 0
gender 0
race_ethnicity 0
parental_level_of_education 0
lunch 0
test_preparation_course 0
math_score 0
reading_score 0
writing_score 0
science_score 0
total_score 0
grade 0
dtype: int64

0.3 Visualization: Distributions


0.3.1 Step 7: Distribution of Grades
Plot the distribution of grades to understand grade trends.
[10]: plt.figure(figsize=(4, 2))
sns.countplot(x='grade', data=data, palette='magma', hue="grade")
plt.title('Distribution of Grades')
plt.xlabel('Grade')
plt.ylabel('Count')
plt.show()

0.3.2 Step 8: Individual Subject Score Distributions


Visualize the distribution of scores for each subject.

4
[11]: subjects = ['math_score', 'reading_score', 'writing_score', 'science_score']
for subject in subjects:
plt.figure(figsize=(3, 2))
ax = sns.histplot(data[subject], kde=True, bins=20)
plt.title(f'Distribution of {subject.capitalize()}')
plt.xlabel(subject.capitalize())
plt.ylabel('Frequency')
plt.show()

5
0.4 Probability & Statistics Questions
0.4.1 Step 10: Calculate Z-scores
Example: Calculate the probability that a student scores above 300 in total scores.

Note: The function (0.5 * (1 + math.erf(z / np.sqrt(2)))) calculates the cumulative


probability or the area under the curve from − ∞ to �, which is z-score area. You can
refer Z-score table as well.
[12]: def z_score(value):
return (value - data['total_score'].mean()) / data['total_score'].std()

z = z_score(300)
probability_above_300 = 1 - (0.5 * (1 + math.erf(z / np.sqrt(2))))
print(f"Probability of scoring above 300: {probability_above_300 * 100:.2f}%")

Probability of scoring above 300: 20.21%

6
0.4.2 Step 11: Solve Statistical Problems
Example: What percentage of students score between 250 and 350?
[13]: z1 = z_score(250)
z2 = z_score(350)
probability_between = (0.5 * (1 + math.erf(z2 / np.sqrt(2)))) - (0.5 * (1 +␣
↪math.erf(z1 / np.sqrt(2))))

print(f"Percentage of students scoring between 250 and 350:␣


↪{probability_between * 100:.2f}%")

Percentage of students scoring between 250 and 350: 61.45%

You might also like