0% found this document useful (0 votes)

32 views

Lab 2 - Basic Statistical Analysis

Uploaded by

078msdsa001.baikuntha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Lab 2 - Basic Statistical Analysis

Uploaded by

078msdsa001.baikuntha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Lab 2 - Basic Statistical Analysis

December 12, 2024

0.1 Imports
0.1.1 Step 1: Import Required Libraries
Import essential libraries for data manipulation, visualization, and statistics.
[1]: import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

0.1.2 Step 2: Load the Dataset

Load the student performance dataset from the specified CSV file.
[2]: file_path = "dataset/Student_performance_10k.csv"
data = pd.read_csv(file_path)

[3]: data.head()

[3]: roll_no gender race_ethnicity parental_level_of_education lunch \

0 std-01 male group D some college 1.0
1 std-02 male group B high school 1.0
2 std-03 male group C master's degree 1.0
3 std-04 male group D some college 1.0
4 std-05 male group C some college 0.0

test_preparation_course math_score reading_score writing_score \

0 1.0 89.0 38.0 85.0
1 0.0 65.0 100.0 67.0
2 0.0 10.0 99.0 97.0
3 1.0 22.0 51.0 41.0
4 1.0 26.0 58.0 64.0

science_score total_score grade

0 26.0 238.0 C
1 96.0 328.0 A
2 58.0 264.0 B
3 84.0 198.0 D
4 65.0 213.0 C

1
0.2 Exploratory Data Analysis (EDA)
0.2.1 Step 3: Basic Dataset Information
Display basic information about the dataset, such as column names and data types.
[4]: print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 roll_no 9999 non-null object
1 gender 9982 non-null object
2 race_ethnicity 9977 non-null object
3 parental_level_of_education 9978 non-null object
4 lunch 9976 non-null float64
5 test_preparation_course 9977 non-null float64
6 math_score 9976 non-null float64
7 reading_score 9975 non-null float64
8 writing_score 9976 non-null float64
9 science_score 9977 non-null float64
10 total_score 9981 non-null float64
11 grade 9997 non-null object
dtypes: float64(7), object(5)
memory usage: 937.6+ KB
None

0.2.2 Step 4: Statistical Summary

Use describe() to compute summary statistics for numerical columns.
[5]: print("\nDescriptive Statistics:")
print(data.describe())

Descriptive Statistics:
lunch test_preparation_course math_score reading_score \
count 9976.000000 9977.000000 9976.000000 9975.000000
mean 0.644246 0.388694 57.177125 70.125915
std 0.478765 0.487478 21.746777 19.026245
min 0.000000 0.000000 0.000000 17.000000
25% 0.000000 0.000000 41.000000 57.000000
50% 1.000000 0.000000 58.000000 71.000000
75% 1.000000 1.000000 73.000000 85.000000
max 1.000000 1.000000 100.000000 100.000000

writing_score science_score total_score

count 9976.000000 9977.000000 9981.000000
mean 71.415798 66.063045 264.740908

2
std 18.245360 19.324331 42.304858
min 10.000000 9.000000 89.000000
25% 59.000000 53.000000 237.000000
50% 72.500000 67.000000 268.000000
75% 85.000000 81.000000 294.000000
max 100.000000 100.000000 383.000000

0.2.3 Step 5: Check for Missing Values

Identify the total number of missing values in each column.
[6]: missing_values = data.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values)

Missing Values in Each Column:

roll_no 1
gender 18
race_ethnicity 23
parental_level_of_education 22
lunch 24
test_preparation_course 23
math_score 24
reading_score 25
writing_score 24
science_score 23
total_score 19
grade 3
dtype: int64

0.2.4 Step 6: Handle Missing Values

Simple approach: Drop rows with missing values (not preferred)
[7]: # Uncomment the following block to drop rows with missing values.
# data = data.dropna()

Better Approach: Fill missing numerical values with the mean and categorical values
with the mode.
[8]: numerical_cols = data.select_dtypes(include=[np.number]).columns
categorical_cols = data.select_dtypes(include=["object"]).columns

data[numerical_cols] = data[numerical_cols].fillna(data[numerical_cols].mean())
data[categorical_cols] = data[categorical_cols].fillna(data[categorical_cols].
↪mode().iloc[0])

Verify if missing values are handled

[9]: print("\nMissing Values After Handling:")
print(data.isnull().sum())

3
Missing Values After Handling:
roll_no 0
gender 0
race_ethnicity 0
parental_level_of_education 0
lunch 0
test_preparation_course 0
math_score 0
reading_score 0
writing_score 0
science_score 0
total_score 0
grade 0
dtype: int64

0.3 Visualization: Distributions

0.3.1 Step 7: Distribution of Grades
Plot the distribution of grades to understand grade trends.
[10]: plt.figure(figsize=(4, 2))
sns.countplot(x='grade', data=data, palette='magma', hue="grade")
plt.title('Distribution of Grades')
plt.xlabel('Grade')
plt.ylabel('Count')
plt.show()

0.3.2 Step 8: Individual Subject Score Distributions

Visualize the distribution of scores for each subject.

4
[11]: subjects = ['math_score', 'reading_score', 'writing_score', 'science_score']
for subject in subjects:
plt.figure(figsize=(3, 2))
ax = sns.histplot(data[subject], kde=True, bins=20)
plt.title(f'Distribution of {subject.capitalize()}')
plt.xlabel(subject.capitalize())
plt.ylabel('Frequency')
plt.show()

5
0.4 Probability & Statistics Questions
0.4.1 Step 10: Calculate Z-scores
Example: Calculate the probability that a student scores above 300 in total scores.

Note: The function (0.5 * (1 + math.erf(z / np.sqrt(2)))) calculates the cumulative

probability or the area under the curve from − ∞ to �, which is z-score area. You can
refer Z-score table as well.
[12]: def z_score(value):
return (value - data['total_score'].mean()) / data['total_score'].std()

z = z_score(300)
probability_above_300 = 1 - (0.5 * (1 + math.erf(z / np.sqrt(2))))
print(f"Probability of scoring above 300: {probability_above_300 * 100:.2f}%")

Probability of scoring above 300: 20.21%

6
0.4.2 Step 11: Solve Statistical Problems
Example: What percentage of students score between 250 and 350?
[13]: z1 = z_score(250)
z2 = z_score(350)
probability_between = (0.5 * (1 + math.erf(z2 / np.sqrt(2)))) - (0.5 * (1 +␣
↪math.erf(z1 / np.sqrt(2))))

print(f"Percentage of students scoring between 250 and 350:␣

↪{probability_between * 100:.2f}%")

Percentage of students scoring between 250 and 350: 61.45%

Manual Book Planetary Mixer b10f - Tt
No ratings yet
Manual Book Planetary Mixer b10f - Tt
8 pages
Airbus A320 Callouts
100% (5)
Airbus A320 Callouts
24 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Class 12 Informatics Practices IP Project On CBSE Result Statistics
100% (1)
Class 12 Informatics Practices IP Project On CBSE Result Statistics
23 pages
Serenity Guide
0% (1)
Serenity Guide
459 pages
Students Performance Analysis
No ratings yet
Students Performance Analysis
12 pages
First 4
No ratings yet
First 4
11 pages
student analysis
No ratings yet
student analysis
16 pages
Class Activity-2
No ratings yet
Class Activity-2
3 pages
DA Manual - Part B
No ratings yet
DA Manual - Part B
13 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Python Case Study
No ratings yet
Python Case Study
7 pages
Assignment 02
No ratings yet
Assignment 02
4 pages
Lab 13
No ratings yet
Lab 13
5 pages
2. DATA WRANGLING 2
No ratings yet
2. DATA WRANGLING 2
4 pages
Samarth Raghav
No ratings yet
Samarth Raghav
15 pages
Week2 lab
No ratings yet
Week2 lab
8 pages
B Question4
No ratings yet
B Question4
9 pages
Codes
No ratings yet
Codes
44 pages
Data Preprocessing Python Tome II
No ratings yet
Data Preprocessing Python Tome II
14 pages
2- IP practical
No ratings yet
2- IP practical
2 pages
Numpy and Pandas
No ratings yet
Numpy and Pandas
11 pages
Data Wrangling, 2
No ratings yet
Data Wrangling, 2
4 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
Case Study 1
No ratings yet
Case Study 1
4 pages
LabWeek1
No ratings yet
LabWeek1
2 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Research File 3
No ratings yet
Research File 3
10 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Advanced Statistics (AS) Project Report
No ratings yet
Advanced Statistics (AS) Project Report
52 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
Data Preprocessing Python Tome III
No ratings yet
Data Preprocessing Python Tome III
12 pages
DS Lab Manual Final
No ratings yet
DS Lab Manual Final
49 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
ML Lab FileDhruv
No ratings yet
ML Lab FileDhruv
74 pages
DATA ANALYSIS
No ratings yet
DATA ANALYSIS
2 pages
ADS EXP Assignments
No ratings yet
ADS EXP Assignments
38 pages
UNIT 1,2
No ratings yet
UNIT 1,2
17 pages
Hariks
No ratings yet
Hariks
5 pages
Assessment Test
No ratings yet
Assessment Test
22 pages
Python Code - Summary Statistics
No ratings yet
Python Code - Summary Statistics
6 pages
DataVisualizationUsingPython
No ratings yet
DataVisualizationUsingPython
3 pages
Enhanced_Student_Data_Processing_System
No ratings yet
Enhanced_Student_Data_Processing_System
4 pages
Grade 7 analysis tool
No ratings yet
Grade 7 analysis tool
5 pages
CS-3
No ratings yet
CS-3
28 pages
Learneverythingai
No ratings yet
Learneverythingai
9 pages
Exploratory Data Analysis: Masters of Science
No ratings yet
Exploratory Data Analysis: Masters of Science
12 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
Week1-SPT2 Descriptive Statistics
No ratings yet
Week1-SPT2 Descriptive Statistics
8 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
To Incorporate Log
No ratings yet
To Incorporate Log
5 pages
matplotlib
No ratings yet
matplotlib
6 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Studentperfomance
No ratings yet
Studentperfomance
23 pages
IDML Lab Programs
No ratings yet
IDML Lab Programs
5 pages
Student_Performance_Analysis_Project
No ratings yet
Student_Performance_Analysis_Project
3 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
41 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
Blazor and API Example: Classroom Quiz Application
From Everand
Blazor and API Example: Classroom Quiz Application
Taurius Litvinavicius
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
Milpinti - Story Wire
No ratings yet
Milpinti - Story Wire
3 pages
Porter'S Five Forces Model: Prepared by
No ratings yet
Porter'S Five Forces Model: Prepared by
18 pages
OS Quiz Solutions PDF
No ratings yet
OS Quiz Solutions PDF
4 pages
(LATEST) Netapp NCIE-SAN NS0-502 Questions and Answers
No ratings yet
(LATEST) Netapp NCIE-SAN NS0-502 Questions and Answers
8 pages
Data Sheet Template - Eil
No ratings yet
Data Sheet Template - Eil
2 pages
Revised Wps Pqr Jnpt
No ratings yet
Revised Wps Pqr Jnpt
5 pages
iCUBE 1 2 User Guide English
No ratings yet
iCUBE 1 2 User Guide English
142 pages
Bronto Liftcontrol System
No ratings yet
Bronto Liftcontrol System
20 pages
191022i049905986 PDF
No ratings yet
191022i049905986 PDF
1 page
MGV Drainage Design
100% (1)
MGV Drainage Design
19 pages
A5000 Instruction Manual
No ratings yet
A5000 Instruction Manual
6 pages
Batería ELEKSOL -6GFM250G-C100
No ratings yet
Batería ELEKSOL -6GFM250G-C100
2 pages
Parallel Flange Sections.. in Accordance With BS 4 Part 1 2005
No ratings yet
Parallel Flange Sections.. in Accordance With BS 4 Part 1 2005
2 pages
Nuflo Ez-In
No ratings yet
Nuflo Ez-In
8 pages
Test Hall Ticket 1147 02424 250922 0006: Registration Number
No ratings yet
Test Hall Ticket 1147 02424 250922 0006: Registration Number
1 page
Coreworx Improving Project Outcomes With Interface Management
No ratings yet
Coreworx Improving Project Outcomes With Interface Management
18 pages
Gendiesel 50 TD en
No ratings yet
Gendiesel 50 TD en
168 pages
0740 800 179 - S - CaddyTig 2200i ACDC
No ratings yet
0740 800 179 - S - CaddyTig 2200i ACDC
64 pages
Soal B.inggris Uts Genap X Azhari
No ratings yet
Soal B.inggris Uts Genap X Azhari
3 pages
Section 4 Electrical System: Group 1 Component Location
No ratings yet
Section 4 Electrical System: Group 1 Component Location
2 pages
Dell Compellent Red Hat Enterprise Linux (RHEL) 6x Best Practices
No ratings yet
Dell Compellent Red Hat Enterprise Linux (RHEL) 6x Best Practices
49 pages
Gek 106844 PDF
No ratings yet
Gek 106844 PDF
10 pages
Books: Websphere Message Broker V6, Best Practices Guide: Bullet Proofing Message Flows
100% (1)
Books: Websphere Message Broker V6, Best Practices Guide: Bullet Proofing Message Flows
24 pages
Planos CX 75 - 5+L-Layout2 Heidelberg
No ratings yet
Planos CX 75 - 5+L-Layout2 Heidelberg
1 page
OpTransactionHistoryUX506 05 2024
No ratings yet
OpTransactionHistoryUX506 05 2024
6 pages
Jaw Part1
0% (1)
Jaw Part1
7 pages
Project Manager: Date: Reporting Period: Project Stage: Summary View
No ratings yet
Project Manager: Date: Reporting Period: Project Stage: Summary View
1 page

Lab 2 - Basic Statistical Analysis

Uploaded by

Lab 2 - Basic Statistical Analysis

Uploaded by

Lab 2 - Basic Statistical Analysis

December 12, 2024

0.1.2 Step 2: Load the Dataset

[3]: roll_no gender race_ethnicity parental_level_of_education lunch \

test_preparation_course math_score reading_score writing_score \

science_score total_score grade

0.2.2 Step 4: Statistical Summary

writing_score science_score total_score

0.2.3 Step 5: Check for Missing Values

Missing Values in Each Column:

0.2.4 Step 6: Handle Missing Values

Verify if missing values are handled

0.3 Visualization: Distributions

0.3.2 Step 8: Individual Subject Score Distributions

Note: The function (0.5 * (1 + math.erf(z / np.sqrt(2)))) calculates the cumulative

Probability of scoring above 300: 20.21%

print(f"Percentage of students scoring between 250 and 350:␣

Percentage of students scoring between 250 and 350: 61.45%

You might also like