Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Descriptive Analytics2.Ipynb - Colab

The document is a Jupyter notebook focused on descriptive analytics using a dataset containing household income and expense information. It includes data loading, statistical analysis, outlier detection, and various visualizations using Python libraries like pandas and matplotlib. Key insights include average income, expenses, and the distribution of family members and earning members.

Uploaded by

lsivakum
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Descriptive Analytics2.Ipynb - Colab

The document is a Jupyter notebook focused on descriptive analytics using a dataset containing household income and expense information. It includes data loading, statistical analysis, outlier detection, and various visualizations using Python libraries like pandas and matplotlib. Key insights include average income, expenses, and the distribution of family members and earning members.

Uploaded by

lsivakum
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

8/23/24, 11:48 AM descriptive analytics.

ipynb - Colab

Descriptive analytis

1. statistics module
2. pandas

pandas

load dataset

import pandas as pd
data = pd.read_csv('/content/sample_data/Inc_Exp_Data.csv')
print('Dataset dimension:',data.shape)
print('Columns :\n',data.columns)

Dataset dimension: (50, 7)


Columns :
Index(['Mthly_HH_Income', 'Mthly_HH_Expense', 'No_of_Fly_Members',
'Emi_or_Rent_Amt', 'Annual_HH_Income', 'Highest_Qualified_Member',
'No_of_Earning_Members'],
dtype='object')

data.head()

Mthly_HH_Income Mthly_HH_Expense No_of_Fly_Members Emi_or_Rent_Amt Annual_HH_I

0 5000 8000 3 2000

1 6000 7000 2 3000

2 10000 4500 2 0 1

3 10000 2000 1 0

4 12500 12000 2 3000 1

dataset contains 50 rows and 7 columns. 6 features are numeric and highetest-qualified feature
is string no null values

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Mthly_HH_Income 50 non-null int64
1 Mthly_HH_Expense 50 non-null int64
2 No_of_Fly_Members 50 non-null int64
3 Emi_or_Rent_Amt 50 non-null int64
https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 1/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
4 Annual_HH_Income 50 non-null int64
5 Highest_Qualified_Member 50 non-null object
6 No_of_Earning_Members 50 non-null int64
dtypes: int64(6), object(1)
memory usage: 2.9+ KB

data.describe()

Mthly_HH_Income Mthly_HH_Expense No_of_Fly_Members Emi_or_Rent_Amt Annual_

count 50.000000 50.000000 50.000000 50.000000 5.0

mean 41558.000000 18818.000000 4.060000 3060.000000 4.9

std 26097.908979 12090.216824 1.517382 6241.434948 3.2

min 5000.000000 2000.000000 1.000000 0.000000 6.4

25% 23550.000000 10000.000000 3.000000 0.000000 2.5

50% 35000.000000 15500.000000 4.000000 0.000000 4.4

75% 50375.000000 25000.000000 5.000000 3500.000000 5.9

max 100000.000000 50000.000000 7.000000 35000.000000 1.4

visualize summary statistics

import matplotlib.pyplot as plt

bp = data[['Mthly_HH_Income', 'Mthly_HH_Expense','No_of_Fly_Members','No_of_Earning_Mem
plt.show()

https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 2/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab

monthly income, expenses and no_of_eraning_members have outliers

bp=plt.boxplot(data['Mthly_HH_Income'])
print(type(bp))
print(bp.keys())

<class 'dict'>
dict_keys(['whiskers', 'caps', 'boxes', 'medians', 'fliers', 'means'])
{'whiskers': [<matplotlib.lines.Line2D object at 0x7fb109d0e5c0>, <matplotlib.lines.L

remove outliers

mn, mx = [item.get_ydata()[1] for item in bp['caps']]


print('max:', mx,'min:', mn)

max: 90000 min: 5000

data.drop(data[data['Mthly_HH_Income']>mx].index, inplace=True)
plt.boxplot(data['Mthly_HH_Income'])

https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 3/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab

{'whiskers': [<matplotlib.lines.Line2D at 0x7fb109ec6020>,


<matplotlib.lines.Line2D at 0x7fb109ec62c0>],
'caps': [<matplotlib.lines.Line2D at 0x7fb109ec6560>,
<matplotlib.lines.Line2D at 0x7fb109ec6800>],
'boxes': [<matplotlib.lines.Line2D at 0x7fb109ec5d80>],
'medians': [<matplotlib.lines.Line2D at 0x7fb109ec6aa0>],
'fliers': [<matplotlib.lines.Line2D at 0x7fb109ec6d40>],
'means': []}

print('outliers in monthly income:\n',data[data['Mthly_HH_Income']>90000])

outliers in monthly income:


Mthly_HH_Income Mthly_HH_Expense No_of_Fly_Members Emi_or_Rent_Amt \
46 98000 25000 5 0
47 100000 30000 6 0
48 100000 50000 4 20000
49 100000 40000 6 10000

Annual_HH_Income Highest_Qualified_Member No_of_Earning_Members


46 1152480 Professional 2
47 1404000 Graduate 3
48 1032000 Professional 2
49 1320000 Post-Graduate 1

print('Average & SD on monthly income: ',data['Mthly_HH_Income'].mean(), round(data['Mthl


print('Average & SD on monthly expenses: ',data['Mthly_HH_Expense'].mean(),round(data['Mt
print('Average & SD on annual income: ',data['Annual_HH_Income'].mean(),round(data['Annu
print('Average earning members: ',data['No_of_Earning_Members'].mean())

Average & SD on monthly income: 41558.0 26097.91


Average & SD on monthly expenses: 18818.0 12090.22
Average & SD on annual income: 490019.04 320135.79

https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 4/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab
Average earning members: 1.46

print('max & min monthly income: ',data['Mthly_HH_Income'].max(),data['Mthly_HH_Income'].


print('max & min monthly expenses: ',data['Mthly_HH_Expense'].max(),data['Mthly_HH_Expens
print('max & min earning members: ',data['No_of_Earning_Members'].max(),data['No_of_Earni
print('max & min annual income: ',data['Annual_HH_Income'].max(),data['Annual_HH_Income']

max & min monthly income: 100000 5000


max & min monthly expenses: 50000 2000
max & min earning members: 4 1
max & min annual income: 1404000 64200

categorical fields

print('Most occuring value in highest qualified : ',data['Highest_Qualified_Member'].mode


print('Most occuring value in no. of earning members: ',data['No_of_Earning_Members'].mod

Most occuring value in highest qualified : 0 Graduate


Name: Highest_Qualified_Member, dtype: object
Most occuring value in no. of earning members: 0 1
Name: No_of_Earning_Members, dtype: int64

visualizations

import matplotlib.pyplot as plt

earn_members = data['No_of_Earning_Members'].unique()
earn_members

array([1, 2, 3, 4])

earn_members = data['No_of_Earning_Members'].unique()
plt.hist(data['No_of_Earning_Members'])
plt.title('Number of earning members in the families')
plt.xlabel('No. of earning members')
plt.ylabel('Count of families')
plt.xticks(earn_members )
#plt.yticks(range(1,len(data),3))
plt.show()

https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 5/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab

family_members = data['No_of_Fly_Members'].unique()
plt.hist(data['No_of_Fly_Members'])
plt.title('Number of Flamily Members in the families')
plt.xlabel('No. of family Members')
plt.ylabel('Count of families')
plt.xticks(family_members )
plt.show()

https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 6/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab

bar chart

x = range(len(data))
idx = [i+0.4 for i in x]
plt.bar(x,data['No_of_Fly_Members'], width=0.4, label='Family members')
plt.bar(idx,data['No_of_Earning_Members'], width=0.4,label='Earning members')
plt.title('No. of family members & earning members in each family')
plt.legend()
plt.xticks(range(0,51,5))
plt.ylabel('Count')
plt.show()

https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 7/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab

plt.plot(data['Mthly_HH_Income'], label='Income')
plt.plot(data['Mthly_HH_Expense'], label='Expenditure')
plt.legend()
plt.title('Family Income vs Expenditure')
plt.ylabel('Amount ')
plt.show()

https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 8/9
8/23/24, 11:48 AM descriptive analytics.ipynb - Colab

x = data['No_of_Earning_Members'].value_counts()
print(x)
plt.pie(x,labels=x.index, autopct='%.0f%%' )
plt.title('Proportion of No. of Earning members in the families ')
plt.show()

1 33
2 12
3 4
4 1
Name: No_of_Earning_Members, dtype: int64

x = data['Highest_Qualified_Member'].value_counts()
print(x)
plt.pie(x,labels=x.index,autopct='%.0f%%' )
plt.title('Proportion of highest qualified in the families ')
plt.show()

Graduate 19
Under-Graduate 10
Professional 10
Post-Graduate 6
Illiterate 5
Name: Highest_Qualified_Member, dtype: int64

https://colab.research.google.com/drive/1xk9I7-lnL2a6XcZmwsJ8BlsWA7Drd1do#scrollTo=MSxC8SClqcVL&printMode=true 9/9

You might also like