Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dsbda Ass3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Data Science and Big

Data Analytics
Laboratory
Third Year 2019 Course
Assignment No 3

Prof.K.B.Sadafale
Assistant Professor
Computer Dept. GCOEAR, Avasari
Descriptive Statistics - Measures of Central Tendency and variability

Perform the following operations on any open source dataset


(e.g., data.csv)
1. Provide summary statistics (mean, median, minimum,
maximum, standard deviation) for a dataset (age, income etc.)
with numeric variables grouped by one of the qualitative
(categorical) variable. For example, if your categorical variable is
age groups and quantitative variable is income, then provide
summary statistics of income grouped by the age groups. Create
a list that contains a numeric value for each response to the
categorical variable.
2. Write a Python program to display some basic statistical
details like percentile, mean, standard deviation etc. of the
species of ‘Iris-setosa’, ‘Iris-versicolor’ and 'Iris-virginica‘ of
iris.csv dataset.
Provide the codes with outputs and explain everything that you
do in this step.
Descriptive Statistics
create a DataFrame as follows
Output
sum()
✔ Returns the sum of the values for the requested axis.
✔ By default, axis is index (axis=0)
print (df.sum())

Each individual column is added individually (Strings are


appended).
print (df.sum(1))
mean()
Returns the average value
print (df.mean())
std()

✔ Returns the standard deviation of the numerical


columns.
print (df.std())
Functions & Description
✔ Let us now understand the functions under Descriptive
Statistics in Python Pandas.
✔ The following table list down the important functions
✔ Note − Since DataFrame is a Heterogeneous data
structure. Generic operations don’t work with all functions.
✔ Summarizing Data

✔ print (df.describe())
print (df.describe(include=['object']))

print (df.describe(include='all'))
Example
Read csv “mtcars”

Output
# Get the mean of each column

mtcars.mean()()
Output
# Get the mean of each row
median
✔ The median of a distribution is the value where 50% of the
data lies below it and 50% lies above it.
✔ In essence, the median splits the data in half.
✔ The median is also known as the 50% percentile since
50% of the observations are found below it.
✔ you can get the median using the df.median() function:

✔ # Get the median of each column

✔ mtcars.median()
Mode
✔ The mode of a variable is simply the value that appears
most frequently.

✔ Unlike mean and median, you can take the mode of a


categorical variable and it is possible to have multiple
modes.

✔ Find the mode with df.mode()


mtcars.mode()

✔ The columns with multiple modes (multiple values with the


same count) return multiple values as the mode.
✔ Columns with no mode (no value that appears more than
once) return NaN.
Example
✔ Write a Python program to display some basic statistical
details like percentile, mean, standard deviation etc. of the
species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris- versicolor’
of iris.csv dataset.
Iris.csv
data.describe()
• Problem Statement : Write a Python program to display some
basic statistical details like standard deviation, mean, standard
deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and
'Iris-virginica' of iris.csv dataset.
Code:

print('iris-setosa')
setosa=data['target']=='Iris-setosa'
print(data[setosa].describe())

print('Iris-virginica')
setosa=data['target']=='Iris-virginica'
print(data[setosa].describe())

print('Iris- versicolor')
setosa=data['target']=='Iris- versicolor'
print(data[setosa].describe())

You might also like