0% found this document useful (0 votes)

21 views

Data Sci

The document contains 8 questions related to data analysis using pandas on a dataset containing height and weight information. The questions cover topics like loading the dataset, viewing basic statistics, plotting graphs, and modifying the dataframe. Code solutions are provided for each question to demonstrate various pandas functions like read_csv(), describe(), plot(), dropna(), etc. on the height-weight dataset.

Uploaded by

ketisi2987

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Data Sci

Uploaded by

ketisi2987

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 29

ROLL NO:

Assignment 1:SET-A

Q.1: Write a python program to create a data frame containing columns name as
name,age,and percentage. Add 10 rows to the data frame. View the data frame.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['subhan',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['alim',88,99]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df)

Output:
Name Age Percentage
0 subhan 20 78
1 tofik 22 45
2 rayyan 21 15
3 sharif 21 65
4 alim 88 99
5 shoib 18 97
6 danish 19 49
7 mustakim 25 6
8 mosin 20 78
9 arbaz 22 15

Q.2: Write a pyhton program to print shape,number of rows-number of columns, data

type, feature names, and description of the data.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['subhan',20,12]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df.shape)
print(df.size)
print(df.dtypes)
print(df.columns)
print(df.describe)

Output:
(10, 3)
30
Name object
Age object
Percentage object
dtype: object
Index(['Name', 'Age', 'Percentage'], dtype='object')
<bound method NDFrame.describe of Name Age Percentage
0 alim 20 78
1 tofik 22 45
2 rayyan 21 15
3 sharif 21 65
4 subhan 20 12
5 shoib 18 97
6 danish 19 49
7 mustakim 25 6
8 mosin 20 78
9 arbaz 22 15>

Q.3: Write a python program to view basic statistical detail of data.

import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['subhan',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['alim',20,12]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df.describe())

Output:
Name Age Percentage
count 10 10 10
unique 10 6 8
top arbaz 20 15
freq 1 3 2
Q.4: Write a python program to add 5 rows with duplicate values and missing value. Add
a column Remark with empty values.

import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['subhan',21,15]
df.loc[3]=[None,21,65]
df.loc[4]=['subhan',20,12]
df['Remarks']=None
print(df

Output:
Name Age Percentage Remarks
0 alim 20 78 None
1 tofik 22 45 None
2 subhan 21 15 None
3 NaN 21 65 None
4 subhan 20 12 None

Q.5: Write a python program to get the number of observations missing values and
duplicate values.

import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,15]
df.loc[1]=['tofik',22,45]
df.loc[2]=['subhan',20,15]
df.loc[3]=[None,21,65]
df.loc[4]=['subhan',20,15]
print(df['Name'].size)
missing=df.isnull()
print(missing)
dup=df.duplicated()
print(dup)

Output:
5
Name Age Percentage
0 False False False
1 False False False
2 False False False
3 True False False
4 False False False
0 False
1 False
2 False
3 False
4 True
dtype: bool
Q.6:Write a python program to drop ‘Remark’ column from the dataframe. Also drop all
nulland empty values.print the modified data.

import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['ALIM',20,15]
df.loc[1]=['SHOAIB',22,45]
df.loc[2]=['SHARIF',20,15]
df.loc[3]=[None,21,65]
df.loc[4]=['ALIM',20,15]
df['Remarks']=None
print(df)
df.drop(labels=['Remarks'],axis=1,inplace=True)
print(df)
df.dropna(axis=0,inplace=True)
print(df)

Output:
Name Age Percentage Remarks
0 ALIM 20 15 None
1 SHOAIB 22 45 None
2 SHARIF 20 15 None
3 NaN 21 65 None
4 ALIM 20 15 None
Name Age Percentage
0 ALIM 20 15
1 SHOAIB 22 45
2 SHARIF 20 15
3 NaN 21 65
4 ALIM 20 15
Name Age Percentage
0 ALIM 20 15
1 SHOAIB 22 45
2 SHARIF 20 15
4 ALIM 20 15

Q.7:Write a python program to generate a line plot pf name vs percentage

import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(columns=['Name','age','percentage'])
df.loc[0]=['kashish',19,95]
df.loc[1]=['Ramiza',20,91]
df.loc[2]=['naki',7,90]
df.loc[3]=['Faisal',18,85]
df.loc[4]=['Aman',23,80]
df.loc[5]=['Anas',24,75]
df.loc[6]=['Fazil',21,70]
df.loc[7]=['Mustaqim',22,65]
df.loc[8]=['Alfiya',20,89]
df.loc[9]=['Aqsa',21,86]
print(df)
df.plot(x="Name",y="percentage")
plt.title('Line plot name vs percentage')
plt.xlabel('name of student')
plt.ylabel('percentage')
plt.show()
print(df)

Output:
Name age percentage
0 kashish 19 95
1 Ramiza 20 91
2 naki 7 90
3 Faisal 18 85
4 Aman 23 80
5 Anas 24 75
6 Fazil 21 70
7 Mustaqim 22 65
8 Alfiya 20 89
9 Aqsa 21 86

Q.8:Write a python program to generate a scatter plot pf name vs percentage.

import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(columns=['Name','age','percentage'])
df.loc[0]=['kashish',19,95]
df.loc[1]=['Ramiza',20,91]
df.loc[2]=['naki',7,90]
df.loc[3]=['Faisal',18,85]
df.loc[4]=['Aman',23,80]
df.loc[5]=['Anas',24,75]
df.loc[6]=['Fazil',21,70]
df.loc[7]=['Mustaqim',22,65]
df.loc[8]=['Alfiya',20,89]
df.loc[9]=['Aqsa',21,86]
print(df)
plt.scatter(x=df["Name"],y=df["percentage"])
plt.title('Scatter plot name vs percentage')
plt.xlabel('name of student')
plt.ylabel('percentage')
plt.show()
print(df)

Output:
Name age percentage
0 kashish 19 95
1 Ramiza 20 91
2 naki 7 90
3 Faisal 18 85
4 Aman 23 80
5 Anas 24 75
6 Fazil 21 70
7 Mustaqim 22 65
8 Alfiya 20 89
9 Aqsa 21 86
Assingment 1:SET-B
Q1) Download the heights and weights dataset and load the data set from a given csv file
into a dataframe. print the first,last 10 rows and random 20 rows.

import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/nfirst 10 rows')
print(df.head(10))

print('/nlast 10 rows')
print(df.tail(10))

print('/n random 10 rows')

print(df.sample(20))

Output:
/nfirst 10 rows
Index Height(Inches) Weight(Pounds)
0 1 65.78331 112.9925
1 2 71.51521 136.4873
2 3 69.39874 153.0269
3 4 68.21660 142.3354
4 5 67.78781 144.2971
5 6 68.69784 123.3024
6 7 69.80204 141.4947
7 8 70.01472 136.4623
8 9 67.90265 112.3723
9 10 66.78236 120.6672
/nlast 10 rows
Index Height(Inches) Weight(Pounds)
24990 24991 69.97767 125.3672
24991 24992 71.91656 128.2840
24992 24993 70.96218 146.1936
24993 24994 66.19462 118.7974
24994 24995 67.21126 127.6603
24995 24996 69.50215 118.0312
24996 24997 64.54826 120.1932
24997 24998 64.69855 118.2655
24998 24999 67.52918 132.2682
24999 25000 68.87761 124.8742
/n random 10 rows
Index Height(Inches) Weight(Pounds)
18515 18516 71.15912 143.7729
3550 3551 65.95300 130.0755
16400 16401 67.20032 130.9151
10718 10719 70.79804 125.7816
4830 4831 66.24238 121.9611
9121 9122 68.05361 137.0546
11516 11517 68.67632 115.3375
3126 3127 67.59507 126.1888
15670 15671 68.55083 137.6187
20293 20294 65.96939 139.4453
5842 5843 68.92916 129.1092
21409 21410 69.27081 124.6497
15365 15366 67.18395 137.6251
16889 16890 69.07788 131.0112
12382 12383 69.50005 135.9850
14220 14221 68.49492 132.0698
2701 2702 69.25709 142.5795
4578 4579 68.91069 103.7011
11468 11469 65.93881 125.8178
19312 19313 68.47651 117.6580

Q2) Write a python program to find the shape, size datatypes of the dataframe object.

import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/n shape of dataframe',df.shape)
print('/n size of dataframe',df.size)
print('/n datatype of dataframe',df.dtypes)

Output:
/n shape of dataframe (25000, 3)
/n size of dataframe 75000
/n datatype of dataframe Index int64
Height(Inches) float64
Weight(Pounds) float64
dtype: object

Q3) Write a python program to view basic statistical details of the data.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print("/n basic statistical details of a data:/n",df.describe())

Output:
basic statistical details of a data:
Index Height(Inches) Weight(Pounds)
count 25000.000000 25000.000000 25000.000000
mean 12500.500000 67.993114 127.079421
std 7217.022701 1.901679 11.660898
min 1.000000 60.278360 78.014760
25% 6250.750000 66.704397 119.308675
50% 12500.500000 67.995700 127.157750
75% 18750.250000 69.272958 134.892850
max 25000.000000 75.152800 170.924000
Q4) Write a pyython program to get the number of observation , missing values and nan
values.

import pandas as pd
import numpy as np
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/n number of observation:',df['Index'].size)
missing=df.isnull()
nan_values=np.isnan(df)
print("/n nan values:",nan_values.size)

Output:
number of observation : 25000

missing values:
75000

nan values : 75000

Q5) Write a python program to add a column to the dataframe “BMI” which is
calculated as: weight/height^2.

import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
df['BMI']=(df['Weight(Pounds)']/df['Height(Inches)']**2)
print("after adding colom/n",df)

Output:
after adding column/n Index Height(Inches) Weight(Pounds)
BMI
0 1 65.78331 112.9925 2.950311
1 2 71.51521 136.4873 3.642400
2 3 69.39874 153.0269 4.862195
3 4 68.21660 142.3354 4.353572
4 5 67.78781 144.2971 4.531187
... ... ... ... ...
24995 24996 69.50215 118.0312 2.884013
24996 24997 64.54826 120.1932 3.467294
24997 24998 64.69855 118.2655 3.341389
24998 24999 67.52918 132.2682 3.836436
24999 25000 68.87761 124.8742 3.286921

[25000 rows x 4 columns]

Q6) Write a python program to find the maximum and minimum BMI.
import pandas as pd
import numpy as np
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
df['BMI']=((df['Weight(Pounds)']/df['Height(Inches)'])**2)
print("\n Maximum of BMI = ",max(df['BMI']))
print("\n Minimum of BMI = ",min(df['BMI']))

Output::
Maximum of BMI = 5.933879009339526

Minimum of BMI = 1.531593289721334

Q7) Write a python program to generate a scatter plot of height vs weight.

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('HeightWeight.csv')
df = pd.DataFrame(data)

plt.scatter(x=df['Height(Inches)'],y=df['Weight(Pounds)'],c='blue')
plt.title("Scatter Plot")
plt.xlable("Height(Inches)")
plt.ylabel("Weight(Pounds)")
plt.show()

Output::
ROLL NO:

Assingment 2:SET-A
Q1) Create an array using numpy and display mean and median.
importumpy as np
demo = np.array([[30,75,70],[80,90,20],[50,95,60]])
print(demo)
print('/n')
print(np.mean(demo))
print('/n')
print(np.median(demo))
print('/n')

Output::
[[30 75 70]
[80 90 20]
[50 95 60]]
/n
63.333333333333336
/n
70.0
/n

Q2) Create a data frame as follows: Print df.sum

import pandas as pd
import numpy as np
d= {'Name':pd.Series(['Ram','Sham','Meena','Seeta','Geeta','Rakesh','Madhav']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df=pd.DataFrame(d)
print(df.sum())

Output::
Name RamShamMeenaSeetaGeetaRakeshMadhav
Age 181
Rating 25.61
dtype: object

Q3) For the above data display statistical details:

import pandas as pd
import numpy as np
md={'Name':pd.Series(['Ram','Sham','Meena','Seeta','Geeta','Rakesh','Madhav']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df=pd.DataFrame(md)
print(df.describe())

Output:
Age Rating
count 7.000000 7.000000
mean 25.857143 3.658571
std 2.734262 0.698628
min 23.000000 2.560000
25% 24.000000 3.220000
50% 25.000000 3.800000
75% 27.500000 4.105000
max 30.000000 4.600000

Q4) Consider the array [13,52,44,32,30,0,36,45].Calculate standard deviation.

import numpy as np
data=np.array([13,52,44,32,30,0,36,45])
print("Standard Deviation of sample is %s"%(np.std(data)))

Output::
Standard Deviation of sample is 16.263455967290593

Q5) Create a data frame as follows:

Virat Rohit
92 89
97 87
85 67
74 55
71 47
55 72
85 76
63 79
42 44
32 92
71 99

55 47

Display the mean of every player :

import pandas as pd
import scipy.stats as s
score={'Virat':[92,97,85,74,71,55,85,63,42,32,71,55],'Rohit':
[89,87,67,55,47,72,76,79,44,92,99,47]}
df=pd.DataFrame(score)
print(df)
print("\nArithmetic Mean Values")
print("Score 1",s.tmean(df["Virat"]).round(2))
print("Score 2",s.tmean(df["Rohit"]).round(2))

Output:
Virat Rohit
0 92 89
1 97 87
2 85 67
3 74 55
4 71 47
5 55 72
6 85 76
7 63 79
8 42 44
9 32 92
10 71 99
11 55 47

Arithmetic Mean Values

Score 1 68.5
Score 2 71.17
Q6) Consider the array[24,29,20,22,24,26,27,30,20,31,26,38,44,47],Calculate IQR.

import numpy as np
mydata=np.array([24,29,20,22,24,26,27,30,20,31,26,38,44,47])
q3,q1=np.percentile(mydata,[75,25])
iqrvalue=q3-q1
print(iqrvalue)

Output:
6.75

Q7) Write a python program to find the maximum value of a given flattened array:

import numpy as np
arr=np.array([[25,26,45],[12,36,42],[8,50,65]])
print("\n Original flattened Array:\n",arr)
arr.flatten()
max=np.max(arr)
print("\n Maximum value of flattened array:\n",max)
min=np.min(arr)
print("\n Minimum value of flattened array:\n",min)

Output:
Original flattened Array:
[[25 26 45]
[12 36 42]
[ 8 50 65]]

Maximum value of flattened array:

Minimum value of flattened array:

Q8) Write a python program to compute Eclidian Distance between two data points in a
dataset.

import numpy as np
point1= np.array((1,2,3))
point2= np.array((1,1,1))
dist = np.linalg.norm(point1 - point2)
print("Euclidian Distance between two pints: ",dist)

Output:
Euclidian Distance between two pints: 2.23606797749979

Q.9: Create one dataframe of data values.Find out mean , range , and IQR for this data.
import pandas as pd
import numpy as np
import scipy.stats as s
d = [32,36,46,47,56,69,75,79,79,88,89,91,92,93,96,97,101,105,112,116]
data=pd.DataFrame(d)
print("\n DataFrame:\n",data)
print("\n Mean of dataframe : ",s.tmean(data))
data_range = np.max(data)-np.min(data)
print("\n Range of dataframe : ",data_range)
Q1 = np.median(data[:10])
Q3 = np.median(data[10:])
IQR = Q3 - Q1
print("\n Inter Quartile Range (IQR) of dataframe : ",IQR)

Output::
DataFrame:
0
0 32
1 36
2 46
3 47
4 56
5 69
6 75
7 79
8 79
9 88
10 89
11 91
12 92
13 93
14 96
15 97
16 101
17 105
18 112
19 116

Mean of dataframe : 79.95

DataFrame:
0
0 32
1 36
2 46
3 47
4 56
5 69
6 75
7 79
8 79
9 88
10 89
11 91
12 92
13 93
14 96
15 97
16 101
17 105
18 112
19 116

Mean of dataframe : 79.95

Range of dataframe : 0 84
dtype: int64

Inter Quartile Range (IQR) of dataframe : 34.0

Q10)WRITE A PYTHON PROGRAM TO COMPUTE SUM OF MANHATTAN

DISTANCE BETWEEN ALL PAIRS OF POINTS:
x = [2,5,7]
y = [4,6,8]
n = len(x)
def distance_sum (x,y,n):
sum = 0
for i in range(n):
for j in range(i+1,n):

sum += (abs(x[i] - x[j])) + (abs(y[i] - y[j]))

return sum
print("\n sum of Manhattan distance between all pairs of point is =",distance_sum(x,y,n))

OUPUT:
sum of Manhattan distance between all pairs of point is = 18

Q.12: Create a Dataframe for student's information such name, graduation percentage
and age. Display average age of student ,average of graduation percentage. And ,also
describe all basic statistics od data.(Hint:use describe()).
import pandas as pd
import scipy.stats as s
data={'Name':['sharif','shoaib','nafisa','alim'],'Age':[20,22,23,21],
'perc':[65.2,78.4,78.6,74.5]}
df=pd.DataFrame(data)
print("\n Average Age :",sum(df['Age']/len(df['Age'])))
print("\n Average Percentage : ",sum(df['perc']/len(df['perc'])))
print("\n Basic Stastistics of data :\n",df.describe())

Output:
Average Age : 21.5
Average Percentage : 74.17500000000001

Basic Stastistics of data :

Age perc
count 4.000000 4.000000
mean 21.500000 74.175000
std 1.290994 6.273954
min 20.000000 65.200000
25% 20.750000 72.175000
50% 21.500000 76.450000
75% 22.250000 78.450000
max 23.000000 78.600000

Q.11: Write a Numpy program to compute the histogram of nums agains bins.
Sample Output:
nums:[0.5 0.7 1.0 1.2 1.3 2.1]
bins:[0 1 2 3].
import numpy as np
import matplotlib.pyplot as plt
nums = np.array([0.5,0.7,1.0,1.2,1.3,2.1])
bins = np.array([0,1,2,3])
print("nums: ",nums)
print("bins: ",bins)
print("Result:", np.histogram(nums, bins))
plt.hist(nums, bins=bins)
plt.show()

Output:
nums: [0.5 0.7 1. 1.2 1.3 2.1]
bins: [0 1 2 3]
Result: (array([2, 3, 1]), array([0, 1, 2, 3]))
Assingment 2:SET-B

Q1) Download iris dataset file.Read this csv file using read_csv() function.Take sample
from entire dataset.Display maximum and minimum values of all numeric attributes.

import pandas as pd
data=pd.read_csv("Iris.csv")
df=pd.DataFrame(data)
sample=df.sample()
print("\n Sample from Dataset:\n",sample)
print("\n Maximum of sepal Length:",max(df['SepalLengthCm']))
print("\n Minimum of sepal Length:",min(df['SepalLengthCm']))
print("\n Maximum of sepal Width:",max(df['SepalWidthCm']))
print("\n Minimum of sepal Width:",min(df['SepalWidthCm']))

print("\n Maximum of Petal Length:",max(df['PetalLengthCm']))

print("\n Minimum of Petal Length:",min(df['PetalLengthCm']))
print("\n Maximum of Petal Width:",max(df['PetalWidthCm']))
print("\n Minimum of Petal Width:",min(df['PetalWidthCm']))

Output:

Maximum of sepal Length: 7.9

Minimum of sepal Length: 4.3

Maximum of sepal Width: 4.4

Minimum of sepal Width: 2.0

Maximum of Petal Length: 6.9

Minimum of Petal Length: 1.0

Maximum of Petal Width: 2.5

Minimum of Petal Width: 0.1

Q.2:Continue with above dataset ,find number of records for each distinct value of class
attributes.Consider entire dataset and not the sample.

import pandas as pd
data=pd.read_csv("Iris.csv")
df=pd.DataFrame(data)
print("\n DataFrame:\n",df)
cnt=df['Species'].value_counts()
print("\n number of records for each distinct value of species attribute:\n",cnt)

Output:

number of records for each distinct value of species attribute:

Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
Name: Species, dtype: int64

Q.3:Display column-wise mean,and median for iris dataset (Hint:Use mean() and
median() function of pandas dataframe.

import pandas as pd
import scipy.stats as s
import statistics as st
data=pd.read_csv('Iris.csv')
df=pd.DataFrame(data)
print("\n DataFrame :\n",df)
print("\n Mean of sepal Length:", s.tmean(df['SepalLengthCm']))
print("\n Median of sepal Length:", st.median(df['SepalLengthCm']))
print("\n Mean of sepal Width:", s.tmean(df['SepalWidthCm']))
print("\n Median of sepal Width:", st.median(df['SepalWidthCm']))

print("\n Mean of Petal Length:", s.tmean(df['PetalLengthCm']))

print("\n Median of Petal Length:", st.median(df['PetalLengthCm']))
print("\n Mean of Petal Width:", s.tmean(df['PetalWidthCm']))
print("\n Median of Petal Width:", st.median(df['PetalWidthCm']))

OUTPUT:
DataFrame :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150 rows x 6 columns]

Mean of sepal Length: 5.843333333333334

Median of sepal Length: 5.8

Mean of sepal Width: 3.0540000000000003

Median of sepal Width: 3.0

Mean of Petal Length: 3.758666666666666

Median of Petal Length: 4.35

Mean of Petal Width: 1.1986666666666668

Median of Petal Width: 1.3

ASSINGMENT 3: SET-A

Create own dataset and do simple preprocessing

Dataset Name: Data.csv (save following data in excel and save it with .csv extension)

Country Age Salary Purchased

France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

Q.1: Write a program in python to perform following task.

1.import dataset and do the following:
a) Dscribing the dataset
b) Shape of the dataset
c) Display first 3 rows from dataset

import pandas as pd
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Describing Dataset:\n",df.describe())
print("\n Shape of Dataset:\n",df.shape)
print("\n First three rows of Dataset:\n",df.head(3))

OUTPUT:
Describing Dataset:
Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000
Shape of Dataset:
(10, 4)

First three rows of Dataset:

Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No

Q.1:Handling Missing value:

a) Repalce missing value of salary, age column with mean of thatcolumn.

import pandas as pd
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Displaying Dataset:\n",df)
data['Salary']= data['Salary'].fillna(data['Salary'].mean())
data['Age']= data['Age'].fillna(data['Age'].mean())
print("\n ****** Modified Dataset ******\n",df)

OUTPUT:
Displaying Dataset:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

Modified Dataset

Country Age Salary Purchased
0 France 44.000000 72000.000000 No
1 Spain 27.000000 48000.000000 Yes
2 Germany 30.000000 54000.000000 No
3 Spain 38.000000 61000.000000 No
4 Germany 40.000000 63777.777778 Yes
5 France 35.000000 58000.000000 Yes
6 Spain 38.777778 52000.000000 No
7 France 48.000000 79000.000000 Yes
8 Germany 50.000000 83000.000000 No
9 France 37.000000 67000.000000 Yes

Q3) Data.csv have two categorical column(the country column,and the purchased
column).
a. Apply OneHot coding on country column.
b.Apply Label encoding on purchased column.

import pandas as pd
from sklearn import preprocessing
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Describing Dataset:\n",df)
one_hot_encoded_data = pd.get_dummies(data, columns = ['Country'])
print("\n *******After applying OneHot coding on Country*******\n",one_hot_encoded_data)
label_encoder = preprocessing.LabelEncoder()
df['Purchased']= label_encoder.fit_transform(df['Purchased'])
df['Purchased'].unique()
print("\n *******After applying OneHot coding on Country**********\n",df)

OUTPUT:
Describing Dataset:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes

After applying OneHot coding on Country

Age Salary Purchased Country_France Country_Germany Country_Spain
0 44.0 72000.0 No 1 0 0
1 27.0 48000.0 Yes 0 0 1
2 30.0 54000.0 No 0 1 0
3 38.0 61000.0 No 0 0 1
4 40.0 NaN Yes 0 1 0
5 35.0 58000.0 Yes 1 0 0
6 NaN 52000.0 No 0 0 1
7 48.0 79000.0 Yes 1 0 0
8 50.0 83000.0 No 0 1 0
9 37.0 67000.0 Yes 1 0 0

*After applying OneHot coding on Country****

Country Age Salary Purchased
0 France 44.0 72000.0 0
1 Spain 27.0 48000.0 1
2 Germany 30.0 54000.0 0
3 Spain 38.0 61000.0 0
4 Germany 40.0 NaN 1
5 France 35.0 58000.0 1
6 Spain NaN 52000.0 0
7 France 48.0 79000.0 1
8 Germany 50.0 83000.0 0
9 France 37.0 67000.0 1
ASSINGMENT 3: SET-B

Q.1: Import standard data set and use transformation techniques

Dataset Name: winequality-red.csv

Write a program in python to perform following task

1. Import Dtaset from above link.

import pandas as pd
data=pd.read_csv("winequality_red.csv")
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)

OUTPUT:

Dataset is :
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067

free sulfur dioxide total sulfur dioxide density pH sulphates \

0 11.0 34.0 0.99780 3.51 0.56
1 25.0 67.0 0.99680 3.20 0.68
2 15.0 54.0 0.99700 3.26 0.65
3 17.0 60.0 0.99800 3.16 0.58
4 11.0 34.0 0.99780 3.51 0.56
... ... ... ... ... ...
1594 32.0 44.0 0.99490 3.45 0.58
1595 39.0 51.0 0.99512 3.52 0.76
1596 29.0 40.0 0.99574 3.42 0.75
1597 32.0 44.0 0.99547 3.57 0.71
1598 18.0 42.0 0.99549 3.39 0.66

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6

[1599 rows x 12 columns]

Q.2: Rescaling: Normalization the dataset using MinMaxScalar class

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print(df)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
print(scaled)

OUTPUT:

fixed acidity volatile acidity citric acid residual sugar chlorides \

0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067

free sulfur dioxide total sulfur dioxide density pH sulphates \

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6

[1599 rows x 12 columns]

[[0.24778761 0.39726027 0. ... 0.13772455 0.15384615 0.4 ]
[0.28318584 0.52054795 0. ... 0.20958084 0.21538462 0.4 ]
[0.28318584 0.43835616 0.04 ... 0.19161677 0.21538462 0.4 ]
...
[0.15044248 0.26712329 0.13 ... 0.25149701 0.4 0.6 ]
[0.11504425 0.35958904 0.12 ... 0.22754491 0.27692308 0.4 ]
[0.12389381 0.13013699 0.47 ... 0.19760479 0.4 0.6 ]]

Q.3:) Standardizing Data (trans. Them into a standard Guassian distribution with a mean
of 0 and a standard deviation of 1)

import pandas as pd
from sklearn import preprocessing
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
standard = preprocessing.scale(df)
print("\n *********Standardized Data*********\n",standard)

OUTPUT:

free sulfur dioxide total sulfur dioxide density pH sulphates \

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6

[1599 rows x 12 columns]

*********Standardized Data*********
[[-0.52835961 0.96187667 -1.39147228 ... -0.57920652 -0.96024611
-0.78782264]
[-0.29854743 1.96744245 -1.39147228 ... 0.1289504 -0.58477711
-0.78782264]
[-0.29854743 1.29706527 -1.18607043 ... -0.04808883 -0.58477711
-0.78782264]
...
[-1.1603431 -0.09955388 -0.72391627 ... 0.54204194 0.54162988
0.45084835]
[-1.39015528 0.65462046 -0.77526673 ... 0.30598963 -0.20930812
-0.78782264]
[-1.33270223 -1.21684919 1.02199944 ... 0.01092425 0.54162988
0.45084835]]

Q.4:)Normalizing Data (rescale each observation to a length pof 1 (a unit norm) .For this,
use the normalizer class.)

import pandas as pd
import numpy as np
from sklearn import preprocessing
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
normalized = preprocessing.normalize(df,norm='l2')
print("\n***********Normalized Data***********\n",normalized)

OUTPUT:
Dataset is :
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067

free sulfur dioxide total sulfur dioxide density pH sulphates \

alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6

[1599 rows x 12 columns]

***********Normalized Data***********
[[0.19347777 0.01830195 0. ... 0.01464156 0.24576906 0.13072822]
[0.10698874 0.01207052 0. ... 0.00932722 0.13442175 0.06858252]
[0.13494887 0.01314886 0.00069205 ... 0.01124574 0.16955114 0.08650569]
...
[0.1222319 0.00989496 0.00252225 ... 0.01455142 0.21342078 0.11641133]
[0.10524769 0.01150589 0.00214063 ... 0.0126654 0.18195363 0.08919296]
[0.12491328 0.00645385 0.00978487 ... 0.01374046 0.22900768 0.12491328]]

5.Biarizing Data using we use Binarizer class (Using a banary threshold, it is possible to
transform out data by marking the values above it 1 and those equal to or below it 0)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import Binarizer
data_set = pd.read_csv('Data.csv')

data_set.head()
age = data_set['Age'].values
salary = data_set['Salary'].values
print("\n Original age data values : \n",age)
print("\n Original salary data values : \n",salary)
x = age
x = x.reshape(1,-1)
y = salary
y = y.reshape(1,-1)
binarizer_1 = Binarizer(threshold=35)
binarizer_1 = Binarizer(threshold=61000)

print("\n Binarized age : \n", binarizer_1.fit_transform(x))

print("\n Binarized salary : \n", binarizer_1.fit_transform(y))

OUTPUT:
Original age data values :
[44 27 30 38 40 35 58 48 50 37]

Original salary data values :

[72000 48000 54000 61000 55000 58000 52000 79000 83000 67000]

Binarized age :
[[0 0 0 0 0 0 0 0 0 0]]

Binarized salary :
[[1 0 0 0 0 0 0 1 1 1]]

Decaffeination of Raw, Green Coffee Beans Using Supercritical
No ratings yet
Decaffeination of Raw, Green Coffee Beans Using Supercritical
6 pages
IML project
No ratings yet
IML project
6 pages
Sklearn Tutorial: DNN On Boston Data
No ratings yet
Sklearn Tutorial: DNN On Boston Data
9 pages
exp_3_ml
No ratings yet
exp_3_ml
3 pages
STOCK - MARKET - PROJECT - Jupyter Notebook
No ratings yet
STOCK - MARKET - PROJECT - Jupyter Notebook
24 pages
Final Ip Practical File
No ratings yet
Final Ip Practical File
29 pages
Regression: Pyspark - SQL
No ratings yet
Regression: Pyspark - SQL
5 pages
Advertising - Paulina Frigia Rante (34) - PPBP 1 - Colaboratory
No ratings yet
Advertising - Paulina Frigia Rante (34) - PPBP 1 - Colaboratory
7 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
2 pages
Ankur Assignment
No ratings yet
Ankur Assignment
10 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
4 pages
Sales Prediction Using Python
No ratings yet
Sales Prediction Using Python
6 pages
Coding Tugas Besar Analitika Data
No ratings yet
Coding Tugas Besar Analitika Data
7 pages
Ip Xii Practical File 2024
No ratings yet
Ip Xii Practical File 2024
44 pages
merge
No ratings yet
merge
33 pages
Jamboree
No ratings yet
Jamboree
56 pages
Practica 9
No ratings yet
Practica 9
24 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
Practical 5
No ratings yet
Practical 5
6 pages
FDS Solved Slips
100% (1)
FDS Solved Slips
63 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Ip Project
No ratings yet
Ip Project
27 pages
Experiment 3
No ratings yet
Experiment 3
4 pages
Pandas Dataframe1
No ratings yet
Pandas Dataframe1
43 pages
Data Science Practical Book - Ipynb
No ratings yet
Data Science Practical Book - Ipynb
21 pages
IP Practic MINE
No ratings yet
IP Practic MINE
30 pages
Final
No ratings yet
Final
15 pages
DMV - 6 - Jupyter Notebook
No ratings yet
DMV - 6 - Jupyter Notebook
6 pages
Veri Görselleştirme
No ratings yet
Veri Görselleştirme
33 pages
Linear Regression - Colab
No ratings yet
Linear Regression - Colab
2 pages
Using Tensorflow To Predict Jet Numbers in Cern Proton Collisions (Evaluator-Omid-Baghcheh-Saraei)
No ratings yet
Using Tensorflow To Predict Jet Numbers in Cern Proton Collisions (Evaluator-Omid-Baghcheh-Saraei)
29 pages
May 25
No ratings yet
May 25
3 pages
signlanguage_project - Jupyter Notebook
No ratings yet
signlanguage_project - Jupyter Notebook
9 pages
DOC%201728741951381
No ratings yet
DOC%201728741951381
19 pages
Statistical Data Analysis - Ipynb - Colaboratory
No ratings yet
Statistical Data Analysis - Ipynb - Colaboratory
6 pages
Ip Practical
No ratings yet
Ip Practical
23 pages
IP (12) Proj File Pandas&Matplotlib
No ratings yet
IP (12) Proj File Pandas&Matplotlib
12 pages
XII - Informatics Practices (LAB MANUAL)
100% (1)
XII - Informatics Practices (LAB MANUAL)
42 pages
Netflix Stock Price Prediction
No ratings yet
Netflix Stock Price Prediction
20 pages
Merged
No ratings yet
Merged
47 pages
Info Practical
No ratings yet
Info Practical
56 pages
Week-5 - Jupyter Notebook
No ratings yet
Week-5 - Jupyter Notebook
9 pages
Data Visualization 2 ip
No ratings yet
Data Visualization 2 ip
19 pages
Supervised Learning With Scikit-Learn: Preprocessing Data
No ratings yet
Supervised Learning With Scikit-Learn: Preprocessing Data
32 pages
23111462_unit3
No ratings yet
23111462_unit3
7 pages
ml exp-5,6 (1)[1] (1)
No ratings yet
ml exp-5,6 (1)[1] (1)
6 pages
Machine Learning Stock Time Series 1700932258
No ratings yet
Machine Learning Stock Time Series 1700932258
21 pages
21 Mts 5725
No ratings yet
21 Mts 5725
36 pages
Day 39
No ratings yet
Day 39
6 pages
Week 3 GGG
No ratings yet
Week 3 GGG
17 pages
PRACTICAL FILE IP - Copy (1)
No ratings yet
PRACTICAL FILE IP - Copy (1)
27 pages
IP Practical File
No ratings yet
IP Practical File
27 pages
Functionapplicationp PDF
No ratings yet
Functionapplicationp PDF
6 pages
22BBS0224
No ratings yet
22BBS0224
5 pages
ML 1st Program
No ratings yet
ML 1st Program
3 pages
Empirical Crop Suitability Model 1694688954
No ratings yet
Empirical Crop Suitability Model 1694688954
24 pages
12212221 (1) copy
No ratings yet
12212221 (1) copy
9 pages
KMeans Clustering Bidimensional Daniel Ames Camayo
No ratings yet
KMeans Clustering Bidimensional Daniel Ames Camayo
15 pages
K Means
No ratings yet
K Means
15 pages
Clustering Documentation Python Code
No ratings yet
Clustering Documentation Python Code
8 pages
No Ph.D. Game Design With Three.js
From Everand
No Ph.D. Game Design With Three.js
Nikiforos Kontopoulos
No ratings yet
Schulz Rotary Screw Catalog
No ratings yet
Schulz Rotary Screw Catalog
7 pages
(Lecture - 2 - Updated) Dynamics of Structures Chapter 2 PDF
No ratings yet
(Lecture - 2 - Updated) Dynamics of Structures Chapter 2 PDF
37 pages
Results ISO 12944-9 English_2021
No ratings yet
Results ISO 12944-9 English_2021
7 pages
ME2114-2014-15 Solutions
No ratings yet
ME2114-2014-15 Solutions
10 pages
XPEL PRIME Spec Sheet USA V2 1
No ratings yet
XPEL PRIME Spec Sheet USA V2 1
2 pages
Criminalistic S
No ratings yet
Criminalistic S
7 pages
Chem m11
No ratings yet
Chem m11
20 pages
Saes Q 010 PDF
No ratings yet
Saes Q 010 PDF
4 pages
Creep Slides
No ratings yet
Creep Slides
52 pages
Titration Level 3 Labnotebook
No ratings yet
Titration Level 3 Labnotebook
3 pages
LGFP 2: SKF Food Compatible Bearing Grease
No ratings yet
LGFP 2: SKF Food Compatible Bearing Grease
2 pages
Powertech Transformers
No ratings yet
Powertech Transformers
12 pages
EcoPoly Data Sheet Template EcoSmooth PDF
No ratings yet
EcoPoly Data Sheet Template EcoSmooth PDF
1 page
Grass Cutter
No ratings yet
Grass Cutter
25 pages
Industrial Pumps: Rebuild and Maintenance Guide
No ratings yet
Industrial Pumps: Rebuild and Maintenance Guide
64 pages
PEG-Stabilized Core-Shell Surface-Imprinted Nanoparticles
No ratings yet
PEG-Stabilized Core-Shell Surface-Imprinted Nanoparticles
17 pages
SPP Thesis
No ratings yet
SPP Thesis
188 pages
10 5923 J Clothing 20130201 01 PDF
No ratings yet
10 5923 J Clothing 20130201 01 PDF
8 pages
Cap 9 Ingles
No ratings yet
Cap 9 Ingles
54 pages
Aspen Plus Dynamics DIERS Benchmarks
No ratings yet
Aspen Plus Dynamics DIERS Benchmarks
9 pages
Zycotherm AS PDF
No ratings yet
Zycotherm AS PDF
4 pages
1 Spectrometer
No ratings yet
1 Spectrometer
4 pages
Ultrasonic Interferometer
No ratings yet
Ultrasonic Interferometer
22 pages
HIP на 30000-60000 psi
No ratings yet
HIP на 30000-60000 psi
15 pages
PPM Convertion
No ratings yet
PPM Convertion
5 pages
Controlled: Operating Procedure Intergranular Corrosion Test Practice "B" As Per ASTM A262-02a
No ratings yet
Controlled: Operating Procedure Intergranular Corrosion Test Practice "B" As Per ASTM A262-02a
7 pages
Last Leap 2
No ratings yet
Last Leap 2
395 pages
Doc1 Water Activity
No ratings yet
Doc1 Water Activity
23 pages
SCIENCE Grade 9: Quarter 2 - Module 3 Ionic Compound
No ratings yet
SCIENCE Grade 9: Quarter 2 - Module 3 Ionic Compound
16 pages