Data Sci
Data Sci
Assignment 1:SET-A
Q.1: Write a python program to create a data frame containing columns name as
name,age,and percentage. Add 10 rows to the data frame. View the data frame.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['subhan',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['alim',88,99]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df)
Output:
Name Age Percentage
0 subhan 20 78
1 tofik 22 45
2 rayyan 21 15
3 sharif 21 65
4 alim 88 99
5 shoib 18 97
6 danish 19 49
7 mustakim 25 6
8 mosin 20 78
9 arbaz 22 15
Output:
(10, 3)
30
Name object
Age object
Percentage object
dtype: object
Index(['Name', 'Age', 'Percentage'], dtype='object')
<bound method NDFrame.describe of Name Age Percentage
0 alim 20 78
1 tofik 22 45
2 rayyan 21 15
3 sharif 21 65
4 subhan 20 12
5 shoib 18 97
6 danish 19 49
7 mustakim 25 6
8 mosin 20 78
9 arbaz 22 15>
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['subhan',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['alim',20,12]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df.describe())
Output:
Name Age Percentage
count 10 10 10
unique 10 6 8
top arbaz 20 15
freq 1 3 2
Q.4: Write a python program to add 5 rows with duplicate values and missing value. Add
a column Remark with empty values.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['subhan',21,15]
df.loc[3]=[None,21,65]
df.loc[4]=['subhan',20,12]
df['Remarks']=None
print(df
Output:
Name Age Percentage Remarks
0 alim 20 78 None
1 tofik 22 45 None
2 subhan 21 15 None
3 NaN 21 65 None
4 subhan 20 12 None
Q.5: Write a python program to get the number of observations missing values and
duplicate values.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,15]
df.loc[1]=['tofik',22,45]
df.loc[2]=['subhan',20,15]
df.loc[3]=[None,21,65]
df.loc[4]=['subhan',20,15]
print(df['Name'].size)
missing=df.isnull()
print(missing)
dup=df.duplicated()
print(dup)
Output:
5
Name Age Percentage
0 False False False
1 False False False
2 False False False
3 True False False
4 False False False
0 False
1 False
2 False
3 False
4 True
dtype: bool
Q.6:Write a python program to drop ‘Remark’ column from the dataframe. Also drop all
nulland empty values.print the modified data.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['ALIM',20,15]
df.loc[1]=['SHOAIB',22,45]
df.loc[2]=['SHARIF',20,15]
df.loc[3]=[None,21,65]
df.loc[4]=['ALIM',20,15]
df['Remarks']=None
print(df)
df.drop(labels=['Remarks'],axis=1,inplace=True)
print(df)
df.dropna(axis=0,inplace=True)
print(df)
Output:
Name Age Percentage Remarks
0 ALIM 20 15 None
1 SHOAIB 22 45 None
2 SHARIF 20 15 None
3 NaN 21 65 None
4 ALIM 20 15 None
Name Age Percentage
0 ALIM 20 15
1 SHOAIB 22 45
2 SHARIF 20 15
3 NaN 21 65
4 ALIM 20 15
Name Age Percentage
0 ALIM 20 15
1 SHOAIB 22 45
2 SHARIF 20 15
4 ALIM 20 15
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(columns=['Name','age','percentage'])
df.loc[0]=['kashish',19,95]
df.loc[1]=['Ramiza',20,91]
df.loc[2]=['naki',7,90]
df.loc[3]=['Faisal',18,85]
df.loc[4]=['Aman',23,80]
df.loc[5]=['Anas',24,75]
df.loc[6]=['Fazil',21,70]
df.loc[7]=['Mustaqim',22,65]
df.loc[8]=['Alfiya',20,89]
df.loc[9]=['Aqsa',21,86]
print(df)
df.plot(x="Name",y="percentage")
plt.title('Line plot name vs percentage')
plt.xlabel('name of student')
plt.ylabel('percentage')
plt.show()
print(df)
Output:
Name age percentage
0 kashish 19 95
1 Ramiza 20 91
2 naki 7 90
3 Faisal 18 85
4 Aman 23 80
5 Anas 24 75
6 Fazil 21 70
7 Mustaqim 22 65
8 Alfiya 20 89
9 Aqsa 21 86
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(columns=['Name','age','percentage'])
df.loc[0]=['kashish',19,95]
df.loc[1]=['Ramiza',20,91]
df.loc[2]=['naki',7,90]
df.loc[3]=['Faisal',18,85]
df.loc[4]=['Aman',23,80]
df.loc[5]=['Anas',24,75]
df.loc[6]=['Fazil',21,70]
df.loc[7]=['Mustaqim',22,65]
df.loc[8]=['Alfiya',20,89]
df.loc[9]=['Aqsa',21,86]
print(df)
plt.scatter(x=df["Name"],y=df["percentage"])
plt.title('Scatter plot name vs percentage')
plt.xlabel('name of student')
plt.ylabel('percentage')
plt.show()
print(df)
Output:
Name age percentage
0 kashish 19 95
1 Ramiza 20 91
2 naki 7 90
3 Faisal 18 85
4 Aman 23 80
5 Anas 24 75
6 Fazil 21 70
7 Mustaqim 22 65
8 Alfiya 20 89
9 Aqsa 21 86
Assingment 1:SET-B
Q1) Download the heights and weights dataset and load the data set from a given csv file
into a dataframe. print the first,last 10 rows and random 20 rows.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/nfirst 10 rows')
print(df.head(10))
print('/nlast 10 rows')
print(df.tail(10))
Output:
/nfirst 10 rows
Index Height(Inches) Weight(Pounds)
0 1 65.78331 112.9925
1 2 71.51521 136.4873
2 3 69.39874 153.0269
3 4 68.21660 142.3354
4 5 67.78781 144.2971
5 6 68.69784 123.3024
6 7 69.80204 141.4947
7 8 70.01472 136.4623
8 9 67.90265 112.3723
9 10 66.78236 120.6672
/nlast 10 rows
Index Height(Inches) Weight(Pounds)
24990 24991 69.97767 125.3672
24991 24992 71.91656 128.2840
24992 24993 70.96218 146.1936
24993 24994 66.19462 118.7974
24994 24995 67.21126 127.6603
24995 24996 69.50215 118.0312
24996 24997 64.54826 120.1932
24997 24998 64.69855 118.2655
24998 24999 67.52918 132.2682
24999 25000 68.87761 124.8742
/n random 10 rows
Index Height(Inches) Weight(Pounds)
18515 18516 71.15912 143.7729
3550 3551 65.95300 130.0755
16400 16401 67.20032 130.9151
10718 10719 70.79804 125.7816
4830 4831 66.24238 121.9611
9121 9122 68.05361 137.0546
11516 11517 68.67632 115.3375
3126 3127 67.59507 126.1888
15670 15671 68.55083 137.6187
20293 20294 65.96939 139.4453
5842 5843 68.92916 129.1092
21409 21410 69.27081 124.6497
15365 15366 67.18395 137.6251
16889 16890 69.07788 131.0112
12382 12383 69.50005 135.9850
14220 14221 68.49492 132.0698
2701 2702 69.25709 142.5795
4578 4579 68.91069 103.7011
11468 11469 65.93881 125.8178
19312 19313 68.47651 117.6580
Q2) Write a python program to find the shape, size datatypes of the dataframe object.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/n shape of dataframe',df.shape)
print('/n size of dataframe',df.size)
print('/n datatype of dataframe',df.dtypes)
Output:
/n shape of dataframe (25000, 3)
/n size of dataframe 75000
/n datatype of dataframe Index int64
Height(Inches) float64
Weight(Pounds) float64
dtype: object
Q3) Write a python program to view basic statistical details of the data.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print("/n basic statistical details of a data:/n",df.describe())
Output:
basic statistical details of a data:
Index Height(Inches) Weight(Pounds)
count 25000.000000 25000.000000 25000.000000
mean 12500.500000 67.993114 127.079421
std 7217.022701 1.901679 11.660898
min 1.000000 60.278360 78.014760
25% 6250.750000 66.704397 119.308675
50% 12500.500000 67.995700 127.157750
75% 18750.250000 69.272958 134.892850
max 25000.000000 75.152800 170.924000
Q4) Write a pyython program to get the number of observation , missing values and nan
values.
import pandas as pd
import numpy as np
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/n number of observation:',df['Index'].size)
missing=df.isnull()
nan_values=np.isnan(df)
print("/n nan values:",nan_values.size)
Output:
number of observation : 25000
missing values:
75000
Q5) Write a python program to add a column to the dataframe “BMI” which is
calculated as: weight/height^2.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
df['BMI']=(df['Weight(Pounds)']/df['Height(Inches)']**2)
print("after adding colom/n",df)
Output:
after adding column/n Index Height(Inches) Weight(Pounds)
BMI
0 1 65.78331 112.9925 2.950311
1 2 71.51521 136.4873 3.642400
2 3 69.39874 153.0269 4.862195
3 4 68.21660 142.3354 4.353572
4 5 67.78781 144.2971 4.531187
... ... ... ... ...
24995 24996 69.50215 118.0312 2.884013
24996 24997 64.54826 120.1932 3.467294
24997 24998 64.69855 118.2655 3.341389
24998 24999 67.52918 132.2682 3.836436
24999 25000 68.87761 124.8742 3.286921
Output::
Maximum of BMI = 5.933879009339526
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('HeightWeight.csv')
df = pd.DataFrame(data)
plt.scatter(x=df['Height(Inches)'],y=df['Weight(Pounds)'],c='blue')
plt.title("Scatter Plot")
plt.xlable("Height(Inches)")
plt.ylabel("Weight(Pounds)")
plt.show()
Output::
ROLL NO:
Assingment 2:SET-A
Q1) Create an array using numpy and display mean and median.
importumpy as np
demo = np.array([[30,75,70],[80,90,20],[50,95,60]])
print(demo)
print('/n')
print(np.mean(demo))
print('/n')
print(np.median(demo))
print('/n')
Output::
[[30 75 70]
[80 90 20]
[50 95 60]]
/n
63.333333333333336
/n
70.0
/n
Output::
Name RamShamMeenaSeetaGeetaRakeshMadhav
Age 181
Rating 25.61
dtype: object
import pandas as pd
import numpy as np
md={'Name':pd.Series(['Ram','Sham','Meena','Seeta','Geeta','Rakesh','Madhav']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df=pd.DataFrame(md)
print(df.describe())
Output:
Age Rating
count 7.000000 7.000000
mean 25.857143 3.658571
std 2.734262 0.698628
min 23.000000 2.560000
25% 24.000000 3.220000
50% 25.000000 3.800000
75% 27.500000 4.105000
max 30.000000 4.600000
import numpy as np
data=np.array([13,52,44,32,30,0,36,45])
print("Standard Deviation of sample is %s"%(np.std(data)))
Output::
Standard Deviation of sample is 16.263455967290593
Virat Rohit
92 89
97 87
85 67
74 55
71 47
55 72
85 76
63 79
42 44
32 92
71 99
55 47
import pandas as pd
import scipy.stats as s
score={'Virat':[92,97,85,74,71,55,85,63,42,32,71,55],'Rohit':
[89,87,67,55,47,72,76,79,44,92,99,47]}
df=pd.DataFrame(score)
print(df)
print("\nArithmetic Mean Values")
print("Score 1",s.tmean(df["Virat"]).round(2))
print("Score 2",s.tmean(df["Rohit"]).round(2))
Output:
Virat Rohit
0 92 89
1 97 87
2 85 67
3 74 55
4 71 47
5 55 72
6 85 76
7 63 79
8 42 44
9 32 92
10 71 99
11 55 47
import numpy as np
mydata=np.array([24,29,20,22,24,26,27,30,20,31,26,38,44,47])
q3,q1=np.percentile(mydata,[75,25])
iqrvalue=q3-q1
print(iqrvalue)
Output:
6.75
Q7) Write a python program to find the maximum value of a given flattened array:
import numpy as np
arr=np.array([[25,26,45],[12,36,42],[8,50,65]])
print("\n Original flattened Array:\n",arr)
arr.flatten()
max=np.max(arr)
print("\n Maximum value of flattened array:\n",max)
min=np.min(arr)
print("\n Minimum value of flattened array:\n",min)
Output:
Original flattened Array:
[[25 26 45]
[12 36 42]
[ 8 50 65]]
Q8) Write a python program to compute Eclidian Distance between two data points in a
dataset.
import numpy as np
point1= np.array((1,2,3))
point2= np.array((1,1,1))
dist = np.linalg.norm(point1 - point2)
print("Euclidian Distance between two pints: ",dist)
Output:
Euclidian Distance between two pints: 2.23606797749979
Q.9: Create one dataframe of data values.Find out mean , range , and IQR for this data.
import pandas as pd
import numpy as np
import scipy.stats as s
d = [32,36,46,47,56,69,75,79,79,88,89,91,92,93,96,97,101,105,112,116]
data=pd.DataFrame(d)
print("\n DataFrame:\n",data)
print("\n Mean of dataframe : ",s.tmean(data))
data_range = np.max(data)-np.min(data)
print("\n Range of dataframe : ",data_range)
Q1 = np.median(data[:10])
Q3 = np.median(data[10:])
IQR = Q3 - Q1
print("\n Inter Quartile Range (IQR) of dataframe : ",IQR)
Output::
DataFrame:
0
0 32
1 36
2 46
3 47
4 56
5 69
6 75
7 79
8 79
9 88
10 89
11 91
12 92
13 93
14 96
15 97
16 101
17 105
18 112
19 116
DataFrame:
0
0 32
1 36
2 46
3 47
4 56
5 69
6 75
7 79
8 79
9 88
10 89
11 91
12 92
13 93
14 96
15 97
16 101
17 105
18 112
19 116
Range of dataframe : 0 84
dtype: int64
OUPUT:
sum of Manhattan distance between all pairs of point is = 18
Q.12: Create a Dataframe for student's information such name, graduation percentage
and age. Display average age of student ,average of graduation percentage. And ,also
describe all basic statistics od data.(Hint:use describe()).
import pandas as pd
import scipy.stats as s
data={'Name':['sharif','shoaib','nafisa','alim'],'Age':[20,22,23,21],
'perc':[65.2,78.4,78.6,74.5]}
df=pd.DataFrame(data)
print("\n Average Age :",sum(df['Age']/len(df['Age'])))
print("\n Average Percentage : ",sum(df['perc']/len(df['perc'])))
print("\n Basic Stastistics of data :\n",df.describe())
Output:
Average Age : 21.5
Average Percentage : 74.17500000000001
Q.11: Write a Numpy program to compute the histogram of nums agains bins.
Sample Output:
nums:[0.5 0.7 1.0 1.2 1.3 2.1]
bins:[0 1 2 3].
import numpy as np
import matplotlib.pyplot as plt
nums = np.array([0.5,0.7,1.0,1.2,1.3,2.1])
bins = np.array([0,1,2,3])
print("nums: ",nums)
print("bins: ",bins)
print("Result:", np.histogram(nums, bins))
plt.hist(nums, bins=bins)
plt.show()
Output:
nums: [0.5 0.7 1. 1.2 1.3 2.1]
bins: [0 1 2 3]
Result: (array([2, 3, 1]), array([0, 1, 2, 3]))
Assingment 2:SET-B
Q1) Download iris dataset file.Read this csv file using read_csv() function.Take sample
from entire dataset.Display maximum and minimum values of all numeric attributes.
import pandas as pd
data=pd.read_csv("Iris.csv")
df=pd.DataFrame(data)
sample=df.sample()
print("\n Sample from Dataset:\n",sample)
print("\n Maximum of sepal Length:",max(df['SepalLengthCm']))
print("\n Minimum of sepal Length:",min(df['SepalLengthCm']))
print("\n Maximum of sepal Width:",max(df['SepalWidthCm']))
print("\n Minimum of sepal Width:",min(df['SepalWidthCm']))
Output:
Q.2:Continue with above dataset ,find number of records for each distinct value of class
attributes.Consider entire dataset and not the sample.
import pandas as pd
data=pd.read_csv("Iris.csv")
df=pd.DataFrame(data)
print("\n DataFrame:\n",df)
cnt=df['Species'].value_counts()
print("\n number of records for each distinct value of species attribute:\n",cnt)
Output:
Q.3:Display column-wise mean,and median for iris dataset (Hint:Use mean() and
median() function of pandas dataframe.
import pandas as pd
import scipy.stats as s
import statistics as st
data=pd.read_csv('Iris.csv')
df=pd.DataFrame(data)
print("\n DataFrame :\n",df)
print("\n Mean of sepal Length:", s.tmean(df['SepalLengthCm']))
print("\n Median of sepal Length:", st.median(df['SepalLengthCm']))
print("\n Mean of sepal Width:", s.tmean(df['SepalWidthCm']))
print("\n Median of sepal Width:", st.median(df['SepalWidthCm']))
OUTPUT:
DataFrame :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
import pandas as pd
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Describing Dataset:\n",df.describe())
print("\n Shape of Dataset:\n",df.shape)
print("\n First three rows of Dataset:\n",df.head(3))
OUTPUT:
Describing Dataset:
Age Salary
count 9.000000 9.000000
mean 38.777778 63777.777778
std 7.693793 12265.579662
min 27.000000 48000.000000
25% 35.000000 54000.000000
50% 38.000000 61000.000000
75% 44.000000 72000.000000
max 50.000000 83000.000000
Shape of Dataset:
(10, 4)
import pandas as pd
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Displaying Dataset:\n",df)
data['Salary']= data['Salary'].fillna(data['Salary'].mean())
data['Age']= data['Age'].fillna(data['Age'].mean())
print("\n ****** Modified Dataset ******\n",df)
OUTPUT:
Displaying Dataset:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
Q3) Data.csv have two categorical column(the country column,and the purchased
column).
a. Apply OneHot coding on country column.
b.Apply Label encoding on purchased column.
import pandas as pd
from sklearn import preprocessing
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Describing Dataset:\n",df)
one_hot_encoded_data = pd.get_dummies(data, columns = ['Country'])
print("\n *******After applying OneHot coding on Country*******\n",one_hot_encoded_data)
label_encoder = preprocessing.LabelEncoder()
df['Purchased']= label_encoder.fit_transform(df['Purchased'])
df['Purchased'].unique()
print("\n *******After applying OneHot coding on Country**********\n",df)
OUTPUT:
Describing Dataset:
Country Age Salary Purchased
0 France 44.0 72000.0 No
1 Spain 27.0 48000.0 Yes
2 Germany 30.0 54000.0 No
3 Spain 38.0 61000.0 No
4 Germany 40.0 NaN Yes
5 France 35.0 58000.0 Yes
6 Spain NaN 52000.0 No
7 France 48.0 79000.0 Yes
8 Germany 50.0 83000.0 No
9 France 37.0 67000.0 Yes
import pandas as pd
data=pd.read_csv("winequality_red.csv")
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
OUTPUT:
Dataset is :
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6
OUTPUT:
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6
Q.3:) Standardizing Data (trans. Them into a standard Guassian distribution with a mean
of 0 and a standard deviation of 1)
import pandas as pd
from sklearn import preprocessing
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
standard = preprocessing.scale(df)
print("\n *********Standardized Data*********\n",standard)
OUTPUT:
Dataset is :
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6
*********Standardized Data*********
[[-0.52835961 0.96187667 -1.39147228 ... -0.57920652 -0.96024611
-0.78782264]
[-0.29854743 1.96744245 -1.39147228 ... 0.1289504 -0.58477711
-0.78782264]
[-0.29854743 1.29706527 -1.18607043 ... -0.04808883 -0.58477711
-0.78782264]
...
[-1.1603431 -0.09955388 -0.72391627 ... 0.54204194 0.54162988
0.45084835]
[-1.39015528 0.65462046 -0.77526673 ... 0.30598963 -0.20930812
-0.78782264]
[-1.33270223 -1.21684919 1.02199944 ... 0.01092425 0.54162988
0.45084835]]
Q.4:)Normalizing Data (rescale each observation to a length pof 1 (a unit norm) .For this,
use the normalizer class.)
import pandas as pd
import numpy as np
from sklearn import preprocessing
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
normalized = preprocessing.normalize(df,norm='l2')
print("\n***********Normalized Data***********\n",normalized)
OUTPUT:
Dataset is :
fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.4 0.700 0.00 1.9 0.076
1 7.8 0.880 0.00 2.6 0.098
2 7.8 0.760 0.04 2.3 0.092
3 11.2 0.280 0.56 1.9 0.075
4 7.4 0.700 0.00 1.9 0.076
... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090
1595 5.9 0.550 0.10 2.2 0.062
1596 6.3 0.510 0.13 2.3 0.076
1597 5.9 0.645 0.12 2.0 0.075
1598 6.0 0.310 0.47 3.6 0.067
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6
***********Normalized Data***********
[[0.19347777 0.01830195 0. ... 0.01464156 0.24576906 0.13072822]
[0.10698874 0.01207052 0. ... 0.00932722 0.13442175 0.06858252]
[0.13494887 0.01314886 0.00069205 ... 0.01124574 0.16955114 0.08650569]
...
[0.1222319 0.00989496 0.00252225 ... 0.01455142 0.21342078 0.11641133]
[0.10524769 0.01150589 0.00214063 ... 0.0126654 0.18195363 0.08919296]
[0.12491328 0.00645385 0.00978487 ... 0.01374046 0.22900768 0.12491328]]
5.Biarizing Data using we use Binarizer class (Using a banary threshold, it is possible to
transform out data by marking the values above it 1 and those equal to or below it 0)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import Binarizer
data_set = pd.read_csv('Data.csv')
data_set.head()
age = data_set['Age'].values
salary = data_set['Salary'].values
print("\n Original age data values : \n",age)
print("\n Original salary data values : \n",salary)
x = age
x = x.reshape(1,-1)
y = salary
y = y.reshape(1,-1)
binarizer_1 = Binarizer(threshold=35)
binarizer_1 = Binarizer(threshold=61000)
OUTPUT:
Original age data values :
[44 27 30 38 40 35 58 48 50 37]
Binarized age :
[[0 0 0 0 0 0 0 0 0 0]]
Binarized salary :
[[1 0 0 0 0 0 0 1 1 1]]