Manual
Manual
Manual
Use a case study on a data set and apply the various EDA
9. and visualization techniques and present an analysis
report.
EXP NO: 2
Perform Exploratory Data Analysis on Email Data Set
DATE:
AIM
To perform exploratory data analysis (EDA) on email data set to export all your emails as a
dataset, import them inside a pandas data frame, visualize them and get different insights from the
data.
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('D:/SIBIYA/DEV LAB/emails.csv')
print(df.head().info)
# Feature Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, roc_curve, auc, confusion_matrix
from sklearn.svm import SVC
# Calculate accuracy
accuracy = accuracy_score(y_test, predicted_labels)
print("Accuracy:", accuracy)
# Plotting ROC-AUC Curve # Compute the ROC curve and AUC score
fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1])
auc_score = auc(fpr, tpr)
print("AUC Score:", auc_score)
# Using Support Vector Classifier (SVC) from scikit-learn # Example feature vectors (X) and labels (y)
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array(['spam', 'ham', 'spam', 'ham', 'spam'])
OUTPUT
<bound method DataFrame.info of Email No. the to ect and ... military allowing ff dry Prediction
0 Email 1 0 0 1 0 ... 0 0 0 0 0
1 Email 2 8 13 24 6 ... 0 0 1 0 0
2 Email 3 0 0 1 0 ... 0 0 0 0 0
3 Email 4 0 5 22 0 ... 0 0 0 0 0
4 Email 5 7 6 17 1 ... 0 0 1 0 0
AIM
To work with Numpy arrays, Pandas data frames, Basic plots using Matplotlib.
NUMPY
PROGRAM
Write a NumPy program to create a null vector of size 10 and update sixth value to 11
import numpy as np
x = np.zeros(10)
print(x)
print("Update sixth value to 11")
x[6] = 11
print(x)
OUTPUT
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Update sixth value to 11
[ 0. 0. 0. 0. 0. 0. 11. 0. 0. 0.]
Write a NumPy program to create a 3x3 matrix with values ranging from 2 to 10
import numpy as np
x = np.arange(2, 11).reshape(3,3)
print(x)
OUTPUT
[[ 2 3 4]
[ 5 6 7]
[ 8 9 10]]
Write a NumPy program to convert an array to a float type
import numpy as np
x= np.array([[12, 12], [2, 7], [25, 36]])
print("Original array elements:")
print(x)
print("Convert to float values :")
print(x.astype(float))
OUTPUT
Original array elements:
[[12 12]
[ 2 7]
[25 36]]
Convert to float values :
[[12. 12.]
[ 2. 7.]
[25. 36.]]
Write a NumPy program to find the real and imaginary parts of an array of complex numbers
import numpy as np
x = np.sqrt([1+0j])
y = np.sqrt([0+1j])
print("Original array:x ",x)
print("Original array:y ",y)
print("Real part of the array:")
print(x.real)
print(y.real)
print("Imaginary part of the array:")
print(x.imag)
print(y.imag)
OUTPUT
Original array:x [1.+0.j]
Original array:y [0.70710678+0.70710678j]
Real part of the array:
[1.]
[0.70710678]
Imaginary part of the array:
[0.]
[0.70710678]
PANDAS
PROGRAM
Write a Pandas program to get the powers of an array values element-wise. Note: First array
elements raised to powers from second array
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]});
print("Original array")
print(df)
print("First array elements raised to powers from second array, element-wise:")
print(np.power(df, 2))
OUTPUT
Original array
X Y Z
0 78 84 86
1 85 94 97
2 96 89 96
3 80 83 72
4 86 86 83
First array elements raised to powers from second array, element-wise:
X Y Z
0 6084 7056 7396
1 7225 8836 9409
2 9216 7921 9216
3 6400 6889 5184
4 7396 7396 6889
Write a Pandas program to select the specified columns and rows from a given data frame.
Select 'name' and 'score' columns in rows 1, 3, 5, 6 from the following data frame. exam_data, 'score',
'attempts', 'qualify', labels.
import pandas as pd
import numpy as np
exam_data = {
'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data, index=labels)
print("Select specific columns and rows:")
print(df.iloc[[1, 3, 5, 6], [1, 3]])
OUTPUT
Select specific columns and rows:
score qualify
b 9.0 no
d NaN no
f 20.0 yes
g 14.5 yes
Write a Pandas program to count the number of rows and columns of a DataFrame. Sample
Python dictionary data and list labels:
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura',
'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
total_rows=len(df.axes[0])
total_cols=len(df.axes[1])
print("Number of Rows: "+str(total_rows))
print("Number of Columns: "+str(total_cols))
OUTPUT
Number of Rows: 10
Number of Columns: 4
Write a Pandas program to group by the first column and get second column as lists in rows
import pandas as pd
df = pd.DataFrame( {'col1':['C1','C1','C2','C2','C2','C3','C2'], 'col2':[1,2,3,3,4,6,5]})
print("Original DataFrame")
print(df)
df = df.groupby('col1')['col2'].apply(list)
print("\nGroup on the col1:")
print(df)
OUTPUT
Original DataFrame
col1 col2
0 C1 1
1 C1 2
2 C2 3
3 C2 3
4 C2 4
5 C3 6
6 C2 5
Group on the col1:
col1
C1 [1, 2]
C2 [3, 3, 4, 5]
C3 [6]
Name: col2, dtype: object
Write a Pandas program to check whether a given column is present in a Data Frame or
not.
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print("Original DataFrame")
print(df)
if 'col4' in df.columns:
print("Col4 is present in DataFrame.")
else:
print("Col4 is not present in DataFrame.")
if 'col1' in df.columns:
print("Col1 is present in DataFrame.")
else:
print("Col1 is not present in DataFrame.")
OUTPUT
Original DataFrame
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 12
3 4 9 1
4 7 5 11
Col4 is not present in DataFrame.
Col1 is present in DataFrame.
Write a Pandas program to select the rows where the number of attempts in the examination is
greater than 2.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura',
'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts' : [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Number of attempts in the examination is greater than 2:")
print(df[df['attempts'] > 2])
OUTPUT
Number of attempts in the examination is greater than 2:
name score attempts qualify
b Dima 9.0 3 no
d James NaN 3 no
f Michael 20.0 3 yes
Write a Pandas program to select the rows where the score is missing, i.e. is NaN.
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura',
'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Rows where score is missing:")
print(df[df['score'].isnull()])
OUTPUT
Rows where score is missing:
name score attempts qualify
d James NaN 3 no
h Laura NaN 1 no
MATPLOTLIB
PROGRAM
MATPLOTLIB-1
import matplotlib.pyplot as plt
x = [1,2,3]
y = [2,4,1]
plt.plot(x, y)
plt.xlabel('x - axis')
plt.ylabel('y - axis')
plt.title('My first graph!')
plt.show()
OUTPUT
MATPLOTLIB-2
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r is
# for red
plt.plot(b, "or")
plt.plot(list(range(0, 22, 3)))
# naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th Rep')
# get current axes command
ax = plt.gca()
# get command over the individual
# boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# set the range or the bounds of
# the left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)
# set the interval by which
# the x-axis set the marks
plt.xticks(list(range(-3, 10)))
# set the intervals by which y-axis
# set the marks
plt.yticks(list(range(-3, 20, 3)))
# legend denotes that what color
# signifies what
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th Rep'])
# annotate command helps to write
# ON THE GRAPH any text xy denotes
# the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15))
# gives a title to the Graph
plt.title('All Features Discussed')
plt.show()
OUTPUT
MATPLOTLIB-3
import matplotlib.pyplot as plt
x = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++']
popularity = [22.2, 17.6, 8.8, 8, 7.7, 6.7]
x_pos = [i for i, _ in enumerate(x)]
plt.bar(x_pos, popularity, color='blue')
plt.xlabel("Languages")
plt.ylabel("Popularity")
plt.title("PopularitY of Programming Language\n" + "Worldwide, Oct 2017 compared to a year ago")
plt.xticks(x_pos, x)
# Turn on the grid
plt.minorticks_on()
plt.grid(which='major', linestyle='-', linewidth='0.5', color='red')
# Customize the minor grid
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
OUTPUT
MATPLOTLIB-4
import matplotlib.pyplot as plt
days = [1, 2, 3, 4, 5]
sleeping = [7, 8, 6, 11, 7]
eating = [2, 3, 4, 3, 2]
working = [7, 8, 7, 2, 2]
playing = [8, 5, 7, 8, 13]
plt.plot([], [], color='m', label='Sleeping', linewidth=5)
plt.plot([], [], color='c', label='Eating', linewidth=5)
plt.plot([], [], color='r', label='Working', linewidth=5)
plt.plot([], [], color='k', label='Playing', linewidth=5)
plt.stackplot(days, sleeping, eating, working, playing, colors=['m', 'c', 'r', 'k'])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Stack Plot')
plt.legend()
plt.show()
OUTPUT
RESULT
Thus, the Numpy arrays, Pandas data frames, Basic plots using Matplotlib is executed and
verified successfully.
EXP NO: 4
Variable and Row Filters in R
DATE:
AIM
To explore various variable and row filters in R for cleaning data. Apply various plot features in
R on sample data sets and visualize.
PROGRAM
OUTPUT
RESULT
Thus, the exploration on variable and row filters for cleaning data, and applying plot features in
R is done and executed successfully.
EXP NO: 5
Perform Time Series Analysis and Apply the Visualization Techniques
DATE:
AIM
To perform Time Series Analysis and apply the various visualization techniques.
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot
from pandas import read_csv
import seaborn as sns
df.Low.diff(2).plot(figsize=(6, 6))
plt.show()
# Finding the trend in the "Open" # column using moving average method
window_size = 50
rolling_mean = df['Open'].rolling \
(window_size).mean()
rolling_mean.plot()
plt.show()
OUTPUT
Displaying The First Five Rows of Dataset
AIM
To perform Data Analysis and representation on a Map using various Map data sets with Mouse
Rollover effect, user interaction.
PROGRAM
import pandas as pd
import geopandas as gpd
import math
import matplotlib.pyplot as plt
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster
# Create a map
m_1 = folium.Map(location=[42.32, -71.0589], tiles='openstreetmap', zoom_start=10)
# Display the map
m_1.save("D:/SIBIYA/DEV LAB/map_1.html")
# Load the data
crimes = pd.read_csv("D:/SIBIYA/DEV LAB/crime.csv", encoding='latin-1')
# Drop rows with missing locations
crimes.dropna(subset=['Lat', 'Long', 'DISTRICT'], inplace=True)
# Focus on major crimes in 2018
crimes = crimes[crimes.OFFENSE_CODE_GROUP.isin([
'Larceny', 'Auto Theft', 'Robbery', 'Larceny From Motor Vehicle', 'Residential Burglary',
'Simple Assault', 'Harassment', 'Ballistics', 'Aggravated Assault', 'Other Burglary',
'Arson', 'Commercial Burglary', 'HOME INVASION', 'Homicide', 'Criminal Harassment',
'Manslaughter'])]
crimes = crimes[crimes.YEAR >= 2018]
# Print the first five rows of the table
print(crimes.head())
daytime_robberies = crimes[((crimes.OFFENSE_CODE_GROUP == 'Robbery') & \
(crimes.HOUR.isin(range(9, 18))))]
# Create a map
m_2 = folium.Map(location=[42.32, -71.0589], tiles='cartodbpositron', zoom_start=13)
# Add points to the map
for idx, row in daytime_robberies.iterrows():
Marker([row['Lat'], row['Long']]).add_to(m_2)
# Display the map
m_2.save("D:/SIBIYA/DEV LAB/map_2.html")
# Create the map
m_3 = folium.Map(location=[42.32, -71.0589], tiles='cartodbpositron', zoom_start=13)
# Add points to the map
mc = MarkerCluster()
for idx, row in daytime_robberies.iterrows():
if not math.isnan(row['Long']) and not math.isnan(row['Lat']):
mc.add_child(Marker([row['Lat'], row['Long']]))
m_3.add_child(mc)
# Display the map
m_3.save("D:/SIBIYA/DEV LAB/map_3.html")
# Create a base map
m_4 = folium.Map(location=[42.32, -71.0589], tiles='cartodbpositron', zoom_start=13)
def color_producer(val):
if val <= 12:
return 'forestgreen'
else:
return 'darkred'
# Add a bubble map to the base map
for i in range(0, len(daytime_robberies)):
Circle(
location=[daytime_robberies.iloc[i]['Lat'], daytime_robberies.iloc[i]['Long']],
radius=20,
color=color_producer(daytime_robberies.iloc[i]['HOUR'])).add_to(m_4)
# Display the map
m_4.save("D:/SIBIYA/DEV LAB/map_4.html")
OUTPUT
RESULT
Thus, the Data Analysis and representation on a Map using various Map data sets is executed
and verified successfully.
EXP NO: 7
Build Cartographic Visualization
DATE:
AIM
To build cartographic visualization for multiple datasets involving various countries of the
world; states and districts in India.
PROGRAM
# Import Libraries
import pandas as pd
import geopandas
import folium
import geodatasets
import matplotlib.pyplot as plt
from folium import plugins
# Iterate through list and add a marker for each volcano, color-coded by its type.
i=0
for coordinates in geo_df_list:
# assign a color marker for the type of volcano, Strato being the most common
if geo_df.Type[i] == "Stratovolcano":
type_color = "green"
elif geo_df.Type[i] == "Complex volcano":
type_color = "blue"
elif geo_df.Type[i] == "Shield volcano":
type_color = "orange"
elif geo_df.Type[i] == "Lava dome":
type_color = "pink"
else:
type_color = "purple"
AIM
To perform EDA on Wine Quality Data Set.
PROGRAM
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('D:/SIBIYA/DEV LAB/winequality.csv')
print(df.head())
print(df.info())
print(df.describe().T)
print(df.isnull().sum())
for col in df.columns:
if df[col].isnull().sum() > 0:
df[col] = df[col].fillna(df[col].mean())
print(df.isnull().sum().sum())
df.hist(bins=20, figsize=(10, 10))
plt.show()
plt.bar(df['quality'], df['alcohol'])
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.show()
plt.figure(figsize=(12, 12))
sb.heatmap(df.corr() > 0.7, annot=True, cbar=False)
plt.show()
AIM
Case study on a data set and apply the various EDA and visualization techniques and
present an analysis report.
PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
# Create a plot that shows the number of parking violations for each vehicle year.
plt.plot(df_vehicle_year)
plt.show()
# Calculate the number of parking violations in each county as a percentage of all parking violations.
df_county['Percentage'] = df_county['Percentage'] / df_county['Percentage'].sum() * 100