Manual

CONTENT
SL.NO PROGRAM PAGE MARK SIGNATURE

NO
Install the data Analysis and Visualization tool: R/

1.
Python /Tableau Public/ Power BI.
Perform exploratory data analysis (EDA) on with

datasets like email data set. Export all your emails as a
2.
dataset, import them inside a pandas data frame,
visualize them and get different insights from the data.
Working with Numpy arrays, Pandas data frames , Basic

3.
plots using Matplotlib.
Explore various variable and row filters in R for

4. cleaning data. Apply various plot features in R on
sample data sets and visualize.
Perform Time Series Analysis and apply the various

5.
visualization techniques.
Perform Data Analysis and representation on a Map

6. using various Map data sets with Mouse Rollover effect,
user interaction, etc..
Build cartographic visualization for multiple datasets

7. involving various countries of the world; states and
districts in India etc.
8. Perform EDA on Wine Quality Data Set.
Use a case study on a data set and apply the various EDA
9. and visualization techniques and present an analysis
report.
EXP NO: 2
Perform Exploratory Data Analysis on Email Data Set
DATE:
AIM
To perform exploratory data analysis (EDA) on email data set to export all your emails as a
dataset, import them inside a pandas data frame, visualize them and get different insights from the
data.
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('D:/SIBIYA/DEV LAB/emails.csv')
print(df.head().info)
# Feature Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, roc_curve, auc, confusion_matrix
from sklearn.svm import SVC
# Assuming the column with words is named "text"

df = df.dropna(subset=["text"])
# Separate the features (words) and the target variable (spam)
X = df.drop(["Email No.", "Prediction"], axis=1) # Exclude Email_no. and spam columns
y = df["Prediction"]
# Perform feature selection using mutual information
selector = SelectKBest(score_func=mutual_info_classif, k=1500) # Select top 1500 features
X_selected = selector.fit_transform(X, y)
# Get the selected feature names

selected_feature_names = X.columns[selector.get_support()].tolist()
# Create a new dataframe with the selected features

df_selected = df[["Email No.", "Prediction"] + selected_feature_names]
# Print the shape of the new dataframe

print("New dataframe shape:", df_selected.shape)
# Perform train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the shape of the train and test sets

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Fit the model on the training data

model = MultinomialNB()
model.fit(X_train, y_train)
# Predict probabilities for the test data
probs = model.predict_proba(X_test)
# Predict labels for the test data
predicted_labels = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, predicted_labels)
print("Accuracy:", accuracy)
# Plotting ROC-AUC Curve # Compute the ROC curve and AUC score
fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1])
auc_score = auc(fpr, tpr)
print("AUC Score:", auc_score)
# Plot the ROC curve

plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score)
plt.plot([0, 1], [0, 1], 'k--') # Diagonal line indicating random chance
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
# Plotting Confusion Matrix # Example predicted labels

predicted_labels = np.array(['spam', 'ham', 'spam', 'ham', 'spam'])
# Example true labels
true_labels = np.array(['spam', 'ham', 'ham', 'ham', 'spam'])
# Define the classes and the order of the confusion matrix
classes = ['spam', 'ham']
# Create confusion matrix

cm = confusion_matrix(true_labels, predicted_labels, labels=classes)
# Plot confusion matrix

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=classes, yticklabels=classes)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()
# Using Support Vector Classifier (SVC) from scikit-learn # Example feature vectors (X) and labels (y)
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array(['spam', 'ham', 'spam', 'ham', 'spam'])
# Create an instance of SVC classifier

model = SVC()
# Fit the model to the training data

model.fit(X, y)
# Predict labels for the same data

predicted_labels = model.predict(X)
# Calculate accuracy
accuracy = accuracy_score(y, predicted_labels)
print("Accuracy:", accuracy)
# Confusion Matrix # Create confusion matrix
cm = confusion_matrix(y, predicted_labels)
# Plot confusion matrix

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()
OUTPUT
<bound method DataFrame.info of Email No. the to ect and ... military allowing ff dry Prediction
0 Email 1 0 0 1 0 ... 0 0 0 0 0
1 Email 2 8 13 24 6 ... 0 0 1 0 0
2 Email 3 0 0 1 0 ... 0 0 0 0 0
3 Email 4 0 5 22 0 ... 0 0 0 0 0
4 Email 5 7 6 17 1 ... 0 0 1 0 0
[5 rows x 3002 columns]>

New dataframe shape: (5172, 1502)
X_train shape: (4137, 3000)
X_test shape: (1035, 3000)
y_train shape: (4137,)
y_test shape: (1035,)
Accuracy: 0.9545893719806763
AUC Score: 0.9793548623047947
RESULT
Thus, the exploratory data analysis (EDA) on email data set to export all your emails as a
dataset are executed and verified successfully.
EXP NO: 3
Matplotlib
DATE:
AIM
To work with Numpy arrays, Pandas data frames, Basic plots using Matplotlib.
NUMPY
PROGRAM
Write a NumPy program to create a null vector of size 10 and update sixth value to 11
import numpy as np
x = np.zeros(10)
print(x)
print("Update sixth value to 11")
x[6] = 11
print(x)
OUTPUT
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Update sixth value to 11
[ 0. 0. 0. 0. 0. 0. 11. 0. 0. 0.]
Write a NumPy program to create a 3x3 matrix with values ranging from 2 to 10
import numpy as np
x = np.arange(2, 11).reshape(3,3)
print(x)
OUTPUT
[[ 2 3 4]
[ 5 6 7]
[ 8 9 10]]
Write a NumPy program to convert an array to a float type
import numpy as np
x= np.array([[12, 12], [2, 7], [25, 36]])
print("Original array elements:")
print(x)
print("Convert to float values :")
print(x.astype(float))
OUTPUT
Original array elements:
[[12 12]
[ 2 7]
[25 36]]
Convert to float values :
[[12. 12.]
[ 2. 7.]
[25. 36.]]
Write a NumPy program to create an empty and a full array

import numpy as np
# Create an empty array
x = np.empty((3,4))
print(x)
# Create a full array
y = np.full((3,3),6)
print(y)
OUTPUT
[[6.23042070e-307 4.67296746e-307 1.69121096e-306 4.45042613e-307]
[6.11926293e-308 3.56043054e-307 7.56595733e-307 1.60216183e-306]
[8.45596650e-307 2.33644299e-307 1.89144180e-307 3.26083326e-322]]
[[6 6 6]
[6 6 6]
[6 6 6]]
Write a NumPy program to find the real and imaginary parts of an array of complex numbers
import numpy as np
x = np.sqrt([1+0j])
y = np.sqrt([0+1j])
print("Original array:x ",x)
print("Original array:y ",y)
print("Real part of the array:")
print(x.real)
print(y.real)
print("Imaginary part of the array:")
print(x.imag)
print(y.imag)
OUTPUT
Original array:x [1.+0.j]
Original array:y [0.70710678+0.70710678j]
Real part of the array:
[1.]
[0.70710678]
Imaginary part of the array:
[0.]
[0.70710678]
PANDAS
PROGRAM
Write a Pandas program to get the powers of an array values element-wise. Note: First array
elements raised to powers from second array
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]});
print("Original array")
print(df)
print("First array elements raised to powers from second array, element-wise:")
print(np.power(df, 2))
OUTPUT
Original array
X Y Z
0 78 84 86
1 85 94 97
2 96 89 96
3 80 83 72
4 86 86 83
First array elements raised to powers from second array, element-wise:
X Y Z
0 6084 7056 7396
1 7225 8836 9409
2 9216 7921 9216
3 6400 6889 5184
4 7396 7396 6889
Write a Pandas program to select the specified columns and rows from a given data frame.
Select 'name' and 'score' columns in rows 1, 3, 5, 6 from the following data frame. exam_data, 'score',
'attempts', 'qualify', labels.
import pandas as pd
import numpy as np
exam_data = {
'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data, index=labels)
print("Select specific columns and rows:")
print(df.iloc[[1, 3, 5, 6], [1, 3]])
OUTPUT
Select specific columns and rows:
score qualify
b 9.0 no
d NaN no
f 20.0 yes
g 14.5 yes
Write a Pandas program to count the number of rows and columns of a DataFrame. Sample
Python dictionary data and list labels:
import pandas as pd
import numpy as np
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura',
'Kevin', 'Jonas'],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
df = pd.DataFrame(exam_data , index=labels)
total_rows=len(df.axes[0])
total_cols=len(df.axes[1])
print("Number of Rows: "+str(total_rows))
print("Number of Columns: "+str(total_cols))
OUTPUT
Number of Rows: 10
Number of Columns: 4
Write a Pandas program to group by the first column and get second column as lists in rows
import pandas as pd
df = pd.DataFrame( {'col1':['C1','C1','C2','C2','C2','C3','C2'], 'col2':[1,2,3,3,4,6,5]})
print("Original DataFrame")
print(df)
df = df.groupby('col1')['col2'].apply(list)
print("\nGroup on the col1:")
print(df)
OUTPUT
Original DataFrame
col1 col2
0 C1 1
1 C1 2
2 C2 3
3 C2 3
4 C2 4
5 C3 6
6 C2 5
Group on the col1:
col1
C1 [1, 2]
C2 [3, 3, 4, 5]
C3 [6]
Name: col2, dtype: object
Write a Pandas program to check whether a given column is present in a Data Frame or
not.
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print("Original DataFrame")
print(df)
if 'col4' in df.columns:
print("Col4 is present in DataFrame.")
else:
print("Col4 is not present in DataFrame.")
if 'col1' in df.columns:
print("Col1 is present in DataFrame.")
else:
print("Col1 is not present in DataFrame.")
OUTPUT
Original DataFrame
col1 col2 col3
0 1 4 7
1 2 5 8
2 3 6 12
3 4 9 1
4 7 5 11
Col4 is not present in DataFrame.
Col1 is present in DataFrame.
Write a Pandas program to select the rows where the number of attempts in the examination is
greater than 2.
import pandas as pd
import numpy as np
'Kevin', 'Jonas'],
'attempts' : [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
print("Number of attempts in the examination is greater than 2:")
print(df[df['attempts'] > 2])
OUTPUT
Number of attempts in the examination is greater than 2:
name score attempts qualify
b Dima 9.0 3 no
d James NaN 3 no
f Michael 20.0 3 yes
Write a Pandas program to select the rows where the score is missing, i.e. is NaN.
import pandas as pd
import numpy as np
'Kevin', 'Jonas'],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
print("Rows where score is missing:")
print(df[df['score'].isnull()])
OUTPUT
Rows where score is missing:
name score attempts qualify
d James NaN 3 no
h Laura NaN 1 no
MATPLOTLIB
PROGRAM
MATPLOTLIB-1
x = [1,2,3]
y = [2,4,1]
plt.plot(x, y)
plt.xlabel('x - axis')
plt.ylabel('y - axis')
plt.title('My first graph!')
plt.show()
OUTPUT
MATPLOTLIB-2
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r is
# for red
plt.plot(b, "or")
plt.plot(list(range(0, 22, 3)))
# naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th Rep')
# get current axes command
ax = plt.gca()
# get command over the individual
# boundary line of the graph body
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
# set the range or the bounds of
# the left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)
# set the interval by which
# the x-axis set the marks
plt.xticks(list(range(-3, 10)))
# set the intervals by which y-axis
# set the marks
plt.yticks(list(range(-3, 20, 3)))
# legend denotes that what color
# signifies what
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th Rep'])
# annotate command helps to write
# ON THE GRAPH any text xy denotes
# the position on the graph
plt.annotate('Temperature V / s Days', xy = (1.01, -2.15))
# gives a title to the Graph
plt.title('All Features Discussed')
plt.show()
OUTPUT
MATPLOTLIB-3
x = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++']
popularity = [22.2, 17.6, 8.8, 8, 7.7, 6.7]
x_pos = [i for i, _ in enumerate(x)]
plt.bar(x_pos, popularity, color='blue')
plt.xlabel("Languages")
plt.ylabel("Popularity")
plt.title("PopularitY of Programming Language\n" + "Worldwide, Oct 2017 compared to a year ago")
plt.xticks(x_pos, x)
# Turn on the grid
plt.minorticks_on()
plt.grid(which='major', linestyle='-', linewidth='0.5', color='red')
# Customize the minor grid
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
OUTPUT
MATPLOTLIB-4
days = [1, 2, 3, 4, 5]
sleeping = [7, 8, 6, 11, 7]
eating = [2, 3, 4, 3, 2]
working = [7, 8, 7, 2, 2]
playing = [8, 5, 7, 8, 13]
plt.plot([], [], color='m', label='Sleeping', linewidth=5)
plt.plot([], [], color='c', label='Eating', linewidth=5)
plt.plot([], [], color='r', label='Working', linewidth=5)
plt.plot([], [], color='k', label='Playing', linewidth=5)
plt.stackplot(days, sleeping, eating, working, playing, colors=['m', 'c', 'r', 'k'])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Stack Plot')
plt.legend()
plt.show()
OUTPUT
RESULT
Thus, the Numpy arrays, Pandas data frames, Basic plots using Matplotlib is executed and
verified successfully.
EXP NO: 4
Variable and Row Filters in R
DATE:
AIM
To explore various variable and row filters in R for cleaning data. Apply various plot features in
R on sample data sets and visualize.
PROGRAM
OUTPUT
RESULT
Thus, the exploration on variable and row filters for cleaning data, and applying plot features in
R is done and executed successfully.
EXP NO: 5
Perform Time Series Analysis and Apply the Visualization Techniques
DATE:
AIM
To perform Time Series Analysis and apply the various visualization techniques.
PROGRAM
import pandas as pd
import numpy as np
from matplotlib import pyplot
from pandas import read_csv
import seaborn as sns
# reading the dataset using read_csv

df = pd.read_csv(r"D:\SIBIYA\DEV LAB\stock_data.csv", parse_dates=True, index_col="Date")
# displaying the first five rows of dataset

df.head()
print(df)
# Box Plot in Time Series

df.drop(columns='Unnamed: 0', inplace=True)
df['Date'] = pd.to_datetime(df['Date'])
# extract year from date column
df["Year"] = df["Date"].dt.year
# box plot grouped by year
sns.boxplot(data=df, x="Year", y="Open")
plt.show()
# Plotting Line plot for Time Series data.
df['Volume'].plot()
plt.show()
# plot all other columns using a subplot

df.plot(subplots=True, figsize=(4, 4))
plt.show()
df.Low.diff(2).plot(figsize=(6, 6))
plt.show()
# Finding the trend in the "Open" # column using moving average method
window_size = 50
rolling_mean = df['Open'].rolling \
(window_size).mean()
rolling_mean.plot()
plt.show()
OUTPUT
Displaying The First Five Rows of Dataset
Box Plot in Time Series Dataset

Plotting Line plot for Time Series data
Plot All Other Columns Using a Subplot(figsize=(4, 4))

Plot All Other Columns Using a Subplot(figsize=(6,6))
Finding the trend in the "Open"

RESULT
Thus, the Time Series Analysis and applying the various visualization techniques are executed
and verified successfully.
EXP NO: 6
Perform Data Analysis and Representation on a Map Data Sets
DATE:
AIM
To perform Data Analysis and representation on a Map using various Map data sets with Mouse
Rollover effect, user interaction.
PROGRAM
import pandas as pd
import geopandas as gpd
import math
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster
# Create a map
m_1 = folium.Map(location=[42.32, -71.0589], tiles='openstreetmap', zoom_start=10)
# Display the map
m_1.save("D:/SIBIYA/DEV LAB/map_1.html")
# Load the data
crimes = pd.read_csv("D:/SIBIYA/DEV LAB/crime.csv", encoding='latin-1')
# Drop rows with missing locations
crimes.dropna(subset=['Lat', 'Long', 'DISTRICT'], inplace=True)
# Focus on major crimes in 2018
crimes = crimes[crimes.OFFENSE_CODE_GROUP.isin([
'Larceny', 'Auto Theft', 'Robbery', 'Larceny From Motor Vehicle', 'Residential Burglary',
'Simple Assault', 'Harassment', 'Ballistics', 'Aggravated Assault', 'Other Burglary',
'Arson', 'Commercial Burglary', 'HOME INVASION', 'Homicide', 'Criminal Harassment',
'Manslaughter'])]
crimes = crimes[crimes.YEAR >= 2018]
# Print the first five rows of the table
print(crimes.head())
daytime_robberies = crimes[((crimes.OFFENSE_CODE_GROUP == 'Robbery') & \
(crimes.HOUR.isin(range(9, 18))))]
# Create a map
m_2 = folium.Map(location=[42.32, -71.0589], tiles='cartodbpositron', zoom_start=13)
# Add points to the map
for idx, row in daytime_robberies.iterrows():
Marker([row['Lat'], row['Long']]).add_to(m_2)
# Display the map
# Create the map
# Add points to the map
mc = MarkerCluster()
for idx, row in daytime_robberies.iterrows():
if not math.isnan(row['Long']) and not math.isnan(row['Lat']):
mc.add_child(Marker([row['Lat'], row['Long']]))
m_3.add_child(mc)
# Display the map
# Create a base map
def color_producer(val):
if val <= 12:
return 'forestgreen'
else:
return 'darkred'
# Add a bubble map to the base map
for i in range(0, len(daytime_robberies)):
Circle(
location=[daytime_robberies.iloc[i]['Lat'], daytime_robberies.iloc[i]['Long']],
radius=20,
color=color_producer(daytime_robberies.iloc[i]['HOUR'])).add_to(m_4)
# Display the map
OUTPUT
RESULT
Thus, the Data Analysis and representation on a Map using various Map data sets is executed
and verified successfully.
EXP NO: 7
Build Cartographic Visualization
DATE:
AIM
To build cartographic visualization for multiple datasets involving various countries of the
world; states and districts in India.
PROGRAM
# Import Libraries
import pandas as pd
import geopandas
import folium
import geodatasets
from folium import plugins
df1 = pd.read_csv("D:/SIBIYA/DEV LAB/volcano_data_2010.csv")

# Keep only relevant columns
df = df1.loc[:, ("Year", "Name", "Country", "Latitude", "Longitude", "Type")]
df.info()
# Create point geometries

geometry = geopandas.points_from_xy(df.Longitude, df.Latitude)
geo_df = geopandas.GeoDataFrame(
df[["Year", "Name", "Country", "Latitude", "Longitude", "Type"]], geometry=geometry)
geo_df.head()
world = geopandas.read_file(geodatasets.get_path("naturalearth.land"))
df.Type.unique()
fig, ax = plt.subplots(figsize=(24, 18))
world.plot(ax=ax, alpha=0.4, color="grey")
geo_df.plot(column="Type", ax=ax, legend=True)
plt.title("Volcanoes")
# Stamen Terrain
map = folium.Map(location=[13.406, 80.110], tiles="Stamen Terrain", zoom_start=9)
map.save("D:/SIBIYA/DEV LAB/map1.html")
# OpenStreetMap
map = folium.Map(location=[13.406, 80.110], tiles="OpenStreetMap", zoom_start=9)
# Stamen Toner
map = folium.Map(location=[13.406, 80.110], tiles="Stamen Toner", zoom_start=9)
# Use terrain map layer to see volcano terrain
map = folium.Map(location=[4, 10], tiles="Stamen Terrain", zoom_start=3)
# Create a geometry list from the GeoDataFrame
geo_df_list = [[point.xy[1][0], point.xy[0][0]] for point in geo_df.geometry]
# Iterate through list and add a marker for each volcano, color-coded by its type.
i=0
for coordinates in geo_df_list:
# assign a color marker for the type of volcano, Strato being the most common
if geo_df.Type[i] == "Stratovolcano":
type_color = "green"
elif geo_df.Type[i] == "Complex volcano":
type_color = "blue"
elif geo_df.Type[i] == "Shield volcano":
type_color = "orange"
elif geo_df.Type[i] == "Lava dome":
type_color = "pink"
else:
type_color = "purple"
# Place the markers with the popup labels and data

map.add_child(
folium.Marker(
location=coordinates,
popup="Year: "
+ str(geo_df.Year[i]) + "<br>"
+ "Name: "
+ str(geo_df.Name[i]) + "<br>"
+ "Country: "
+ str(geo_df.Country[i]) + "<br>"
+ "Type: "
+ str(geo_df.Type[i]) + "<br>"
+ "Coordinates: "
+ str(geo_df_list[i]),
icon=folium.Icon(color="%s" % type_color),
) )
i=i+1
# This example uses heatmaps to visualize the density of volcanoes

# which is more in some parts of the world compared to others.
map = folium.Map(location=[15, 30], tiles="Cartodb dark_matter", zoom_start=2)
heat_data = [[point.xy[1][0], point.xy[0][0]] for point in geo_df.geometry]
print(heat_data)
plugins.HeatMap(heat_data).add_to(map)
OUTPUT
RESULT
Thus, the cartographic visualization for multiple datasets involving various countries of the
world; states and districts in India is visualized and verified successfully.
EXP NO: 8
Perform Exploratory Data Analysis on Wine Quality Data Set
DATE:
AIM
To perform EDA on Wine Quality Data Set.
PROGRAM
import numpy as np
import pandas as pd
import seaborn as sb
import warnings
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
warnings.filterwarnings('ignore')
df = pd.read_csv('D:/SIBIYA/DEV LAB/winequality.csv')
print(df.head())
print(df.info())
print(df.describe().T)
print(df.isnull().sum())
for col in df.columns:
if df[col].isnull().sum() > 0:
df[col] = df[col].fillna(df[col].mean())
print(df.isnull().sum().sum())
df.hist(bins=20, figsize=(10, 10))
plt.show()
plt.bar(df['quality'], df['alcohol'])
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.show()
plt.figure(figsize=(12, 12))
sb.heatmap(df.corr() > 0.7, annot=True, cbar=False)
plt.show()
df = df.drop('total sulfur dioxide', axis=1)

df['best quality'] = [1 if x > 5 else 0 for x in df.quality]
print(df.replace({'white': 1, 'red': 0}, inplace=True))
features = df.drop(['quality', 'best quality'], axis=1)
target = df['best quality']
xtrain, xtest, ytrain, ytest = train_test_split(
features, target, test_size=0.2, random_state=40)
print(xtrain.shape, xtest.shape)
norm = MinMaxScaler()
xtrain = norm.fit_transform(xtrain)
xtest = norm.transform(xtest)
models = [LogisticRegression(), XGBClassifier(), SVC(kernel='rbf')]
for i in range(3):
models[i].fit(xtrain, ytrain)
print(f'{models[i]} : ')
print('Training Accuracy : ', metrics.roc_auc_score(ytrain, models[i].predict(xtrain)))
print('Validation Accuracy : ', metrics.roc_auc_score(
ytest, models[i].predict(xtest)))
print()
metrics.plot_confusion_matrix(models[1], xtest, ytest)
plt.show()
print(metrics.classification_report(ytest, models[1].predict(xtest)))
OUTPUT
RESULT
Thus, the EDA on Wine Quality Data Set is performed and executed successfully.
EXP NO: 9
Case Study
DATE:
AIM
Case study on a data set and apply the various EDA and visualization techniques and
present an analysis report.
PROGRAM
import pandas as pd
# Read the file and import all rows.

df = pd.read_csv('D:/SIBIYA/DEV LAB/violations.csv')
# Change the data type of the 'Issue Date' column to date.

df['Issue Date'] = pd.to_datetime(df['Issue Date'])
# Print out the number of rows imported from the file.

print('Number of Rows: ' + str(len(df)))
# Remove rows containing invalid data.

df = df[(df['Registration State'] != "99") & (df['Plate Type'] != "999") & (df['Issue Date'] >= '2020-04-
01') & (df['Issue Date'] <= '2020-11-30') & (df['Violation Code'] != 0) & (df['Vehicle
Make'].notnull())
& (df['Violation Time'].notnull()) & (df['Vehicle Year'] != 0) & (df['Vehicle Year'] <= 2020)]
# Print out the number of rows remaining in the dataset.
print('Number of Rows: ' + str(len(df)))
# Isolate the data to be used in the plot.
df_vehicle_year = df.groupby('Vehicle Year')['Summons Number'].count()
# Create a plot that shows the number of parking violations for each vehicle year.
plt.plot(df_vehicle_year)
plt.show()
df[df['Registration State'] != 'NY'].groupby('Violation Code')['Summons

Number'].count().nlargest(5).reset_index(name='Count')
df[df['Vehicle Make'] == 'HONDA'].groupby('Street Name')['Summons

# Subset for only rows where the Registration State is NY.

df_ny = df[df['Registration State'] == 'NY']
# Calculate the ratio of non-passenger plates to all plates, grouped by year.

df_ny_notpas = df_ny[df_ny['Plate Type'] != 'PAS'].groupby('Vehicle Year')['Summons Number'].count()
df_ny_all = df_ny.groupby('Vehicle Year')['Summons Number'].count()
ratio = df_ny_notpas / df_ny_all
# Replace nulls with 0.

ratio.fillna(0, inplace = True)
# Create and show plot.

plt.plot(ratio)
plt.show()
df[df['Plate Type'] == 'PAS'].groupby('Vehicle Color')['Summons

df[df['Plate Type'] == 'COM'].groupby('Vehicle Color')['Summons
print('Number of Registration States: ' + str(df['Registration State'].nunique()))

print('Average Number of Parking Violations per Registration State: ' +
str(df.groupby('Registration State')['Summons Number'].count().mean()))
df.groupby('Violation Code')['Plate Type'].apply(lambda x:
x.value_counts().head(1)).reset_index(name='Count').rename(columns={'level_1': 'Plate Type'})
# Count the number of parking violations in each county.

df_county = df.groupby('Violation County')['Summons
Number'].count().reset_index(name='Percentage')
# Calculate the number of parking violations in each county as a percentage of all parking violations.
df_county['Percentage'] = df_county['Percentage'] / df_county['Percentage'].sum() * 100
# Sort and display the resulting dataframe.

df_county.sort_values(by='Percentage', ascending=False).reset_index(drop=True)
OUTPUT
RESULT
Thus, the data set and applying the various EDA and visualization techniques is studied.

Manual

Uploaded by

Copyright:

Available Formats

Manual

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Manual

Uploaded by

Copyright:

Available Formats

CONTENT

SL.NO PROGRAM PAGE MARK SIGNATURE

Install the data Analysis and Visualization tool: R/

Perform exploratory data analysis (EDA) on with

Working with Numpy arrays, Pandas data frames , Basic

Explore various variable and row filters in R for

Perform Time Series Analysis and apply the various

Perform Data Analysis and representation on a Map

Build cartographic visualization for multiple datasets

8. Perform EDA on Wine Quality Data Set.

# Assuming the column with words is named "text"

# Get the selected feature names

# Create a new dataframe with the selected features

# Print the shape of the new dataframe

# Perform train-test split

# Print the shape of the train and test sets

# Fit the model on the training data

# Plot the ROC curve

# Plotting Confusion Matrix # Example predicted labels

# Create confusion matrix

# Plot confusion matrix

# Create an instance of SVC classifier

# Fit the model to the training data

# Predict labels for the same data

# Plot confusion matrix

[5 rows x 3002 columns]>

Write a NumPy program to create an empty and a full array

# reading the dataset using read_csv

# displaying the first five rows of dataset

# Box Plot in Time Series

# plot all other columns using a subplot

Box Plot in Time Series Dataset

Plot All Other Columns Using a Subplot(figsize=(4, 4))

Finding the trend in the "Open"

df1 = pd.read_csv("D:/SIBIYA/DEV LAB/volcano_data_2010.csv")

# Create point geometries

# Place the markers with the popup labels and data

# This example uses heatmaps to visualize the density of volcanoes

from sklearn.model_selection import train_test_split

df = df.drop('total sulfur dioxide', axis=1)

# Read the file and import all rows.

# Change the data type of the 'Issue Date' column to date.

# Print out the number of rows imported from the file.

# Remove rows containing invalid data.

df[df['Registration State'] != 'NY'].groupby('Violation Code')['Summons

df[df['Vehicle Make'] == 'HONDA'].groupby('Street Name')['Summons

# Subset for only rows where the Registration State is NY.

# Calculate the ratio of non-passenger plates to all plates, grouped by year.

# Replace nulls with 0.

# Create and show plot.

df[df['Plate Type'] == 'PAS'].groupby('Vehicle Color')['Summons

print('Number of Registration States: ' + str(df['Registration State'].nunique()))

# Count the number of parking violations in each county.

# Sort and display the resulting dataframe.

You might also like