Lab Manual (DAV)
Lab Manual (DAV)
Objective:
To understand and explore different data visualization tools in Python, specifically
Matplotlib, Seaborn, Plotly, and Pandas, and to demonstrate basic plots using each of these
libraries.
Implementation:
1. Matplotlib – A basic plotting library for creating static, animated, and interactive
visualizations.
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 15, 25, 30]
# Line Plot
plt.plot(x, y, marker='o', color='blue', linestyle='-')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Matplotlib Line Plot")
plt.show()
2. Seaborn – A statistical data visualization library built on top of Matplotlib, useful for
creating informative and attractive graphics.
# Sample data
data = {'Category': ['A', 'B', 'C', 'D', 'E'], 'Values': [15, 30, 25,
10, 20]}
df = pd.DataFrame(data)
# Bar Plot
sns.barplot(x='Category', y='Values', data=df, palette="viridis")
plt.title("Seaborn Bar Plot")
plt.show()
import plotly.express as px
# Sample data
df = px.data.iris() # Built-in iris dataset
# Scatter Plot
fig = px.scatter(df, x='sepal_width', y='sepal_length',
color='species', title="Plotly Scatter Plot")
fig.show()
4. Pandas Plotting – Uses Matplotlib as a backend to create simple visualizations
directly from DataFrames.
import pandas as pd
# Sample data
df = pd.DataFrame({'Category': ['A', 'B', 'C', 'D', 'E'], 'Values':
[15, 30, 25, 10, 20]})
Expected Output:
Objective:
To calculate and understand basic descriptive statistics such as mean, median, mode,
variance, and standard deviation.
Implementation Code:
import numpy as np
import pandas as pd
from scipy import stats
# Sample data
data = [12, 15, 14, 10, 18, 20, 12, 14, 13, 15]
# Displaying Results
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")
Explanation:
Expected Output:
Mean: 14.3
Median: 14.0
Mode: 12
Variance: 9.69
Standard Deviation: 3.11
The values will vary based on the input data, but this code covers the basic steps for
calculating the most commonly used descriptive statistics.
Experiment 3: Hypothesis Testing Using Chi-Square Test
Objective:
To perform a chi-square test of independence to check if two categorical variables are
independent.
The chi-square test of independence checks whether there is a significant association between
two categorical variables. The null hypothesis (H₀) assumes that the two variables are
independent, and the alternative hypothesis (H₁) assumes that they are not independent.
Formula:
Implementation Code:
from scipy.stats import chi2_contingency
import numpy as np
# Displaying Results
print(f"Chi-Square Statistic: {chi2}")
print(f"p-value: {p}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies:\n{expected}")
Expected Output:
Chi-Square Statistic: 22.5
p-value: 4.573286448626054e-06
Degrees of Freedom: 1
Expected Frequencies:
[[25. 15.]
[25. 35.]]
In this case, the p-value is less than 0.05, meaning we reject the null hypothesis and conclude
that there is a significant association between gender and product preference.
Experiment 4: Regression Analysis, Fitting a Linear Model, and Making
Predictions
Objective:
To perform regression analysis by fitting a linear model to the data and making predictions
based on the fitted model.
Explanation:
In linear regression, the goal is to fit a model that predicts a target variable (dependent
variable) based on one or more input variables (independent variables). In simple linear
regression, the model is of the form:
y = mx + b
Where:
Implementation Code:
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
# Making predictions
y_pred = model.predict(X)
In the plot, you’ll see a scatter plot of the data points (blue) and the fitted regression line
(red). The output will include the coefficient (slope) and intercept of the line, as well as the
predictions for new values of X.
Experiment 5: Plotly and Visualization Libraries in Python
Objective:
To explore and create interactive visualizations using Plotly, and compare it with other
popular visualization libraries like Matplotlib and Seaborn.
Explanation:
Plotly is a powerful library for creating interactive plots. Unlike static plots generated by
Matplotlib or Seaborn, Plotly allows you to create visualizations that users can interact with,
such as zooming, hovering, and clicking.
# Scatter Plot
fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species',
title="Interactive Scatter Plot (Plotly)")
fig.show()
Expected Output:
Example 2: Line Plot
# Sample data for Line Plot
df_line = pd.DataFrame({
"x": [1, 2, 3, 4, 5],
"y": [10, 15, 13, 18, 20]
})
# Line Plot
fig_line = px.line(df_line, x='x', y='y', title="Interactive Line Plot
(Plotly)")
fig_line.show()
Expected Output:
# Line Plot
plt.plot(x, y, marker='o', color='blue', linestyle='-', label='Data')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Matplotlib Line Plot")
plt.legend()
plt.show()
Expected Output:
Matplotlib is great for static visualizations, and this simple line plot shows how data can be
represented.
Key Differences:
Objective:
To create visualizations using Matplotlib, a powerful plotting library in Python.
Explanation:
Matplotlib is used for creating static, animated, and interactive visualizations in Python. It is
highly customizable, allowing users to create a wide variety of plots such as line plots, bar
plots, histograms, and more.
A line plot is useful for visualizing the relationship between two continuous variables.
python
Copy code
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Line Plot
plt.plot(x, y, marker='o', color='b', linestyle='-', label='Line 1')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot using Matplotlib')
plt.legend()
plt.grid(True)
plt.show()
Explanation:
A bar plot is used to compare different categories. It displays data with rectangular bars.
python
Copy code
# Sample data
categories = ['A', 'B', 'C', 'D']
values = [5, 7, 3, 6]
# Bar Plot
plt.bar(categories, values, color='orange')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot using Matplotlib')
plt.show()
Explanation:
python
Copy code
import numpy as np
# Histogram
plt.hist(data, bins=30, color='g', edgecolor='black')
plt.xlabel('Data Range')
plt.ylabel('Frequency')
plt.title('Histogram using Matplotlib')
plt.show()
Explanation:
python
Copy code
# Sample data
labels = ['Apple', 'Banana', 'Cherry', 'Date']
sizes = [15, 25, 35, 25]
# Pie Chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['red', 'yellow', 'pink', 'brown'])
plt.title('Fruit Distribution using Pie Chart')
plt.show()
Explanation:
Matplotlib offers extensive customization options for each type of plot, allowing you to tailor
the visualizations to your needs.
Experiment 7: Data Visualization Using Pandas
Objective:
To explore data visualization capabilities using Pandas, which has built-in support for basic
plotting through Matplotlib. Pandas makes it easy to visualize data directly from DataFrames.
Explanation:
Pandas allows you to create simple yet powerful visualizations like line plots, bar charts,
histograms, and more, all using data from a Pandas DataFrame. Below are three examples to
demonstrate this functionality.
A line plot is useful for visualizing trends over time or continuous data.
import pandas as pd
import numpy as np
# Sample data
data = {'Year': [2015, 2016, 2017, 2018, 2019],
'Sales': [150, 200, 250, 300, 350]}
df = pd.DataFrame(data)
# Line Plot
df.plot(x='Year', y='Sales', kind='line', marker='o', title="Sales Over
Years")
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
Explanation:
df = pd.DataFrame(data)
# Bar Plot
df.plot(x='Category', y='Values', kind='bar', color='skyblue',
title="Category Values")
plt.xlabel('Category')
plt.ylabel('Values')
plt.show()
Explanation:
python
Copy code
# Generate random data for Histogram
data = np.random.randn(1000)
# Create a DataFrame
df = pd.DataFrame(data, columns=['Value'])
# Histogram
df.plot(kind='hist', bins=30, color='orange', title="Distribution of Random
Data")
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Explanation:
1. Line Plot: A line plot showing sales trends over the years.
2. Bar Plot: A bar chart comparing values for categories A, B, C, and D.
3. Histogram: A histogram showing the distribution of random data.
Pandas makes it easy to quickly visualize data with minimal code, especially for quick
exploratory analysis. Let me know if you'd like further examples or modifications!
Experiment 8: Working with Time Series Data in Python
Objective:
To demonstrate how to work with time series data, including loading, plotting, and analyzing
time series data using Python.
Explanation:
Time series data consists of observations taken at specific time intervals. In Python, we often
use libraries like Pandas and Matplotlib to handle and visualize time series data. Pandas
provides the datetime functionality to handle dates and times efficiently.
1. Loading Time Series Data: Often from a CSV or directly from a generated series.
2. Plotting: Visualizing the data to detect trends, seasonality, etc.
3. Basic Analysis: Handling missing values, resampling, etc.
Implementation Code:
Let's start with generating a simple time series dataset and plotting it.
python
Copy code
import pandas as pd
import matplotlib.pyplot as plt
• pd.date_range() creates a range of dates starting from 2020-01-01 for 100 days.
• pd.Series() creates the time series with the dates as the index.
• plot() is used to plot the time series data.
Sometimes, time series data is collected at high frequencies (e.g., minute-level data). We
might want to resample the data to a lower frequency (e.g., daily or monthly averages).
Time series data often contains missing values. Here's how you can handle them.
• ffill() is a method to forward fill the missing values (propagates the last valid
observation).
• We also plot the original data with missing values and the filled data for comparison.
Expected Output:
1. Time Series Plot: A line plot showing the generated time series data.
2. Resampled Data Plot: A line plot showing monthly mean values of the time series data.
3. Missing Data Handling: A plot showing the original data with missing values and the filled
data.
Summary:
In this example, we demonstrated how to generate a simple time series, plot it, resample the
data to a different frequency, and handle missing values. These are some of the common
tasks when working with time series data in Python.
Experiment 9: Analysis and Visualization of IPL Cricket Data
Objective:
To perform data analysis and visualize the performance of teams and players in the Indian
Premier League (IPL) cricket tournament. The focus will be on exploring various statistics,
such as runs, wickets, and team performance over the years.
Explanation:
In this experiment, we can work with a dataset containing IPL match statistics, which
includes data such as runs scored by teams, wickets taken, players’ performances, and more.
The goal is to use Pandas for data analysis and Matplotlib/Seaborn for visualization.
For this example, let’s assume we have a dataset ipl_matches.csv containing IPL match
data, such as:
• Match Date
• Teams (Home and Away)
• Runs scored by each team
• Wickets taken by each team
• Player performances (runs, wickets, etc.)
We will perform:
We can analyze the performance of teams by looking at their total wins, runs, and wickets
across different seasons.
plt.figure(figsize=(12, 6))
for team in team_runs['team1'].unique():
team_data = team_runs[team_runs['team1'] == team]
plt.plot(team_data['year'], team_data['team1_runs'], label=team)
Explanation:
• groupby(['year', 'team1']) groups the data by year and team1 to calculate the sum
of runs.
• We plot the total runs scored by each team over the years.
Explanation:
Explanation:
• We group the player data by player name and calculate their total runs.
• We plot the top 10 batsmen with the highest runs for a specific year (2020).
Expected Output:
1. Total Runs by Team: A line plot showing the total runs scored by each team across different
seasons.
2. Team Wins: A line plot showing the number of matches won by each team over the years.
3. Top Batsmen: A bar plot showing the top 10 batsmen based on total runs in IPL 2020.
Key Functions for Analysis:
Summary:
In this experiment, we analyzed IPL cricket data to explore team performance, wins over the
years, and top player performances. We used Pandas for data manipulation and Matplotlib for
data visualization. This process provides insights into the performance trends of teams and
players across IPL seasons.
Experiment 10: Analysis and Visualization of Patient Data Set
Objective:
To perform data analysis and visualization on a patient dataset. We will explore patient
demographics, medical history, diagnosis, and other health-related metrics using Pandas and
Matplotlib/Seaborn.
Explanation:
Before analysis, ensure the data is clean (e.g., handle missing values, incorrect data types).
# Drop rows with missing values (or you can fill them with mean/median)
patient_df.dropna(inplace=True)
Explanation:
• We plot a histogram of the age column to visualize the distribution of patients by age.
plt.figure(figsize=(8, 6))
gender_counts.plot(kind='bar', color='lightgreen')
plt.title('Gender Distribution of Patients')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()
Explanation:
• We use value_counts() to count the occurrences of each gender and plot the result in a
bar chart.
Example 3: Diagnosis Distribution
# Plot the distribution of diagnoses
diagnosis_counts = patient_df['diagnosis'].value_counts()
plt.figure(figsize=(10, 6))
diagnosis_counts.plot(kind='bar', color='coral')
plt.title('Diagnosis Distribution')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
Explanation:
• We use value_counts() to count the number of patients in each diagnosis category and
visualize it with a bar plot.
We will explore how BMI, blood pressure, and cholesterol levels are related.
Explanation:
• sns.scatterplot() creates a scatter plot to visualize the relationship between BMI and
blood pressure, color-coded by diagnosis.
We can calculate and visualize the correlation between numerical features like BMI, blood
pressure, and cholesterol levels
Explanation:
Expected Output:
1. Age Distribution: A histogram showing the distribution of ages in the patient dataset.
2. Gender Distribution: A bar chart showing the number of male and female patients.
3. Diagnosis Distribution: A bar chart showing the count of patients diagnosed with different
conditions.
4. BMI vs. Blood Pressure: A scatter plot showing the relationship between BMI and blood
pressure for different diagnoses.
5. Correlation Heatmap: A heatmap showing the correlations between BMI, blood pressure,
and cholesterol levels.
Summary: