0% found this document useful (0 votes)

27 views

Lab Manual (DAV)

Uploaded by

Paras Chaturvedi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Lab Manual (DAV)

Uploaded by

Paras Chaturvedi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Experiment 1: Introduction to Various Data Visualization Tools

Objective:
To understand and explore different data visualization tools in Python, specifically
Matplotlib, Seaborn, Plotly, and Pandas, and to demonstrate basic plots using each of these
libraries.

Implementation:

1. Matplotlib – A basic plotting library for creating static, animated, and interactive
visualizations.

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 15, 25, 30]

# Line Plot
plt.plot(x, y, marker='o', color='blue', linestyle='-')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Matplotlib Line Plot")
plt.show()

2. Seaborn – A statistical data visualization library built on top of Matplotlib, useful for
creating informative and attractive graphics.

import seaborn as sns

import pandas as pd

# Sample data
data = {'Category': ['A', 'B', 'C', 'D', 'E'], 'Values': [15, 30, 25,
10, 20]}
df = pd.DataFrame(data)

# Bar Plot
sns.barplot(x='Category', y='Values', data=df, palette="viridis")
plt.title("Seaborn Bar Plot")
plt.show()

3. Plotly – An interactive visualization library that supports dynamic plots and

dashboards.

import plotly.express as px

# Sample data
df = px.data.iris() # Built-in iris dataset

# Scatter Plot
fig = px.scatter(df, x='sepal_width', y='sepal_length',
color='species', title="Plotly Scatter Plot")
fig.show()
4. Pandas Plotting – Uses Matplotlib as a backend to create simple visualizations
directly from DataFrames.

import pandas as pd

# Sample data
df = pd.DataFrame({'Category': ['A', 'B', 'C', 'D', 'E'], 'Values':
[15, 30, 25, 10, 20]})

# Plotting with Pandas

df.plot(kind='bar', x='Category', y='Values', color='orange',
legend=False)
plt.title("Pandas Bar Plot")
plt.xlabel("Category")
plt.ylabel("Values")
plt.show()

Expected Output:

• Matplotlib: A simple line plot.

• Seaborn: A colorful bar plot.
• Plotly: An interactive scatter plot.
• Pandas Plotting: A bar plot using Pandas’ built-in plotting capabilities.
Experiment 2: Descriptive Statistics (Mean, Median, Mode, Variance, etc.)

Objective:
To calculate and understand basic descriptive statistics such as mean, median, mode,
variance, and standard deviation.

Implementation Code:

import numpy as np
import pandas as pd
from scipy import stats

# Sample data
data = [12, 15, 14, 10, 18, 20, 12, 14, 13, 15]

# Using NumPy for Mean, Variance, and Standard Deviation

mean = np.mean(data)
variance = np.var(data)
std_dev = np.std(data)

# Using Pandas for Median

median = pd.Series(data).median()

# Using SciPy for Mode

mode = stats.mode(data)[0][0] # mode() returns an array, so we take the
first value

# Displaying Results
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")

Explanation:

1. Mean: The average of all the numbers.

2. Median: The middle value when the data is sorted in ascending order.
3. Mode: The most frequent value in the data.
4. Variance: The measure of how far the numbers in the dataset are from the mean.
5. Standard Deviation: The square root of the variance, representing how spread out
the numbers are.

Expected Output:

Mean: 14.3
Median: 14.0
Mode: 12
Variance: 9.69
Standard Deviation: 3.11

The values will vary based on the input data, but this code covers the basic steps for
calculating the most commonly used descriptive statistics.
Experiment 3: Hypothesis Testing Using Chi-Square Test

Objective:
To perform a chi-square test of independence to check if two categorical variables are
independent.

Chi-Square Test Explanation:

The chi-square test of independence checks whether there is a significant association between
two categorical variables. The null hypothesis (H₀) assumes that the two variables are
independent, and the alternative hypothesis (H₁) assumes that they are not independent.

Formula:

Implementation Code:
from scipy.stats import chi2_contingency
import numpy as np

# Sample data in a contingency table (rows represent categories of one

variable, columns represent categories of the second variable)
# Example: Relationship between Gender and Preference for a product
(Male/Female vs. Likes/Dislikes)
data = np.array([[30, 10], # 30 Males like, 10 Males dislike
[20, 40]]) # 20 Females like, 40 Females dislike

# Perform Chi-Square test

chi2, p, dof, expected = chi2_contingency(data)

# Displaying Results
print(f"Chi-Square Statistic: {chi2}")
print(f"p-value: {p}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies:\n{expected}")

# Interpretation based on p-value

alpha = 0.05
if p < alpha:
print("\nReject the null hypothesis: There is a significant association
between the variables.")
else:
print("\nFail to reject the null hypothesis: There is no significant
association between the variables.")
Explanation:

• Observed Frequencies (O): The actual count of occurrences in each category.

• Expected Frequencies (E): The frequencies we would expect if the null hypothesis is
true (i.e., the variables are independent).
• Chi-Square Statistic: The test statistic that measures the difference between observed
and expected values.
• Degrees of Freedom: This is calculated as (number of rows - 1) * (number of
columns - 1).
• p-value: The probability of observing the data if the null hypothesis is true.

Expected Output:
Chi-Square Statistic: 22.5
p-value: 4.573286448626054e-06
Degrees of Freedom: 1
Expected Frequencies:
[[25. 15.]
[25. 35.]]

Reject the null hypothesis: There is a significant association between the

variables.

In this case, the p-value is less than 0.05, meaning we reject the null hypothesis and conclude
that there is a significant association between gender and product preference.
Experiment 4: Regression Analysis, Fitting a Linear Model, and Making
Predictions

Objective:
To perform regression analysis by fitting a linear model to the data and making predictions
based on the fitted model.

Explanation:

In linear regression, the goal is to fit a model that predicts a target variable (dependent
variable) based on one or more input variables (independent variables). In simple linear
regression, the model is of the form:

y = mx + b

Where:

• y = dependent variable (target)

• x = independent variable (predictor)
• m = slope of the line
• b = y-intercept

Implementation Code:
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Sample data (Independent Variable: X, Dependent Variable: y)

X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # Reshaping to a 2D array for
sklearn
y = np.array([1, 2, 1.3, 3.75, 2.25])

# Create a linear regression model and fit it

model = LinearRegression()
model.fit(X, y)

# Making predictions
y_pred = model.predict(X)

# Output the coefficients and intercept

print(f"Coefficient (m): {model.coef_}")
print(f"Intercept (b): {model.intercept_}")

# Visualizing the results

plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', label='Fitted line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression - Fitting a Model')
plt.legend()
plt.show()
# Making predictions for new values
new_X = np.array([6, 7]).reshape(-1, 1)
new_predictions = model.predict(new_X)
print(f"Predictions for new values: {new_predictions}")

Explanation of the Code:

1. X and y: The independent and dependent variables. We define X as a 2D array (for

sklearn compatibility) and y as a 1D array of target values.
2. LinearRegression(): The model is created using Scikit-learn’s LinearRegression()
class.
3. model.fit(X, y): The model is trained using the data (X, y).
4. model.predict(X): Once the model is trained, we use predict() to generate the
predicted values (y_pred) based on the input X.
5. Visualization: A scatter plot is created to show the actual data, and the fitted line is
plotted on top of it.
Expected Output:
Coefficient (m): [0.5]
Intercept (b): 0.7
Predictions for new values: [3.2 3.7]

In the plot, you’ll see a scatter plot of the data points (blue) and the fitted regression line
(red). The output will include the coefficient (slope) and intercept of the line, as well as the
predictions for new values of X.
Experiment 5: Plotly and Visualization Libraries in Python

Objective:
To explore and create interactive visualizations using Plotly, and compare it with other
popular visualization libraries like Matplotlib and Seaborn.

Explanation:

Plotly is a powerful library for creating interactive plots. Unlike static plots generated by
Matplotlib or Seaborn, Plotly allows you to create visualizations that users can interact with,
such as zooming, hovering, and clicking.

1. Plotly - Interactive Visualizations

Example 1: Scatter Plot

import plotly.express as px

# Sample data (Iris dataset)

df = px.data.iris()

# Scatter Plot
fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species',
title="Interactive Scatter Plot (Plotly)")
fig.show()

Expected Output:
Example 2: Line Plot
# Sample data for Line Plot
df_line = pd.DataFrame({
"x": [1, 2, 3, 4, 5],
"y": [10, 15, 13, 18, 20]
})

# Line Plot
fig_line = px.line(df_line, x='x', y='y', title="Interactive Line Plot
(Plotly)")
fig_line.show()

Expected Output:

2. Matplotlib - Static Visualizations

Example: Line Plot

python
Copy code
import matplotlib.pyplot as plt

# Sample data for Line Plot

x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 18, 20]

# Line Plot
plt.plot(x, y, marker='o', color='blue', linestyle='-', label='Data')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Matplotlib Line Plot")
plt.legend()
plt.show()
Expected Output:

Matplotlib is great for static visualizations, and this simple line plot shows how data can be
represented.

3. Seaborn - Statistical Plots

Example: Scatter Plot with Regression Line

python
Copy code
import seaborn as sns
import pandas as pd

# Sample data for Scatter Plot

df_seaborn = pd.DataFrame({
"x": [1, 2, 3, 4, 5],
"y": [10, 15, 13, 18, 20]
})

# Scatter Plot with Regression Line

sns.regplot(x='x', y='y', data=df_seaborn, scatter_kws={"color": "blue"},
line_kws={"color": "red"})
plt.title("Seaborn Scatter Plot with Regression Line")
plt.show()
Expected Output:

Seaborn is a statistical visualization library built on top of Matplotlib that automatically

handles many plot aesthetics and includes statistical features.

Key Differences:

• Plotly: Interactive, web-based visualizations suitable for dashboards and presentations.

• Matplotlib: Static visualizations that are useful for printed reports or simple visualizations in
scripts.
• Seaborn: Built on top of Matplotlib, it provides enhanced statistical plots and aesthetic
improvements.
Experiment 6: Applying Matplotlib in Python

Objective:
To create visualizations using Matplotlib, a powerful plotting library in Python.

Explanation:

Matplotlib is used for creating static, animated, and interactive visualizations in Python. It is
highly customizable, allowing users to create a wide variety of plots such as line plots, bar
plots, histograms, and more.

Implementation Examples using Matplotlib

Example 1: Line Plot

A line plot is useful for visualizing the relationship between two continuous variables.

python
Copy code
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Line Plot
plt.plot(x, y, marker='o', color='b', linestyle='-', label='Line 1')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot using Matplotlib')
plt.legend()
plt.grid(True)
plt.show()
Explanation:

• plt.plot() is used to create the line plot.

• xlabel(), ylabel(), and title() add labels to the axes and a title.
• legend() adds a legend to the plot, and grid() adds grid lines.

Example 2: Bar Plot

A bar plot is used to compare different categories. It displays data with rectangular bars.

python
Copy code
# Sample data
categories = ['A', 'B', 'C', 'D']
values = [5, 7, 3, 6]

# Bar Plot
plt.bar(categories, values, color='orange')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot using Matplotlib')
plt.show()

Explanation:

• plt.bar() creates a bar plot.

• We pass the categories on the x-axis and the corresponding values on the y-axis
Example 3: Histogram

A histogram is used to represent the distribution of a dataset.

python
Copy code
import numpy as np

# Generate random data

data = np.random.normal(0, 1, 1000) # 1000 random numbers from a normal
distribution

# Histogram
plt.hist(data, bins=30, color='g', edgecolor='black')
plt.xlabel('Data Range')
plt.ylabel('Frequency')
plt.title('Histogram using Matplotlib')
plt.show()

Explanation:

• np.random.normal() generates random data from a normal distribution.

• plt.hist() creates the histogram. We specify the number of bins and the color.
Example 4: Pie Chart

A pie chart is useful for showing the proportions of a whole.

python
Copy code
# Sample data
labels = ['Apple', 'Banana', 'Cherry', 'Date']
sizes = [15, 25, 35, 25]

# Pie Chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['red', 'yellow', 'pink', 'brown'])
plt.title('Fruit Distribution using Pie Chart')
plt.show()

Explanation:

• plt.pie() creates the pie chart.

• autopct is used to display percentages.
• startangle rotates the chart for better visual alignment.
Expected Output:

1. Line Plot: A plot of y = 2x, with a line connecting the points.

2. Bar Plot: A bar chart comparing values for different categories (A, B, C, D).
3. Histogram: A plot showing the distribution of the randomly generated data.
4. Pie Chart: A pie chart showing the proportions of different fruits in the dataset.

Key Matplotlib Functions:

• plt.plot(): Creates line plots.

• plt.bar(): Creates bar charts.
• plt.hist(): Creates histograms.
• plt.pie(): Creates pie charts.
• plt.xlabel(), plt.ylabel(): Add labels to axes.
• plt.title(): Adds a title to the plot.
• plt.legend(): Adds a legend to the plot.
• plt.show(): Displays the plot.

Matplotlib offers extensive customization options for each type of plot, allowing you to tailor
the visualizations to your needs.
Experiment 7: Data Visualization Using Pandas

Objective:
To explore data visualization capabilities using Pandas, which has built-in support for basic
plotting through Matplotlib. Pandas makes it easy to visualize data directly from DataFrames.

Explanation:

Pandas allows you to create simple yet powerful visualizations like line plots, bar charts,
histograms, and more, all using data from a Pandas DataFrame. Below are three examples to
demonstrate this functionality.

Example 1: Line Plot

A line plot is useful for visualizing trends over time or continuous data.

import pandas as pd
import numpy as np

# Sample data
data = {'Year': [2015, 2016, 2017, 2018, 2019],
'Sales': [150, 200, 250, 300, 350]}

df = pd.DataFrame(data)

# Line Plot
df.plot(x='Year', y='Sales', kind='line', marker='o', title="Sales Over
Years")
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
Explanation:

• df.plot() creates a plot using DataFrame columns.

• kind='line' specifies that it's a line plot.
• x='Year' and y='Sales' specify which columns to use for the x and y axes.

Example 2: Bar Plot

A bar plot is useful for comparing categories.

data = {'Category': ['A', 'B', 'C', 'D'],

'Values': [23, 45, 56, 78]}

df = pd.DataFrame(data)

# Bar Plot
df.plot(x='Category', y='Values', kind='bar', color='skyblue',
title="Category Values")
plt.xlabel('Category')
plt.ylabel('Values')
plt.show()

Explanation:

• kind='bar' specifies that we are creating a bar plot.

• This plot compares the values for different categories (A, B, C, D).
Example 3: Histogram

Histograms are used to visualize the distribution of numeric data.

python
Copy code
# Generate random data for Histogram
data = np.random.randn(1000)

# Create a DataFrame
df = pd.DataFrame(data, columns=['Value'])

# Histogram
df.plot(kind='hist', bins=30, color='orange', title="Distribution of Random
Data")
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Explanation:

• kind='hist' specifies that we are creating a histogram.

• bins=30 controls the number of bins in the histogram.
• We use randomly generated data to visualize its distribution.
Expected Output:

1. Line Plot: A line plot showing sales trends over the years.
2. Bar Plot: A bar chart comparing values for categories A, B, C, and D.
3. Histogram: A histogram showing the distribution of random data.

Key Pandas Plotting Functions:

• df.plot(kind='line'): Line plot (default type).

• df.plot(kind='bar'): Bar chart.
• df.plot(kind='hist'): Histogram.
• You can also use df.plot(x='col_name', y='col_name') to specify which
columns to plot.

Pandas makes it easy to quickly visualize data with minimal code, especially for quick
exploratory analysis. Let me know if you'd like further examples or modifications!
Experiment 8: Working with Time Series Data in Python

Objective:
To demonstrate how to work with time series data, including loading, plotting, and analyzing
time series data using Python.

Explanation:

Time series data consists of observations taken at specific time intervals. In Python, we often
use libraries like Pandas and Matplotlib to handle and visualize time series data. Pandas
provides the datetime functionality to handle dates and times efficiently.

Here’s a simple workflow for working with time series data:

1. Loading Time Series Data: Often from a CSV or directly from a generated series.
2. Plotting: Visualizing the data to detect trends, seasonality, etc.
3. Basic Analysis: Handling missing values, resampling, etc.

Implementation Code:

1. Generating and Plotting Time Series Data

Let's start with generating a simple time series dataset and plotting it.

python
Copy code
import pandas as pd
import matplotlib.pyplot as plt

# Generate a date range

dates = pd.date_range('2020-01-01', periods=100, freq='D')

# Generate some random data (e.g., stock prices)

data = pd.Series(range(1, 101), index=dates)

# Plot the time series data

data.plot(title="Time Series Data Example", figsize=(10, 6))
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
Explanation:

• pd.date_range() creates a range of dates starting from 2020-01-01 for 100 days.
• pd.Series() creates the time series with the dates as the index.
• plot() is used to plot the time series data.

2. Resampling Time Series Data

Sometimes, time series data is collected at high frequencies (e.g., minute-level data). We
might want to resample the data to a lower frequency (e.g., daily or monthly averages).

# Resample the data to monthly frequency and take the mean

monthly_data = data.resample('M').mean()

# Plot the resampled data

monthly_data.plot(title="Resampled Monthly Data", figsize=(10, 6))
plt.xlabel('Date')
plt.ylabel('Mean Value')
plt.show()
Explanation:

• .resample('M') resamples the data to a monthly frequency (M = Month).

• .mean() calculates the mean for each month.

3. Handling Missing Data in Time Series

Time series data often contains missing values. Here's how you can handle them.

# Introduce missing values (NaN) into the data

data_with_nan = data.copy()
data_with_nan[::10] = None # Introduce NaN at every 10th point

# Fill missing values using forward fill method

filled_data = data_with_nan.ffill()

# Plot the original and filled data

plt.figure(figsize=(10, 6))
plt.plot(data_with_nan, label='Original with NaN', color='red',
linestyle='--')
plt.plot(filled_data, label='Filled Data', color='green', linestyle='-')
plt.legend()
plt.title("Handling Missing Data in Time Series")
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
Explanation:

• ffill() is a method to forward fill the missing values (propagates the last valid
observation).
• We also plot the original data with missing values and the filled data for comparison.

Expected Output:

1. Time Series Plot: A line plot showing the generated time series data.
2. Resampled Data Plot: A line plot showing monthly mean values of the time series data.
3. Missing Data Handling: A plot showing the original data with missing values and the filled
data.

Key Pandas Functions for Time Series:

• pd.date_range(): Generate a range of dates.

• Series.resample(): Resample the data to different frequencies (e.g., 'D' for daily, 'M' for
monthly).
• Series.ffill(): Forward fill missing values.
• Series.plot(): Plot the time series data.

Summary:

In this example, we demonstrated how to generate a simple time series, plot it, resample the
data to a different frequency, and handle missing values. These are some of the common
tasks when working with time series data in Python.
Experiment 9: Analysis and Visualization of IPL Cricket Data

Objective:
To perform data analysis and visualize the performance of teams and players in the Indian
Premier League (IPL) cricket tournament. The focus will be on exploring various statistics,
such as runs, wickets, and team performance over the years.

Explanation:

In this experiment, we can work with a dataset containing IPL match statistics, which
includes data such as runs scored by teams, wickets taken, players’ performances, and more.
The goal is to use Pandas for data analysis and Matplotlib/Seaborn for visualization.

For this example, let’s assume we have a dataset ipl_matches.csv containing IPL match
data, such as:

• Match Date
• Teams (Home and Away)
• Runs scored by each team
• Wickets taken by each team
• Player performances (runs, wickets, etc.)

We will perform:

1. Data cleaning and processing.

2. Analysis of team performance over the years.
3. Visualization of match statistics and player performances.

Step 1: Load the IPL Dataset

import pandas as pd

# Load IPL data (assuming a dataset in CSV format)

# Replace with the path to your dataset
ipl_df = pd.read_csv('ipl_matches.csv')

# Show the first few rows of the dataset

print(ipl_df.head())

Assuming the dataset includes columns like:

• date: Date of the match

• team1: Home team
• team2: Away team
• team1_runs: Runs scored by team1
• team2_runs: Runs scored by team2
• team1_wickets: Wickets taken by team1
• team2_wickets: Wickets taken by team2
Step 2: Analyzing Team Performance Over the Years

We can analyze the performance of teams by looking at their total wins, runs, and wickets
across different seasons.

Example: Total Runs Scored by Each Team Over Time

# Extract the year from the match date
ipl_df['year'] = pd.to_datetime(ipl_df['date']).dt.year

# Total runs scored by each team over the years

team_runs = ipl_df.groupby(['year', 'team1'])['team1_runs'].sum() +
ipl_df.groupby(['year', 'team2'])['team2_runs'].sum()

# Reset index for better visualization

team_runs = team_runs.reset_index()

# Plot the data

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
for team in team_runs['team1'].unique():
team_data = team_runs[team_runs['team1'] == team]
plt.plot(team_data['year'], team_data['team1_runs'], label=team)

plt.title('Total Runs Scored by Each Team Over the Years')

plt.xlabel('Year')
plt.ylabel('Total Runs')
plt.legend()
plt.grid(True)
plt.show()

Explanation:

• groupby(['year', 'team1']) groups the data by year and team1 to calculate the sum
of runs.
• We plot the total runs scored by each team over the years.

Step 3: Team Wins and Performance

Example: Number of Matches Won by Each Team

# Count the number of wins for each team (assuming a 'result' column where
'win' is the outcome)
team_wins = ipl_df[ipl_df['result'] == 'win'].groupby(['year',
'team1'])['result'].count() + ipl_df[ipl_df['result'] ==
'win'].groupby(['year', 'team2'])['result'].count()

# Reset index for better visualization

team_wins = team_wins.reset_index()

# Plot the number of wins for each team

plt.figure(figsize=(12, 6))
for team in team_wins['team1'].unique():
win_data = team_wins[team_wins['team1'] == team]
plt.plot(win_data['year'], win_data['result'], label=team)
plt.title('Number of Matches Won by Each Team Over the Years')
plt.xlabel('Year')
plt.ylabel('Matches Won')
plt.legend()
plt.grid(True)
plt.show()

Explanation:

• We count the number of wins by grouping by year and team.

• We plot the number of matches won by each team over the years.

Step 4: Player Performance Analysis

Example: Runs Scored by Top Batsmen in a Specific Year

python
Copy code
# Assume we have player performance data in the columns 'player_name',
'runs', 'year'
player_data = ipl_df[['year', 'player_name', 'runs']]

# Filter for a specific year (e.g., 2020)

player_data_2020 = player_data[player_data['year'] == 2020]

# Group by player and sum the runs

top_batsmen_2020 =
player_data_2020.groupby('player_name')['runs'].sum().sort_values(ascending
=False)

# Plot the top 10 batsmen of 2020

top_batsmen_2020.head(10).plot(kind='bar', figsize=(10, 6),
color='skyblue')
plt.title('Top 10 Batsmen in IPL 2020')
plt.xlabel('Player Name')
plt.ylabel('Total Runs')
plt.xticks(rotation=45)
plt.show()

Explanation:

• We group the player data by player name and calculate their total runs.
• We plot the top 10 batsmen with the highest runs for a specific year (2020).

Expected Output:

1. Total Runs by Team: A line plot showing the total runs scored by each team across different
seasons.
2. Team Wins: A line plot showing the number of matches won by each team over the years.
3. Top Batsmen: A bar plot showing the top 10 batsmen based on total runs in IPL 2020.
Key Functions for Analysis:

• groupby(): Group data by one or more columns.

• sum(), count(): Aggregate the data by summing or counting the occurrences.
• sort_values(): Sort the data based on a column.
• plot(): Plot the data using Matplotlib.
• reset_index(): Convert a multi-index dataframe back to a regular dataframe for easier
plotting.

Summary:

In this experiment, we analyzed IPL cricket data to explore team performance, wins over the
years, and top player performances. We used Pandas for data manipulation and Matplotlib for
data visualization. This process provides insights into the performance trends of teams and
players across IPL seasons.
Experiment 10: Analysis and Visualization of Patient Data Set

Objective:
To perform data analysis and visualization on a patient dataset. We will explore patient
demographics, medical history, diagnosis, and other health-related metrics using Pandas and
Matplotlib/Seaborn.

Explanation:

A typical patient dataset might contain the following columns:

• Patient ID: Unique identifier for each patient.

• Age: Age of the patient.
• Gender: Gender of the patient.
• BMI (Body Mass Index): Health indicator.
• Blood Pressure: Health indicator.
• Diagnosis: The primary diagnosis (e.g., 'Hypertension', 'Diabetes', 'Healthy').
• Treatment: Type of treatment received.
• Cholesterol Level: A health indicator.

Our goal will be to:

1. Load and clean the dataset.

2. Analyze and visualize basic statistics, like age distribution, gender ratio, common diagnoses,
etc.
3. Explore correlations between health indicators like BMI, blood pressure, and cholesterol.

Step 1: Loading the Patient Dataset

import pandas as pd

# Load the patient dataset (assuming a CSV file)

# Replace with the actual path to your dataset
patient_df = pd.read_csv('patient_data.csv')

# Show the first few rows of the dataset

print(patient_df.head())

Assuming the dataset includes columns like:

• patient_id, age, gender, bmi, blood_pressure, cholesterol_level, diagnosis,

etc.
Step 2: Data Cleaning and Preprocessing

Before analysis, ensure the data is clean (e.g., handle missing values, incorrect data types).

# Check for missing values

print(patient_df.isnull().sum())

# Drop rows with missing values (or you can fill them with mean/median)
patient_df.dropna(inplace=True)

# Convert columns to appropriate data types (e.g., age as int)

patient_df['age'] = patient_df['age'].astype(int)

Step 3: Analyzing Demographics

We will explore the distribution of age, gender, and diagnosis categories.

Example 1: Age Distribution

import matplotlib.pyplot as plt

# Plot the distribution of age

plt.figure(figsize=(10, 6))
patient_df['age'].plot(kind='hist', bins=20, color='skyblue',
edgecolor='black')
plt.title('Age Distribution of Patients')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

Explanation:

• We plot a histogram of the age column to visualize the distribution of patients by age.

Example 2: Gender Distribution

# Plot the gender distribution
gender_counts = patient_df['gender'].value_counts()

plt.figure(figsize=(8, 6))
gender_counts.plot(kind='bar', color='lightgreen')
plt.title('Gender Distribution of Patients')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

Explanation:

• We use value_counts() to count the occurrences of each gender and plot the result in a
bar chart.
Example 3: Diagnosis Distribution
# Plot the distribution of diagnoses
diagnosis_counts = patient_df['diagnosis'].value_counts()

plt.figure(figsize=(10, 6))
diagnosis_counts.plot(kind='bar', color='coral')
plt.title('Diagnosis Distribution')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

Explanation:

• We use value_counts() to count the number of patients in each diagnosis category and
visualize it with a bar plot.

Step 4: Exploring Relationships Between Health Indicators

We will explore how BMI, blood pressure, and cholesterol levels are related.

Example 4: BMI vs. Blood Pressure

python
Copy code
import seaborn as sns

# Scatter plot of BMI vs. Blood Pressure

plt.figure(figsize=(10, 6))
sns.scatterplot(x='bmi', y='blood_pressure', data=patient_df,
hue='diagnosis', palette='Set1')
plt.title('BMI vs Blood Pressure')
plt.xlabel('BMI')
plt.ylabel('Blood Pressure')
plt.legend(title='Diagnosis')
plt.show()

Explanation:

• sns.scatterplot() creates a scatter plot to visualize the relationship between BMI and
blood pressure, color-coded by diagnosis.

Example 5: Correlation Heatmap

We can calculate and visualize the correlation between numerical features like BMI, blood
pressure, and cholesterol levels

# Calculate the correlation matrix

correlation_matrix = patient_df[['bmi', 'blood_pressure',
'cholesterol_level']].corr()

# Plot the correlation heatmap

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1,
vmax=1, fmt='.2f')
plt.title('Correlation Heatmap of Health Indicators')
plt.show()

Explanation:

• corr() computes the correlation matrix between numeric columns.

• sns.heatmap() visualizes the correlation matrix using a heatmap.

Expected Output:

1. Age Distribution: A histogram showing the distribution of ages in the patient dataset.
2. Gender Distribution: A bar chart showing the number of male and female patients.
3. Diagnosis Distribution: A bar chart showing the count of patients diagnosed with different
conditions.
4. BMI vs. Blood Pressure: A scatter plot showing the relationship between BMI and blood
pressure for different diagnoses.
5. Correlation Heatmap: A heatmap showing the correlations between BMI, blood pressure,
and cholesterol levels.

Key Functions Used:

• value_counts(): To count occurrences of unique values.

• plot(kind='hist'): To create a histogram for numerical data.
• sns.scatterplot(): To create a scatter plot with two continuous variables.
• sns.heatmap(): To plot a heatmap for the correlation matrix.

Summary:

In this experiment, we analyzed a patient dataset to explore demographics, health indicators,

and their relationships. Using Pandas for data manipulation and Seaborn/Matplotlib for
visualization, we gained insights into patient age distribution, gender ratio, common
diagnoses, and correlations between health metrics.

SMEC ML LAB MANUAL R22
No ratings yet
SMEC ML LAB MANUAL R22
21 pages
DVA Lab Manual
No ratings yet
DVA Lab Manual
20 pages
AD3411 DATA SCIENCE AND ANALYTICS LAB (2)_removed
No ratings yet
AD3411 DATA SCIENCE AND ANALYTICS LAB (2)_removed
24 pages
DA Lab ANSWERS
No ratings yet
DA Lab ANSWERS
10 pages
DATA SCIENCE EXPERIMENTS
No ratings yet
DATA SCIENCE EXPERIMENTS
31 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
27 pages
Lab Manual
No ratings yet
Lab Manual
7 pages
Machinelearning Prac
No ratings yet
Machinelearning Prac
17 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
AD3411-DATA SCIENCE AND ANALYTICS LABORATORY
No ratings yet
AD3411-DATA SCIENCE AND ANALYTICS LABORATORY
27 pages
CO-367 Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
CO-367 Machine Learning Lab File: Submitted To: Submitted by
12 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
som
No ratings yet
som
19 pages
Data science and analtics Laboratory
No ratings yet
Data science and analtics Laboratory
21 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
dsa
No ratings yet
dsa
26 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
36 pages
Lab Mannual
No ratings yet
Lab Mannual
49 pages
FDSA Lab Manual
No ratings yet
FDSA Lab Manual
31 pages
ML(sudhanshu)
No ratings yet
ML(sudhanshu)
24 pages
Dal Programs With Output
No ratings yet
Dal Programs With Output
11 pages
Class X - A.I. - Practical Lab Manual - VVA 2024-25
No ratings yet
Class X - A.I. - Practical Lab Manual - VVA 2024-25
50 pages
ml exp 3-7 manuval
No ratings yet
ml exp 3-7 manuval
21 pages
DS_lab manual
No ratings yet
DS_lab manual
31 pages
16 Mark Ds
No ratings yet
16 Mark Ds
18 pages
EXP1-siddhant gupta (23_SE_148)
No ratings yet
EXP1-siddhant gupta (23_SE_148)
17 pages
EXP 1B DAV
No ratings yet
EXP 1B DAV
4 pages
Data Science Lab Experiments
No ratings yet
Data Science Lab Experiments
32 pages
Lab Experiments Vi Sem-1
No ratings yet
Lab Experiments Vi Sem-1
10 pages
AD3411 (2)
No ratings yet
AD3411 (2)
28 pages
ml2020 Pythonlab02
No ratings yet
ml2020 Pythonlab02
3 pages
ML LAB(R22) MANUAL (4)
No ratings yet
ML LAB(R22) MANUAL (4)
25 pages
DV Lab Manual 2022-23
No ratings yet
DV Lab Manual 2022-23
10 pages
Rahul ML file'[1] 2
No ratings yet
Rahul ML file'[1] 2
30 pages
KJD ML File
No ratings yet
KJD ML File
45 pages
PML Ex3
No ratings yet
PML Ex3
20 pages
AD3411 - 1 To 5
No ratings yet
AD3411 - 1 To 5
11 pages
CS F320 - Assignment II - Draft (Subject to a Few Changes in the Description of Problems)
No ratings yet
CS F320 - Assignment II - Draft (Subject to a Few Changes in the Description of Problems)
12 pages
fdsa lab manual final
No ratings yet
fdsa lab manual final
70 pages
lecture4
No ratings yet
lecture4
60 pages
Machine Learning Lab File: Submitted To: Submitted by
No ratings yet
Machine Learning Lab File: Submitted To: Submitted by
9 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Principles of AI Laboratory Varshadr
No ratings yet
Principles of AI Laboratory Varshadr
54 pages
FDS Lab 1 Manuel .1..1new
No ratings yet
FDS Lab 1 Manuel .1..1new
34 pages
Data Science
No ratings yet
Data Science
18 pages
DSA lab manual pgms_fINAL
No ratings yet
DSA lab manual pgms_fINAL
34 pages
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
No ratings yet
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
68 pages
Lab 11,12 - Copy
No ratings yet
Lab 11,12 - Copy
7 pages
Data Science Manual
No ratings yet
Data Science Manual
16 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
AIML Assignment_merged
No ratings yet
AIML Assignment_merged
7 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Python Code - Summary Statistics
No ratings yet
Python Code - Summary Statistics
6 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
Time Series Analysis Group 9
No ratings yet
Time Series Analysis Group 9
16 pages
machinelearning_lab manual
No ratings yet
machinelearning_lab manual
26 pages
ML_lab
No ratings yet
ML_lab
30 pages
message (3)
No ratings yet
message (3)
2 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Chapter 3
No ratings yet
Chapter 3
11 pages
Final Research
No ratings yet
Final Research
45 pages
Complete Download Exercising Essential Statistics Fourth Edition Evan Berman PDF All Chapters
100% (5)
Complete Download Exercising Essential Statistics Fourth Edition Evan Berman PDF All Chapters
50 pages
SIP Report
No ratings yet
SIP Report
45 pages
Chapter Four Result and Findings 4.1: Age of The Respodent
No ratings yet
Chapter Four Result and Findings 4.1: Age of The Respodent
13 pages
(Ebook) Using Statistical Methods in Social Science Research: With a Complete SPSS Guide by Soleman H. Abu-Bader ISBN 9780197522431, 9780197522486, 0197522432, 0197522483 - The ebook in PDF/DOCX format is available for instant download
100% (2)
(Ebook) Using Statistical Methods in Social Science Research: With a Complete SPSS Guide by Soleman H. Abu-Bader ISBN 9780197522431, 9780197522486, 0197522432, 0197522483 - The ebook in PDF/DOCX format is available for instant download
66 pages
Chi Square Test
No ratings yet
Chi Square Test
32 pages
Nama: Dadan Noviandri NIM: 312020019 Kelas: B (S1 Keperawatan LJ)
No ratings yet
Nama: Dadan Noviandri NIM: 312020019 Kelas: B (S1 Keperawatan LJ)
5 pages
MBA MAP Project report
No ratings yet
MBA MAP Project report
63 pages
A Study On Investors Behavior Towards Mutual Funds Ijariie6493
No ratings yet
A Study On Investors Behavior Towards Mutual Funds Ijariie6493
12 pages
Best Fit Probability Distributions For Monthly Radiosonde Weather Data
No ratings yet
Best Fit Probability Distributions For Monthly Radiosonde Weather Data
8 pages
Chi-Square Assignment
No ratings yet
Chi-Square Assignment
4 pages
Respuesta Proyecto Unidad Didáctica 4
No ratings yet
Respuesta Proyecto Unidad Didáctica 4
2 pages
Chapter 5 g5
No ratings yet
Chapter 5 g5
4 pages
Biostatistics and Orthodontics
50% (2)
Biostatistics and Orthodontics
72 pages
AsiaPacificJournalofResearch Muddassir
No ratings yet
AsiaPacificJournalofResearch Muddassir
7 pages
1 Tapasya Julka Malhotra
No ratings yet
1 Tapasya Julka Malhotra
14 pages
DE GUZMAN, ISAIAH Q._MMEM
No ratings yet
DE GUZMAN, ISAIAH Q._MMEM
19 pages
Odds Ratio
No ratings yet
Odds Ratio
16 pages
Respotas Da Materia Statistics 4
No ratings yet
Respotas Da Materia Statistics 4
11 pages
Afmg PSM Test Fmgeau20
No ratings yet
Afmg PSM Test Fmgeau20
16 pages
Download Complete Even You Can Learn Statistics and Analytics: An Easy to Understand Guide 4th Edition David Levine PDF for All Chapters
100% (4)
Download Complete Even You Can Learn Statistics and Analytics: An Easy to Understand Guide 4th Edition David Levine PDF for All Chapters
66 pages
Thesis Dependent and Independent Variables
100% (3)
Thesis Dependent and Independent Variables
5 pages
Non - Business - GST IMPACT
No ratings yet
Non - Business - GST IMPACT
75 pages
Chi Squared Tests Applied To Ecology
No ratings yet
Chi Squared Tests Applied To Ecology
3 pages
STATS - DOANE - Chapter 15 Chi-Square Tests
100% (1)
STATS - DOANE - Chapter 15 Chi-Square Tests
189 pages
A Study On Stress Related Problem Faced by The Employees
No ratings yet
A Study On Stress Related Problem Faced by The Employees
96 pages
Chi Square Test Course Requirement
No ratings yet
Chi Square Test Course Requirement
5 pages
Statistics For The Behavioral Sciences 4th Edition Susan A. Nolan All Chapter Instant Download
100% (2)
Statistics For The Behavioral Sciences 4th Edition Susan A. Nolan All Chapter Instant Download
52 pages
Consumer Preference Between Local Brands and International Brands
No ratings yet
Consumer Preference Between Local Brands and International Brands
48 pages