Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
27 views

Lab Manual (DAV)

Uploaded by

Paras Chaturvedi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Lab Manual (DAV)

Uploaded by

Paras Chaturvedi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Experiment 1: Introduction to Various Data Visualization Tools

Objective:
To understand and explore different data visualization tools in Python, specifically
Matplotlib, Seaborn, Plotly, and Pandas, and to demonstrate basic plots using each of these
libraries.

Implementation:

1. Matplotlib – A basic plotting library for creating static, animated, and interactive
visualizations.

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 15, 25, 30]

# Line Plot
plt.plot(x, y, marker='o', color='blue', linestyle='-')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Matplotlib Line Plot")
plt.show()

2. Seaborn – A statistical data visualization library built on top of Matplotlib, useful for
creating informative and attractive graphics.

import seaborn as sns


import pandas as pd

# Sample data
data = {'Category': ['A', 'B', 'C', 'D', 'E'], 'Values': [15, 30, 25,
10, 20]}
df = pd.DataFrame(data)

# Bar Plot
sns.barplot(x='Category', y='Values', data=df, palette="viridis")
plt.title("Seaborn Bar Plot")
plt.show()

3. Plotly – An interactive visualization library that supports dynamic plots and


dashboards.

import plotly.express as px

# Sample data
df = px.data.iris() # Built-in iris dataset

# Scatter Plot
fig = px.scatter(df, x='sepal_width', y='sepal_length',
color='species', title="Plotly Scatter Plot")
fig.show()
4. Pandas Plotting – Uses Matplotlib as a backend to create simple visualizations
directly from DataFrames.

import pandas as pd

# Sample data
df = pd.DataFrame({'Category': ['A', 'B', 'C', 'D', 'E'], 'Values':
[15, 30, 25, 10, 20]})

# Plotting with Pandas


df.plot(kind='bar', x='Category', y='Values', color='orange',
legend=False)
plt.title("Pandas Bar Plot")
plt.xlabel("Category")
plt.ylabel("Values")
plt.show()

Expected Output:

• Matplotlib: A simple line plot.


• Seaborn: A colorful bar plot.
• Plotly: An interactive scatter plot.
• Pandas Plotting: A bar plot using Pandas’ built-in plotting capabilities.
Experiment 2: Descriptive Statistics (Mean, Median, Mode, Variance, etc.)

Objective:
To calculate and understand basic descriptive statistics such as mean, median, mode,
variance, and standard deviation.

Implementation Code:

import numpy as np
import pandas as pd
from scipy import stats

# Sample data
data = [12, 15, 14, 10, 18, 20, 12, 14, 13, 15]

# Using NumPy for Mean, Variance, and Standard Deviation


mean = np.mean(data)
variance = np.var(data)
std_dev = np.std(data)

# Using Pandas for Median


median = pd.Series(data).median()

# Using SciPy for Mode


mode = stats.mode(data)[0][0] # mode() returns an array, so we take the
first value

# Displaying Results
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")

Explanation:

1. Mean: The average of all the numbers.


2. Median: The middle value when the data is sorted in ascending order.
3. Mode: The most frequent value in the data.
4. Variance: The measure of how far the numbers in the dataset are from the mean.
5. Standard Deviation: The square root of the variance, representing how spread out
the numbers are.

Expected Output:

Mean: 14.3
Median: 14.0
Mode: 12
Variance: 9.69
Standard Deviation: 3.11

The values will vary based on the input data, but this code covers the basic steps for
calculating the most commonly used descriptive statistics.
Experiment 3: Hypothesis Testing Using Chi-Square Test

Objective:
To perform a chi-square test of independence to check if two categorical variables are
independent.

Chi-Square Test Explanation:

The chi-square test of independence checks whether there is a significant association between
two categorical variables. The null hypothesis (H₀) assumes that the two variables are
independent, and the alternative hypothesis (H₁) assumes that they are not independent.

Formula:

Implementation Code:
from scipy.stats import chi2_contingency
import numpy as np

# Sample data in a contingency table (rows represent categories of one


variable, columns represent categories of the second variable)
# Example: Relationship between Gender and Preference for a product
(Male/Female vs. Likes/Dislikes)
data = np.array([[30, 10], # 30 Males like, 10 Males dislike
[20, 40]]) # 20 Females like, 40 Females dislike

# Perform Chi-Square test


chi2, p, dof, expected = chi2_contingency(data)

# Displaying Results
print(f"Chi-Square Statistic: {chi2}")
print(f"p-value: {p}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies:\n{expected}")

# Interpretation based on p-value


alpha = 0.05
if p < alpha:
print("\nReject the null hypothesis: There is a significant association
between the variables.")
else:
print("\nFail to reject the null hypothesis: There is no significant
association between the variables.")
Explanation:

• Observed Frequencies (O): The actual count of occurrences in each category.


• Expected Frequencies (E): The frequencies we would expect if the null hypothesis is
true (i.e., the variables are independent).
• Chi-Square Statistic: The test statistic that measures the difference between observed
and expected values.
• Degrees of Freedom: This is calculated as (number of rows - 1) * (number of
columns - 1).
• p-value: The probability of observing the data if the null hypothesis is true.

Expected Output:
Chi-Square Statistic: 22.5
p-value: 4.573286448626054e-06
Degrees of Freedom: 1
Expected Frequencies:
[[25. 15.]
[25. 35.]]

Reject the null hypothesis: There is a significant association between the


variables.

In this case, the p-value is less than 0.05, meaning we reject the null hypothesis and conclude
that there is a significant association between gender and product preference.
Experiment 4: Regression Analysis, Fitting a Linear Model, and Making
Predictions

Objective:
To perform regression analysis by fitting a linear model to the data and making predictions
based on the fitted model.

Explanation:

In linear regression, the goal is to fit a model that predicts a target variable (dependent
variable) based on one or more input variables (independent variables). In simple linear
regression, the model is of the form:

y = mx + b

Where:

• y = dependent variable (target)


• x = independent variable (predictor)
• m = slope of the line
• b = y-intercept

Implementation Code:
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

# Sample data (Independent Variable: X, Dependent Variable: y)


X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # Reshaping to a 2D array for
sklearn
y = np.array([1, 2, 1.3, 3.75, 2.25])

# Create a linear regression model and fit it


model = LinearRegression()
model.fit(X, y)

# Making predictions
y_pred = model.predict(X)

# Output the coefficients and intercept


print(f"Coefficient (m): {model.coef_}")
print(f"Intercept (b): {model.intercept_}")

# Visualizing the results


plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', label='Fitted line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression - Fitting a Model')
plt.legend()
plt.show()
# Making predictions for new values
new_X = np.array([6, 7]).reshape(-1, 1)
new_predictions = model.predict(new_X)
print(f"Predictions for new values: {new_predictions}")

Explanation of the Code:

1. X and y: The independent and dependent variables. We define X as a 2D array (for


sklearn compatibility) and y as a 1D array of target values.
2. LinearRegression(): The model is created using Scikit-learn’s LinearRegression()
class.
3. model.fit(X, y): The model is trained using the data (X, y).
4. model.predict(X): Once the model is trained, we use predict() to generate the
predicted values (y_pred) based on the input X.
5. Visualization: A scatter plot is created to show the actual data, and the fitted line is
plotted on top of it.
Expected Output:
Coefficient (m): [0.5]
Intercept (b): 0.7
Predictions for new values: [3.2 3.7]

In the plot, you’ll see a scatter plot of the data points (blue) and the fitted regression line
(red). The output will include the coefficient (slope) and intercept of the line, as well as the
predictions for new values of X.
Experiment 5: Plotly and Visualization Libraries in Python

Objective:
To explore and create interactive visualizations using Plotly, and compare it with other
popular visualization libraries like Matplotlib and Seaborn.

Explanation:

Plotly is a powerful library for creating interactive plots. Unlike static plots generated by
Matplotlib or Seaborn, Plotly allows you to create visualizations that users can interact with,
such as zooming, hovering, and clicking.

1. Plotly - Interactive Visualizations

Example 1: Scatter Plot


import plotly.express as px

# Sample data (Iris dataset)


df = px.data.iris()

# Scatter Plot
fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species',
title="Interactive Scatter Plot (Plotly)")
fig.show()

Expected Output:
Example 2: Line Plot
# Sample data for Line Plot
df_line = pd.DataFrame({
"x": [1, 2, 3, 4, 5],
"y": [10, 15, 13, 18, 20]
})

# Line Plot
fig_line = px.line(df_line, x='x', y='y', title="Interactive Line Plot
(Plotly)")
fig_line.show()

Expected Output:

2. Matplotlib - Static Visualizations

Example: Line Plot


python
Copy code
import matplotlib.pyplot as plt

# Sample data for Line Plot


x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 18, 20]

# Line Plot
plt.plot(x, y, marker='o', color='blue', linestyle='-', label='Data')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Matplotlib Line Plot")
plt.legend()
plt.show()
Expected Output:

Matplotlib is great for static visualizations, and this simple line plot shows how data can be
represented.

3. Seaborn - Statistical Plots

Example: Scatter Plot with Regression Line


python
Copy code
import seaborn as sns
import pandas as pd

# Sample data for Scatter Plot


df_seaborn = pd.DataFrame({
"x": [1, 2, 3, 4, 5],
"y": [10, 15, 13, 18, 20]
})

# Scatter Plot with Regression Line


sns.regplot(x='x', y='y', data=df_seaborn, scatter_kws={"color": "blue"},
line_kws={"color": "red"})
plt.title("Seaborn Scatter Plot with Regression Line")
plt.show()
Expected Output:

Seaborn is a statistical visualization library built on top of Matplotlib that automatically


handles many plot aesthetics and includes statistical features.

Key Differences:

• Plotly: Interactive, web-based visualizations suitable for dashboards and presentations.


• Matplotlib: Static visualizations that are useful for printed reports or simple visualizations in
scripts.
• Seaborn: Built on top of Matplotlib, it provides enhanced statistical plots and aesthetic
improvements.
Experiment 6: Applying Matplotlib in Python

Objective:
To create visualizations using Matplotlib, a powerful plotting library in Python.

Explanation:

Matplotlib is used for creating static, animated, and interactive visualizations in Python. It is
highly customizable, allowing users to create a wide variety of plots such as line plots, bar
plots, histograms, and more.

Implementation Examples using Matplotlib

Example 1: Line Plot

A line plot is useful for visualizing the relationship between two continuous variables.

python
Copy code
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Line Plot
plt.plot(x, y, marker='o', color='b', linestyle='-', label='Line 1')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot using Matplotlib')
plt.legend()
plt.grid(True)
plt.show()
Explanation:

• plt.plot() is used to create the line plot.


• xlabel(), ylabel(), and title() add labels to the axes and a title.
• legend() adds a legend to the plot, and grid() adds grid lines.

Example 2: Bar Plot

A bar plot is used to compare different categories. It displays data with rectangular bars.

python
Copy code
# Sample data
categories = ['A', 'B', 'C', 'D']
values = [5, 7, 3, 6]

# Bar Plot
plt.bar(categories, values, color='orange')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot using Matplotlib')
plt.show()

Explanation:

• plt.bar() creates a bar plot.


• We pass the categories on the x-axis and the corresponding values on the y-axis
Example 3: Histogram

A histogram is used to represent the distribution of a dataset.

python
Copy code
import numpy as np

# Generate random data


data = np.random.normal(0, 1, 1000) # 1000 random numbers from a normal
distribution

# Histogram
plt.hist(data, bins=30, color='g', edgecolor='black')
plt.xlabel('Data Range')
plt.ylabel('Frequency')
plt.title('Histogram using Matplotlib')
plt.show()

Explanation:

• np.random.normal() generates random data from a normal distribution.


• plt.hist() creates the histogram. We specify the number of bins and the color.
Example 4: Pie Chart

A pie chart is useful for showing the proportions of a whole.

python
Copy code
# Sample data
labels = ['Apple', 'Banana', 'Cherry', 'Date']
sizes = [15, 25, 35, 25]

# Pie Chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['red', 'yellow', 'pink', 'brown'])
plt.title('Fruit Distribution using Pie Chart')
plt.show()

Explanation:

• plt.pie() creates the pie chart.


• autopct is used to display percentages.
• startangle rotates the chart for better visual alignment.
Expected Output:

1. Line Plot: A plot of y = 2x, with a line connecting the points.


2. Bar Plot: A bar chart comparing values for different categories (A, B, C, D).
3. Histogram: A plot showing the distribution of the randomly generated data.
4. Pie Chart: A pie chart showing the proportions of different fruits in the dataset.

Key Matplotlib Functions:

• plt.plot(): Creates line plots.


• plt.bar(): Creates bar charts.
• plt.hist(): Creates histograms.
• plt.pie(): Creates pie charts.
• plt.xlabel(), plt.ylabel(): Add labels to axes.
• plt.title(): Adds a title to the plot.
• plt.legend(): Adds a legend to the plot.
• plt.show(): Displays the plot.

Matplotlib offers extensive customization options for each type of plot, allowing you to tailor
the visualizations to your needs.
Experiment 7: Data Visualization Using Pandas

Objective:
To explore data visualization capabilities using Pandas, which has built-in support for basic
plotting through Matplotlib. Pandas makes it easy to visualize data directly from DataFrames.

Explanation:

Pandas allows you to create simple yet powerful visualizations like line plots, bar charts,
histograms, and more, all using data from a Pandas DataFrame. Below are three examples to
demonstrate this functionality.

Example 1: Line Plot

A line plot is useful for visualizing trends over time or continuous data.

import pandas as pd
import numpy as np

# Sample data
data = {'Year': [2015, 2016, 2017, 2018, 2019],
'Sales': [150, 200, 250, 300, 350]}

df = pd.DataFrame(data)

# Line Plot
df.plot(x='Year', y='Sales', kind='line', marker='o', title="Sales Over
Years")
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
Explanation:

• df.plot() creates a plot using DataFrame columns.


• kind='line' specifies that it's a line plot.
• x='Year' and y='Sales' specify which columns to use for the x and y axes.

Example 2: Bar Plot

A bar plot is useful for comparing categories.

data = {'Category': ['A', 'B', 'C', 'D'],


'Values': [23, 45, 56, 78]}

df = pd.DataFrame(data)

# Bar Plot
df.plot(x='Category', y='Values', kind='bar', color='skyblue',
title="Category Values")
plt.xlabel('Category')
plt.ylabel('Values')
plt.show()

Explanation:

• kind='bar' specifies that we are creating a bar plot.


• This plot compares the values for different categories (A, B, C, D).
Example 3: Histogram

Histograms are used to visualize the distribution of numeric data.

python
Copy code
# Generate random data for Histogram
data = np.random.randn(1000)

# Create a DataFrame
df = pd.DataFrame(data, columns=['Value'])

# Histogram
df.plot(kind='hist', bins=30, color='orange', title="Distribution of Random
Data")
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Explanation:

• kind='hist' specifies that we are creating a histogram.


• bins=30 controls the number of bins in the histogram.
• We use randomly generated data to visualize its distribution.
Expected Output:

1. Line Plot: A line plot showing sales trends over the years.
2. Bar Plot: A bar chart comparing values for categories A, B, C, and D.
3. Histogram: A histogram showing the distribution of random data.

Key Pandas Plotting Functions:

• df.plot(kind='line'): Line plot (default type).


• df.plot(kind='bar'): Bar chart.
• df.plot(kind='hist'): Histogram.
• You can also use df.plot(x='col_name', y='col_name') to specify which
columns to plot.

Pandas makes it easy to quickly visualize data with minimal code, especially for quick
exploratory analysis. Let me know if you'd like further examples or modifications!
Experiment 8: Working with Time Series Data in Python

Objective:
To demonstrate how to work with time series data, including loading, plotting, and analyzing
time series data using Python.

Explanation:

Time series data consists of observations taken at specific time intervals. In Python, we often
use libraries like Pandas and Matplotlib to handle and visualize time series data. Pandas
provides the datetime functionality to handle dates and times efficiently.

Here’s a simple workflow for working with time series data:

1. Loading Time Series Data: Often from a CSV or directly from a generated series.
2. Plotting: Visualizing the data to detect trends, seasonality, etc.
3. Basic Analysis: Handling missing values, resampling, etc.

Implementation Code:

1. Generating and Plotting Time Series Data

Let's start with generating a simple time series dataset and plotting it.

python
Copy code
import pandas as pd
import matplotlib.pyplot as plt

# Generate a date range


dates = pd.date_range('2020-01-01', periods=100, freq='D')

# Generate some random data (e.g., stock prices)


data = pd.Series(range(1, 101), index=dates)

# Plot the time series data


data.plot(title="Time Series Data Example", figsize=(10, 6))
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
Explanation:

• pd.date_range() creates a range of dates starting from 2020-01-01 for 100 days.
• pd.Series() creates the time series with the dates as the index.
• plot() is used to plot the time series data.

2. Resampling Time Series Data

Sometimes, time series data is collected at high frequencies (e.g., minute-level data). We
might want to resample the data to a lower frequency (e.g., daily or monthly averages).

# Resample the data to monthly frequency and take the mean


monthly_data = data.resample('M').mean()

# Plot the resampled data


monthly_data.plot(title="Resampled Monthly Data", figsize=(10, 6))
plt.xlabel('Date')
plt.ylabel('Mean Value')
plt.show()
Explanation:

• .resample('M') resamples the data to a monthly frequency (M = Month).


• .mean() calculates the mean for each month.

3. Handling Missing Data in Time Series

Time series data often contains missing values. Here's how you can handle them.

# Introduce missing values (NaN) into the data


data_with_nan = data.copy()
data_with_nan[::10] = None # Introduce NaN at every 10th point

# Fill missing values using forward fill method


filled_data = data_with_nan.ffill()

# Plot the original and filled data


plt.figure(figsize=(10, 6))
plt.plot(data_with_nan, label='Original with NaN', color='red',
linestyle='--')
plt.plot(filled_data, label='Filled Data', color='green', linestyle='-')
plt.legend()
plt.title("Handling Missing Data in Time Series")
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
Explanation:

• ffill() is a method to forward fill the missing values (propagates the last valid
observation).
• We also plot the original data with missing values and the filled data for comparison.

Expected Output:

1. Time Series Plot: A line plot showing the generated time series data.
2. Resampled Data Plot: A line plot showing monthly mean values of the time series data.
3. Missing Data Handling: A plot showing the original data with missing values and the filled
data.

Key Pandas Functions for Time Series:

• pd.date_range(): Generate a range of dates.


• Series.resample(): Resample the data to different frequencies (e.g., 'D' for daily, 'M' for
monthly).
• Series.ffill(): Forward fill missing values.
• Series.plot(): Plot the time series data.

Summary:

In this example, we demonstrated how to generate a simple time series, plot it, resample the
data to a different frequency, and handle missing values. These are some of the common
tasks when working with time series data in Python.
Experiment 9: Analysis and Visualization of IPL Cricket Data

Objective:
To perform data analysis and visualize the performance of teams and players in the Indian
Premier League (IPL) cricket tournament. The focus will be on exploring various statistics,
such as runs, wickets, and team performance over the years.

Explanation:

In this experiment, we can work with a dataset containing IPL match statistics, which
includes data such as runs scored by teams, wickets taken, players’ performances, and more.
The goal is to use Pandas for data analysis and Matplotlib/Seaborn for visualization.

For this example, let’s assume we have a dataset ipl_matches.csv containing IPL match
data, such as:

• Match Date
• Teams (Home and Away)
• Runs scored by each team
• Wickets taken by each team
• Player performances (runs, wickets, etc.)

We will perform:

1. Data cleaning and processing.


2. Analysis of team performance over the years.
3. Visualization of match statistics and player performances.

Step 1: Load the IPL Dataset


import pandas as pd

# Load IPL data (assuming a dataset in CSV format)


# Replace with the path to your dataset
ipl_df = pd.read_csv('ipl_matches.csv')

# Show the first few rows of the dataset


print(ipl_df.head())

Assuming the dataset includes columns like:

• date: Date of the match


• team1: Home team
• team2: Away team
• team1_runs: Runs scored by team1
• team2_runs: Runs scored by team2
• team1_wickets: Wickets taken by team1
• team2_wickets: Wickets taken by team2
Step 2: Analyzing Team Performance Over the Years

We can analyze the performance of teams by looking at their total wins, runs, and wickets
across different seasons.

Example: Total Runs Scored by Each Team Over Time


# Extract the year from the match date
ipl_df['year'] = pd.to_datetime(ipl_df['date']).dt.year

# Total runs scored by each team over the years


team_runs = ipl_df.groupby(['year', 'team1'])['team1_runs'].sum() +
ipl_df.groupby(['year', 'team2'])['team2_runs'].sum()

# Reset index for better visualization


team_runs = team_runs.reset_index()

# Plot the data


import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
for team in team_runs['team1'].unique():
team_data = team_runs[team_runs['team1'] == team]
plt.plot(team_data['year'], team_data['team1_runs'], label=team)

plt.title('Total Runs Scored by Each Team Over the Years')


plt.xlabel('Year')
plt.ylabel('Total Runs')
plt.legend()
plt.grid(True)
plt.show()

Explanation:

• groupby(['year', 'team1']) groups the data by year and team1 to calculate the sum
of runs.
• We plot the total runs scored by each team over the years.

Step 3: Team Wins and Performance

Example: Number of Matches Won by Each Team


# Count the number of wins for each team (assuming a 'result' column where
'win' is the outcome)
team_wins = ipl_df[ipl_df['result'] == 'win'].groupby(['year',
'team1'])['result'].count() + ipl_df[ipl_df['result'] ==
'win'].groupby(['year', 'team2'])['result'].count()

# Reset index for better visualization


team_wins = team_wins.reset_index()

# Plot the number of wins for each team


plt.figure(figsize=(12, 6))
for team in team_wins['team1'].unique():
win_data = team_wins[team_wins['team1'] == team]
plt.plot(win_data['year'], win_data['result'], label=team)
plt.title('Number of Matches Won by Each Team Over the Years')
plt.xlabel('Year')
plt.ylabel('Matches Won')
plt.legend()
plt.grid(True)
plt.show()

Explanation:

• We count the number of wins by grouping by year and team.


• We plot the number of matches won by each team over the years.

Step 4: Player Performance Analysis

Example: Runs Scored by Top Batsmen in a Specific Year


python
Copy code
# Assume we have player performance data in the columns 'player_name',
'runs', 'year'
player_data = ipl_df[['year', 'player_name', 'runs']]

# Filter for a specific year (e.g., 2020)


player_data_2020 = player_data[player_data['year'] == 2020]

# Group by player and sum the runs


top_batsmen_2020 =
player_data_2020.groupby('player_name')['runs'].sum().sort_values(ascending
=False)

# Plot the top 10 batsmen of 2020


top_batsmen_2020.head(10).plot(kind='bar', figsize=(10, 6),
color='skyblue')
plt.title('Top 10 Batsmen in IPL 2020')
plt.xlabel('Player Name')
plt.ylabel('Total Runs')
plt.xticks(rotation=45)
plt.show()

Explanation:

• We group the player data by player name and calculate their total runs.
• We plot the top 10 batsmen with the highest runs for a specific year (2020).

Expected Output:

1. Total Runs by Team: A line plot showing the total runs scored by each team across different
seasons.
2. Team Wins: A line plot showing the number of matches won by each team over the years.
3. Top Batsmen: A bar plot showing the top 10 batsmen based on total runs in IPL 2020.
Key Functions for Analysis:

• groupby(): Group data by one or more columns.


• sum(), count(): Aggregate the data by summing or counting the occurrences.
• sort_values(): Sort the data based on a column.
• plot(): Plot the data using Matplotlib.
• reset_index(): Convert a multi-index dataframe back to a regular dataframe for easier
plotting.

Summary:

In this experiment, we analyzed IPL cricket data to explore team performance, wins over the
years, and top player performances. We used Pandas for data manipulation and Matplotlib for
data visualization. This process provides insights into the performance trends of teams and
players across IPL seasons.
Experiment 10: Analysis and Visualization of Patient Data Set

Objective:
To perform data analysis and visualization on a patient dataset. We will explore patient
demographics, medical history, diagnosis, and other health-related metrics using Pandas and
Matplotlib/Seaborn.

Explanation:

A typical patient dataset might contain the following columns:

• Patient ID: Unique identifier for each patient.


• Age: Age of the patient.
• Gender: Gender of the patient.
• BMI (Body Mass Index): Health indicator.
• Blood Pressure: Health indicator.
• Diagnosis: The primary diagnosis (e.g., 'Hypertension', 'Diabetes', 'Healthy').
• Treatment: Type of treatment received.
• Cholesterol Level: A health indicator.

Our goal will be to:

1. Load and clean the dataset.


2. Analyze and visualize basic statistics, like age distribution, gender ratio, common diagnoses,
etc.
3. Explore correlations between health indicators like BMI, blood pressure, and cholesterol.

Step 1: Loading the Patient Dataset


import pandas as pd

# Load the patient dataset (assuming a CSV file)


# Replace with the actual path to your dataset
patient_df = pd.read_csv('patient_data.csv')

# Show the first few rows of the dataset


print(patient_df.head())

Assuming the dataset includes columns like:

• patient_id, age, gender, bmi, blood_pressure, cholesterol_level, diagnosis,


etc.
Step 2: Data Cleaning and Preprocessing

Before analysis, ensure the data is clean (e.g., handle missing values, incorrect data types).

# Check for missing values


print(patient_df.isnull().sum())

# Drop rows with missing values (or you can fill them with mean/median)
patient_df.dropna(inplace=True)

# Convert columns to appropriate data types (e.g., age as int)


patient_df['age'] = patient_df['age'].astype(int)

Step 3: Analyzing Demographics

We will explore the distribution of age, gender, and diagnosis categories.

Example 1: Age Distribution


import matplotlib.pyplot as plt

# Plot the distribution of age


plt.figure(figsize=(10, 6))
patient_df['age'].plot(kind='hist', bins=20, color='skyblue',
edgecolor='black')
plt.title('Age Distribution of Patients')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

Explanation:

• We plot a histogram of the age column to visualize the distribution of patients by age.

Example 2: Gender Distribution


# Plot the gender distribution
gender_counts = patient_df['gender'].value_counts()

plt.figure(figsize=(8, 6))
gender_counts.plot(kind='bar', color='lightgreen')
plt.title('Gender Distribution of Patients')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

Explanation:

• We use value_counts() to count the occurrences of each gender and plot the result in a
bar chart.
Example 3: Diagnosis Distribution
# Plot the distribution of diagnoses
diagnosis_counts = patient_df['diagnosis'].value_counts()

plt.figure(figsize=(10, 6))
diagnosis_counts.plot(kind='bar', color='coral')
plt.title('Diagnosis Distribution')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

Explanation:

• We use value_counts() to count the number of patients in each diagnosis category and
visualize it with a bar plot.

Step 4: Exploring Relationships Between Health Indicators

We will explore how BMI, blood pressure, and cholesterol levels are related.

Example 4: BMI vs. Blood Pressure


python
Copy code
import seaborn as sns

# Scatter plot of BMI vs. Blood Pressure


plt.figure(figsize=(10, 6))
sns.scatterplot(x='bmi', y='blood_pressure', data=patient_df,
hue='diagnosis', palette='Set1')
plt.title('BMI vs Blood Pressure')
plt.xlabel('BMI')
plt.ylabel('Blood Pressure')
plt.legend(title='Diagnosis')
plt.show()

Explanation:

• sns.scatterplot() creates a scatter plot to visualize the relationship between BMI and
blood pressure, color-coded by diagnosis.

Example 5: Correlation Heatmap

We can calculate and visualize the correlation between numerical features like BMI, blood
pressure, and cholesterol levels

# Calculate the correlation matrix


correlation_matrix = patient_df[['bmi', 'blood_pressure',
'cholesterol_level']].corr()

# Plot the correlation heatmap


plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1,
vmax=1, fmt='.2f')
plt.title('Correlation Heatmap of Health Indicators')
plt.show()

Explanation:

• corr() computes the correlation matrix between numeric columns.


• sns.heatmap() visualizes the correlation matrix using a heatmap.

Expected Output:

1. Age Distribution: A histogram showing the distribution of ages in the patient dataset.
2. Gender Distribution: A bar chart showing the number of male and female patients.
3. Diagnosis Distribution: A bar chart showing the count of patients diagnosed with different
conditions.
4. BMI vs. Blood Pressure: A scatter plot showing the relationship between BMI and blood
pressure for different diagnoses.
5. Correlation Heatmap: A heatmap showing the correlations between BMI, blood pressure,
and cholesterol levels.

Key Functions Used:

• value_counts(): To count occurrences of unique values.


• plot(kind='hist'): To create a histogram for numerical data.
• sns.scatterplot(): To create a scatter plot with two continuous variables.
• sns.heatmap(): To plot a heatmap for the correlation matrix.

Summary:

In this experiment, we analyzed a patient dataset to explore demographics, health indicators,


and their relationships. Using Pandas for data manipulation and Seaborn/Matplotlib for
visualization, we gained insights into patient age distribution, gender ratio, common
diagnoses, and correlations between health metrics.

You might also like