Unit _Data Visualization
Unit _Data Visualization
Introduction to Graphic
Representation of Data
• Data visualization is the process of representing data
through charts, graphs, and other visual formats.
• It makes complex data more accessible,
understandable, and usable.
• Different types of charts are suitable for different data
types: single variable, two variables, and more than two
variables.
Characteristics of Effective
Graphical Displays
• Clarity: The graph should be easy to understand, with clear labels,
scales, and legends.
• Accuracy: The graph must accurately represent the data without
distortion or misleading interpretations.
• Simplicity: Avoid unnecessary decorations or complexity that may
obscure the data’s meaning.
• Relevance: The chosen graph type should be appropriate for the
data being displayed.
• Consistency: Use consistent scales and labels, especially when
comparing multiple graphs.
• Visual Appeal: Aesthetic design should not overshadow clarity
but enhance the viewer's ability to interpret the data.
Single variable (Univariate)
• Single variable plots, also known as univariate plots, are
used to visualize the distribution and characteristics of a
single variable within a dataset.
• These plots are essential in understanding the central
tendency, spread, and overall shape of the data.
Chart Types :
Charts for Single Variables
(Univariate Data)
• Dot Plot: Displays data points as dots; useful for small
data sets.
• Example: Distribution of test scores in a class.
• Jitter Plot: Similar to a dot plot but with added random
noise to spread out overlapping points.
• Example: Scores of multiple tests with slight overlaps.
• Pie Chart: Shows proportions of categories as parts of
a circle.
• Example: Market share of companies in a sector.
• Note: Use pie charts sparingly; they are less effective for
precise comparisons.
• Box-and-Whisker Plot: Shows distribution using
quartiles; highlights outliers.
• Example: Income distribution in a region.
• Histogram: Represents the frequency of data within
intervals (bins).
• Example: Age distribution of people in a city.
1. Dot Plot
• Description: A dot plot displays individual data points along a single
axis. Each dot represents one observation, and dots may be stacked
to represent frequency.
• When to Use: Dot plots are useful for small to medium-sized
datasets to visualize the frequency and distribution of data points.
They are especially helpful when you want to see individual values
and their distribution.
• Pros:
• Shows individual data points.
• Easy to interpret for small datasets.
• Cons:
• Becomes cluttered with large datasets.
• Example Use Case: Displaying the number of students scoring each
grade in a small class.
2. Jitter Plot
• Description: A jitter plot is similar to a dot plot but with random
noise added to prevent overlapping of data points (jittering). This
makes it easier to see individual points when many values are the
same.
• When to Use: Jitter plots are useful when you have a large number
of identical data points that would overlap in a standard dot plot.
The jitter helps to spread out the points and avoid overplotting.
• Pros:
• Prevents overplotting.
• Useful for large datasets with many identical values.
• Cons:
• The added noise can slightly distort the perception of data distribution.
• Example Use Case: Visualizing the distribution of exam scores
where many students have the same scores.
3. Error Bar Plot
• Description: An error bar plot shows the mean or median of data points
along with error bars that represent variability (e.g., standard deviation,
standard error).
• When to Use: Error bar plots are used when you want to show not only
the central tendency of the data but also the variability or uncertainty
around it. They are common in scientific and experimental data analysis.
• Pros:
• Provides insight into data variability and reliability.
• Useful for comparing the central tendency and spread across different groups.
• Cons: Interpretation can be complex, especially with overlapping error
bars.
• Example Use Case: Comparing the average response time in different
experimental conditions with error bars representing standard deviation.
• x and y:
• x: The data points for the x-axis.
• y: The data points for the y-axis.
• These are required parameters and represent the data
being plotted.
• yerr and xerr:
• yerr: The error values in the y-direction.
• xerr: The error values in the x-direction.
4. Box-and-Whisker Plot (Box
Plot)
• Description: A box plot displays the distribution of data based on five
summary statistics: minimum, first quartile (Q1), median, third quartile (Q3),
and maximum. The box represents the interquartile range (IQR), and the
"whiskers" extend to the smallest and largest values within 1.5 times the IQR.
• When to Use: Box plots are ideal for comparing distributions between
groups or identifying outliers. They provide a concise summary of data
spread and central tendency.
• Pros:
• Summarizes data distribution.
• Identifies outliers.
• Useful for comparing multiple groups.
• Cons:
• Does not show individual data points.
• Example Use Case: Comparing the distribution of salaries across different
departments in a company.
5. Histogram
• Description: A histogram divides the data into bins (intervals) and
counts the number of observations in each bin. It displays the
frequency distribution of a continuous variable.
• When to Use: Histograms are used to visualize the shape of the
distribution of a dataset, such as whether it is normal, skewed, or
multimodal. They are particularly useful for large datasets.
• Pros:
• Shows the shape of the distribution.
• Can handle large datasets.
• Cons:
• Choice of bin width can affect the interpretation.
• Does not show individual data points.
• Example Use Case: Displaying the distribution of household
incomes in a city.
Choosing the Right Plot
• Small Datasets:
• Dot Plot: Use when you want to visualize individual data points and their
frequencies.
• Jitter Plot: Use when there is significant overlap in data points in a dot plot.
• Data with Variability Information:
• Error Bar Plot: Use when you want to convey the central tendency and variability
or uncertainty in the data.
• Comparing Distributions:
• Box-and-Whisker Plot: Use for summarizing and comparing distributions across
multiple groups, while also identifying outliers.
• Visualizing Distribution Shape:
• Histogram: Use when you want to understand the overall shape, skewness, and
modality of the data distribution.
• Each of these plots serves a specific purpose and helps in understanding
different aspects of the data. The choice of plot depends on the dataset
size, the type of data (categorical or continuous), and the specific insights
you aim to gain.
Two-variable plots
• Two-variable plots, also known as bivariate plots, are
used to explore the relationship between two variables.
• These plots help in identifying patterns, correlations,
trends, and potential anomalies between the variables.
• Below is a detailed explanation of various two-variable
plots, when to use them, and which graphs are most
suitable for different types of data.
1. Bar Chart
• Description: A bar chart displays data using rectangular bars
where the length of the bar is proportional to the value of the
variable. Bar charts can compare two variables by using grouped
or stacked bars.
• When to Use: Bar charts are used when comparing categorical
data between two groups. They are particularly useful when you
want to show differences in magnitude between categories.
• Pros:
• Easy to interpret.
• Effective for comparing discrete categories.
• Cons: Not suitable for continuous data.
• Example Use Case: Comparing the sales figures of different
products across two different years.
2. Scatter Plot
• Description: A scatter plot displays individual data points plotted on a
two-dimensional graph with one variable on the x-axis and the other
on the y-axis. It shows how one variable is related to another.
• When to Use: Scatter plots are ideal for examining the relationship or
correlation between two continuous variables. They are useful for
detecting patterns, trends, clusters, and outliers.
• Pros:
• Visualizes relationships and correlations.
• Identifies patterns, clusters, and outliers.
• Cons:
• Can be cluttered with large datasets.
Example Use Case: Exploring the relationship between advertising
spend and sales revenue.
3. Line Plot
• Description: A line plot connects individual data points with lines,
typically used to display trends over time. Each data point represents
the value of a variable at a specific time or ordered sequence.
• When to Use: Line plots are used for visualizing trends in time
series data or any sequential data. They are particularly useful for
showing changes and trends over time.
• Pros:
• Effective for visualizing trends over time.
• Clear representation of changes in data.
• Cons: Not suitable for non-sequential data.
• Example Use Case: Tracking the monthly temperature change over
a year.
4. Log-Log Plot
• Description: A log-log plot is a scatter plot where both the x-axis
and y-axis are on a logarithmic scale. This type of plot is used when
the data spans several orders of magnitude.
• When to Use: Log-log plots are used when both variables have a
multiplicative relationship, or when dealing with data that spans
multiple orders of magnitude. They are commonly used in scientific
data analysis.
• Pros:
• Useful for visualizing power-law relationships.
• Handles wide-ranging data scales.
• Cons:
• Can be difficult to interpret without proper knowledge.
• Example Use Case: Analyzing the relationship between the size of
an earthquake and the energy released.
Choosing the Right Plot
• Comparing Categorical Data:
• Bar Chart: Use when you need to compare categorical variables across
different groups. For example, comparing test scores between two different
classes.
• Exploring Relationships Between Continuous Variables:
• Scatter Plot: Ideal for visualizing the correlation or relationship between
two continuous variables, such as height and weight of individuals.
• Visualizing Trends Over Time:
• Line Plot: Best for showing trends and changes over time, such as stock
prices over months or years.
• Analyzing Data with Wide Ranges:
• Log-Log Plot: Use when your data spans several orders of magnitude and
you suspect a multiplicative relationship between the variables.
Summary of When to Use Which
Plot:
• Bar Chart: Compare categories between two variables
(especially categorical variables).
• Scatter Plot: Explore relationships and correlations between two
continuous variables.
• Line Plot: Track changes or trends over time or ordered
sequences.
• Log-Log Plot: Analyze data that spans several orders of
magnitude with multiplicative relationships.
• These two-variable plots are fundamental tools in data analysis,
helping to uncover insights about relationships, trends, and
patterns between variables. The choice of plot depends on the
nature of the data and the specific analytical goals.
More than two-variable plots
• More than two-variable plots, also known as
multivariate plots, are used to visualize relationships
between three or more variables in a dataset.
• These plots help in understanding complex interactions
and patterns that cannot be captured by two-variable
plots.
• Below is a detailed explanation of various multivariate
plots, when to use them, and which graphs are most
suitable for different types of data.
1. Stacked Plot
• Description: A stacked plot visualizes the cumulative contribution of multiple
variables over a single dimension, often time. Each segment (or "stack") of the
plot represents one variable's contribution, with the segments stacked on top of
each other.
• When to Use: Stacked plots are used to show how different variables contribute
to a total over time or another dimension. They are useful for understanding the
proportion of each variable relative to the whole.
• Pros:
• Shows the composition of multiple variables.
• Useful for displaying cumulative totals.
• Cons:
• Can be difficult to interpret if there are too many variables.
• Changes in individual variables are harder to track.
• Example Use Case: Visualizing the sales contribution of different product
categories over several months.
2. Parallel Coordinate Plot
• Description: A parallel coordinate plot visualizes multiple variables by plotting
each variable on a separate vertical axis. Each data point is represented as a
line connecting the axes.
• When to Use: Parallel coordinate plots are used when you need to compare
many variables across different observations simultaneously. They are
particularly useful for identifying patterns, correlations, and outliers in high-
dimensional data.
• Pros:
• Handles high-dimensional data.
• Good for comparing multiple variables simultaneously.
• Cons:
• Can become cluttered with large datasets.
• Interpretation can be challenging without careful design.
• Example Use Case: Analyzing the characteristics of different types of cars,
such as engine size, fuel efficiency, and price.
3. Scatter Matrix (Pair Plot)
• Description: A scatter matrix is a grid of scatter plots for each pair of
variables in a dataset. Each cell in the grid shows the relationship between
two variables, with histograms along the diagonal to show the distribution of
individual variables.
• When to Use: Scatter matrices are useful when you want to explore the
relationships between all pairs of variables in a dataset. They are particularly
helpful in identifying correlations, patterns, and potential multicollinearity.
• Pros:
• Provides a comprehensive view of relationships between multiple variables.
• Helps in identifying correlations and patterns.
• Cons:
• Becomes overwhelming with too many variables.
• Can be hard to interpret without careful analysis.
• Example Use Case: Exploring relationships between various financial
indicators such as revenue, profit, expenses, and stock price.
4. Heatmap
• Description: A heatmap is a two-dimensional representation of data
where the individual values are represented by colors. It is commonly used
to visualize correlations between multiple variables in a matrix format.
• When to Use: Heatmaps are ideal for visualizing the correlation or
relationship between many variables in a compact and intuitive manner.
They are particularly useful in identifying clusters, patterns, and
correlations.
• Pros:
• Compact and easy to interpret.
• Effective for visualizing large datasets.
• Cons:
• May oversimplify complex relationships.
• Choice of color scale can impact interpretation.
• Example Use Case: Displaying the correlation matrix of various economic
indicators like GDP, inflation rate, unemployment rate, and interest rate.
Choosing the Right Plot
• Understanding Composition and Contribution:
• Stacked Plot: Use when you want to show how different
variables contribute to a total over a single dimension, such as
time.
• Comparing Multiple Variables:
• Parallel Coordinate Plot: Ideal for comparing multiple variables
across different observations, especially in high-dimensional datasets.
• Scatter Matrix: Use when you want to explore pairwise relationships
between all variables in a dataset.
• Visualizing Relationships and Correlations:
• Heatmap: Best for visualizing correlations between many variables in a
compact and intuitive format.
Summary of When to Use Which
Plot:
• Stacked Plot: Use for cumulative contributions over time or
another dimension.
• Parallel Coordinate Plot: Best for high-dimensional data
comparisons.
• Scatter Matrix: Useful for pairwise exploration of relationships.
• Heatmap: Ideal for visualizing correlations.
• These multivariate plots are essential tools in understanding
complex datasets with multiple variables. The choice of plot
depends on the nature of the data, the specific insights you aim
to gain, and the ease of interpretation required for your
analysis.
Conclusion:
• Choosing the right chart type is essential for effective
data communication.
• Consider the nature of the data (univariate, bivariate,
multivariate) and the message you wish to convey.
• Following best practices ensures that your visualizations
are clear, accurate, and engaging.
• Refer notebook :Visualization_plots for all plots.