Unit 3
Unit 3
Unit 3
UNIT-3
EXPLORATORY DATA ANALYSIS
AND THE DATA SCIENCE PROCESS
C.KAVITHA
ASSISTANT PROFESSOR
DEPARTMENT OF CSE
1 06/02/2024
SYLLABUS
UNIT 3 EXPLORATORY DATA ANALYSIS AND THE DATA
SCIENCE PROCESS
Exploratory Data Analysis and the Data Science Process - Basic tools
(plots, graphs and summary statistics) of EDA -Philosophy of EDA - The
Data Science Process – Data Visualization - Basic principles, ideas and
tools for data visualization - Examples of exciting projects- Data
Visualization using Tableau.
TEXT / REFERENCE BOOKS
1. Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk
From The Frontline. O’Reilly. 2014.
2. Introduction to Linear Algebra - By Gilbert Strang, Wellesley-
Cambridge Press, 5th Edition.2016.
3. Applied Statistics and Probability For Engineers – By Douglas
Montgomery.2016.
4. Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of
Massive Datasets. v2.1, Cambridge University Press. 2014. (free online)
5. Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of
Data Science.
6. Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and
Techniques, 3rd Edition. ISBN 0123814790, 2011.
7. Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of
Statistical Learning, 2nd Edition. ISBN 0387952845. 2009. (free online)
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an approach to
analyzing datasets to summarize their main
characteristics, often with visual methods. EDA is used
for seeing what the data can tell us before the
modeling task.
EDA assists Data science professionals in various
ways:-
1 Getting a better understanding of data
2 Identifying various data patterns
3 Getting a better understanding of the problem
statement
4 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
5 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
6 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Descriptive Statistics
Descriptive statistics is a helpful way to understand
characteristics of your data and to get a quick summary
of it. Pandas in python provide an interesting
method describe(). The describe function applies basic
statistical computations on the dataset like extreme
values, count of data points standard deviation etc. Any
missing value or NaN value is automatically skipped.
describe() function gives a good picture of distribution of
data.
7 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Grouping data
Group by is an interesting measure available in pandas
which can help us figure out effect of different
categorical attributes on other data variables.
Correlation and Correlation computation
Correlation is a simple relationship between two
variables in a context such that one variable affects the
other. Correlation is different from act of causing.
8 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
ANOVA
ANOVA stands for Analysis of Variance. It is
performed to figure out the relation between the
different group of categorical data.
Under ANOVA we have two measures as result:
– F-testscore : which shows the variation of groups
mean over variation
– p-value: it shows the importance of the result
10 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
TYPES OF EDA
• There are broadly two categories of EDA, graphical and non-
graphical.
Univariate Non-graphical
Multivariate Non-graphical
Univariate graphical
Multivariate graphical
12 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Univariate non-graphical
This is the simplest form of data analysis as during this
we use just one variable to research the info. The
standard goal of univariate non-graphical EDA is to
know the underlying sample distribution/ data and make
observations about the population. Outlier detection is
additionally part of the analysis.
13 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
14 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Multivariate non-graphical
Multivariate non-graphical EDA technique is usually wont to show the
connection between two or more variables within the sort of either
cross-tabulation or statistics.
For categorical data, an extension of tabulation called cross-tabulation is
extremely useful. For 2 variables, cross-tabulation is preferred by making a
two-way table with column headings that match the amount of one-variable
and row headings that match the amount of the opposite two variables, then
filling the counts with all subjects that share an equivalent pair of levels.
For each categorical variable and one quantitative variable, we create
statistics for quantitative variables separately for every level of the specific
variable then compare the statistics across the amount of categorical
variable.
Comparing the means is an off-the-cuff version of ANOVA and comparing
medians may be a robust version of one-way ANOVA.
15 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Multivariate graphical:
A graphical representation always gives you a better
understanding of the relationship, especially among
multiple variables.
Other common sorts of multivariate graphics are:
Scatterplot: For 2 quantitative variables, the essential
graphical EDA technique is that the scatterplot , sohas
one variable on the x-axis and one on the y-axis and
therefore the point for every case in your dataset.
16 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
17 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
TOOLS IN EDA
The most commonly used software tools to perform EDA are Python and R.
Graphical exploratory data analysis employs visual tools to display data, such as:
Box plots
Box plots are used where there is a need to summarize data on an interval scale like the
ones on the stock market, where ticks observed in one whole day may be represented in
a single box, highlighting the lowest, highest, median and outliers.
Heatmap
Heatmaps are most often used for the representation of the correlation between
variables. Here is an example of a heatmap.
As you can see from the chart, there is a strong correlation between density and residual
sugar and absolutely no correlation between alcohol and residual sugar.
Histograms
The histogram is the graphical representation of numerical data that splits the data into
ranges. The taller the bar, the greater the number of data points falling in that range. A
good example here is the height data of a class of students. You would notice that the
height data looks like a bell curves for a particular class with most the data lying within
a certain range and a few of outside these ranges. There will be outliers too, either very
18 short or very small. 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Line graphs: one of the most basic types of charts that plots data
points on a graph; has a wealth of uses in almost every field of
study.
Pictograms: replace numbers with images to visually explain data.
They’re common in the design of infographics, as well as visuals
that data scientists can use to explain complex findings to non-data-
scientist professionals and the public.
Scattergrams or scatterplots: typically used to display two
variables in a set of data and then look for correlations among the
data. For example, scientists might use it to evaluate the presence of
two particular chemicals or gases in marine life in an effort to look
for a relationship between the two variables.
Non-graphical exploratory data analysis involves data collection
and reporting in nonvisual or non-pictorial formats.
19 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Some of the most common data science tools used to create an EDA
include:
Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high-level, built-in data structures, combined
with dynamic typing and dynamic binding, make it very attractive for
rapid application development, as well as for use as a scripting or glue
language to connect existing components together. Python and EDA can
be used together to identify missing values in a data set, which is
important so you can decide how to handle missing values for machine
learning.
R: An open-source programming language and free software
environment for statistical computing and graphics supported by the R
Foundation for Statistical Computing. The R language is widely used
among statisticians in data science in developing statistical observations
and data analysis.
20 06/02/2024
Philosophy of EDA
Philosophy EDA is not identical to statistical graphics although the two terms
are used almost interchangeably. Statistical graphics is a collection
of techniques--all graphically based and all focusing on one data
characterization aspect. EDA encompasses a larger venue; EDA is an
approach to data analysis that postpones the usual assumptions about
what kind of model the data follow with the more direct approach of
allowing the data itself to reveal its underlying structure and model.
EDA is not a mere collection of techniques; EDA is a philosophy as
to how we dissect a data set; what we look for; how we look; and
how we interpret. It is true that EDA heavily uses the collection of
techniques that we call "statistical graphics", but it is not identical to
statistical graphics
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA SCIENCE PROCESS
22 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
23 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
24 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
26 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Data cleaning on the other hand is the process of detecting, correcting and
ensuring that your given data set is free from error, consistent and usable by
identifying any errors or corruptions in the data, correcting or deleting
them, or manually processing them as needed to prevent the error from
corrupting our final analysis.
Data Cleaning is useful as you need to sanitize Data while gathering it. The
following are some of the most typical causes of Data Inconsistencies and
Errors:
Duplicate items are reduced from a variety of Databases.
The error with the input Data in terms of Precision.
Changes, Updates, and Deletions are made to the Data entries.
Variables with missing values across multiple Databases.
27 06/02/2024
DATA CLEANING – MISSING
VALUES
DATA CLEANING -
OUTLIERS
◾ What is an Outlier? – an anomaly – We will generally define outliers as samples that are exceptionally far
from the mainstream of the data.
◾ Outliers can have many causes, such as:
◾ Standard Deviation - If we know that the distribution of values in the sample is Gaussian or Gaussian-like,
we can use the standard deviation of the sample as a cut-off for identifying outliers
DATA CLEANING - OUTLIER
DETECTION
◾ Outlier detection methods – Numeric
◾ Inter-Quartile Range (IQR) / Box-plot – IQR is a concept in statistics that is used to measure the statistical
dispersion and data variability by dividing the dataset into quartiles.
DATA CLEANING – OUTLIER TREATMENT
◾ Mean/Median or random imputation – same as missing value treatment
◾ Dropping values – remove the observations with outliers - same as missing value treatment
◾ Discretization (Binning)
DATA CLEANING –DUPLICATE DATA
◾ be sure they are not real data that coincidentally have values that are identical
◾ try to figure why you have duplicates in your data (is it due to class imbalance?)
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Feature Scaling
The final step of data preprocessing is to apply the
very important feature scaling.
Feature Scaling is a technique to standardize the
independent features present in the data in a fixed
range. It is performed during the data pre-processing.
Why Scaling :- Most of the times, your dataset will
contain features highly varying in magnitudes, units
and range. But since, most of the machine learning
algorithms use Euclidean distance between two data
points in their computations, this is a problem.
35 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
37 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
38 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
1) Supervised Learning
It is based on the results of a previous operation that is related to the existing
business operation. Based on previous patterns, Supervised Learning aids in the
prediction of an outcome. Some of the Supervised Learning Algorithms are:
Linear Regression
Random Forest
Support Vector Machines
39 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
2) Unsupervised Learning
This form of learning has no pre-existing consequence or
pattern. Instead, it concentrates on examining the
interactions and connections between the presently
available Data points. Some of the Unsupervised Learning
Algorithms are:
KNN (k-Nearest Neighbors)
K-means Clustering
Hierarchical Clustering
Anomaly Detection
40 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
3) Reinforcement Learning
It is a fascinating Machine Learning technique that
uses a dynamic Dataset that interacts with the real
world. In simple terms, it is a mechanism by which a
system learns from its mistakes and improves over
time. Some of the Reinforcement Learning Algorithms
are:
Q-Learning
State-Action-Reward-State-Action (SARSA)
Deep Q Network
41 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
This is the next phase, and it’s crucial to check that our Data Science
Modelling efforts meet the expectations. The Data Model is applied to the Test
Data to check if it’s accurate and houses all desirable features. You can further
test your Data Model to identify any adjustments that might be required to
enhance the performance and achieve the desired results. If the required
precision is not achieved, you can go back to Step 5 (Machine Learning
Algorithms), choose an alternate Data Model, and then test the model again.
42 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example
Import the dataset & Libraries
First step is usually importing the libraries that will be needed in the program. A
library is essentially a collection of modules that can be called and used.
Pandas offer tools for cleaning and process your data. It is the most popular
Python library that is used for data analysis. In pandas, a data table is called a
dataframe.
43 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
45 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
46 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
47 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Replaces every variable null values with their respective mean, and
‘inplace =True’ indicates to affect the changes to dataset.
48 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
49 06/02/2024
Data visualization is the process of translating large data sets and
Data Visualization
50 06/02/2024
Data Visualization Tools
1. Tableau
It is a business intelligence service that aids people in visualizing as well as understanding their
data it’s also one of those very widely used services in the field of business intelligence. It
allows you to design an interactive reports dashboard and worksheets to obtain business visions
it has outstanding visualization capabilities and has a great performance.
Pros:
Outstanding visual library
User friendly
Great performance
Connectivity to data
Powerful computation
Quick insights
Cons:
Inflexible pricing
No option for auto-refresh
Restrictive imports
Manual updates for static features
51 06/02/2024
2. Power BI
Power BI, Microsoft's easy-to-use data visualization tool, is available for both on-
premise installation and deployment on the cloud infrastructure. Power BI is one of the
most complete data visualization tools that supports a myriad of backend databases,
including Teradata, Salesforce, PostgreSQL, Oracle, Google Analytics, Github, Adobe
Analytics, Azure, SQL Server, and Excel. The enterprise-level tool creates stunning
visualizations and delivers real-time insights for fast decision-making.
The Pros of Power BI:
No requirement for specialized tech support
Easily integrates with existing applications
Personalized, rich dashboard
High-grade security
No speed or memory constraints
Compatible with Microsoft products
The Cons of Power BI:
Cannot work with varied, multiple datasets
52 06/02/2024
3. Dundas BI
Dundas BI offers highly-customizable data visualizations with interactive
scorecards, maps, gauges, and charts, optimizing the creation of ad-hoc, multi-page
reports. By providing users full control over visual elements, Dundas BI simplifies
the complex operation of cleansing, inspecting, transforming, and modeling big
datasets.
The Pros of Dundas BI:
Exceptional flexibility
A large variety of data sources and charts
Wide range of in-built features for extracting, displaying, and modifying data
The Cons of Dundas BI:
No option for predictive analytics
3D charts not supported
53 06/02/2024
4. JupyteR
A web-based application, JupyteR, is one of the top-rated data visualization tools
that enable users to create and share documents containing visualizations,
equations, narrative text, and live code. JupyteR is ideal for data cleansing and
transformation, statistical modeling, numerical simulation, interactive computing,
and machine learning.
The Pros of JupyteR:
Rapid prototyping
Visually appealing results
Facilitates easy sharing of data insights
The Cons of JupyteR:
Tough to collaborate
At times code reviewing becomes complicated
54 06/02/2024
5. Zoho Reports
Zoho Reports, also known as Zoho Analytics, is a comprehensive data
visualization tool that integrates Business Intelligence and online reporting
services, which allow quick creation and sharing of extensive reports in
minutes. The high-grade visualization tool also supports the import of Big
Data from major databases and applications.
The Pros of Zoho Reports:
Effortless report creation and modification
Includes useful functionalities such as email scheduling and report sharing
Plenty of room for data
Prompt customer support.
The Cons of Zoho Reports:
User training needs to be improved
The dashboard becomes confusing when there are large volumes of data
55 06/02/2024
6. GoogleCharts
One of the major players in the data visualization market space, Google Charts, coded
with SVG and HTML5, is famed for its capability to produce graphical and pictorial
data visualizations. Google Charts offers zoom functionality, and it provides users
with unmatched cross-platform compatibility with iOS, Android, and even the earlier
versions of the Internet Explorer browser.
The Pros of Google Charts:
User-friendly platform
Easy to integrate data
Visually attractive data graphs
Compatibility with Google products.
The Cons of Google Charts:
The export feature needs fine-tuning
Inadequate demos on tools
Lacks customization abilities
Network connectivity required for visualization
56 06/02/2024
7. Sisense
Regarded as one of the most agile data visualization tools, Sisense gives users access
to instant data analytics anywhere, at any time. The best-in-class visualization tool
can identify key data patterns and summarize statistics to help decision-makers make
data-driven decisions.
The Pros of Sisense:
Ideal for mission-critical projects involving massive datasets
Reliable interface
High-class customer support
Quick upgrades
Flexibility of seamless customization
The Cons of Sisense:
Developing and maintaining analytic cubes can be challenging
Does not support time formats
Limited visualization versions
57 06/02/2024
8. Plotly
An open-source data visualization tool, Plotly offers full integration with
analytics-centric programming languages like Matlab, Python, and R, which
enables complex visualizations. Widely used for collaborative work,
disseminating, modifying, creating, and sharing interactive, graphical data,
Plotly supports both on-premise installation and cloud deployment.
The Pros of Plotly:
Allows online editing of charts
High-quality image export
Highly interactive interface
Server hosting facilitates easy sharing
The Cons of Plotly:
Speed is a concern at times
Free version has multiple limitations
Various screen-flashings create confusion and distraction
58 06/02/2024
9. Data Wrapper
Data Wrapper is one of the very few data visualization tools on the market that
is available for free. It is popular among media enterprises because of its
inherent ability to quickly create charts and present graphical statistics on Big
Data. Featuring a simple and intuitive interface, Data Wrapper allows users to
create maps and charts that they can easily embed into reports.
The Pros of Data Wrapper:
Does not require installation for chart creation
Ideal for beginners
Free to use
The Cons of Data Wrapper:
Building complex charts like Sankey is a problem
Security is an issue as it is an open-source tool
59 06/02/2024
10. QlikView
A major player in the data visualization market, Qlikview provides solutions to over
40,000 clients in 100 countries. Qlikview's data visualization tool, besides enabling
accelerated, customized visualizations, also incorporates a range of solid features,
including analytics, enterprise reporting, and Business Intelligence capabilities.
The Pros of QlikView:
User-friendly interface
Appealing, colorful visualizations
Trouble-free maintenance
A cost-effective solution
The Cons of QlikView:
RAM limitations
Poor customer support
Does not include the 'drag and drop' feature
60 06/02/2024
Data Visualization with python
Python offers multiple great graphing libraries that come
packed with lots of different features.
Here are a few popular plotting libraries:
Matplotlib: low level, provides lots of freedom
Pandas Visualization: easy to use interface, built on
Matplotlib
Seaborn: high-level interface, great default styles
ggplot: based on R’s ggplot2, uses
Grammar of Graphics
Plotly: can create interactive plots
61 06/02/2024
Matplotlib
Matplotlib is a visualization library in Python for 2D
plots of arrays. Matplotlib is written in Python and
makes use of the NumPy library. It can be used in
Python and IPython shells, Jupyter notebook, and web
application servers. Matplotlib comes with a wide
variety of plots like line, bar, scatter, histogram, etc.
which can help us, deep-dive, into understanding
trends, patterns, correlations. It was introduced by
John Hunter in 2002.
62 06/02/2024
Seaborn
Conceptualized and built originally at the Stanford
University, this library sits on top of matplotlib. In a sense,
it has some flavors of matplotlib while from the
visualization point, its is much better than matplotlib and
has added features as well. Below are its advantages
Built-in themes aid better visualization
Statistical functions aiding better data insights
Better aesthetics and built-in plots
Helpful documentation with effective examples
63 06/02/2024
Bokeh
Bokeh is an interactive visualization library for
modern web browsers. It is suitable for large or
streaming data assets and can be used to develop
interactive plots and dashboards. There is a wide array
of intuitive graphs in the library which can be
leveraged to develop solutions. It works closely with
PyData tools. The library is well-suited for creating
customized visuals according to required use-cases.
The visuals can also be made interactive to serve a
what-if scenario model. All the codes are open source
and available on GitHub.
64 06/02/2024
plotly
plotly.py is an interactive, open-source, high-level,
declarative, and browser-based visualization library for
Python. It holds an array of useful visualization which
includes scientific charts, 3D graphs, statistical charts,
financial charts among others. Plotly graphs can be viewed
in Jupyter notebooks, standalone HTML files, or hosted
online. Plotly library provides options for interaction and
editing. The robust API works perfectly in both local and
web browser mode.
65 06/02/2024
ggplot
ggplot is a Python implementation of the grammar of graphics.
The Grammar of Graphics refers to the mapping of data to
aesthetic attributes (colour, shape, size) and geometric objects
(points, lines, bars). The basic building blocks according to the
grammar of graphics are data, geom (geometric objects), stats
(statistical transformations), scale, coordinate system, and facet.
Using ggplot in Python allows you to develop informative
visualizations incrementally, understanding the nuances of the
data first, and then tuning the components to improve the visual
representations.
66 06/02/2024
Data Visualization in python using Seaborn
Box plot:
A box plot (or box-and-whisker plot) s is the visual
representation of the depicting groups of numerical data
through their quartiles against continuous/categorical
data.
A box plot consists of 5 things.
Minimum
First Quartile or 25%
Median (Second Quartile) or 50%
Third Quartile or 75%
Maximum
67 06/02/2024
Representation of box plot.
68 06/02/2024
Syntax:
seaborn.boxplot(x=None, y=None, hue=None,
data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting. If x and y are absent, this is
interpreted as wide-form.
Returns: It returns the Axes object with the plot drawn
onto it.
69 06/02/2024
Pandas and Seaborn is one of those packages and
makes importing and analyzing data much easier.
# import module
import seaborn as sns
import pandas
70 06/02/2024
71 06/02/2024
Example 2:
# import module
import seaborn as sns
import pandas
72 06/02/2024
73 06/02/2024
Voilin Plot:
A voilin plot is similar to a boxplot. It shows several
quantitative data across one or more categorical
variables such that those distributions can be
compared.
Syntax: seaborn.violinplot(x=None, y=None,
hue=None, data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting.
74 06/02/2024
# import module
import seaborn
seaborn.set(style = 'whitegrid')
75 06/02/2024
76 06/02/2024
Line plot:
Lineplot Is the most popular plot to draw a relationship
between x and y with the possibility of several
semantic groupings.
Syntax : sns.lineplot(x=None, y=None)
Parameters:
x, y: Input data variables; must be numeric. Can pass
data directly or reference columns in data.
77 06/02/2024
# import module
import seaborn as sns
import pandas
# loading csv
data = pandas.read_csv("nba.csv")
# ploting lineplot
sns.lineplot( data['Age'], data['Weight'])
78 06/02/2024
79 06/02/2024
Example 2: Use the hue parameter for plotting the graph.
# import module
import seaborn as sns
import pandas
# plot
sns.lineplot(data['Age'],data['Weight'], hue =data["Position"])
80 06/02/2024
81 06/02/2024
Scatter Plot
Scatter plots or scatter graphs is a bivariate plot having greater
resemblance to line graphs in the way they are built.
Scatterplot Can be used with several semantic groupings which can
help to understand well in a graph against continuous/categorical
data. It can draw a two-dimensional graph.
Syntax: seaborn.scatterplot(x=None, y=None)
Parameters:
x, y: Input data variables that should be numeric.
Returns: This method returns the Axes object with the plot drawn
onto it.
82 06/02/2024
Advantages of a scatter plot
Displays correlation between variables
Suitable for large data sets
Easier to find data clusters
Better representation of each data point
83 06/02/2024
# import module
import seaborn
import pandas
# load csv
data = pandas.read_csv("nba.csv")
# plotting
seaborn.scatterplot(data['Age'],data['Weight'])
84 06/02/2024
85 06/02/2024
Example 2: Use the hue parameter for plotting the graph
import seaborn
import pandas
data = pandas.read_csv("nba.csv")
86 06/02/2024
87 06/02/2024
Bar plot:
Barplot represents an estimate of central tendency for a numeric
variable with the height of each rectangle and provides some indication
of the uncertainty around that estimate using error bars.
Syntax : seaborn.barplot(x=None, y=None, hue=None, data=None)
Parameters :
x, y : This parameter take names of variables in data or vector data,
Inputs for plotting long-form data.
hue : (optional) This parameter take column name for colour encoding.
data : (optional) This parameter take DataFrame, array, or list of
arrays, Dataset for plotting. If x and y are absent, this is interpreted as
wide-form. Otherwise it is expected to be long-form.
Returns : Returns the Axes object with the plot drawn onto it.
88 06/02/2024
# import module
import seaborn
seaborn.set(style = 'whitegrid')
89 06/02/2024
90 06/02/2024
Example 2
# import module
import seaborn
seaborn.set(style = 'whitegrid')
91 06/02/2024
92 06/02/2024
Point plot:
Point plot used to show point estimates and confidence intervals
using scatter plot glyphs. A point plot represents an estimate of
central tendency for a numeric variable by the position of scatter
plot points and provides some indication of the uncertainty around
that estimate using error bars.
Syntax: seaborn.pointplot(x=None, y=None, hue=None, data=None)
Parameters:
x, y: Inputs for plotting long-form data.
hue: (optional) column name for color encoding.
data: dataframe as a Dataset for plotting.
Return: The Axes object with the plot drawn onto it.
93 06/02/2024
# import module
import seaborn
seaborn.set(style = 'whitegrid')
94 06/02/2024
95 06/02/2024
Countplot
A countplot is a plot between a categorical and a
continuous variable. The continuous variable in this
case being the number of times the categorical is
present or simply the frequency. In a sense, count plot
can be said to be closely linked to a histogram or a bar
graph.
96 06/02/2024
Syntax : seaborn.countplot(x=None, y=None, hue=None,
data=None)
Parameters :
x, y: This parameter take names of variables in data or
vector data, optional, Inputs for plotting long-form data.
hue : (optional) This parameter take column name for
color encoding.
data : (optional) This parameter take DataFrame, array,
or list of arrays, Dataset for plotting. If x and y are
absent, this is interpreted as wide-form. Otherwise, it is
expected to be long-form.
Returns: Returns the Axes object with the plot drawn onto
97 it. 06/02/2024
# import module
import seaborn
seaborn.set(style = 'whitegrid')
98 06/02/2024
Output Countplot
99 06/02/2024
Bivariate and Univariate data using seaborn and pandas:
Bivariate data: This type of data involves two different
variables. The analysis of this type of data deals with
causes and relationships and the analysis is done to find
out the relationship between the two variables.
Univariate data: This type of data consists of only one
variable. The analysis of univariate data is thus the
simplest form of analysis since the information deals with
only one quantity that changes. It does not deal with
causes or relationships and the main purpose of the
analysis is to describe the data and find patterns that exist
within it.
100 06/02/2024
Example 1: example of Bivariate data disturbation
Using the box plot.
# import module
import seaborn as sns
import pandas
101 06/02/2024
102 06/02/2024
Let’s see an example of univariate data distribution:
Example: Using the dist plot
# import module
import seaborn as sns
import pandas
sns.distplot( data['Age'])
103 06/02/2024
104 06/02/2024
Correlation heatmap
A correlation heatmap is a heatmap that shows a 2D correlation
matrix between two discrete dimensions, using colored cells to
represent data from usually a monochromatic scale. The values
of the first dimension appear as the rows of the table while of the
second dimension as a column.
The color of the cell is proportional to the number of
measurements that match the dimensional value. This makes
correlation heatmaps ideal for data analysis since it makes
patterns easily readable and highlights the differences and
variation in the same data.
A correlation heatmap, like a regular heatmap, is assisted by a
colorbar making data easily readable and comprehensible.
105 06/02/2024
Syntax: heatmap(data, vmin, vmax, center, cmap,
……………………………………………………)
Example:
For the example given below, here a dataset
downloaded from kaggle.com is being used. The plot
shows data related to bestseller novels on amazon.
106 06/02/2024
# import modules
import matplotlib.pyplot as mp
import pandas as pd
import seaborn as sb
# displaying heatmap
mp.show()
107 06/02/2024
108 06/02/2024
Histogram
Histograms display counts of data and are hence similar to a bar
chart. A histogram plot can also tell us how close a data distribution
is to a normal curve. While working out statistical method, it is very
important that we have a data which is normally or close to a normal
distribution. However, histograms are univariate in nature and bar
charts bivariate.
A bar graph charts actual counts against categories e.g. height of the
bar indicates the number of items in that category whereas a
histogram displays the same categorical variables in bins.
Bins are integral part while building a histogram they control the
data points which are within a range. As a widely accepted choice
we usually limit bin to a size of 5-20, however this is totally
governed by the data points which is present.
109 06/02/2024
Example: Diabetes dataset
# illustrate histogram
features = ['BloodPressure', 'SkinThickness']
diabetes[features].hist(figsize=(10, 4))
110 06/02/2024
Some examples using Matplotlib
Bar chart using Matplotlib-Titanic dataset
#Creating the dataset
df = sns.load_dataset('titanic')
df=df.groupby('who')['fare'].sum().to_frame().reset_index()
119 06/02/2024
Gaining information from data
iris_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
We can see that only one column has categorical data and all the other columns are of the
numeric type with non-Null entries.
120 06/02/2024
Data Insights:
1 All columns are not having any Null Entries
2 Four columns are numerical type
3 Only Single column categorical type
121 06/02/2024
Statistical Insight
iris_data.describe()
Statistical Insight
iris_data.describe()
122 06/02/2024
Data Insights:
Mean values
Standard Deviation ,
Minimum Values
Maximum Values
123 06/02/2024
Checking Missing Values
We will check if our data contains any missing values or not. Missing
values can occur when no information is provided for one or more items or
for a whole unit. We will use the isnull() method.
iris_data.isnull().sum()
124 06/02/2024
Checking For Duplicate Entries
iris_data[iris_data.duplicated()]
There are 3 duplicates, therefore we must check whether each species data set is
balanced in no's or no
125 06/02/2024
Checking the balance
iris_data[‘species’].value_counts()
126 06/02/2024
Data Visualization
Visualizing the target column
Our target column will be the Species column because
at the end we will need the result according to the
species only. Note: We will use Matplotlib and
Seaborn library for the data visulalization.
127 06/02/2024
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
plt.title(‘Species Count’)
sns.countplot(iris_data[‘species’])
128 06/02/2024
129 06/02/2024
Data Insight:
This further visualizes that species are well balanced
Each species ( Iris virginica, setosa, versicolor) has 50
as it’s count
131 06/02/2024
132 06/02/2024
Data Insights:
Iris Setosa species has smaller sepal length but higher width.
Versicolor lies in almost middle for length as well as width
Virginica has larger sepal lengths and smaller sepal widths
Comparison between various species based on petal length
and width
plt.figure(figsize=(16,9))
plt.title(‘Comparison between various species based on petal
lenght and width’)
sns.scatterplot(iris_data[‘petal_length’],
iris_data[‘petal_width’], hue = iris_data[‘species’], s= 50)
133 06/02/2024
134 06/02/2024
Data Insights
Setosa species have the smallest petal length as well as
petal width
Versicolor species have average petal length and petal
width
Virginica species have the highest petal length as well
as petal width
135 06/02/2024
Let’s plot all the column’s relationships using a pairplot.
It can be used for multivariate analysis.
sns.pairplot(iris_data,hue=”species”,height=4)
136 06/02/2024
Data Insights:
High co relation between petal length and width
columns.
Setosa has both low petal length and width
Versicolor has both average petal length and width
Virginica has both high petal length and width.
Sepal width for setosa is high and length is low.
Versicolor have average values for for sepal
dimensions.
Virginica has small width but large sepal length
137 06/02/2024
The heatmap is a data visualization technique that is
used to analyze the dataset as colors in two dimensions.
Basically, it shows a correlation between all numerical
variables in the dataset. In simpler terms, we can plot the
above-found correlation using the heatmaps.
Checking Correlation
plt.figure(figsize=(10,11))
sns.heatmap(iris_data.corr(),annot=True)
plt.plot()
138 06/02/2024
Heatmap
139 06/02/2024
Data Insights:
Sepal Length and Sepal Width features are slightly
correlated with each other
Checking Mean & Median Values for each species
iris.groupby(‘species’).agg([‘mean’, ‘median’])
mean and median
outputs
visualizing the distribution , mean and median using box plots & violin
140 plots 06/02/2024
Box plots to know about distribution
boxplot to see how the categorical feature “Species” is
distributed with all other four input variables
fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot( y=”petal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 0])
sns.boxplot( y=”petal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 1])
sns.boxplot( y=”sepal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 0])
sns.boxplot( y=”sepal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 1])
plt.show()
141 06/02/2024
142 06/02/2024
Data Insights:
Setosa is having smaller feature and less distributed
Versicolor is distributed in a average manner and
average features
Virginica is highly distributed with large no .of values
and features
Clearly the mean/ median values are being shown by
each plots for various features(sepal length & width,
petal length & width)
143 06/02/2024
Violin Plot for checking distribution
The violin plot shows density of the length and width in the
species. The thinner part denotes that there is less density whereas
the fatter part conveys higher density
fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 1],inner=’quartile’)
plt.show()
144 06/02/2024
145 06/02/2024
Data Insights:
Setosa is having less distribution and density in case of
petal length & width
Versicolor is distributed in a average manner and
average features in case of petal length & width
Virginica is highly distributed with large no .of values
and features in case of sepal length & width
High density values are depicting the mean/median
values, for example: Iris Setosa has highest density at
5.0 cm ( sepal length feature) which is also the median
value(5.0) as per the table
146 06/02/2024
Mean / Median Table for reference
147 06/02/2024
Plotting the Histogram & Probability Density
Function (PDF)
plotting the probability density function(PDF) with
each feature as a variable on X-axis and it’s histogram
and corresponding kernel density plot on Y-axis.
148 06/02/2024
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_length") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_width") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_length") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_width") \
.add_legend()
plt.show()
149 06/02/2024
Plot 1 | Classification feature : Sepal Length
150 06/02/2024
Plot 2 | Classification feature : Sepal Width
151 06/02/2024
Plot 3 | Classification feature : Petal Length
152 06/02/2024
Plot 4 | Classification feature : Petal Width
153 06/02/2024
Data Insights:
Plot 1 shows that there is a significant amount of overlap between the
species on sepal length, so it is not an effective Classification feature
Plot 2 shows that there is even higher overlap between the species on
sepal width, so it is not an effective Classification feature
Plot 3 shows that petal length is a good Classification feature as it
clearly separates the species . The overlap is extremely less (between
Versicolor and Virginica) , Setosa is well separated from the rest two
Just like Plot 3, Plot 4 also shows that petal width is a good
Classification feature . The overlap is significantly less (between
Versicolor and Virginica) , Setosa is well separated from the rest two
154 06/02/2024
Choosing Plot 3 (Classification feature as Petal
Length)to distinguish among the species
155 06/02/2024
Plot 3 | Classification feature : Petal Length
Data Insights:
The pdf curve of Iris Setosa ends roughly at 2.1
If petal length < 2.1, then species is Iris Setosa
The point of intersection between pdf curves of
Versicolor and Virginica is roughly at 4.8
If petal length > 2.1 and petal length < 4.8 then species
is Iris Versicolor
If petal length > 4.8 then species is Iris Virginica
156 06/02/2024
Titanic Data EDA
Titanic Dataset –
It is one of the most popular datasets used for understanding machine
learning basics. It contains information of all the passengers aboard the
RMS Titanic, which unfortunately was shipwrecked. This dataset can be
used to predict whether a given passenger survived or not.
157 06/02/2024
158 06/02/2024
Features: The titanic dataset has roughly the following types of features:
Categorical/Nominal: Variables that can be divided into multiple categories but
having no order or priority.
Eg. Embarked (C = Cherbourg; Q = Queenstown; S = Southampton)
Binary: A subtype of categorical features, where the variable has only two
categories.
Eg: Sex (Male/Female)
Ordinal: They are similar to categorical features but they have an order(i.e can be
sorted).
Eg. Pclass (1, 2, 3)
Continuous: They can take up any value between the minimum and maximum
values in a column.
Eg. Age, Fare
Count: They represent the count of a variable.
Eg. SibSp, Parch
Useless: They don’t contribute to the final outcome of an ML model.
Here, PassengerId, Name, Cabin and Ticket might fall into this category.
159 06/02/2024
PROCEDURE
Import the relevant python libraries for the analysis
Load the train and test dataset and set the index if applicable
Visually inspect the head of the dataset,Examine the train dataset to
understand in particular if the data is tidy, shape of the dataset,examine
datatypes, examine missing values, unique counts and build a data
dictictionary dataframe
Run discriptive statistics of object and numerical datatypes, and finally
transform datatypes accordingly
Carry-out univariate,bivariate and multivariate analysis using graphical and
non graphical(some numbers representing the data) mediums
Feature Engineering : Extract title from name, Extract new features from
name, age, fare, sibsp, parch and cabin
Preprocessing and Prepare data for statistical modeling
Statistical Modelling
160 06/02/2024
Import the relevant python libraries for the analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Load the train and test dataset and set the index if
applicable
#load the train dataset d
df = pd.read_csv('train.csv')
161 06/02/2024
#inspect the first few rows of the train dataset
df.head()
df.shape
(891, 11)
162 06/02/2024
# identify datatypes
df.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
163 06/02/2024
# identify missing values
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
164 06/02/2024
# Identify number of unique values
df.nunique()
0 MissingVal NUnique
Survived int64 0 2
Pclass int64 0 3
Name object 0 891
Sex object 0 2
Age float64 177 88
SibSp int64 0 7
Parch int64 0 7
Ticket object 0 681
Fare float64 0 248
Cabin object 687 147
Embarked. object 2 3
165 06/02/2024
df.describe()
166 06/02/2024
Carryout univariate and multivariate analysis using
graphical and non graphical(some numbers represting the
data)
df.Survived.value_counts(normalize=True)
Output:
0 0.616162
1 0.383838
Name: Survived, dtype: float64
only 38% of the passengers were survived, where as a
majority 61% the passenger did not survive the disaster
Univariate Analysis
fig, axes = plt.subplots(2, 4, figsize=(16, 10))
sns.countplot('Survived',data=df,ax=axes[0,0])
sns.countplot('Pclass',data=df,ax=axes[0,1])
sns.countplot('Sex',data=df,ax=axes[0,2])
sns.countplot('SibSp',data=df,ax=axes[0,3])
sns.countplot('Parch',data=df,ax=axes[1,0])
sns.countplot('Embarked',data=df,ax=axes[1,1])
sns.distplot(df['Fare'], kde=True,ax=axes[1,2])
sns.distplot(df['Age'].dropna(),kde=True,ax=axes[1,3])
168 06/02/2024
Output:
<matplotlib.axes._subplots.AxesSubplot at
0x7ffa6366bdd8>
169 06/02/2024
Bivariate EDA
We can clearly see that male survial rates is around 20%
where as female survial rate is about 75% which suggests
that gender has a strong relationship with the survival rates.
There is also a clear relationship between Pclass and the
survival by referring to first plot below. Passengers on
Pclass1 had a better survial rate of approx 60% whereas
passengers on pclass3 had the worst survial rate of approx
22%
There is also a marginal relationship between the fare and
survial rate.
170 06/02/2024
figbi, axesbi = plt.subplots(2, 4, figsize=(16, 10))
df.groupby('Pclass')
['Survived'].mean().plot(kind='barh',ax=axesbi[0,0],xlim=[0,1])
df.groupby('SibSp')
['Survived'].mean().plot(kind='barh',ax=axesbi[0,1],xlim=[0,1])
df.groupby('Parch')
['Survived'].mean().plot(kind='barh',ax=axesbi[0,2],xlim=[0,1])
df.groupby('Sex')
['Survived'].mean().plot(kind='barh',ax=axesbi[0,3],xlim=[0,1])
df.groupby('Embarked')
['Survived'].mean().plot(kind='barh',ax=axesbi[1,0],xlim=[0,1])
sns.boxplot(x="Survived", y="Age", data=df,ax=axesbi[1,1])
sns.boxplot(x="Survived", y="Fare", data=df,ax=axesbi[1,2])
171 06/02/2024
172 06/02/2024
Multivariate EDA
Construct a Coorelation matrix of the int64 and float64
feature types
There is a positve coorelation between Fare and
Survived and a negative coorelation between Pclass
and Surived
There is a negative coorelation between Fare and
Pclass, Age and Plcass
173 06/02/2024
import seaborn as sns
f, ax = plt.subplots(figsize=(10, 8))
corr = df.corr()
sns.heatmap(corr,
mask=np.zeros_like(corr, dtype=np.bool),
cmap=sns.diverging_palette(220, 10,
as_cmap=True),
square=True, ax=ax)
174 06/02/2024
175 06/02/2024
Having missing values in a dataset can cause errors with
some machine learning algorithms and either the rows that
has missing values should be removed or imputed
Imputing refers to using a model to replace missing values.
There are many options we could consider when replacing
a missing value, for example:
constant value that has meaning within the domain, such as 0,
distinct from all other values.
value from another randomly selected record.
mean, median or mode value for the column.
value estimated by another predictive model.
176 06/02/2024
# impute the missing Fare values with the mean Fare
value
df.Fare.fillna(train.Fare.mean(),inplace=True)
177 06/02/2024
Statistical Modelling
df.columns
Output:
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Name_len',
'Ticket_First', 'FamilyCount', 'Cabin_First', 'title’],
dtype='object')
178 06/02/2024
trainML = df[['Survived', 'Pclass', 'Name', 'Sex', 'Age',
'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked', 'Name_len',
'Ticket_First', 'FamilyCount', 'title’]]
179 06/02/2024
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Embarked 0
Name_len 0
Ticket_First 0
FamilyCount 0
title 0
dtype: int64
180 06/02/2024
Data Visualization using Tableau
Tableau is a Data Visualisation tool that is widely used for
Business Intelligence but is not limited to it. It helps create
interactive graphs and charts in the form of dashboards and
worksheets to gain business insights. And all of this is made
possible with gestures as simple as drag and drop
Tableau is a powerful and fastest growing data visualization
tool used in the Business Intelligence Industry. It helps in
simplifying raw data in a very easily understandable format.
Tableau helps create the data that can be understood by
professionals at any level in an organization. It also allows non-
technical users to create customized dashboards.
181 06/02/2024
Data analysis is very fast with Tableau tool and the
visualizations created are in the form of dashboards
and worksheets.
The best features of Tableau software are
Data Blending
Real time analysis
Collaboration of data
182 06/02/2024
What Products does Tableau offer?
183 06/02/2024
Tableau Product Suite
The Tableau Product Suite consists of
Tableau Desktop
Tableau Public
Tableau Online
Tableau Server
Tableau Reader
184 06/02/2024
For a clear understanding, data analytics in Tableau tool can
be classified into two section.
Developer Tools: The Tableau tools that are used for
development such as the creation of dashboards, charts,
report generation, visualization fall into this category. The
Tableau products, under this category, are the Tableau
Desktop and the Tableau Public.
Sharing Tools: As the name suggests, the purpose of these
Tableau products is sharing the visualizations, reports,
dashboards that were created using the developer tools.
Products that fall into this category are Tableau Online,
Server, and Reader.
185 06/02/2024
TableauDesktop
Tableau Desktop has a rich feature set and allows you to code and customize
reports. Right from creating the charts, reports, to blending them all together to
form a dashboard, all the necessary work is created in Tableau Desktop.
For live data analysis, Tableau Desktop provides connectivity to Data
Warehouse, as well as other various types of files. The workbooks and the
dashboards created here can be either shared locally or publicly.
Based on the connectivity to the data sources and publishing option, Tableau
Desktop is classified into
Tableau Desktop Personal: The development features are similar to Tableau Desktop.
Personal version keeps the workbook private, and the access is limited. The workbooks
cannot be published online. Therefore, it should be distributed either Offline or in
Tableau Public.
Tableau Desktop Professional: It is pretty much similar to Tableau Desktop. The
difference is that the work created in the Tableau Desktop can be published online or in
Tableau Server. Also, in Professional version, there is full access to all sorts of the
datatype. It is best suitable for those who wish to publish their work in Tableau Server.
186 06/02/2024
Tableau Public
It is Tableau version specially build for the cost-
effective users. By the word “Public,” it means that the
workbooks created cannot be saved locally; in turn, it
should be saved to the Tableau’s public cloud which
can be viewed and accessed by anyone.
There is no privacy to the files saved to the cloud since
anyone can download and access the same. This
version is the best for the individuals who want to
learn Tableau and for the ones who want to share their
data with the general public.
187 06/02/2024
Tableau Server
The software is specifically used to share the workbooks,
visualizations that are created in the Tableau Desktop application
across the organization. To share dashboards in the Tableau Server,
you must first publish your work in the Tableau Desktop. Once the
work has been uploaded to the server, it will be accessible only to
the licensed users.
However, It’s not necessary that the licensed users need to have the
Tableau Server installed on their machine. They just require the
login credentials with which they can check reports via a web
browser. The security is high in Tableau server, and it is much suited
for quick and effective sharing of data in an organization.
The admin of the organization will always have full control over the
server. The hardware and the software are maintained by the
organization.
188 06/02/2024
Tableau Online
As the name suggests, it is an online sharing tool of Tableau. Its
functionalities are similar to Tableau Server, but the data is
stored on servers hosted in the cloud which are maintained by
the Tableau group.
There is no storage limit on the data that can be published in the
Tableau Online. Tableau Online creates a direct link to over 40
data sources that are hosted in the cloud such as the MySQL,
Hive, Amazon Aurora, Spark SQL and many more.
To publish, both Tableau Online and Server require the
workbooks created by Tableau Desktop. Data that is streamed
from the web applications say Google Analytics,
Salesforce.com are also supported by Tableau Server and
Tableau Online.
189 06/02/2024
Tableau Reader
Tableau Reader is a free tool which allows you to view
the workbooks and visualizations created using
Tableau Desktop or Tableau Public. The data can be
filtered but editing and modifications are restricted.
The security level is zero in Tableau Reader as anyone
who gets the workbook can view it using Tableau
Reader.
If you want to share the dashboards that you have
created, the receiver should have Tableau Reader to
view the document.
190 06/02/2024
How does Tableau work?
Tableau connects and extracts the data stored in various places.
It can pull data from any platform imaginable. A simple
database such as an excel, pdf, to a complex database like
Oracle, a database in the cloud such as Amazon webs services,
Microsoft Azure SQL database, Google Cloud SQL and various
other data sources can be extracted by Tableau.
When Tableau is launched, ready data connectors are available
which allows you to connect to any database. Depending on the
version of Tableau that you have purchased the number of data
connectors supported by Tableau will vary.
191 06/02/2024
The pulled data can be either connected live or extracted to the
Tableau’s data engine, Tableau Desktop. This is where the Data
analyst, data engineer work with the data that was pulled up and
develop visualizations. The created dashboards are shared with the
users as a static file. The users who receive the dashboards views
the file using Tableau Reader.
The data from the Tableau Desktop can be published to the Tableau
server. This is an enterprise platform where collaboration,
distribution, governance, security model, automation features are
supported. With the Tableau server, the end users have a better
experience in accessing the files from all locations be it a desktop,
mobile or email.
192 06/02/2024
Tableau Uses- Following are the main uses and applications
of Tableau:
Business Intelligence
Data Visualization
Data Collaboration
Data Blending
Real-time data analysis
Query translation into visualization
To import large size of data
To create no-code data queries
To manage large size metadata
193 06/02/2024
Excel Vs. Tableau
Both Excel and Tableau are data analysis tools, but
each tool has its unique approach to data exploration.
However, the analysis in Tableau is more potent than
excel.
Excel works with rows and columns in spreadsheets
whereas Tableau enables in exploring excel data using
its drag and drop feature. Tableau formats the data in
Graphs, pictures that are easily understandable.
194 06/02/2024
Parameters Excel Tableau
Spreadsheet application
Perfect visualization tool
Purpose used for manipulating the
used for analysis.
data.
Most suitable for quick
Most suitable for and easy representation of
Usage statistical analysis of big data which helps in
structured data. resolving the big data
issues.
Moderate speed with
Moderate speed with no options to optimize and
Performance
option to quicken. enhance the progress of an
operation.
The inbuilt security Extensive options to
feature is weak when secure data without
compared to Tableau. The scripting. Security
Security
security update needs to features like row level
be installed on a regular security and permission
basis. are inbuilt.
To utilize excel to full
The tool can be used
potential, macro and
User Interface without any coding
visual basic scripting
knowledge.
knowledge is required.
Best for preparing on-off Best while working with
Business need
reports with small data big data.
Comes with different
Bundled with MS Office versions such as the
Products
tools Tableau server, cloud, and
desktop.
195 06/02/2024
Excel integrates with Tableaus integrated with
Integration
around 60 applications over 250 applications
In Tableaus, you are free
to explore data without
When you are working in even knowing the answer
excel, you need have an that you want. With the
Real time data
idea of where your data in-built features like data
exploration
takes you to get to know blending and drill-down,
the insights you will be able to
determine the variations
and data patterns.
When working in excel,
we first manipulate the
data that is present and
then the visualization such
as the different charts, Whereas in Tableau, the
Easy Visualizations graphs are created data is visualized from the
manually. To make the beginning.
visualizations easily
understandable, you
should understand the
features of excel well.
196 06/02/2024
Tableau provides us various services according to our business need
Installing Tableau on your System
Tableau Desktop, Tableau Public, and Tableau Online, all these offer
Data Visual Creation. Choice of Tableau depends upon the type of work.
Variants of Tableau
Tableau Desktop is a program that allows you to execute complicated
data analysis tasks and generate dynamic, interactive representations to
explain the results. Tableau also lets you share your analysis and
visualizations with the rest of your company, allowing everyone from
coworkers to top management to look into the data that matters to them.
197 06/02/2024
After installation, if you find this Homescreen you are
good to go:
198 06/02/2024
Load Data in Tableau
We will be using global superstore data which is
perfect for learning purposes.
199 06/02/2024
200 06/02/2024
Tableau supports various data formats which can be loaded by choosing those options.
Under a file we see various options to load data from the local directory and
under to a server, we see options to load data from cloud servers. for loading
CSV files we select Text file options, for excel and SQL files we choose their
respective options.
201 06/02/2024
Understanding different Sections in Tableau
Tableau loaded with global-superstore data and now
we can see Tableau work-page.
Tableau work-page consist of different section.
202 06/02/2024
Source: Local
Menu Bar: Here you’ll find various commands such as File, Data, and Format.
Toolbar Icon: The toolbar contains a number of buttons that enable you to perform various
tasks with a click, such as Save, Undo, and New Worksheet.
Dimension Shelf: This shelf contains all the categorical columns under it. example:
categories, segments, gender, name, etc
Measure Shelf: This shelf contains all numerical columns under it like profit, total sales,
discount, etc
Page Shelf: This shelf is used for joining pages and create animations. we will come on it later
Filter Shelf: You can choose which data to include and exclude using the Filters shelf, for
example, you might want to analyze the profit for each customer segment, but only for certain
shipping containers and delivery times. You may make a view like this by putting fields on the
Filters tier.
Marks Card: The visualization can be designed using the Marks card. The markings card can
be used to change the data components of the visualization, such as color, size, shape, path,
label, and tooltip.
Worksheet: In the workbook, the worksheet is where the real visualization may be seen. The
worksheet contains information about the visual’s design and functionality.
203 06/02/2024
•Data Source: Using Data Source we can add new data, modify, remove
data.
•Current Sheet: The current sheets are those sheets which we have
created and to those, we can give some names.
•New Sheet: If we want to create a new worksheet ( blank canvas ) we can
do using this tab.
•New Dashboard: This button is used to create a dashboard canvas.
•New Storyboard: It is used to create a new story
204 06/02/2024
Creating Visuals in Tableau
Let’s begin with the real data visualization using Tableau-
Tableau supports the following data types:
Boolean: True and false can be stored in this data type.
Date/Datetime:
This data type can help in leveraging Tableau’s default date
hierarchy
behavior when applied to valid date or DateTime fields.
Number: These are values that are numeric. Values can be integers
or floating-point numbers (numbers with decimals).
String: This is a sequence of characters encased in single or double
quotation marks.
Geolocation: These are values that we need to plot maps.
205 06/02/2024
206 06/02/2024
Follow these steps:
drag the dimension and measure in row and column input field
and it will automatically suggest a graph best fitted on data.
you can change the graph by clicking on the show me button and
select whichever graph you want.
you can also remove the axis just by dragging and dropping them
under the marks card (remove field).
Show Me: When you click this label, a palette appears, giving
you rapid access to many options for showing the selected types
of fields. The palette changes depending on the fields in the
worksheet you’ve selected or are active.
207 06/02/2024
From the above image, you might have observed that the default aggregation on the measure is sum but you can
change the aggregation to sum, avg, min, max, etc, you can also customize the axis name, orientation, size, show-
hide axis
208 06/02/2024
Enhancing the Analysis:
In order to create a beautiful interactive visual, you must understand the following features:
a. Marks card
Marks card is very important for plotting graphs. In marks card we have:
Colour button which is used to give different colors to different categories and measures,
Size button is used to give size which depends on how big a value is. The bigger the value
means bigger the size of a particular mark
Label button which is used to show labels to graphs, clicking on the label button throws us
some settings where you can set the formatting of labels.
Tooltips, here you can add information like ( profit, quantity, sales, discount, category,
state, etc.) which will be visible on hovering over the graph
The Details button allows you to display more information without affecting the table’s
structure. which is used to show details about particular points. dragging a field on details
buttons will show the details of that point, and this feature is majorly used for maps to show
more details of a particular point.
209 06/02/2024
210 06/02/2024
b. Filter
After creating some plots you might want to use different filters, to do so follow these steps:
On the filter shelf, you can drag any measure or dimension whichever you want to apply a filter
on.
As you drop the field a box will appear, now you can select any particular category, or top-n
rows according to measure values or you can write some rules to select top rows or by using
some parameters.
Now click on show filter after selecting the filter you just applied.
You may want to apply multiple filters, to do so you will need to add previous filters into
context by clicking on add to context here Context Filter is a Tableau filter that is applied
before all other filters. You can choose different options standard, fit width, fit height, entire
view from the toolbar in order to fit the visualization into the worksheet.
c. Hierarchy
You can quickly establish hierarchies with Tableau to keep your data organized.
Hierarchy is basically nesting the same type of related data together. Tableau calendar data is an
example of a hierarchy.
211 06/02/2024
Date-time, calendar is in the form of hierarchy in Tableau, which can be drilled down to year -> quarter -
>month -> day by clicking on the “+” button on the features tab,
You also can create your own hierarchy like country -> state -> city -> postal code, just by dragging
features to another and when needed clicking on ‘+’ button you can drill down further to city, state,
212 postal code. 06/02/2024
d. Parameter
A parameter is a workbook variable like a number, date, or string that can be
readily managed by the user to replace a constant value in a calculation.
213 06/02/2024
In the above image, our goal was to choose the top N countries having maximum
sales but here we wanted to let the user select how many top countries they want
to list. To accomplish so, we’ll need to create the following parameters:
Click the down arrow to the right of Dimensions in the Data pane and select
Create Parameter from the pop-up menu that appears, as shown in the above
image. and give a name variable1.
Select a data type from the Data Type drop-down menu, in my case, I have
chosen to int range from 1–100 list and the current value will be 5.
Click on show parameter will show the parameter with a slider. but it hasn’t
yet connected with any working.
Here we wanted to choose top-N countries based on the sales. drag country
field to filter shelf and choose top tab and then choose variable1 in by field
section and choose SUM(SALES).
Now slide the parameter value and observe the difference.
214 06/02/2024
e . Calculated field
Tableau gives us the option to create a calculated field where
we can create our own new field( column). Tableau comes with
many functions like if-else, switch, case, date diff, level of
dimension which is extensively used for our visualization
To segment data
level of details(LOD)
To change a field’s data type, for example, from a string to a date.
To aggregate data
handling date time
To filter results
To calculate ratios
215 06/02/2024
216 06/02/2024
Creating Calculated Field:
Here our goal is to calculate delivery days using order
date and ship date:
Select Analysis > Create Calculated Field and give
the name delivery days
Give the Rule to calculate delivery days in the rule
box. here we will use the DATEDIFF function to
subtract two dates.
Type Rule: delivery days = DATEDIFF(‘day’,
[shipdate],[order date])
now drag the delivery days field in rows or cols.
217 06/02/2024
f. Format
Formatting in Tableau is very easy. Just click on the
format button wherever you want to format. we can
format text, numbers, percentage, decimals, date-time
format, label color, label size, axis line color,
worksheet, columns, header, etc . as shown in the
above image.
Data Analytics in Tableau
In the Analytics tab, we have several analytical tools
like forecasting, clustering, trend line, Average line,
constant line, etc.
218 06/02/2024
219 06/02/2024
Steps to perform Analytics
From the Analytics tab on the left side, you can choose
various options.
Dragging and dropping a constant line on a particular X, or
Y-axis draws a line at a given constant value.
Dragging forecast on your sheet will give you a time
forecasting of a given measure, which you can edit by
clicking right click on the forecasted part, there you can
choose the confidence interval, time steps to be forecasted
and forecast model, etc.
The trend line is not the same as forecasting. The trend line
only tells us if the overall trend is increasing or decreasing.
220 06/02/2024
Summary on Tableau
Tableau definition or Tableau meaning: Tableau is a powerful and fastest growing data
visualization tool used in the Business Intelligence Industry.
The Tableau Product Suite consists of 1) Tableau Desktop 2) Tableau Public 3) Tableau
Online 4) Tableau Server and Tableau Reader.
Tableau Desktop has a rich feature set and allows you to code and customize reports.
In Tableau public, workbooks created cannot be saved locally, in turn, it should be saved to
the Tableau’s public cloud which can be viewed and accessed by anyone.
Tableau server is specifically used to share the workbooks, visualizations that are created in
the Tableau Desktop application across the organization.
Tableau online has all the similar functionalities of the Tableau Server, but the data is stored
on servers hosted in the cloud which are maintained by the Tableau group.
Tableau Reader is a free tool which allows you to view the workbooks and visualizations
created using Tableau Desktop or Tableau Public.
Tableau connects and extracts the data stored in various places. It can pull data from any
platform imaginable.
The spreadsheet application used for manipulating the data while Tableau is a perfect
visualization tool used for analysis.
221 06/02/2024
THANK YOU
222 06/02/2024