Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 3

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 222

SCSA3016 DATA SCIENCE

UNIT-3
EXPLORATORY DATA ANALYSIS
AND THE DATA SCIENCE PROCESS
C.KAVITHA
ASSISTANT PROFESSOR
DEPARTMENT OF CSE

1 06/02/2024
SYLLABUS
UNIT 3 EXPLORATORY DATA ANALYSIS AND THE DATA
SCIENCE PROCESS
Exploratory Data Analysis and the Data Science Process - Basic tools
(plots, graphs and summary statistics) of EDA -Philosophy of EDA - The
Data Science Process – Data Visualization - Basic principles, ideas and
tools for data visualization - Examples of exciting projects- Data
Visualization using Tableau.
TEXT / REFERENCE BOOKS
1. Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk
From The Frontline. O’Reilly. 2014.
2. Introduction to Linear Algebra - By Gilbert Strang, Wellesley-
Cambridge Press, 5th Edition.2016.
3. Applied Statistics and Probability For Engineers – By Douglas
Montgomery.2016.
4. Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of
Massive Datasets. v2.1, Cambridge University Press. 2014. (free online)
5. Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of
Data Science.
6. Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and
Techniques, 3rd Edition. ISBN 0123814790, 2011.
7. Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of
Statistical Learning, 2nd Edition. ISBN 0387952845. 2009. (free online)
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an approach to
analyzing datasets to summarize their main
characteristics, often with visual methods. EDA is used
for seeing what the data can tell us before the
modeling task.
EDA assists Data science professionals in various
ways:-
1 Getting a better understanding of data
2 Identifying various data patterns
3 Getting a better understanding of the problem
statement
4 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

The EDA is important to,


 Detect outliers and anomalies
 Determine the quality of data
 Determine what statistical models can fit the data
 Find out if the assumptions about the data, that you or your team started out
with is correct or way off.
 Extract variables or dimensions on which the data can be pivoted.
 Determine whether to apply univariate or multivariate analytical techniques.
 EDA is typically used for these four goals:
 Exploring a single variable and looking at trends over time.
 Checking data for errors.
 Checking assumptions.
 Looking at relationships between variables

5 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Various exploratory data analysis methods like:

Descriptive Statistics, which is a way of giving a brief


overview of the dataset we are dealing with, including
some measures and features of the sample
Grouping data [Basic grouping with group by]
ANOVA, Analysis Of Variance, which is a
computational method to divide variations in an
observations set into different components.
Correlation and correlation methods

6 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Descriptive Statistics
Descriptive statistics is a helpful way to understand
characteristics of your data and to get a quick summary
of it. Pandas in python provide an interesting
method describe(). The describe function applies basic
statistical computations on the dataset like extreme
values, count of data points standard deviation etc. Any
missing value or NaN value is automatically skipped.
describe() function gives a good picture of distribution of
data.

7 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Grouping data
Group by is an interesting measure available in pandas
which can help us figure out effect of different
categorical attributes on other data variables.
Correlation and Correlation computation
Correlation is a simple relationship between two
variables in a context such that one variable affects the
other. Correlation is different from act of causing.

8 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

ANOVA
ANOVA stands for Analysis of Variance. It is
performed to figure out the relation between the
different group of categorical data.
Under ANOVA we have two measures as result:
– F-testscore : which shows the variation of groups
mean over variation
– p-value: it shows the importance of the result

This can be performed using python module scipy


method name f_oneway()
9 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

10 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
TYPES OF EDA
• There are broadly two categories of EDA, graphical and non-
graphical.
 Univariate Non-graphical
Multivariate Non-graphical
Univariate graphical
Multivariate graphical

 Univariate graphical: Non-graphical methods are


quantitative and objective, they are doing not give the
complete picture of the data; therefore, graphical methods are
more involve a degree of subjective analysis, also are
required.
11 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Common sorts of univariate graphics are:


 Histogram: The foremost basic graph is a histogram, which may be a barplot during
which each bar represents the frequency (count) or proportion (count/total count) of
cases for a variety of values. Histograms are one of the simplest ways to quickly learn a
lot about your data, including central tendency, spread, modality, shape and outliers.
 Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
 Boxplots: Another very useful univariate graphical technique is that the boxplot.
Boxplots are excellent at presenting information about central tendency and show
robust measures of location and spread also as providing information about symmetry
and outliers, although they will be misleading about aspects like multimodality. One
among the simplest uses of boxplots is within the sort of side-by-side boxplots.
 Quantile-normal plots: The ultimate univariate graphical EDA technique is that the
most intricate. it’s called the quantile-normal or QN plot or more generally the
quantile-quantile or QQ plot. it’s wont to see how well a specific sample follows a
specific theoretical distribution. It allows detection of non-normality and diagnosis of
skewness and kurtosis

12 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Univariate non-graphical
This is the simplest form of data analysis as during this
we use just one variable to research the info. The
standard goal of univariate non-graphical EDA is to
know the underlying sample distribution/ data and make
observations about the population. Outlier detection is
additionally part of the analysis.

13 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

The characteristics of population distribution include:


 Central tendency: The central tendency or location of distribution has got to
do with typical or middle values. The commonly useful measures of central
tendency are statistics called mean, median, and sometimes mode during which
the foremost common is mean. For skewed distribution or when there’s concern
about outliers, the median may be preferred.
 Spread: Spread is an indicator of what proportion distant from the middle we
are to seek out the find the info values. the quality deviation and variance are
two useful measures of spread. The variance is that the mean of the square of
the individual deviations and therefore the variance is the root of the variance
 Skewness and kurtosis: Two more useful univariates descriptors are the
skewness and kurtosis of the distribution. Skewness is that the measure of
asymmetry and kurtosis may be a more subtle measure of peakedness compared
to a normal distribution

14 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Multivariate non-graphical
 Multivariate non-graphical EDA technique is usually wont to show the
connection between two or more variables within the sort of either
cross-tabulation or statistics.
 For categorical data, an extension of tabulation called cross-tabulation is
extremely useful. For 2 variables, cross-tabulation is preferred by making a
two-way table with column headings that match the amount of one-variable
and row headings that match the amount of the opposite two variables, then
filling the counts with all subjects that share an equivalent pair of levels.
 For each categorical variable and one quantitative variable, we create
statistics for quantitative variables separately for every level of the specific
variable then compare the statistics across the amount of categorical
variable.
 Comparing the means is an off-the-cuff version of ANOVA and comparing
medians may be a robust version of one-way ANOVA.

15 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Multivariate graphical:
A graphical representation always gives you a better
understanding of the relationship, especially among
multiple variables.
Other common sorts of multivariate graphics are:
Scatterplot: For 2 quantitative variables, the essential
graphical EDA technique is that the scatterplot , sohas
one variable on the x-axis and one on the y-axis and
therefore the point for every case in your dataset.

16 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Run chart: It’s a line graph of data plotted over time.


Heat map: It’s a graphical representation of data
where values are depicted by color.
Multivariate chart: It’s a graphical representation of
the relationships between factors and response.
Bubble chart: It’s a data visualization that displays
multiple circles (bubbles) in two-dimensional plot.

17 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
TOOLS IN EDA
 The most commonly used software tools to perform EDA are Python and R.
 Graphical exploratory data analysis employs visual tools to display data, such as:
Box plots
 Box plots are used where there is a need to summarize data on an interval scale like the
ones on the stock market, where ticks observed in one whole day may be represented in
a single box, highlighting the lowest, highest, median and outliers.
Heatmap
 Heatmaps are most often used for the representation of the correlation between
variables. Here is an example of a heatmap.
 As you can see from the chart, there is a strong correlation between density and residual
sugar and absolutely no correlation between alcohol and residual sugar.
Histograms
 The histogram is the graphical representation of numerical data that splits the data into
ranges. The taller the bar, the greater the number of data points falling in that range. A
good example here is the height data of a class of students. You would notice that the
height data looks like a bell curves for a particular class with most the data lying within
a certain range and a few of outside these ranges. There will be outliers too, either very
18 short or very small. 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

 Line graphs: one of the most basic types of charts that plots data
points on a graph; has a wealth of uses in almost every field of
study.
 Pictograms: replace numbers with images to visually explain data.
They’re common in the design of infographics, as well as visuals
that data scientists can use to explain complex findings to non-data-
scientist professionals and the public.
 Scattergrams or scatterplots: typically used to display two
variables in a set of data and then look for correlations among the
data. For example, scientists might use it to evaluate the presence of
two particular chemicals or gases in marine life in an effort to look
for a relationship between the two variables.
Non-graphical exploratory data analysis involves data collection
and reporting in nonvisual or non-pictorial formats.
19 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Some of the most common data science tools used to create an EDA
include:
 Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high-level, built-in data structures, combined
with dynamic typing and dynamic binding, make it very attractive for
rapid application development, as well as for use as a scripting or glue
language to connect existing components together. Python and EDA can
be used together to identify missing values in a data set, which is
important so you can decide how to handle missing values for machine
learning.
 R: An open-source programming language and free software
environment for statistical computing and graphics supported by the R
Foundation for Statistical Computing. The R language is widely used
among statisticians in data science in developing statistical observations
and data analysis.
20 06/02/2024
Philosophy of EDA
Philosophy EDA is not identical to statistical graphics although the two terms
are used almost interchangeably. Statistical graphics is a collection
of techniques--all graphically based and all focusing on one data
characterization aspect. EDA encompasses a larger venue; EDA is an
approach to data analysis that postpones the usual assumptions about
what kind of model the data follow with the more direct approach of
allowing the data itself to reveal its underlying structure and model.
EDA is not a mere collection of techniques; EDA is a philosophy as
to how we dissect a data set; what we look for; how we look; and
how we interpret. It is true that EDA heavily uses the collection of
techniques that we call "statistical graphics", but it is not identical to
statistical graphics
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
DATA SCIENCE PROCESS

22 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Process of data science


 The key steps involved in Data Science Modelling are:
Step 1: Understanding the Problem
Step 2: Data Extraction
Step 3: Data Cleaning
Step 4: Exploratory Data Analysis
Step 5: Feature Selection
Step 6: Incorporating Machine Learning Algorithms
Step 7: Testing the Models
Step 8: Deploying the Model

23 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Step 1: Understanding the Problem


The first step involved in Data Science Modelling is
understanding the problem. A Data Scientist listens for
keywords and phrases when interviewing a line-of-business
expert about a business challenge. The Data Scientist breaks
down the problem into a procedural flow that always
involves a holistic understanding of the business challenge,
the Data that must be collected, and various Artificial
Intelligence and Data Science approach that can be used to
address the problem.

24 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Step 2: Data Extraction


 The next step in Data Science Modelling is Data Extraction. Not just any Data, but the
Unstructured Data pieces you collect, relevant to the business problem you’re trying to
address. The Data Extraction is done from various sources online, surveys, and existing
Databases.

Gathering the data


 Data is raw information, its the representation of both human and machine observation of the
world. Dataset entirely depends on what type of problem you want to solve. Each problem in
machine learning has its own unique approach.
Some website to get the dataset :
 Kaggle:
https://www.kaggle.com/datasets
 UCI Machine Learning Repository: One of the oldest sources on the web to get the dataset.
http://mlr.cs.umass.edu/ml/
 This awesome GitHub repository has high-quality datasets.
https://github.com/awesomedata/awesome-public-datasets
25 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Step 3: Data Pre-processing and Cleaning


Steps In Data Preprocessing:
Gathering the data
Import the dataset & Libraries
Dealing with Missing Values
Divide the dataset into Dependent & Independent
variable
dealing with Categorical values
Split the dataset into training and test set
Feature Scaling

26 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Data cleaning on the other hand is the process of detecting, correcting and
ensuring that your given data set is free from error, consistent and usable by
identifying any errors or corruptions in the data, correcting or deleting
them, or manually processing them as needed to prevent the error from
corrupting our final analysis.
Data Cleaning is useful as you need to sanitize Data while gathering it. The
following are some of the most typical causes of Data Inconsistencies and
Errors:
 Duplicate items are reduced from a variety of Databases.
 The error with the input Data in terms of Precision.
 Changes, Updates, and Deletions are made to the Data entries.
 Variables with missing values across multiple Databases.

27 06/02/2024
DATA CLEANING – MISSING
VALUES
DATA CLEANING -
OUTLIERS

◾ What is an Outlier? – an anomaly – We will generally define outliers as samples that are exceptionally far
from the mainstream of the data.
◾ Outliers can have many causes, such as:

• Measurement or input error


• Data corruption
• True outlier observation (e.g.., Michael Jordan in basketball)
◾ Outlier can be in numeric, string, and date

◾ String - E.g., City name is mentioned in a Country or gibberish data


◾ Date – E.g., date appearing outside the period of analysis in scope
DATA CLEANING -OUTLIERS
◾ Outlier can be

◾ Univariate outliers - can be found when we look at distribution of a single variable


◾ Multi-variate outliers - are found in n-dimensional space. E.g., understanding the relationship between
height and weight
DATA CLEANING - OUTLIER
DETECTION

◾ Outlier detection methods – Numeric

◾ Standard Deviation - If we know that the distribution of values in the sample is Gaussian or Gaussian-like,
we can use the standard deviation of the sample as a cut-off for identifying outliers
DATA CLEANING - OUTLIER
DETECTION
◾ Outlier detection methods – Numeric

◾ Inter-Quartile Range (IQR) / Box-plot – IQR is a concept in statistics that is used to measure the statistical
dispersion and data variability by dividing the dataset into quartiles.
DATA CLEANING – OUTLIER TREATMENT
◾ Mean/Median or random imputation – same as missing value treatment

◾ Dropping values – remove the observations with outliers - same as missing value treatment

◾ Top, Bottom, and Zero Coding

◾ Discretization (Binning)
DATA CLEANING –DUPLICATE DATA

◾ Before removing duplicates form the data make sure

◾ be sure they are not real data that coincidentally have values that are identical
◾ try to figure why you have duplicates in your data (is it due to class imbalance?)
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Feature Scaling
The final step of data preprocessing is to apply the
very important feature scaling.
Feature Scaling is a technique to standardize the
independent features present in the data in a fixed
range. It is performed during the data pre-processing.
Why Scaling :- Most of the times, your dataset will
contain features highly varying in magnitudes, units
and range. But since, most of the machine learning
algorithms use Euclidean distance between two data
points in their computations, this is a problem.
35 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Standardization and Normalization


Data Standardization and Normalization is a common
practice in machine learning.
Standardization is another scaling technique where the
values are centered around the mean with a unit
standard deviation. This means that the mean of the
attribute becomes zero and the resultant distribution
has a unit standard deviation.
Normalization is a scaling technique in which values
are shifted and rescaled so that they end up ranging
between 0 and 1. It is also known as Min-Max scaling.
36 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Step 4: Exploratory Data Analysis


Exploratory Data Analysis (EDA) is a robust technique
for familiarising yourself with Data and extracting
useful insights. Data Scientists sift through
Unstructured Data to find patterns and infer
relationships between Data elements. Data Scientists
use Statistics and Visualisation tools to summarise
Central Measurements and variability to perform EDA.

37 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Step 5: Feature Selection


 Feature Selection is the process of identifying and selecting the
features that contribute the most to the prediction variable or output
that you are interested in, either automatically or manually.
 The presence of irrelevant characteristics in your Data can reduce
the Model accuracy and cause your Model to train based on
irrelevant features. In other words, if the features are strong enough,
the Machine Learning Algorithm will give fantastic outcomes.
 Two types of characteristics must be addressed:
 Consistent characteristics that are unlikely to change.
 Variable characteristics whose values change over time.

38 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Step 6: Incorporating Machine Learning Algorithms


 This is one of the most crucial processes in Data Science Modelling as the
Machine Learning Algorithm aids in creating a usable Data Model. There are a
lot of algorithms to pick from, the Model is selected based on the problem.
There are three types of Machine Learning methods that are incorporated:

1) Supervised Learning
 It is based on the results of a previous operation that is related to the existing
business operation. Based on previous patterns, Supervised Learning aids in the
prediction of an outcome. Some of the Supervised Learning Algorithms are:
 Linear Regression
 Random Forest
 Support Vector Machines

39 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

2) Unsupervised Learning
 This form of learning has no pre-existing consequence or
pattern. Instead, it concentrates on examining the
interactions and connections between the presently
available Data points. Some of the Unsupervised Learning
Algorithms are:
KNN (k-Nearest Neighbors)
K-means Clustering
Hierarchical Clustering
Anomaly Detection

40 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

3) Reinforcement Learning
It is a fascinating Machine Learning technique that
uses a dynamic Dataset that interacts with the real
world. In simple terms, it is a mechanism by which a
system learns from its mistakes and improves over
time. Some of the Reinforcement Learning Algorithms
are:
Q-Learning
State-Action-Reward-State-Action (SARSA)
Deep Q Network

41 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Step 7: Testing the Models

 This is the next phase, and it’s crucial to check that our Data Science
Modelling efforts meet the expectations. The Data Model is applied to the Test
Data to check if it’s accurate and houses all desirable features. You can further
test your Data Model to identify any adjustments that might be required to
enhance the performance and achieve the desired results. If the required
precision is not achieved, you can go back to Step 5 (Machine Learning
Algorithms), choose an alternate Data Model, and then test the model again.

Step 8: Deploying the Model


 The Model which provides the best result based on test findings is completed
and deployed in the production environment whenever the desired result is
achieved through proper testing as per the business needs.

42 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
Example
Import the dataset & Libraries
 First step is usually importing the libraries that will be needed in the program. A
library is essentially a collection of modules that can be called and used.
 Pandas offer tools for cleaning and process your data. It is the most popular
Python library that is used for data analysis. In pandas, a data table is called a
dataframe.

43 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Dealing with Missing Values


 Sometimes we may find some data are missing in the dataset. if we
found then we will remove those rows or we can calculate
either mean, mode or median of the feature and replace it with
missing values. This is an approximation which can add variance to
the dataset.
#Check for null values- dataset.isna() or dataset.isnull() to see
the null values in dataset.
#Drop Null values- Pandas provide a dropna() function that
can be used to drop either row or columns with missing data.
#Replacing Null values with Strategy: For replacing null
values we use the strategy that can be applied on a feature which has
numeric data. We can calculate the Mean, Median or Mode of the
feature and replace it with the missing values.
44 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

45 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Drop Null values-

there is 4th and 6th index have null values.


Now as above we can see both rows with
missing data has been removed.

46 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Replacing Null values

47 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Replaces every variable null values with their respective mean, and
‘inplace =True’ indicates to affect the changes to dataset.

48 06/02/2024
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

De-Duplicate means remove all duplicate values.


There is no need for duplicate values in data analysis.
These values only affect the accuracy and efficiency of
the analysis result. To find duplicate values in the
dataset we will use a simple dataframe function i.e.
duplicated(). Let’s see the example:
dataset.duplicated()

49 06/02/2024
 Data visualization is the process of translating large data sets and
Data Visualization

metrics into charts, graphs and other visuals.


 The resulting visual representation of data makes it easier to
identify and share real-time trends, outliers, and new insights
about the information represented in the data.
 Data visualization is one of the steps of the data science process,
which states that after data has been collected, processed and
modeled, it must be visualized for conclusions to be made.
 Data visualization is also an element of the broader data
presentation architecture (DPA) discipline, which aims to
identify, locate, manipulate, format and deliver data in the most
efficient way possible.

50 06/02/2024
Data Visualization Tools
1. Tableau
 It is a business intelligence service that aids people in visualizing as well as understanding their
data it’s also one of those very widely used services in the field of business intelligence. It
allows you to design an interactive reports dashboard and worksheets to obtain business visions
it has outstanding visualization capabilities and has a great performance.
Pros:
 Outstanding visual library
 User friendly
 Great performance
 Connectivity to data
 Powerful computation
 Quick insights
Cons:
 Inflexible pricing
 No option for auto-refresh
 Restrictive imports
 Manual updates for static features

51 06/02/2024
2. Power BI
 Power BI, Microsoft's easy-to-use data visualization tool, is available for both on-
premise installation and deployment on the cloud infrastructure. Power BI is one of the
most complete data visualization tools that supports a myriad of backend databases,
including Teradata, Salesforce, PostgreSQL, Oracle, Google Analytics, Github, Adobe
Analytics, Azure, SQL Server, and Excel. The enterprise-level tool creates stunning
visualizations and delivers real-time insights for fast decision-making.
The Pros of Power BI:
 No requirement for specialized tech support
 Easily integrates with existing applications
 Personalized, rich dashboard
 High-grade security
 No speed or memory constraints
 Compatible with Microsoft products
The Cons of Power BI:
 Cannot work with varied, multiple datasets

52 06/02/2024
3. Dundas BI
 Dundas BI offers highly-customizable data visualizations with interactive
scorecards, maps, gauges, and charts, optimizing the creation of ad-hoc, multi-page
reports. By providing users full control over visual elements, Dundas BI simplifies
the complex operation of cleansing, inspecting, transforming, and modeling big
datasets.
The Pros of Dundas BI:
 Exceptional flexibility
 A large variety of data sources and charts
 Wide range of in-built features for extracting, displaying, and modifying data
The Cons of Dundas BI:
 No option for predictive analytics
 3D charts not supported

53 06/02/2024
4. JupyteR
 A web-based application, JupyteR, is one of the top-rated data visualization tools
that enable users to create and share documents containing visualizations,
equations, narrative text, and live code. JupyteR is ideal for data cleansing and
transformation, statistical modeling, numerical simulation, interactive computing,
and machine learning.
The Pros of JupyteR:
 Rapid prototyping
 Visually appealing results
 Facilitates easy sharing of data insights
The Cons of JupyteR:
 Tough to collaborate
 At times code reviewing becomes complicated

54 06/02/2024
5. Zoho Reports
 Zoho Reports, also known as Zoho Analytics, is a comprehensive data
visualization tool that integrates Business Intelligence and online reporting
services, which allow quick creation and sharing of extensive reports in
minutes. The high-grade visualization tool also supports the import of Big
Data from major databases and applications.
The Pros of Zoho Reports:
 Effortless report creation and modification
 Includes useful functionalities such as email scheduling and report sharing
 Plenty of room for data
 Prompt customer support.
The Cons of Zoho Reports:
 User training needs to be improved
 The dashboard becomes confusing when there are large volumes of data

55 06/02/2024
6. GoogleCharts
 One of the major players in the data visualization market space, Google Charts, coded
with SVG and HTML5, is famed for its capability to produce graphical and pictorial
data visualizations. Google Charts offers zoom functionality, and it provides users
with unmatched cross-platform compatibility with iOS, Android, and even the earlier
versions of the Internet Explorer browser.
The Pros of Google Charts:
 User-friendly platform
 Easy to integrate data
 Visually attractive data graphs
 Compatibility with Google products.
The Cons of Google Charts:
 The export feature needs fine-tuning
 Inadequate demos on tools
 Lacks customization abilities
 Network connectivity required for visualization

56 06/02/2024
7. Sisense
Regarded as one of the most agile data visualization tools, Sisense gives users access
to instant data analytics anywhere, at any time. The best-in-class visualization tool
can identify key data patterns and summarize statistics to help decision-makers make
data-driven decisions.
The Pros of Sisense:
 Ideal for mission-critical projects involving massive datasets
 Reliable interface
 High-class customer support
 Quick upgrades
 Flexibility of seamless customization
The Cons of Sisense:
 Developing and maintaining analytic cubes can be challenging
 Does not support time formats
 Limited visualization versions

57 06/02/2024
8. Plotly
 An open-source data visualization tool, Plotly offers full integration with
analytics-centric programming languages like Matlab, Python, and R, which
enables complex visualizations. Widely used for collaborative work,
disseminating, modifying, creating, and sharing interactive, graphical data,
Plotly supports both on-premise installation and cloud deployment.
The Pros of Plotly:
 Allows online editing of charts
 High-quality image export
 Highly interactive interface
 Server hosting facilitates easy sharing
The Cons of Plotly:
 Speed is a concern at times
 Free version has multiple limitations
 Various screen-flashings create confusion and distraction

58 06/02/2024
9. Data Wrapper
 Data Wrapper is one of the very few data visualization tools on the market that
is available for free. It is popular among media enterprises because of its
inherent ability to quickly create charts and present graphical statistics on Big
Data. Featuring a simple and intuitive interface, Data Wrapper allows users to
create maps and charts that they can easily embed into reports.
The Pros of Data Wrapper:
 Does not require installation for chart creation
 Ideal for beginners
 Free to use
The Cons of Data Wrapper:
 Building complex charts like Sankey is a problem
 Security is an issue as it is an open-source tool

59 06/02/2024
10. QlikView

A major player in the data visualization market, Qlikview provides solutions to over
40,000 clients in 100 countries. Qlikview's data visualization tool, besides enabling
accelerated, customized visualizations, also incorporates a range of solid features,
including analytics, enterprise reporting, and Business Intelligence capabilities.
The Pros of QlikView:
 User-friendly interface
 Appealing, colorful visualizations
 Trouble-free maintenance
 A cost-effective solution
The Cons of QlikView:
 RAM limitations
 Poor customer support
 Does not include the 'drag and drop' feature

60 06/02/2024
Data Visualization with python
Python offers multiple great graphing libraries that come
packed with lots of different features.
Here are a few popular plotting libraries:
Matplotlib: low level, provides lots of freedom
Pandas Visualization: easy to use interface, built on
Matplotlib
Seaborn: high-level interface, great default styles
ggplot: based on R’s ggplot2, uses
Grammar of Graphics
Plotly: can create interactive plots

61 06/02/2024
Matplotlib
Matplotlib is a visualization library in Python for 2D
plots of arrays. Matplotlib is written in Python and
makes use of the NumPy library. It can be used in
Python and IPython shells, Jupyter notebook, and web
application servers. Matplotlib comes with a wide
variety of plots like line, bar, scatter, histogram, etc.
which can help us, deep-dive, into understanding
trends, patterns, correlations. It was introduced by
John Hunter in 2002.

62 06/02/2024
Seaborn
 Conceptualized and built originally at the Stanford
University, this library sits on top of matplotlib. In a sense,
it has some flavors of matplotlib while from the
visualization point, its is much better than matplotlib and
has added features as well. Below are its advantages
Built-in themes aid better visualization
Statistical functions aiding better data insights
Better aesthetics and built-in plots
Helpful documentation with effective examples

63 06/02/2024
Bokeh
Bokeh is an interactive visualization library for
modern web browsers. It is suitable for large or
streaming data assets and can be used to develop
interactive plots and dashboards. There is a wide array
of intuitive graphs in the library which can be
leveraged to develop solutions. It works closely with
PyData tools. The library is well-suited for creating
customized visuals according to required use-cases.
The visuals can also be made interactive to serve a
what-if scenario model. All the codes are open source
and available on GitHub.
64 06/02/2024
plotly
plotly.py is an interactive, open-source, high-level,
declarative, and browser-based visualization library for
Python. It holds an array of useful visualization which
includes scientific charts, 3D graphs, statistical charts,
financial charts among others. Plotly graphs can be viewed
in Jupyter notebooks, standalone HTML files, or hosted
online. Plotly library provides options for interaction and
editing. The robust API works perfectly in both local and
web browser mode.

65 06/02/2024
ggplot
 ggplot is a Python implementation of the grammar of graphics.
The Grammar of Graphics refers to the mapping of data to
aesthetic attributes (colour, shape, size) and geometric objects
(points, lines, bars). The basic building blocks according to the
grammar of graphics are data, geom (geometric objects), stats
(statistical transformations), scale, coordinate system, and facet.
 Using ggplot in Python allows you to develop informative
visualizations incrementally, understanding the nuances of the
data first, and then tuning the components to improve the visual
representations.

66 06/02/2024
Data Visualization in python using Seaborn
Box plot:
A box plot (or box-and-whisker plot) s is the visual
representation of the depicting groups of numerical data
through their quartiles against continuous/categorical
data.
A box plot consists of 5 things.
Minimum
First Quartile or 25%
Median (Second Quartile) or 50%
Third Quartile or 75%
Maximum

67 06/02/2024
Representation of box plot.

Box plot representing multi-variate


categorical variables

68 06/02/2024
Syntax:
seaborn.boxplot(x=None, y=None, hue=None,
data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting. If x and y are absent, this is
interpreted as wide-form.
Returns: It returns the Axes object with the plot drawn
onto it.

69 06/02/2024
Pandas and Seaborn is one of those packages and
makes importing and analyzing data much easier.
# import module
import seaborn as sns
import pandas

# read csv and ploting


data = pandas.read_csv( "nba.csv" )
sns.boxplot( data['Age'] )

70 06/02/2024
71 06/02/2024
Example 2:
# import module
import seaborn as sns
import pandas

# read csv and ploting


data = pandas.read_csv( "nba.csv" )
sns.boxplot( data['Age'], data['Weight'])

72 06/02/2024
73 06/02/2024
Voilin Plot:
A voilin plot is similar to a boxplot. It shows several
quantitative data across one or more categorical
variables such that those distributions can be
compared.
Syntax: seaborn.violinplot(x=None, y=None,
hue=None, data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting.

74 06/02/2024
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv("nba.csv")
seaborn.violinplot(x ="Age", y ="Weight",data = data)

75 06/02/2024
76 06/02/2024
Line plot:
Lineplot Is the most popular plot to draw a relationship
between x and y with the possibility of several
semantic groupings.
Syntax : sns.lineplot(x=None, y=None)
Parameters:
x, y: Input data variables; must be numeric. Can pass
data directly or reference columns in data.

77 06/02/2024
# import module
import seaborn as sns
import pandas

# loading csv
data = pandas.read_csv("nba.csv")

# ploting lineplot
sns.lineplot( data['Age'], data['Weight'])

78 06/02/2024
79 06/02/2024
Example 2: Use the hue parameter for plotting the graph.
# import module
import seaborn as sns
import pandas

# read the csv data


data = pandas.read_csv("nba.csv")

# plot
sns.lineplot(data['Age'],data['Weight'], hue =data["Position"])

80 06/02/2024
81 06/02/2024
Scatter Plot
 Scatter plots or scatter graphs is a bivariate plot having greater
resemblance to line graphs in the way they are built.
 Scatterplot Can be used with several semantic groupings which can
help to understand well in a graph against continuous/categorical
data. It can draw a two-dimensional graph.
 Syntax: seaborn.scatterplot(x=None, y=None)
 Parameters:
x, y: Input data variables that should be numeric.
 Returns: This method returns the Axes object with the plot drawn
onto it.

82 06/02/2024
Advantages of a scatter plot
Displays correlation between variables
Suitable for large data sets
Easier to find data clusters
Better representation of each data point

83 06/02/2024
# import module
import seaborn
import pandas

# load csv
data = pandas.read_csv("nba.csv")

# plotting
seaborn.scatterplot(data['Age'],data['Weight'])

84 06/02/2024
85 06/02/2024
Example 2: Use the hue parameter for plotting the graph
import seaborn
import pandas
data = pandas.read_csv("nba.csv")

seaborn.scatterplot( data['Age'], data['Weight'], hue


=data["Position"])

86 06/02/2024
87 06/02/2024
Bar plot:
 Barplot represents an estimate of central tendency for a numeric
variable with the height of each rectangle and provides some indication
of the uncertainty around that estimate using error bars.
 Syntax : seaborn.barplot(x=None, y=None, hue=None, data=None)
 Parameters :
 x, y : This parameter take names of variables in data or vector data,
Inputs for plotting long-form data.
 hue : (optional) This parameter take column name for colour encoding.
 data : (optional) This parameter take DataFrame, array, or list of
arrays, Dataset for plotting. If x and y are absent, this is interpreted as
wide-form. Otherwise it is expected to be long-form.
 Returns : Returns the Axes object with the plot drawn onto it.

88 06/02/2024
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv("nba.csv")
seaborn.barplot(x =data["Age"])

89 06/02/2024
90 06/02/2024
Example 2
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv("nba.csv")
seaborn.barplot(x ="Age", y ="Weight", data = data)

91 06/02/2024
92 06/02/2024
Point plot:
 Point plot used to show point estimates and confidence intervals
using scatter plot glyphs. A point plot represents an estimate of
central tendency for a numeric variable by the position of scatter
plot points and provides some indication of the uncertainty around
that estimate using error bars.
Syntax: seaborn.pointplot(x=None, y=None, hue=None, data=None)
 Parameters:
 x, y: Inputs for plotting long-form data.
 hue: (optional) column name for color encoding.
 data: dataframe as a Dataset for plotting.
 Return: The Axes object with the plot drawn onto it.

93 06/02/2024
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv("nba.csv")
seaborn.pointplot(x = "Age", y = "Weight", data = data)

94 06/02/2024
95 06/02/2024
Countplot
A countplot is a plot between a categorical and a
continuous variable. The continuous variable in this
case being the number of times the categorical is
present or simply the frequency. In a sense, count plot
can be said to be closely linked to a histogram or a bar
graph.

96 06/02/2024
Syntax : seaborn.countplot(x=None, y=None, hue=None,
data=None)
Parameters :
x, y: This parameter take names of variables in data or
vector data, optional, Inputs for plotting long-form data.
hue : (optional) This parameter take column name for
color encoding.
data : (optional) This parameter take DataFrame, array,
or list of arrays, Dataset for plotting. If x and y are
absent, this is interpreted as wide-form. Otherwise, it is
expected to be long-form.
Returns: Returns the Axes object with the plot drawn onto
97 it. 06/02/2024
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv("nba.csv")
seaborn.countplot(data["Age"])

98 06/02/2024
Output Countplot

99 06/02/2024
Bivariate and Univariate data using seaborn and pandas:
Bivariate data: This type of data involves two different
variables. The analysis of this type of data deals with
causes and relationships and the analysis is done to find
out the relationship between the two variables.
Univariate data: This type of data consists of only one
variable. The analysis of univariate data is thus the
simplest form of analysis since the information deals with
only one quantity that changes. It does not deal with
causes or relationships and the main purpose of the
analysis is to describe the data and find patterns that exist
within it.
100 06/02/2024
Example 1: example of Bivariate data disturbation
Using the box plot.
# import module
import seaborn as sns
import pandas

# read csv and ploting


data = pandas.read_csv( "nba.csv" )
sns.boxplot( data['Age'], data['Height'])

101 06/02/2024
102 06/02/2024
Let’s see an example of univariate data distribution:
Example: Using the dist plot
# import module
import seaborn as sns
import pandas

# read top 5 column


data = pandas.read_csv("nba.csv").head()

sns.distplot( data['Age'])
103 06/02/2024
104 06/02/2024
Correlation heatmap
 A correlation heatmap is a heatmap that shows a 2D correlation
matrix between two discrete dimensions, using colored cells to
represent data from usually a monochromatic scale. The values
of the first dimension appear as the rows of the table while of the
second dimension as a column.
 The color of the cell is proportional to the number of
measurements that match the dimensional value. This makes
correlation heatmaps ideal for data analysis since it makes
patterns easily readable and highlights the differences and
variation in the same data.
 A correlation heatmap, like a regular heatmap, is assisted by a
colorbar making data easily readable and comprehensible.
105 06/02/2024
Syntax: heatmap(data, vmin, vmax, center, cmap,
……………………………………………………)
Example:
For the example given below, here a dataset
downloaded from kaggle.com is being used. The plot
shows data related to bestseller novels on amazon.

106 06/02/2024
# import modules
import matplotlib.pyplot as mp
import pandas as pd
import seaborn as sb

# import file with data


data = pd.read_csv(‘bestsellers.csv’)

# prints data that will be plotted


# columns shown here are selected by corr() since
# they are ideal for the plot
print(data.corr())

# plotting correlation heatmap


dataplot = sb.heatmap(data.corr(), cmap="YlGnBu", annot=True)

# displaying heatmap
mp.show()

107 06/02/2024
108 06/02/2024
Histogram
 Histograms display counts of data and are hence similar to a bar
chart. A histogram plot can also tell us how close a data distribution
is to a normal curve. While working out statistical method, it is very
important that we have a data which is normally or close to a normal
distribution. However, histograms are univariate in nature and bar
charts bivariate.
 A bar graph charts actual counts against categories e.g. height of the
bar indicates the number of items in that category whereas a
histogram displays the same categorical variables in bins.
 Bins are integral part while building a histogram they control the
data points which are within a range. As a widely accepted choice
we usually limit bin to a size of 5-20, however this is totally
governed by the data points which is present.
109 06/02/2024
Example: Diabetes dataset
# illustrate histogram
features = ['BloodPressure', 'SkinThickness']
diabetes[features].hist(figsize=(10, 4))

110 06/02/2024
Some examples using Matplotlib
Bar chart using Matplotlib-Titanic dataset
#Creating the dataset
df = sns.load_dataset('titanic')
df=df.groupby('who')['fare'].sum().to_frame().reset_index()

#Creating the bar chart


plt.barh(df['who'],df['fare'],color = ['#F0F8FF','#E6E6FA','#B0E0E6'])

#Adding the aesthetics


plt.title('Chart title')
plt.xlabel('X axis title')
plt.ylabel('Y axis title')

#Show the plot


plt.show()
111 06/02/2024
112 06/02/2024
Line chart using Matplotlib- Iris Dataset
#Creating the dataset
df = sns.load_dataset("iris")
df=df.groupby('sepal_length')
['sepal_width'].sum().to_frame().reset_index()
#Creating the line chart
plt.plot(df['sepal_length'], df['sepal_width'])
#Adding the aesthetics
plt.title('Chart title')
plt.xlabel('X axis title')
plt.ylabel('Y axis title')
#Show the plot
plt.show()
113 06/02/2024
114 06/02/2024
Examples of exciting projects- Exploratory Data
Analysis : Iris Dataset
Importing relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
sns.set()
Source Of Data
Data has been stored inside a csv file namely ‘iris.csv’
Loading data
iris_data = pd.read_csv(‘iris.csv’)
iris_data
118 06/02/2024
Getting Information about the Dataset
We will use the shape parameter to get the shape of the
dataset.
iris_data.shape
Output:
(150, 6)We can see that the dataframe contains 6
columns and 150 rows.

119 06/02/2024
Gaining information from data
iris_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
We can see that only one column has categorical data and all the other columns are of the
numeric type with non-Null entries.

120 06/02/2024
Data Insights:
1 All columns are not having any Null Entries
2 Four columns are numerical type
3 Only Single column categorical type

121 06/02/2024
Statistical Insight
iris_data.describe()

Statistical Insight
iris_data.describe()

122 06/02/2024
Data Insights:
Mean values
Standard Deviation ,
Minimum Values
Maximum Values

123 06/02/2024
Checking Missing Values
We will check if our data contains any missing values or not. Missing
values can occur when no information is provided for one or more items or
for a whole unit. We will use the isnull() method.
iris_data.isnull().sum()

We can see that no column as any missing


value.

124 06/02/2024
Checking For Duplicate Entries
iris_data[iris_data.duplicated()]

There are 3 duplicates, therefore we must check whether each species data set is
balanced in no's or no

125 06/02/2024
Checking the balance
iris_data[‘species’].value_counts()

Therefore we shouldn’t delete the entries as it might


imbalance the data sets and hence will prove to be less
useful for valuable insights

126 06/02/2024
Data Visualization
Visualizing the target column
Our target column will be the Species column because
at the end we will need the result according to the
species only. Note: We will use Matplotlib and
Seaborn library for the data visulalization.

127 06/02/2024
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
plt.title(‘Species Count’)
sns.countplot(iris_data[‘species’])

128 06/02/2024
129 06/02/2024
Data Insight:
This further visualizes that species are well balanced
Each species ( Iris virginica, setosa, versicolor) has 50
as it’s count

Iris Flower Species


130 06/02/2024
Uni-variate Analysis
Comparison between various species based on sepal
length and width
plt.figure(figsize=(17,9))
plt.title(‘Comparison between various species based on
sapel length and width’)
sns.scatterplot(iris_data[‘sepal_length’],iris_data[‘sepa
l_width’],hue =iris_data[‘species’],s=50)

131 06/02/2024
132 06/02/2024
Data Insights:
Iris Setosa species has smaller sepal length but higher width.
Versicolor lies in almost middle for length as well as width
Virginica has larger sepal lengths and smaller sepal widths
Comparison between various species based on petal length
and width
plt.figure(figsize=(16,9))
plt.title(‘Comparison between various species based on petal
lenght and width’)
sns.scatterplot(iris_data[‘petal_length’],
iris_data[‘petal_width’], hue = iris_data[‘species’], s= 50)

133 06/02/2024
134 06/02/2024
Data Insights
Setosa species have the smallest petal length as well as
petal width
Versicolor species have average petal length and petal
width
Virginica species have the highest petal length as well
as petal width

135 06/02/2024
Let’s plot all the column’s relationships using a pairplot.
It can be used for multivariate analysis.
sns.pairplot(iris_data,hue=”species”,height=4)

136 06/02/2024
Data Insights:
High co relation between petal length and width
columns.
Setosa has both low petal length and width
Versicolor has both average petal length and width
Virginica has both high petal length and width.
Sepal width for setosa is high and length is low.
Versicolor have average values for for sepal
dimensions.
Virginica has small width but large sepal length

137 06/02/2024
The heatmap is a data visualization technique that is
used to analyze the dataset as colors in two dimensions.
Basically, it shows a correlation between all numerical
variables in the dataset. In simpler terms, we can plot the
above-found correlation using the heatmaps.
Checking Correlation
plt.figure(figsize=(10,11))
sns.heatmap(iris_data.corr(),annot=True)
plt.plot()

138 06/02/2024
Heatmap

139 06/02/2024
Data Insights:
Sepal Length and Sepal Width features are slightly
correlated with each other
Checking Mean & Median Values for each species
iris.groupby(‘species’).agg([‘mean’, ‘median’])
mean and median
outputs

visualizing the distribution , mean and median using box plots & violin
140 plots 06/02/2024
Box plots to know about distribution
 boxplot to see how the categorical feature “Species” is
distributed with all other four input variables
 fig, axes = plt.subplots(2, 2, figsize=(16,9))
sns.boxplot( y=”petal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 0])
sns.boxplot( y=”petal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 1])
sns.boxplot( y=”sepal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 0])
sns.boxplot( y=”sepal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 1])
plt.show()
141 06/02/2024
142 06/02/2024
Data Insights:
Setosa is having smaller feature and less distributed
Versicolor is distributed in a average manner and
average features
Virginica is highly distributed with large no .of values
and features
Clearly the mean/ median values are being shown by
each plots for various features(sepal length & width,
petal length & width)

143 06/02/2024
Violin Plot for checking distribution
 The violin plot shows density of the length and width in the
species. The thinner part denotes that there is less density whereas
the fatter part conveys higher density
fig, axes = plt.subplots(2, 2, figsize=(16,10))
sns.violinplot( y=”petal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 0],inner=’quartile’)
sns.violinplot( y=”petal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[0, 1],inner=’quartile’)
sns.violinplot( y=”sepal_length”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 0],inner=’quartile’)
sns.violinplot( y=”sepal_width”, x= “species”, data=iris_data,
orient=’v’ , ax=axes[1, 1],inner=’quartile’)
plt.show()

144 06/02/2024
145 06/02/2024
Data Insights:
Setosa is having less distribution and density in case of
petal length & width
Versicolor is distributed in a average manner and
average features in case of petal length & width
Virginica is highly distributed with large no .of values
and features in case of sepal length & width
High density values are depicting the mean/median
values, for example: Iris Setosa has highest density at
5.0 cm ( sepal length feature) which is also the median
value(5.0) as per the table
146 06/02/2024
Mean / Median Table for reference

147 06/02/2024
Plotting the Histogram & Probability Density
Function (PDF)
plotting the probability density function(PDF) with
each feature as a variable on X-axis and it’s histogram
and corresponding kernel density plot on Y-axis.

148 06/02/2024
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_length") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "sepal_width") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_length") \
.add_legend()
sns.FacetGrid(iris, hue="species", height=5) \
.map(sns.distplot, "petal_width") \
.add_legend()
plt.show()
149 06/02/2024
Plot 1 | Classification feature : Sepal Length

150 06/02/2024
Plot 2 | Classification feature : Sepal Width

151 06/02/2024
Plot 3 | Classification feature : Petal Length

152 06/02/2024
Plot 4 | Classification feature : Petal Width

153 06/02/2024
Data Insights:
 Plot 1 shows that there is a significant amount of overlap between the
species on sepal length, so it is not an effective Classification feature
 Plot 2 shows that there is even higher overlap between the species on
sepal width, so it is not an effective Classification feature
 Plot 3 shows that petal length is a good Classification feature as it
clearly separates the species . The overlap is extremely less (between
Versicolor and Virginica) , Setosa is well separated from the rest two
 Just like Plot 3, Plot 4 also shows that petal width is a good
Classification feature . The overlap is significantly less (between
Versicolor and Virginica) , Setosa is well separated from the rest two

154 06/02/2024
Choosing Plot 3 (Classification feature as Petal
Length)to distinguish among the species

The pdf curve of Iris Setosa ends roughly at 2.1

155 06/02/2024
Plot 3 | Classification feature : Petal Length
Data Insights:
The pdf curve of Iris Setosa ends roughly at 2.1
If petal length < 2.1, then species is Iris Setosa
The point of intersection between pdf curves of
Versicolor and Virginica is roughly at 4.8
If petal length > 2.1 and petal length < 4.8 then species
is Iris Versicolor
If petal length > 4.8 then species is Iris Virginica

156 06/02/2024
Titanic Data EDA
 Titanic Dataset –
It is one of the most popular datasets used for understanding machine
learning basics. It contains information of all the passengers aboard the
RMS Titanic, which unfortunately was shipwrecked. This dataset can be
used to predict whether a given passenger survived or not.

Conduct exploratory data analysis (EDA) and statistical modeling on


the Titanic Dataset in order to gather insights and evenutally predicting
survior(0 = Not Survived, 1 = Survived). Out of the 891 passengers that went
on board the titanic, approximately 38% of them got surived where as
majority 62% did not survive the disaster.

The csv file can be downloaded from https://www.kaggle.com/c/titanic/data

157 06/02/2024
158 06/02/2024
Features: The titanic dataset has roughly the following types of features:
 Categorical/Nominal: Variables that can be divided into multiple categories but
having no order or priority.
Eg. Embarked (C = Cherbourg; Q = Queenstown; S = Southampton)
 Binary: A subtype of categorical features, where the variable has only two
categories.
Eg: Sex (Male/Female)
 Ordinal: They are similar to categorical features but they have an order(i.e can be
sorted).
Eg. Pclass (1, 2, 3)
 Continuous: They can take up any value between the minimum and maximum
values in a column.
Eg. Age, Fare
 Count: They represent the count of a variable.
Eg. SibSp, Parch
 Useless: They don’t contribute to the final outcome of an ML model.
Here, PassengerId, Name, Cabin and Ticket might fall into this category.
159 06/02/2024
PROCEDURE
 Import the relevant python libraries for the analysis
 Load the train and test dataset and set the index if applicable
 Visually inspect the head of the dataset,Examine the train dataset to
understand in particular if the data is tidy, shape of the dataset,examine
datatypes, examine missing values, unique counts and build a data
dictictionary dataframe
 Run discriptive statistics of object and numerical datatypes, and finally
transform datatypes accordingly
 Carry-out univariate,bivariate and multivariate analysis using graphical and
non graphical(some numbers representing the data) mediums
 Feature Engineering : Extract title from name, Extract new features from
name, age, fare, sibsp, parch and cabin
 Preprocessing and Prepare data for statistical modeling
 Statistical Modelling

160 06/02/2024
Import the relevant python libraries for the analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Load the train and test dataset and set the index if
applicable
#load the train dataset d
df = pd.read_csv('train.csv')

161 06/02/2024
#inspect the first few rows of the train dataset
df.head()

df.shape
(891, 11)

162 06/02/2024
# identify datatypes
df.dtypes

PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

163 06/02/2024
# identify missing values
df.isnull().sum()

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

164 06/02/2024
# Identify number of unique values
df.nunique()
0 MissingVal NUnique
Survived int64 0 2
Pclass int64 0 3
Name object 0 891
Sex object 0 2
Age float64 177 88
SibSp int64 0 7
Parch int64 0 7
Ticket object 0 681
Fare float64 0 248
Cabin object 687 147
Embarked. object 2 3

165 06/02/2024
df.describe()

166 06/02/2024
Carryout univariate and multivariate analysis using
graphical and non graphical(some numbers represting the
data)
df.Survived.value_counts(normalize=True)

Output:
0 0.616162
1 0.383838
Name: Survived, dtype: float64
only 38% of the passengers were survived, where as a
majority 61% the passenger did not survive the disaster
Univariate Analysis
fig, axes = plt.subplots(2, 4, figsize=(16, 10))
sns.countplot('Survived',data=df,ax=axes[0,0])
sns.countplot('Pclass',data=df,ax=axes[0,1])
sns.countplot('Sex',data=df,ax=axes[0,2])
sns.countplot('SibSp',data=df,ax=axes[0,3])
sns.countplot('Parch',data=df,ax=axes[1,0])
sns.countplot('Embarked',data=df,ax=axes[1,1])
sns.distplot(df['Fare'], kde=True,ax=axes[1,2])
sns.distplot(df['Age'].dropna(),kde=True,ax=axes[1,3])
168 06/02/2024
Output:
<matplotlib.axes._subplots.AxesSubplot at
0x7ffa6366bdd8>

169 06/02/2024
Bivariate EDA
We can clearly see that male survial rates is around 20%
where as female survial rate is about 75% which suggests
that gender has a strong relationship with the survival rates.
There is also a clear relationship between Pclass and the
survival by referring to first plot below. Passengers on
Pclass1 had a better survial rate of approx 60% whereas
passengers on pclass3 had the worst survial rate of approx
22%
There is also a marginal relationship between the fare and
survial rate.

170 06/02/2024
figbi, axesbi = plt.subplots(2, 4, figsize=(16, 10))
df.groupby('Pclass')
['Survived'].mean().plot(kind='barh',ax=axesbi[0,0],xlim=[0,1])
df.groupby('SibSp')
['Survived'].mean().plot(kind='barh',ax=axesbi[0,1],xlim=[0,1])
df.groupby('Parch')
['Survived'].mean().plot(kind='barh',ax=axesbi[0,2],xlim=[0,1])
df.groupby('Sex')
['Survived'].mean().plot(kind='barh',ax=axesbi[0,3],xlim=[0,1])
df.groupby('Embarked')
['Survived'].mean().plot(kind='barh',ax=axesbi[1,0],xlim=[0,1])
sns.boxplot(x="Survived", y="Age", data=df,ax=axesbi[1,1])
sns.boxplot(x="Survived", y="Fare", data=df,ax=axesbi[1,2])

171 06/02/2024
172 06/02/2024
Multivariate EDA
Construct a Coorelation matrix of the int64 and float64
feature types
There is a positve coorelation between Fare and
Survived and a negative coorelation between Pclass
and Surived
There is a negative coorelation between Fare and
Pclass, Age and Plcass

173 06/02/2024
import seaborn as sns

f, ax = plt.subplots(figsize=(10, 8))
corr = df.corr()
sns.heatmap(corr,
mask=np.zeros_like(corr, dtype=np.bool),
cmap=sns.diverging_palette(220, 10,
as_cmap=True),
square=True, ax=ax)

174 06/02/2024
175 06/02/2024
Having missing values in a dataset can cause errors with
some machine learning algorithms and either the rows that
has missing values should be removed or imputed
Imputing refers to using a model to replace missing values.
There are many options we could consider when replacing
a missing value, for example:
constant value that has meaning within the domain, such as 0,
distinct from all other values.
value from another randomly selected record.
mean, median or mode value for the column.
value estimated by another predictive model.

176 06/02/2024
# impute the missing Fare values with the mean Fare
value
df.Fare.fillna(train.Fare.mean(),inplace=True)

# validate if any null values are present after the


imputation
df[df.Fare.isnull()]

177 06/02/2024
Statistical Modelling
df.columns

Output:
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Name_len',
'Ticket_First', 'FamilyCount', 'Cabin_First', 'title’],
dtype='object')

178 06/02/2024
trainML = df[['Survived', 'Pclass', 'Name', 'Sex', 'Age',
'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked', 'Name_len',
'Ticket_First', 'FamilyCount', 'title’]]

# drop rows of missing values


trainML = trainML.dropna()
# check the datafram has any missing values
trainML.isnull().sum()

179 06/02/2024
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Embarked 0
Name_len 0
Ticket_First 0
FamilyCount 0
title 0
dtype: int64

180 06/02/2024
Data Visualization using Tableau
 Tableau is a Data Visualisation tool that is widely used for
Business Intelligence but is not limited to it. It helps create
interactive graphs and charts in the form of dashboards and
worksheets to gain business insights. And all of this is made
possible with gestures as simple as drag and drop
 Tableau is a powerful and fastest growing data visualization
tool used in the Business Intelligence Industry. It helps in
simplifying raw data in a very easily understandable format.
Tableau helps create the data that can be understood by
professionals at any level in an organization. It also allows non-
technical users to create customized dashboards.

181 06/02/2024
Data analysis is very fast with Tableau tool and the
visualizations created are in the form of dashboards
and worksheets.
The best features of Tableau software are
Data Blending
Real time analysis
Collaboration of data

182 06/02/2024
What Products does Tableau offer?

183 06/02/2024
Tableau Product Suite
The Tableau Product Suite consists of
Tableau Desktop
Tableau Public
Tableau Online
Tableau Server
Tableau Reader

184 06/02/2024
For a clear understanding, data analytics in Tableau tool can
be classified into two section.
Developer Tools: The Tableau tools that are used for
development such as the creation of dashboards, charts,
report generation, visualization fall into this category. The
Tableau products, under this category, are the Tableau
Desktop and the Tableau Public.
Sharing Tools: As the name suggests, the purpose of these
Tableau products is sharing the visualizations, reports,
dashboards that were created using the developer tools.
Products that fall into this category are Tableau Online,
Server, and Reader.
185 06/02/2024
 TableauDesktop
Tableau Desktop has a rich feature set and allows you to code and customize
reports. Right from creating the charts, reports, to blending them all together to
form a dashboard, all the necessary work is created in Tableau Desktop.
 For live data analysis, Tableau Desktop provides connectivity to Data
Warehouse, as well as other various types of files. The workbooks and the
dashboards created here can be either shared locally or publicly.
 Based on the connectivity to the data sources and publishing option, Tableau
Desktop is classified into
 Tableau Desktop Personal: The development features are similar to Tableau Desktop.
Personal version keeps the workbook private, and the access is limited. The workbooks
cannot be published online. Therefore, it should be distributed either Offline or in
Tableau Public.
 Tableau Desktop Professional: It is pretty much similar to Tableau Desktop. The
difference is that the work created in the Tableau Desktop can be published online or in
Tableau Server. Also, in Professional version, there is full access to all sorts of the
datatype. It is best suitable for those who wish to publish their work in Tableau Server.

186 06/02/2024
Tableau Public
It is Tableau version specially build for the cost-
effective users. By the word “Public,” it means that the
workbooks created cannot be saved locally; in turn, it
should be saved to the Tableau’s public cloud which
can be viewed and accessed by anyone.
There is no privacy to the files saved to the cloud since
anyone can download and access the same. This
version is the best for the individuals who want to
learn Tableau and for the ones who want to share their
data with the general public.

187 06/02/2024
Tableau Server
 The software is specifically used to share the workbooks,
visualizations that are created in the Tableau Desktop application
across the organization. To share dashboards in the Tableau Server,
you must first publish your work in the Tableau Desktop. Once the
work has been uploaded to the server, it will be accessible only to
the licensed users.
 However, It’s not necessary that the licensed users need to have the
Tableau Server installed on their machine. They just require the
login credentials with which they can check reports via a web
browser. The security is high in Tableau server, and it is much suited
for quick and effective sharing of data in an organization.
 The admin of the organization will always have full control over the
server. The hardware and the software are maintained by the
organization.
188 06/02/2024
Tableau Online
 As the name suggests, it is an online sharing tool of Tableau. Its
functionalities are similar to Tableau Server, but the data is
stored on servers hosted in the cloud which are maintained by
the Tableau group.
 There is no storage limit on the data that can be published in the
Tableau Online. Tableau Online creates a direct link to over 40
data sources that are hosted in the cloud such as the MySQL,
Hive, Amazon Aurora, Spark SQL and many more.
 To publish, both Tableau Online and Server require the
workbooks created by Tableau Desktop. Data that is streamed
from the web applications say Google Analytics,
Salesforce.com are also supported by Tableau Server and
Tableau Online.
189 06/02/2024
Tableau Reader
Tableau Reader is a free tool which allows you to view
the workbooks and visualizations created using
Tableau Desktop or Tableau Public. The data can be
filtered but editing and modifications are restricted.
The security level is zero in Tableau Reader as anyone
who gets the workbook can view it using Tableau
Reader.
If you want to share the dashboards that you have
created, the receiver should have Tableau Reader to
view the document.

190 06/02/2024
How does Tableau work?
 Tableau connects and extracts the data stored in various places.
It can pull data from any platform imaginable. A simple
database such as an excel, pdf, to a complex database like
Oracle, a database in the cloud such as Amazon webs services,
Microsoft Azure SQL database, Google Cloud SQL and various
other data sources can be extracted by Tableau.
 When Tableau is launched, ready data connectors are available
which allows you to connect to any database. Depending on the
version of Tableau that you have purchased the number of data
connectors supported by Tableau will vary.

191 06/02/2024
 The pulled data can be either connected live or extracted to the
Tableau’s data engine, Tableau Desktop. This is where the Data
analyst, data engineer work with the data that was pulled up and
develop visualizations. The created dashboards are shared with the
users as a static file. The users who receive the dashboards views
the file using Tableau Reader.
 The data from the Tableau Desktop can be published to the Tableau
server. This is an enterprise platform where collaboration,
distribution, governance, security model, automation features are
supported. With the Tableau server, the end users have a better
experience in accessing the files from all locations be it a desktop,
mobile or email.

192 06/02/2024
Tableau Uses- Following are the main uses and applications
of Tableau:
 Business Intelligence
 Data Visualization
 Data Collaboration
 Data Blending
 Real-time data analysis
 Query translation into visualization
 To import large size of data
 To create no-code data queries
 To manage large size metadata

193 06/02/2024
Excel Vs. Tableau
Both Excel and Tableau are data analysis tools, but
each tool has its unique approach to data exploration.
However, the analysis in Tableau is more potent than
excel.
Excel works with rows and columns in spreadsheets
whereas Tableau enables in exploring excel data using
its drag and drop feature. Tableau formats the data in
Graphs, pictures that are easily understandable.

194 06/02/2024
Parameters Excel Tableau
Spreadsheet application
Perfect visualization tool
Purpose used for manipulating the
used for analysis.
data.
Most suitable for quick
Most suitable for and easy representation of
Usage statistical analysis of big data which helps in
structured data. resolving the big data
issues.
Moderate speed with
Moderate speed with no options to optimize and
Performance
option to quicken. enhance the progress of an
operation.
The inbuilt security Extensive options to
feature is weak when secure data without
compared to Tableau. The scripting. Security
Security
security update needs to features like row level
be installed on a regular security and permission
basis. are inbuilt.
To utilize excel to full
The tool can be used
potential, macro and
User Interface without any coding
visual basic scripting
knowledge.
knowledge is required.
Best for preparing on-off Best while working with
Business need
reports with small data big data.
Comes with different
Bundled with MS Office versions such as the
Products
tools Tableau server, cloud, and
desktop.
195 06/02/2024
Excel integrates with Tableaus integrated with
Integration
around 60 applications over 250 applications
In Tableaus, you are free
to explore data without
When you are working in even knowing the answer
excel, you need have an that you want. With the
Real time data
idea of where your data in-built features like data
exploration
takes you to get to know blending and drill-down,
the insights you will be able to
determine the variations
and data patterns.
When working in excel,
we first manipulate the
data that is present and
then the visualization such
as the different charts, Whereas in Tableau, the
Easy Visualizations graphs are created data is visualized from the
manually. To make the beginning.
visualizations easily
understandable, you
should understand the
features of excel well.

196 06/02/2024
Tableau provides us various services according to our business need
Installing Tableau on your System

Tableau Desktop, Tableau Public, and Tableau Online, all these offer
Data Visual Creation. Choice of Tableau depends upon the type of work.

Variants of Tableau
Tableau Desktop is a program that allows you to execute complicated
data analysis tasks and generate dynamic, interactive representations to
explain the results. Tableau also lets you share your analysis and
visualizations with the rest of your company, allowing everyone from
coworkers to top management to look into the data that matters to them.

197 06/02/2024
After installation, if you find this Homescreen you are
good to go:

198 06/02/2024
Load Data in Tableau
We will be using global superstore data which is
perfect for learning purposes.

199 06/02/2024
200 06/02/2024
 Tableau supports various data formats which can be loaded by choosing those options.
Under a file we see various options to load data from the local directory and
under to a server, we see options to load data from cloud servers. for loading
CSV files we select Text file options, for excel and SQL files we choose their
respective options.

Connect Tableau to the data file:


 To open the application, click the Tableau icon on your desktop (or in your Start menu).
 In the Connect panel at the left side of the Start page, click the Excel link under the “To a
File” heading to the open file selection option.
 Using the file selection box, select the Excel worksheet that you want to open, and then
click the Open button to continue
 Select
the Orders sheet from the navigation menu on the left and drag it onto
the Drag Sheets Here area, as shown in the above gif.
 After loading we can perform data cleaning, data preprocessing, feature extraction to
some extent.

201 06/02/2024
Understanding different Sections in Tableau
Tableau loaded with global-superstore data and now
we can see Tableau work-page.
 Tableau work-page consist of different section.

202 06/02/2024
Source: Local
 Menu Bar: Here you’ll find various commands such as File, Data, and Format.
 Toolbar Icon: The toolbar contains a number of buttons that enable you to perform various
tasks with a click, such as Save, Undo, and New Worksheet.
 Dimension Shelf: This shelf contains all the categorical columns under it. example:
categories, segments, gender, name, etc
 Measure Shelf: This shelf contains all numerical columns under it like profit, total sales,
discount, etc
 Page Shelf: This shelf is used for joining pages and create animations. we will come on it later
 Filter Shelf: You can choose which data to include and exclude using the Filters shelf, for
example, you might want to analyze the profit for each customer segment, but only for certain
shipping containers and delivery times. You may make a view like this by putting fields on the
Filters tier.
 Marks Card: The visualization can be designed using the Marks card. The markings card can
be used to change the data components of the visualization, such as color, size, shape, path,
label, and tooltip.
 Worksheet: In the workbook, the worksheet is where the real visualization may be seen. The
worksheet contains information about the visual’s design and functionality.

203 06/02/2024
•Data Source: Using Data Source we can add new data, modify, remove
data.
•Current Sheet: The current sheets are those sheets which we have
created and to those, we can give some names.
•New Sheet: If we want to create a new worksheet ( blank canvas ) we can
do using this tab.
•New Dashboard: This button is used to create a dashboard canvas.
•New Storyboard: It is used to create a new story

204 06/02/2024
Creating Visuals in Tableau
Let’s begin with the real data visualization using Tableau-
 Tableau supports the following data types:
 Boolean: True and false can be stored in this data type.
 Date/Datetime:
This data type can help in leveraging Tableau’s default date
hierarchy
behavior when applied to valid date or DateTime fields.
 Number: These are values that are numeric. Values can be integers
or floating-point numbers (numbers with decimals).
 String: This is a sequence of characters encased in single or double
quotation marks.
 Geolocation: These are values that we need to plot maps.

205 06/02/2024
206 06/02/2024
Follow these steps:
 drag the dimension and measure in row and column input field
and it will automatically suggest a graph best fitted on data.
 you can change the graph by clicking on the show me button and
select whichever graph you want.
 you can also remove the axis just by dragging and dropping them
under the marks card (remove field).
 Show Me: When you click this label, a palette appears, giving
you rapid access to many options for showing the selected types
of fields. The palette changes depending on the fields in the
worksheet you’ve selected or are active.

207 06/02/2024
From the above image, you might have observed that the default aggregation on the measure is sum but you can
change the aggregation to sum, avg, min, max, etc, you can also customize the axis name, orientation, size, show-
hide axis

208 06/02/2024
Enhancing the Analysis:
 In order to create a beautiful interactive visual, you must understand the following features:
a. Marks card
Marks card is very important for plotting graphs. In marks card we have:
 Colour button which is used to give different colors to different categories and measures,
 Size button is used to give size which depends on how big a value is. The bigger the value
means bigger the size of a particular mark
 Label button which is used to show labels to graphs, clicking on the label button throws us
some settings where you can set the formatting of labels.
 Tooltips, here you can add information like ( profit, quantity, sales, discount, category,
state, etc.) which will be visible on hovering over the graph
 The Details button allows you to display more information without affecting the table’s
structure. which is used to show details about particular points. dragging a field on details
buttons will show the details of that point, and this feature is majorly used for maps to show
more details of a particular point.

209 06/02/2024
210 06/02/2024
b. Filter
 After creating some plots you might want to use different filters, to do so follow these steps:
 On the filter shelf, you can drag any measure or dimension whichever you want to apply a filter
on.
 As you drop the field a box will appear, now you can select any particular category, or top-n
rows according to measure values or you can write some rules to select top rows or by using
some parameters.
 Now click on show filter after selecting the filter you just applied.
 You may want to apply multiple filters, to do so you will need to add previous filters into
context by clicking on add to context here Context Filter is a Tableau filter that is applied
before all other filters. You can choose different options standard, fit width, fit height, entire
view from the toolbar in order to fit the visualization into the worksheet.
c. Hierarchy
 You can quickly establish hierarchies with Tableau to keep your data organized.
 Hierarchy is basically nesting the same type of related data together. Tableau calendar data is an
example of a hierarchy.

211 06/02/2024
Date-time, calendar is in the form of hierarchy in Tableau, which can be drilled down to year -> quarter -
>month -> day by clicking on the “+” button on the features tab,
You also can create your own hierarchy like country -> state -> city -> postal code, just by dragging
features to another and when needed clicking on ‘+’ button you can drill down further to city, state,
212 postal code. 06/02/2024
d. Parameter
 A parameter is a workbook variable like a number, date, or string that can be
readily managed by the user to replace a constant value in a calculation.

213 06/02/2024
In the above image, our goal was to choose the top N countries having maximum
sales but here we wanted to let the user select how many top countries they want
to list. To accomplish so, we’ll need to create the following parameters:

 Click the down arrow to the right of Dimensions in the Data pane and select
Create Parameter from the pop-up menu that appears, as shown in the above
image. and give a name variable1.
 Select a data type from the Data Type drop-down menu, in my case, I have
chosen to int range from 1–100 list and the current value will be 5.
 Click on show parameter will show the parameter with a slider. but it hasn’t
yet connected with any working.
 Here we wanted to choose top-N countries based on the sales. drag country
field to filter shelf and choose top tab and then choose variable1 in by field
section and choose SUM(SALES).
 Now slide the parameter value and observe the difference.

214 06/02/2024
e . Calculated field
 Tableau gives us the option to create a calculated field where
we can create our own new field( column). Tableau comes with
many functions like if-else, switch, case, date diff, level of
dimension which is extensively used for our visualization
To segment data
level of details(LOD)
To change a field’s data type, for example, from a string to a date.
To aggregate data
handling date time
To filter results
To calculate ratios

215 06/02/2024
216 06/02/2024
Creating Calculated Field:
Here our goal is to calculate delivery days using order
date and ship date:
Select Analysis > Create Calculated Field and give
the name delivery days
Give the Rule to calculate delivery days in the rule
box. here we will use the DATEDIFF function to
subtract two dates.
Type Rule: delivery days = DATEDIFF(‘day’,
[shipdate],[order date])
now drag the delivery days field in rows or cols.
217 06/02/2024
f. Format
Formatting in Tableau is very easy. Just click on the
format button wherever you want to format. we can
format text, numbers, percentage, decimals, date-time
format, label color, label size, axis line color,
worksheet, columns, header, etc . as shown in the
above image.
Data Analytics in Tableau
In the Analytics tab, we have several analytical tools
like forecasting, clustering, trend line, Average line,
constant line, etc.
218 06/02/2024
219 06/02/2024
Steps to perform Analytics
 From the Analytics tab on the left side, you can choose
various options.
 Dragging and dropping a constant line on a particular X, or
Y-axis draws a line at a given constant value.
 Dragging forecast on your sheet will give you a time
forecasting of a given measure, which you can edit by
clicking right click on the forecasted part, there you can
choose the confidence interval, time steps to be forecasted
and forecast model, etc.
 The trend line is not the same as forecasting. The trend line
only tells us if the overall trend is increasing or decreasing.
220 06/02/2024
Summary on Tableau
 Tableau definition or Tableau meaning: Tableau is a powerful and fastest growing data
visualization tool used in the Business Intelligence Industry.
 The Tableau Product Suite consists of 1) Tableau Desktop 2) Tableau Public 3) Tableau
Online 4) Tableau Server and Tableau Reader.
 Tableau Desktop has a rich feature set and allows you to code and customize reports.
 In Tableau public, workbooks created cannot be saved locally, in turn, it should be saved to
the Tableau’s public cloud which can be viewed and accessed by anyone.
 Tableau server is specifically used to share the workbooks, visualizations that are created in
the Tableau Desktop application across the organization.
 Tableau online has all the similar functionalities of the Tableau Server, but the data is stored
on servers hosted in the cloud which are maintained by the Tableau group.
 Tableau Reader is a free tool which allows you to view the workbooks and visualizations
created using Tableau Desktop or Tableau Public.
 Tableau connects and extracts the data stored in various places. It can pull data from any
platform imaginable.
 The spreadsheet application used for manipulating the data while Tableau is a perfect
visualization tool used for analysis.

221 06/02/2024
THANK YOU

222 06/02/2024

You might also like