2024 BA Pre-Read

PRE-READ 2024
BUSINESS ANALYTICS
PREPARED BY PREPARATION COMMITTEE’24, IIFT DELHI

S. No. Topics Page No.
1 Overview Of Business Analytics 1
Generic Flow of Solving Analytics

2 3
Usecases
3 Types of Analytics 4
4 Data and Types of Data 9
5 Measures of Data 10
6 Data Distributions 11
7 Basic Charts 13
8 Advanced Charts 17
9 Inferential Statistics 20
10 Linear Regression 22
11 Logistic Regression 25
12 Time Series Analysis 26
13 FAQ 31
© Preparation Committee’24 Pre Read-Business Analytics

What is Analytics?
Analytics involves the systematic computational examination of data or it’s

statistics. Its purpose is to uncover and interpret significant patterns within
data, as well as to utilize these patterns for informed decision-making.
What is Business Analytics?
Business analytics refers to the practice of using various methods,

techniques, and tools to analyse data related to business operations such as
sales and marketing, supply chain, HR, and finance. It is used to gain
insights into areas such as performance, trends, customer behaviour and
market dynamics etc. The goal of business analytics is to make data driven
decision-making processes within an organization, helping to optimize
strategies, improve performance, and achieve business objectives.
Who is a Business Analyst?
A business analyst is a professional who plays a crucial role in bridging the

gap between business needs and technology solutions within an
organization. Their primary responsibility is to analyse business processes,
gather requirements, and translate them into actionable specifications for
IT systems, projects, or initiatives.
Business analysts typically possess a blend of business acumen and
analytical skills to effectively bridge the gap between business and
technology domains. They often work closely with project managers,
subject matter experts, and IT professionals for successful outcomes..
Broad Responsibilities of a Business Analyst

Requirement Gathering: Collaborating with Stakeholder Communication: Facilitating
stakeholders to understand their business communication between business stakeholders
objectives, processes, and requirements for new and technical teams to ensure a common
business goal. understanding of project objectives.
Analysis and Documentation: Analysing gathered Testing and Validation: Participating in testing
requirements, identifying gaps, and documenting activities to validate that developed solutions meet
detailed business requirements and use cases. the specified requirements and address business
needs.
Solution Design: Working with developers, Change Management: Supporting organizational
architects, and other team members to design change by assessing the impact of proposed
solutions that meet business needs and align with changes, identifying risks, and helping
organizational goals. stakeholders navigate transitions.
© Preparation Committee’24 1 Pre Read-Business Analytics

Business Analyst ML Engineer
Deployment
Deployment
Types of Data
Professionals
Data Engineer Data Scientist
Deployment Deployment
In the landscape of data professionals—Data Engineers, Data Scientists, ML Engineers, and

Business Analysts—the latter play a major role in data visualization, business insights,
storytelling, and reporting. Business Analysts are pivotal in bridging the gap between business
and technology. Hence, a Business Analyst should be proficient in basic probability, statistics,
data visualization, storytelling, basic machine learning algorithms, and business domains.

Understanding the use
case Handling Missing Values
Handling Outliers
Identifying variables Standardising Values

Transformations
Aggregations if necessary
Data sourcing
Univariate analysis
Categorical ordered and
Data cleaning unordered variables
Statistics on numerical
variables
Data wrangling / Data Multivariate analysis
Numeric vs numeric
munging
Correlation vs causation
Categorical vs numerical
Multivariate analysis
Exploratory data analysis
Test Train split Classic machine learning

models based on objective
Linear regression
Fitting an Logistic regression
Analytical/Statistical Clustering
Decision tree
model on train data Random forest
Principal component
analysis
Testing the model
performance on test data
Adjusting model
parameters for
robusntness
Deploy in the Production

Data is like a treasure trove, but without the right tools and know-how, it remains a pile of rocks.
Enter data analytics, the art of extracting valuable insights from information. Here's a breakdown of
different analytics types, their applications, and some helpful tools:
1. Descriptive Analytics: Making Sense of Your Data

Descriptive analytics summarizes past data, providing a clear picture of what has happened.
Imagine you have a giant box filled with receipts from your store. Descriptive analytics is like sorting
through that box and organizing everything neatly. It helps you understand what has happened in
the past by summarizing your data and answering basic questions like:
What: What are my top-selling products?
When: When do sales typically peak?
Where: Where are my customers located?
Who: Who are my typical customers (age, gender, etc.)?
How Much: How much revenue did I generate last month?

Tools for Descriptive Analytics:
Spreadsheets (Microsoft Excel,
Google Sheets), Data visualization
tools (Tableau, Power BI)
Industry Example (Retail):

Let's see how an online store uses descriptive analytics to gain insights:
Total Sales by Product Category: Using descriptive analytics, they can see which product
categories (e.g., clothing, electronics) generate the most revenue. This helps them identify their
best-selling products and areas for potential expansion (e.g., adding more variety in popular
categories).
Website Traffic by Region: By analyzing website traffic data, they can understand where their
customers are coming from geographically. This allows them to tailor their marketing efforts to
specific regions (e.g., focusing on social media advertising popular in certain areas).
Customer Demographics: Descriptive analytics can reveal insights about their customer base,
such as age, gender, and interests. This information can be used to create targeted marketing
campaigns (e.g., emails with special offers relevant to specific customer segments) and
personalize the customer experience (e.g., recommending products based on past purchases).
2. Diagnostic Analytics: Diving Deeper for Causes

Diagnostic analytics goes beyond description, asking "why" things
happen. It analyzes past data to identify the root cause of trends or
patterns.
Tools for Diagnostic Analytics:
Online Analytical Processing (OLAP) tools, Data mining software
(RapidMiner, KNIME)
Industry Example:
"Sole Mates," a popular shoe store, faces a mystery – running shoe sales are slumping!
Descriptive analytics showed the "what," but not the "why."
Here's where diagnostic analytics steps in:
Finding the Culprit: "Sole Mates" uses various data sources to diagnose the
problem:
Customer Reviews: They analyze reviews to see if there are consistent
complaints about recent running shoe models (comfort, design,
performance).
Website Traffic: They check if traffic to running shoe pages has
decreased. Are users bouncing off quickly, suggesting navigation issues?
Sales Data: They examine sales data to see if specific models or sizes are
underperforming.

Possible Causes: Based on the analysis, the culprit could be:
Product Issues: Negative reviews and declining sales for specific models might indicate quality
concerns.
Shifting Preferences: A drop in website traffic for running shoe pages, with no major product
complaints, could suggest changing customer preferences towards a different type of footwear.
Website Woes: If customer reviews don't highlight product issues, but website traffic and sales
data show a decline, the culprit might be website navigation problems.
Taking Action:
Once diagnosed, "Sole Mates" can take corrective actions to improve sales, like addressing quality
issues, promoting specific models, or enhancing website navigation.
By using diagnostic analytics, "Sole Mates" can move beyond the "what" and uncover the "why,"
allowing them to make data-driven decisions for success.
3. Predictive Analytics: Peeking into the Future

Predictive analytics uses historical data, statistical models, and machine learning algorithms to
forecast future events with a certain degree of probability.
Predictive analytics is used across various industries, from retail forecasting future sales trends to
healthcare predicting disease outbreaks. It's a powerful tool that helps businesses make informed
decisions based on data-driven insights about the future.
Tools for Predictive Analytics:

Statistical software (R, Python), Machine learning libraries (Scikit-learn, TensorFlow)
Making Predictions with Data:

Before we dive into how it works, let's understand two common techniques used in predictive
analytics:
Decision Trees:
Imagine a decision tree like a giant "if-then" statement. It asks a series of "yes" or "no" questions
about your data to arrive at a prediction. For example, a bank might use a decision tree to predict
loan defaults.
The tree might ask questions like:

Does the borrower have a high income?
(Yes/No)
Do they have a good credit history? (Yes/No)
Have they defaulted on loans in the past?
(Yes/No
Based on the answers to these questions, the

decision tree would predict whether the borrower is
likely to repay the loan or not.

Random Forests:
While a single decision tree can be helpful, imagine
planting a whole forest of them! Each tree in the
forest asks slightly different questions and makes its
own prediction. Random forests combine the
predictions from all these trees to get a more
accurate overall picture. Think of it like taking a vote
among all the decision trees for the final prediction.
Industry Example:
Bank XYZ can use data analysis to build a credit default risk model to improve their loan approval
process. This model can be built using decision trees or random forests. Decision trees ask a series of
questions about a new loan applicant's data to predict their default risk. Random forests involve
creating multiple decision trees, each with slight variations, and basing the final decision on the
majority vote from all the trees. This approach can provide a more robust and reliable prediction
compared to a single decision tree.
By using a credit default risk model, Bank XYZ can make more informed decisions about approving or
rejecting loan applications. They can also adjust loan terms (interest rate, loan amount) based on the
predicted risk.
4. Prescriptive Analytics: Taking Action Based on Insights

Prescriptive analytics is the most advanced form of data analysis. It not only predicts future events
but also recommends the optimal course of action based on those predictions.
Predictive analytics: Tells you it might rain tomorrow.
Prescriptive analytics: Advises you to pack an umbrella and wear rain boots
Google Maps analyzes live traffic, weather, and your

preferences to predict delays and suggest faster
routes. It then prescribes detours, rest stops, and even
restaurants, making your trip a breeze - all powered by
prescriptive analytics.

Tools for Prescriptive Analysis:
Optimization Software: Imagine having a software assistant that analyzes
all your options and suggests the most efficient way to achieve your goals.
(Examples: FICO Xpress Optimization Suite, IBM ILOG CPLEX Optimizer)
Simulation Modeling Tools: Think of these as virtual test environments.

You can plug in different scenarios (e.g., launching a new product) and see
how your decisions might play out before taking real action. (Examples:
AnyLogic, Rockwell Automation Arena)
Industry Example: Factory Fix-It

Let's see how a factory uses prescriptive analytics to avoid problems:
Predicting Machine Breakdowns: Sensors on machines collect data on temperature, vibration, and
other factors. Prescriptive analytics analyzes this data and predicts when a machine might fail.
Recommending Maintenance: Based on the prediction, the system might recommend preventive
maintenance, like replacing a worn-out part. This helps minimize downtime and keep production
running smoothly.
Marketing Finance Trade Supply chain
Bar plot, Pie chart,

Visualization and
Heat Map, Spider Bar, Pie, Gann, Line, Violin,
Descriptive OLAP Operations
chart, Bubble Box Wisker Candle Stick
Data Modeling
chart
T- Test, Chi-sq OLAP operations,

Scatter plot, Clustering,
Diagnostic Test, Linear Gann Chart, Linear
Correlation Tree Mapping
regression Programming
Time Series Methods,

MLR, SEM, K- , Random Forests,
Predictive ARIMA, MLR ARIMA
Means Clustering Clustering Algorithms
Decision Trees and

Lasso, Monte
Decesion Fuzzy Rule-Based
Prescriptive Decision Trees Carlo
Trees Systems, Switching
simulation
Neural Networks

Data is an unprocessed group of text, images, and voice representing quantities, actions, and
objects. Data must be processed to extract meaningful insights from it.
Types of Data:
Based on characteristics -
Methods of collection - Interview, Focus
Qualitative Data - It is unstructured data
groups.
collected in the form of text, images, and videos.
Purpose - Creating new theories,
Generally involves rigorous data collection for a generating hypotheses from sensitive and
deeper understanding of the research objective. complex data..
Quantitative Data- It is structured data collected Methods of collection - Survey.

Purpose - Testing specific hypotheses and
numerically, including anything that can be
understanding cause-and-effect
counted, measured, or given a numerical value.
relationships, providing insights and
Statistical analysis can be conducted. understanding.
Based on Methods of collection
Cross-Sectional Data- The data collected from multiple subjects at
single points in time.
Ex: Data collected on ice cream flavour preferences among college
students from different areas.
Applications:
1. Comparing Diverse group - Income levels of different regions.
2. Market Analysis - Customer preferences in choosing laptops with
different features.
3. Snapshot Analysis - Effect of a new vaccine across different age
groups.
Time Series Data- The data collected from a single subject at

different points in time.
Ex: Data collected on Adani stock for a period of 5 years daily.
Applications:
1. Weather Forecasting - Data collected on temperature over several
years.
2. Energy consumption - Amount of energy units used in a day over
365 days.
3. Stock Analysis - Value of a stock collected daily over 10 years.
Panel Data - The data collected from multiple subjects at different

points in time.
Ex: Data collected on the health status of patients suffering from
chronic diseases over several years.
Applications:
1. Macroeconomics - GDP, Inflation of different countries over 10 years.
2. Investment - Best performing stocks in the past 8 quarters.
3. Demand - Tracking of fuel demand of different countries over
several years.

Measures of Data:
The arithmetic assessment of data characteristics for representing, summarizing, and analyzing for
various business and research purposes.
1. Measures of Central Tendency –

It is most representative value of the dataset where all data points of a variable tend to accumulate.
Mean– The average value of the dataset. It is affected by very small or large values, which can pull the
mean towards them, making it less effective as a central measure. The rare instances of these
extreme values (outliers) make the mean less representative.
Ex: Performance of students measured by average marks scored, average number of customers
coming to a shop in a day.
Median – The middle value in an ordered dataset. It is not affected by extreme values.
Ex: Consider The Great Khali and five female air hostesses; the mean of their weight is more than the
weight of the five air hostesses, making the mean a non-representative value.
Mode – The most frequently occurring value in the dataset. It is not affected by extreme values and is
mostly used for categorical data.
Ex: Most common foot size for producing footwear massively, saving production costs.
2. Measures of Dispersion –
These measures describe the spread of the data around the central value, indicating whether data is
scattered or concentrated.
Ex: The mean scores of 5 students is 80, but all of them may not score 80. Some may score 100, and
some 40. We need the spread of these values for accurate analysis.
Variance – An arithmetic method to calculate spread using the mean of the data. The variance is the
average of the squared deviations from the arithmetic mean for a set of numbers.
Standard deviation- The square root of variance. While variance is in squared units, standard
deviation is in normal units, making it useful for comparison.
3. Skewness –
It is an asymmetrical distribution of data
typically represented in a graph. Distributions
with more data points on the left side are called
left-skewed, while those with more data points
on the right side are called right-skewed.
4. Kurtosis -
It is the measure of the peak of the distribution
curve. It signifies the accumulation of points
around the mean. The higher the peak, the more
data points are present around the mean.

Data Measurement Scale:
There are many data points representing various objects. The data points recorded on these objects
may serve different purposes, so they can’t be interpreted in the same way. The scores of students
are averaged to get the overall performance of the students, but averaging their ID numbers doesn’t
represent anything. ID numbers are for counting and unique identification.
There are 4 types of measurement scale:

Nominal
The data points of an object are measured with values for categorization and classification.
Ex: Employee IDs given to people in a company help to identify and differentiate employees. The ID
of any employee doesn’t represent any superiority over other employees.
Ordinal
The data points of an object are measured with numerical values that have an ordering. Objects are
given ranks, but the intervals between any two consecutive ranks are not the same.
Ex: Three employees are given ranks 1, 2, and 3 based on productivity, but the differences in the
amount of productivity between workers ranked 1, 2, and 3 are not necessarily the same.
Interval
The ordinal scale, along with the difference between two consecutive ordered values, is equal but
does not have an absolute zero, a reference point.
Ex: Temperature measurement, the level of heat measured in degrees has equal numerical ordering.
0, -1, etc., are also temperatures.
Ratio
The interval scale, along with an absolute zero, a reference point for measurement.
Ex: Weight of two students, 100 kg and 50 kg. The weight of one student is twice that of the other, as
both weights are measured from 0; 100 is twice 50.
Data Distributions -
All data collected in any experiment or research tends to follow some standard distribution, such as
being concentrated or sparse, with unique characteristics that aid in quick evaluation and precise
analysis.
Normal Distribution -
The data collected is more centered around the
mean of the data. The graph looks like a bell-
shaped curve.
It is a continuous distribution.
It is a symmetrical distribution about its mean.
It is asymptotic to the horizontal axis.
It is unimodal.
Area under the curve is 1.
Normal Distribution Formula
Ex: The weight of new born babies
tends to follow a normal distribution.

Uniform Distribution -
The no. of entities collected is the same across the
range of the data. The probability of occurrence of each
point is the same. The graph looks rectangular.
It is a symmetrical distribution about its mean.
It is asymptotic to the horizontal axis.
Area under the curve is 1.
Uniform Distribution Formula

Ex: The insurance premium ranges
equally between 41,000 and 47,000.
Poisson Distribution - Poisson distribution Formula

The distribution of rare occurrences of an event.
It is a discrete distribution.
It describes rare events.
Each occurrence is independent of the other
occurrences.
The expected number of occurrences must hold
constant throughout the experiment.
Ex: The number of defective items manufactured in

an hour in production plant.
Exponential Distribution - Exponential Formula

The distribution of time between rare occurrences of
data points.
It is a family of distributions.
It is skewed to the right.
The x-values range from zero to infinity.
Ex: The time between occurrences of defective

items in a production plant.
Note : The parameter is same in poisson and exponential distribution

Bar Plot Definition : Bar plot is a visual representation that uses rectangular bars to
display data in a categorical manner. Each bar's length represents the
frequency, count, or proportion of a particular category or numerical value.
Bar plots are frequently utilized to compare discrete data points or categories,
aiding in the identification of patterns, trends, or differences among them.
They are widely employed in various fields such as statistics, economics,
finance, and social sciences for their simplicity and effectiveness in conveying
information visually.
Benefits:
Comparing parts of a bigger set of data, highlighting different categories,
or showing change over time.
Have long categories label — it offers more space.
Illustration of both positive and negative values in the dataset.
Limitations:
If you’re using multiple data points.
If you have many categories, avoid overloading your graph. Your graph
shouldn’t have more than 10 bars.
For further reading refer - A complete guide to bar charts
Pie Chart Definition : A pie chart is a circular visualization tool used to represent data
proportions by dividing a circle into sectors. Each sector's size corresponds to
the proportion of the whole dataset it represents, typically expressed as
percentages. Pie charts provide a clear and concise way to illustrate the
relative contributions or distributions of different categories within a dataset.
They are commonly employed in presentations, reports, and dashboards to
convey the composition or distribution of data in a visually accessible format.
Benefits
You have a total number that can be split up into 2-5 categories.
One category outweighs the other by a significant margin.
Limitations
Your dimension has too many categories.
Similar percentages/numbers exist between different values within the
chosen dimension.
Data doesn’t represent a uniform “whole”, or the percentages don’t
measure to 100 percent.
There are negative values or complex fractions in your measure value.
For further reading refer - Pie Charts: Using, Examples, and Interpreting

Histogram Definition : A histogram is a graphical representation of the distribution of
numerical data. It consists of a series of adjacent rectangular bars, where the
area of each bar corresponds to the frequency or proportion of data values
falling within a specific range or "bin" on the horizontal axis. Histograms are
commonly used to visualize the frequency distribution of continuous data,
allowing analysts to identify patterns, central tendencies, and variability within
the dataset. They are particularly useful for exploring the shape and
characteristics of distributions, such as whether they are symmetric, skewed,
or multimodal. Histograms are widely employed in fields such as statistics,
data analysis, and quality control to gain insights into the underlying structure
of data.
Benefits of Histogram
Visualize the distribution of a continuous dataset and frequency
Identify patterns like skewness, modality and kurtosis.
Detect outliers or anomalies that deviate from the rest of the data.
Limitations
Loss of individual data points due to binning.
Bin size choice impacts interpretation and can be misleading.
Not suitable for small datasets.
Limited to displaying one-dimensional data.
Difficult to compare multiple distributions on a single plot.
For further reading refer - Effective Use of Histograms
Line Chart Definition : Line chart is a graphical representation of data points connected
by lines, used to visualize trends and changes over time or across a
continuous variable. The horizontal axis typically represents time or the
independent variable, while the vertical axis represents the dependent
variable.
Benefits of Line Graphs:
Simplify complex data sets into a digestible format.
Highlight relevant trends in large data sets, reducing cognitive
overload.
Easier interpretation and friendlier for non-technical users.
Compellingly communicate stories through the progression of data
points.
Aid in predicting future scenarios by analyzing historical trends.
Limitations of Line Graphs:

Less effective for analyzing complex trends with multiple variables.
Best suited for continuous or sequential data, not categorical or
discrete data.
Not suitable when there's no meaningful reason to connect data
points, such as representing discrete categories or unrelated data.
For further reading refer - A Complete Guide To The Power Of Line Graphs

Scatter Plot Definition : A scatter plot is a graphical representation that displays
individual data points plotted on a two-dimensional Cartesian coordinate
system. Each point represents the values of two variables, with one
variable on the x-axis and the other on the y-axis. Scatter plots are used
to visualize relationships, correlations, and patterns between the two
variables.
Benefits of Scatter Plots:

Visualize relationships between two variables, identifying patterns,
correlations, and trends.
Easily identify outliers for data cleaning and anomaly detection.
Assess correlation strength and direction by examining data point
clustering.
Suitable for large datasets.
Limitations of Scatter Plot

Limited to two variables:
Not suitable for categorical data
Limited in detecting non-linear relationships
For further reading refer - what is a scatterplot?
Box & Definition : graphical representation of the distribution of a dataset. It consists

of a box that spans the interquartile range (IQR) of the data, with a line inside
Whisker the box representing the median. "Whiskers" extend from the box to indicate
the minimum and maximum values within a certain range, often defined by a
Plot multiple of the IQR. Outliers, or data points that fall outside this range, are
typically plotted individually. Box plots provide a visual summary of central
tendency, variability, and the presence of outliers in the data.
Benefits of Box Plots:

Summarize data distribution: median, quartiles, and range.
Identify central tendency and variability:
Detect outliers:
Compare multiple datasets
Can easily represent skewed distributions
Limitations of Box Plots:

Simplified representation:
Limited to univariate analysis:
For further reading refer - Understanding and using Box and Whisker Plots

Heat Map Definition : A heatmap is a graphical representation of data where values
in a matrix are represented as colors. Each cell in the matrix corresponds to
a specific data point, and its color intensity represents the value of that data
point. Heatmaps are often used to visualize the magnitude of relationships
between two variables, with darker colors indicating higher values and
lighter colors indicating lower values. They are particularly useful for
identifying patterns, trends, and correlations within large datasets, making
complex data more interpretable at a glance. Heatmaps are commonly
used in fields such as data analysis, statistics, biology, and finance to explore
and communicate relationships in multidimensional data.
Benefits of Heatmaps:
Visualize large datasets concisely.
Identify patterns, trends, and correlations.
Facilitate exploration of multidimensional data.
Enhance decision-making with key insights.
Communicate complex information effectively to a wide audience.
Limitations of Heatmaps:
Limited to visualizing relationships between two variables.
Interpretation can be subjective and vary by viewer.
Sensitive to outliers, which can skew color representation.
Risk of overinterpretation and identifying spurious correlations.
For further reading refer - heat map (heatmap)

Tree Map
When to Use :
Definition :
Hierarchical data
A tree map is a hierarchical visualization
Multiple levels of categories
method that displays data as nested
Comparing the relative size of categories.
rectangles. Each rectangle's size represents a
For Example- Visualizing the breakdown of
quantitative value, and the hierarchy is shown
expenses in a budget, where larger
through the placement and size of these
rectangles represent higher spending
rectangles.
categories.
Spider Chart
When to Use :
Comparing performance or characteristics
Definition :
of multiple entities across different variables,
A spider or radar chart displays multivariate
especially when the variables are not directly
data in a two-dimensional chart. Each variable comparable.
is represented by an axis starting from the
same point, and the data points are connected For example, evaluating the strengths and
to form a polygon. weaknesses of athletes in various sports
based on attributes like speed, endurance,
and accuracy.
Gantt Chart
When to Use :
Definition : Visualizing project schedules
A Gantt chart is a bar chart used for project Dependencies
management. It illustrates a project schedule Progress over time
by showing the start and finish dates of tasks For Example- Managing production
or elements as horizontal bars. schedules in manufacturing industries to
ensure timely delivery.

Bubble Charts
When To Use :
Relationships between three variables
Definition :
size of the data points is significant and
A bubble chart is a type of chart that displays
adds an extra dimension.
three dimensions of data using bubbles. The x
For Example , analyzing the correlation
and y-axis represent two variables, and the size
between population density, GDP per
of each bubble represents the third variable.
capita, and life expectancy in demographic
studies.
Area Charts
When To Use :
Definition : Displaying trends over time,
An area chart is a variation of a line chart Emphasize the cumulative aspect of the
where the area below the line is filled with data.
color or shading. It is commonly used to For Example, visualizing the trend of
represent cumulative data over time website traffic or user engagement over
weeks, months, or years.
Violin Chart
When To Use :
to visualize the distribution of data
Definition : compare it across different categories or
A violin chart is a method of plotting numeric groups
data and a probability density function. It when the data is non-parametric or lacks a
resembles a box plot with a rotated kernel normal distribution.
density plot on each side. For Example, analysing the distribution of
healthcare outcomes or patient recovery
times.

Ogive
When To Use :
to understand the cumulative distribution of
Definition:
data
An ogive is a graph that represents the
visualize the proportion of values below a
cumulative frequency distribution of a dataset.
certain threshold, especially in statistical
It plots cumulative frequencies on the y-axis
analysis and quality control processes.
against corresponding data values on the x-axis.
For Example, analyzing the cumulative
distribution of product delivery times or
customer wait times.
For more knowledge and exploration refer below links
Tableau library and reference for advanced charts
Data Visualization Catalogue
YouTube videos references

Inferential statistics is a branch of statistics that allows you to draw conclusions about a larger
population by analysing data from a smaller sample. It essentially makes educated guesses about
the whole based on a part.
Population -
The entire data set of persons/objects/events that has some characteristics on which some inference
can be made by researcher.
Sample -
Subset of population. Generally used to make inference for population.
Ex- A soft drink company produces 280 gallons of juice, packed in 2000 (population) bottles. To test
the sugar content(20g), they select sample of 20 bottles, check their sugar levels, and then declare
that all 2000 bottles meet the required sugar content. Checking all bottles would be wastage.
There are two main goals of inferential statistics:
Estimation - Estimating population parameters, like the average income in a city, using sample
statistics like the average income in a small neighbourhood, representing entire population.
Hypothesis Testing -Testing hypotheses about some characteristics of population, like whether a
new fertilizer increases crop yield, by analysing data from a controlled experiment.
How to test? -
Using hypothesis Statements
Null Hypothesis - States that sample data occurred randomly, no real effect.
(in words) All juice bottle has same content of sugar i.e. 20g;
Alternate Hypothesis - States that sample data occurred due to external effect not randomly.
(in words) Not juice bottle has same content of sugar i.e. 20g;
Test Statistic -
A numeric value explaining the how close is the observed data measure to the required level in an
experiment.
Ex: In juice bottle sugar content example, test statistic
Significance level -
The % risk of rejecting null hypothesis when it is true. A experimenter sets this value if the test
statistic crosses this threshold then null hypothesis should be rejected.
Ex: If significance level is 0.05, if the mean of sample is in >21 or <19(20+_5%20) then null hypothesis is
rejected i.e. the sugar levels are not 20g.
Every experiment results p-value (Test Statistic expressed in probability). In case we have p-value
0.04 < 0.05, if error is less than 5% we must reject null hypothesis.
Level of Significance - The accuracy of model i.e. 100-significane level.

1. T-tests:
Used to compare the means of two groups, helpful for testing if a new treatment has a significant
effect. Independent Samples T-test (compares means of two independent groups):
Test Statistic: t = (x̄₁ - x̄₂) / (Sp√(1/n₁ + 1/n₂))
x̄₁ and x̄₂ are the sample means of groups 1 and 2.
Sp is the pooled standard deviation, estimated using formulas involving sample variances.
n₁ and n₂ are the sample sizes of groups 1 and 2.
Example
Tt-test to compare the average sales before and after the campaign for the same group of stores,
Comparing the average sales of 50 stores for the month before the campaign with the sales for the
month after the campaign.
2. Z-tests:
Used to compare the means of two groups, helpful for testing if a new treatment has a significant
effect. One-Sample Z-test (compares a sample mean to a hypothesized population mean):
Test Statistic: z = (x̄ - μ) / (σ√n)
x̄ - sample mean.
μ - hypothesized population mean.
σ - population standard deviation (estimated using sample standard deviation if unknown).
n - the sample size.
Example
One-sample Z-test if the sample size is large (n > 30) and the population standard deviation is known.
The average spend per customer in the existing stores is $100 with a known standard deviation of $15.
The new store's average spend is $105 from a sample of 100 customers. The Z-test will help determine
if this difference is statistically significant.
3. Chi-Square Test:
Used to analyse relationships between categorical variables, useful for seeing if there's a connection
between hair colour and eye colour.
Test Statistic: χ² = Σ (Oi - Ei)² / Ei`
χ² (chi-square) is the test statistic.
Oi - observed frequencies in each category.
Ei - expected frequencies in each category (calculated based on the null hypothesis).
Example
Chi-square test of independence to see if customer satisfaction levels are (satisfied, neutral,
dissatisfied) are independent of the product category (electronics, clothing, groceries).Survey data
from 500 customers is analysed to check if the proportion of satisfied customers differs across
different product categories.
4. ANOVA (Analysis of Variance):

Compares means of more than two groups, helpful for studies with multiple treatment groups. F-
statistic is used to compare the variances between groups. The specific formula is more complex and
involves calculations related to sums of squares within and between groups. Generally done via
software packages.
Example
One-way ANOVA to compare the means of sales across the three teams. Sales data from Team A,
Team B, and Team C, over a quarter is collected. ANOVA will help determine if the differences in
average sales among the three teams are statistically significant.
Regression is a supervised Machine Learning technique of making a function to predict a Variable
using other variables using historical (or) collected data. The variable being predicted is called
dependent (or) response variable. The variable(s) used to predict are called independent (or)
explanatory variables.
Correlation Causation Regression
Strength of relationship Unidirectional effect of Statistical relationship between two

between 2 Variables one variable on other variables
Linear Regression: It is a type of regression in which linear function formed between dependent
and independent variables. (Fitting a straight line between dependent and independent data
points). Example: Price of a house depends on the square ft area.
The estimation is done using a equation writing price as a linear function of Square ft area. The co-
efficient are estimated by using “Ordinary Least Squares Method”.
Applications:
1) Forecasting - It can be used for forecasting of the independent variables.
2) Testing Significance - To check a variable can be explained in terms of other variables or not.
Price of House
Price of House
Equation:
Price of House = 57.4 + 0.017*(Square ft area) Square feet area Square feet area
Multiple Linear regression :

Type of regression to predict value of one dependent variable and based on multiple independent
variables by forming a linear relationship between them.
Example: Price of a house depends not only on square ft area, water index, age of the house etc.
Equation:
Price of House = 57.4 + 0.017*(Square ft area) - 0.666*(Age)
Error (or) Residual : The difference between actual value and estimated value.
- Error - Actual Value - Estimated Value

Assumptions of Linear Regression:
For performing Linear Regression between any two variables requires to follow some assumptions
for accurate predictions.
1. Linear Relation : There exists a linear relationship between
Residual
dependent variable and independent variables. This can be
verified by making plots. Actual
2. Independence of errors : The errors should not be dependent
Residual
making increasing or decreasing trend. This makes current error
Actual
depends on previous error.
3. Homoskedasticity:
Residual
The error variance is constant across the independent variables
set. This ensures that the estimate error is uniform.
Actual
4. Normality: The residuals(difference of actual and calculated
Residual
data point) are normally distributed around 0. This ensures that
the prediction error is low.
Actual
5. No Multicollinearity:
The independent variables should not correlate with each other.
Residual
This ensures no cascading effect and accurate modelling and
prediction. Actual
Interpretation of Regression Output:

A study conducted to estimate the house prices based on the square feet and the age of the house.
Here are the Results.
1)ANOVA Result : Explaining whether there is at least one
independent variable contributes significantly in linearly
predicting the dependent variable.
As P-Value <0.05, we reject Null Hypothesis:
i.e. there is at least one of Square Feet and Age is
significant in predicting price of house..
2)Coefficient of Determination,R^2 : The proportion of
independent variable that is explained in terms of
dependent variable. It is between 0 and 1.
74.1 % is good, generally anything >50% is good.
Sometimes lower R2 is acceptable.
3)Adjusted R^2 : Due to some redundant variables the actual amount of dependent variable
explained is decreased.
71.52% , a decrease of 2.86% indicates the variables, Square Foot and Age not much redundant.x2
4)Independent variables Co-efficient : The co efficient tells the significance effect of the
independent variable on the dependent variable. As p-values across all variables is less <0.05, Square
Foot and Age both are significant in predicting the price of house.
5)Equation : The Linear Equation can be formed using the coefficients and intercept.
the equation, Y = 57.4 + 0.017*x1 – 0.666*x2
Price of House = 57.4 + 0.017*(Square Feet) – 0.666*(Age)
If the Square Feet area increased by 1 sq ft the cost increases up by $17.7.
If the age increased by 1 year the cost decreases by $666.
Practical usage of Linear Regression:
Forecasting sales - The number of ice creams sold in a month can be estimated by regressing with
number of cold sores cases. At initial glance that may seem awful do cold sores people take ice
cream. Cold sores caused by simplex virus but it is triggered by extreme heat and exposure to
sunlight. It is the reason for eating ice creams.
Estimating sales - Suppose a company made a high-quality salt which is a premium product. It
wants to predict the sales so it can estimate by regressing with number of cars and AC’s present in
the locality. For a premium product target customers are higher income people who have AC and
Car.
What Variables(or) factors to take for regression?

1) Measurable - The variable which are measurable in ratio scale give better results.
2) Related or partially related - The variable which are related to predictor variable either positively
or negatively, which least effect predictor variables are not to be taken.
3) Linearity - Make scatter plot w.r.t each independent variable and check is the points are forming a
line.
How to find these factors - Doing Qualitative Analysis.
Note: By increasing no. of independent variables the error decrease but explainability of model
decreases. Analyst must trade-off between them to get best model.
FAQ
1) What if variable is nominal like gender?
A) Then make it ordinal by making Male as 1 and Female as 0
2) What if Variable is like Job has 3 ordinal measures “IT”, “Manufacturing”, “Dev Ops”?
A) Do not code as 1,2,3 as it makes one measure greater than other so take two of them “IT” and
“Manufacturing” as independent variables, if not these two, job must be is “Dev Ops”. On adding it
the model becomes too complex. Generally, Variables with 3 or more nominal measures are not
considered, if not cluster them until 2 cluster remain for coding 0 and 1.
Regression Diagnostics - In case of regression model is not According to plot change the
significant what should be done. independent variable power
Increase or decrease the power of independent factors - Ex:

taking X^2 instead of X. Before doing regression plot scatter plot
of graph to check of they appear to be linear, if not increase the
power of factor X to X^2 do it until the graph looks linear line.
Interaction Effect - Sometimes the interaction of independent
variables also effect. Consider estimating forest growth it
depends on rainfall and temperature. Trees require both water
as well sunlight one without other doesn’t give growth so a new
factor product of temperature and rainfall should be taken for
estimating the growth of forests.
Post Regression Validating Model: After regression model is only valid in case
1) Errors are normally distributed, and error variance should remain same across the predictor variable
2) Error do not autocorrelated (if so, Time Series Regression needs to be done)

What is Logistic Regression?
Logistic Regression is a supervised classification model. It allows you to make predictions from
labelled data, if the target (output) variable is categorical. The logistic regression model predicts
the probability that a given input belongs to a certain class, by fitting a logistic curve (sigmoid
function) to the data, which maps any real-valued input to output in the range [0, 1].
When is Logistic Regression model used?

Logistic regression is used for binary classification tasks, where the outcome variable (dependent
variable) is categorical and has only two possible values, such as "yes" or "no", "0" or "1", "spam" or
"not spam", etc. It is widely used in various fields such as finance (to predict credit default),
marketing (to predict customer churn), and more. It's a fundamental tool in the machine learning
toolbox due to its simplicity, interpretability, and effectiveness in many practical scenarios.
How does Logistic Regression model look like graphically?
Sigmoid curve models the logistic

regression and Maximum likelihood
estimation is used to determine β0 and β1
similar to the least square estimation in
linear regression.
How to interpret the Logistic Regression model?

Ln(p/1-p) is called the log odds. Basically, the odds of passing the exam (p/1-p), indicates how
much more likely a person is to pass the exam than to fail. On small increase in independent
variable increases prob. of success increases heavily.
If hours studied is 0 then log odds ‘-3’(less value) On rearranging, we get probability of success, p.
indicates the prob. of passing is less than failing . If hours If Hours studied = 0, probability of pass = 0.047
studied is 6 then log odds ‘6’ prob. of passing is higher. If Hours studied = 6, probability of pass = 0.997
What is multivariate Logistic Regression.?

multivariate logistic regression is basically just an
extension of the univariate equation

What are the methods of evaluating logistic regression models.?
Logistic Regression models are evaluated by the metrics like accuracy, sensitivity/recall and
specificity which are derived from confusion matrix. Based on ask of the model, the threshold of
the model need to be adjusted to get decent performance evaluation metrics values.
Predicted condition
Total
Positive Negative
population
(PP) (PN)
=P+N
True False
Positive (P) positive negative
(TP) (FN)
Actual
condition
False True
Negative (N) positive negative
(FP) (TN)
What is Time Series data?

Time series data is a sequential collection of
data points recorded over time intervals.
These intervals can be regular or irregular
and typically include measurements of one
or more variables. Examples include daily
stock prices, monthly sales figures, or hourly
temperature readings, or the number of
orders made on a food ordering app per day.
What is Time Series Analysis?

Performing analysis on a time-series data to find useful insights and patterns in termed as time
series analysis. Let's take a food ordering app example. The app might have the data for every day
logged in per hour. They might notice that in this data, the number of orders is significantly
higher in, say, the 1-2 PM time slot but is significantly lower in the 3-4 PM time slot. This
information might be useful for them as they would then be able to estimate the number of
delivery boys required at a particular time of the day. Hence, time series analysis is indispensable
while working with any time series data.

What is Time Series Forecasting?
Time series forecasting is basically looking at the past data to make predictions into the future.
Say that the food ordering app wants to predict the number of orders per day for the next month
in order to plan the resources better. For this, they will look at tons of past data and use it in order
to forecast accurately.
Basic Steps for forecasting
Build and evaluate the

Define the problem Collect the data Analyze the data
forecast model
Three important attributes that every time series data must exhibit
Relevant: Accurate Long enough
The data should be long

The data should be
The time-series data enough to forecast. This
accurate in terms of
should be relevant for is because it is
capturing the
the set objective that important to identify all
timestamps and
we want to the patterns in the past
capturing the
achieve. and forecast which
observation correctly.
patterns repeat in the
future.
Components that contribute to the behaviour of time series data

1. Level
Level is the baseline value of the time series data. It represents the central value around which
data points fluctuate. The level can be thought of as the average value of the series over a period.

2. Trend
Trend describes the long-term direction in the data. It shows whether the data points are
generally increasing, decreasing, or remaining constant over time. Trends can be linear or
nonlinear.
3. Seasonality
Seasonality refers to regular, repeating patterns or cycles in the data that occur at fixed intervals
(such as monthly, quarterly, or annually). These patterns are often driven by seasonal factors like
weather, holidays, or other periodic events.
4. Cyclicity
Cyclicity involves longer-term fluctuations that do not occur at regular intervals and are often
related to economic or business cycles. Unlike seasonality, cyclic patterns do not have a fixed
period and can be irregular and unpredictable.
5. Noise
Noise represents the random variations or "white noise" in the data. These are the residuals that
cannot be attributed to the level, trend, seasonality, or cyclicity. Noise is essentially the
unpredictable part of the time series.

The final time series data, which is a combination of level, trend, seasonality, cyclicity, and noise.
This plot shows how these components interact to form the observed data.
Basic Time series Forecasting methods
Method Example
1. Naive Method If sales were 100 units this month,

Naive Method assumes that the forecast for the next the naive forecast for next month is
period is equal to the last observed value. It's simple but also 100 units.
can be effective for very short-term forecasts.
2. Moving Average
Moving Average method smooths the time series by A 3-period moving average for sales
averaging the data points from several consecutive data (100, 150, 200) would be
periods. This helps to reduce noise and highlight (100+150+200)/3 = 150.
underlying trends.
3. Exponential Smoothing
Exponential Smoothing assigns exponentially
If the last observation was 100 units
decreasing weights to past observations, giving more
and the smoothing parameter is 0.5,
importance to recent data. This method is good for
the forecast might be 0.5 * 100 + 0.5 *
short-term forecasts and can be extended to handle
(previous forecast).
trends (Holt’s method) and seasonality (Holt-Winters
method).
4. ARIMA (Auto Regressive Integrated Moving Average)

ARIMA(1,1,1) means using one lag for
ARIMA combines autoregression (AR), differencing (I),
autoregression, differencing once to
and moving average (MA) models to handle various
make the series stationary, and
types of time series data. It can model data with trends
using one lag of the forecast error.
and seasonality by including seasonal components.

5. Linear Regression
Linear Regression models the relationship between the Forecasting sales based on
dependent time series variable and one or more advertising spend, where sales = a +
independent variables. It can be used for trend analysis b * (advertising spend).
and forecasting when there is a clear linear relationship.
6. Seasonal Decomposition of Time Series (STL)

Decomposing monthly sales data to
STL decomposition separates a time series into seasonal,
understand and forecast the
trend, and residual components. Each component can
seasonal pattern, trend, and
be analysed and forecasted individually before
irregular components.
recombining them for the final forecast.
7. Simple Exponential Smoothing (SES)

SES is suitable for time series without trends or Forecast = alpha * (last observation)
seasonality. It uses a weighted average of past + (1 - alpha) * (last forecast).
observations, where the weights decrease exponentially
as the observations get older.
8. Holt’s Linear Trend Model Level and trend equations are used
Holt’s Linear Trend Model extends simple exponential to update forecasts by taking both
smoothing to capture linear trends by incorporating two recent data and trend changes into
equations: one for the level and one for the trend. account.
9. Holt-Winters Seasonal Method Used for data with both trend and
Holt-Winters Seasonal Method extends Holt’s model to seasonality, like monthly sales data
include seasonality. It has three components: level, showing seasonal fluctuations and a
trend, and seasonal. trend.

Key Words FAQs
Data Mining What is business analytics?

Data Lake What is the difference between descriptive,
Predictive Modeling predictive, and prescriptive analytics?
Exploratory Data Analysis (EDA) What are the main steps in a typical data
Key Performance Indicators (KPIs) analysis project?
Hypothesis Testing How do you handle missing data in a dataset?
Correlation vs. Causation What is data normalisation and why is it
Data Visualisation important?
Dimensionality Reduction What is the difference between data mining
Outliers and data lake.?
Segmentation Explain the concept of overfitting and how to
Regression Analysis prevent it.
Churn Analysis How do you measure the performance of a
ROC Curve predictive model?
Lift Chart What is a confusion matrix and what
Random Forest information does it provide?
Support Vector Machine (SVM) What are the key differences between
Bayesian Inference supervised and unsupervised learning?
Anomaly Detection What is logistic regression and when would you
Sentiment Analysis use it?
ETL (Extract, Transform, Load) Can you explain what a decision tree is and how
Time Series Analysis it works?
Clustering What is a time series analysis and where is it
Market Basket Analysis commonly applied?
Neural Networks What are clustering algorithms and what are
Decision Trees some common examples?
Logistic Regression How would you explain the term "Big Data"?
Feature Engineering How would you explain the term "Big Data"?
Overfitting How do you handle multicollinearity in
Cross-Validation regression models?
Regularisation Explain the concept of p-value in hypothesis
Principal Component Analysis (PCA) testing.
A/B Testing What is the significance of the R-squared value
Confusion Matrix in linear regression?
F1 Score How do you interpret the coefficients of a
Precision and Recall regression model?
Seasonal Decomposition of Time Series How do you decide which type of chart or graph
(STL) to use for a given dataset?
Simple Exponential Smoothing (SES) What are some common data preprocessing
Holt-Winters Seasonal Method steps you take before analysis?
Big Data Analytics

2024 BA Pre-Read

Uploaded by

Copyright:

Available Formats

2024 BA Pre-Read

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2024 BA Pre-Read

Uploaded by

Copyright:

Available Formats

PRE-READ 2024

PREPARED BY PREPARATION COMMITTEE’24, IIFT DELHI

1 Overview Of Business Analytics 1

Generic Flow of Solving Analytics

4 Data and Types of Data 9

12 Time Series Analysis 26

© Preparation Committee’24 Pre Read-Business Analytics

Analytics involves the systematic computational examination of data or it’s

What is Business Analytics?

Business analytics refers to the practice of using various methods,

Who is a Business Analyst?

A business analyst is a professional who plays a crucial role in bridging the

Broad Responsibilities of a Business Analyst

© Preparation Committee’24 1 Pre Read-Business Analytics

Data Engineer Data Scientist

In the landscape of data professionals—Data Engineers, Data Scientists, ML Engineers, and

© Preparation Committee’24 2 Pre Read-Business Analytics

Identifying variables Standardising Values

Test Train split Classic machine learning

Deploy in the Production

© Preparation Committee’24 3 Pre Read-Business Analytics

1. Descriptive Analytics: Making Sense of Your Data

© Preparation Committee’24 4 Pre Read-Business Analytics

Industry Example (Retail):

2. Diagnostic Analytics: Diving Deeper for Causes

© Preparation Committee’24 5 Pre Read-Business Analytics

3. Predictive Analytics: Peeking into the Future

Tools for Predictive Analytics:

Making Predictions with Data:

The tree might ask questions like:

Based on the answers to these questions, the

© Preparation Committee’24 6 Pre Read-Business Analytics

4. Prescriptive Analytics: Taking Action Based on Insights

Google Maps analyzes live traffic, weather, and your

© Preparation Committee’24 7 Pre Read-Business Analytics

Simulation Modeling Tools: Think of these as virtual test environments.

Industry Example: Factory Fix-It

Marketing Finance Trade Supply chain

Bar plot, Pie chart,

T- Test, Chi-sq OLAP operations,

Time Series Methods,

Decision Trees and

© Preparation Committee’24 8 Pre Read-Business Analytics

Quantitative Data- It is structured data collected Methods of collection - Survey.

Time Series Data- The data collected from a single subject at

Panel Data - The data collected from multiple subjects at different

© Preparation Committee’24 9 Pre Read-Business Analytics

1. Measures of Central Tendency –

© Preparation Committee’24 10 Pre Read-Business Analytics

There are 4 types of measurement scale:

© Preparation Committee’24 11 Pre Read-Business Analytics

Uniform Distribution Formula

Poisson Distribution - Poisson distribution Formula

Ex: The number of defective items manufactured in

Exponential Distribution - Exponential Formula

Ex: The time between occurrences of defective

Note : The parameter is same in poisson and exponential distribution

© Preparation Committee’24 12 Pre Read-Business Analytics

For further reading refer - A complete guide to bar charts

© Preparation Committee’24 13 Pre Read-Business Analytics