2024 BA Pre-Read
2024 BA Pre-Read
2024 BA Pre-Read
BUSINESS ANALYTICS
3 Types of Analytics 4
5 Measures of Data 10
6 Data Distributions 11
7 Basic Charts 13
8 Advanced Charts 17
9 Inferential Statistics 20
10 Linear Regression 22
11 Logistic Regression 25
13 FAQ 31
Analysis and Documentation: Analysing gathered Testing and Validation: Participating in testing
requirements, identifying gaps, and documenting activities to validate that developed solutions meet
detailed business requirements and use cases. the specified requirements and address business
needs.
Solution Design: Working with developers, Change Management: Supporting organizational
architects, and other team members to design change by assessing the impact of proposed
solutions that meet business needs and align with changes, identifying risks, and helping
organizational goals. stakeholders navigate transitions.
Deployment
Deployment
Types of Data
Professionals
Deployment Deployment
Adjusting model
parameters for
robusntness
Imagine you have a giant box filled with receipts from your store. Descriptive analytics is like sorting
through that box and organizing everything neatly. It helps you understand what has happened in
the past by summarizing your data and answering basic questions like:
What: What are my top-selling products?
When: When do sales typically peak?
Where: Where are my customers located?
Who: Who are my typical customers (age, gender, etc.)?
How Much: How much revenue did I generate last month?
Industry Example:
"Sole Mates," a popular shoe store, faces a mystery – running shoe sales are slumping!
Descriptive analytics showed the "what," but not the "why."
Here's where diagnostic analytics steps in:
Finding the Culprit: "Sole Mates" uses various data sources to diagnose the
problem:
Customer Reviews: They analyze reviews to see if there are consistent
complaints about recent running shoe models (comfort, design,
performance).
Website Traffic: They check if traffic to running shoe pages has
decreased. Are users bouncing off quickly, suggesting navigation issues?
Sales Data: They examine sales data to see if specific models or sizes are
underperforming.
Taking Action:
Once diagnosed, "Sole Mates" can take corrective actions to improve sales, like addressing quality
issues, promoting specific models, or enhancing website navigation.
By using diagnostic analytics, "Sole Mates" can move beyond the "what" and uncover the "why,"
allowing them to make data-driven decisions for success.
Predictive analytics is used across various industries, from retail forecasting future sales trends to
healthcare predicting disease outbreaks. It's a powerful tool that helps businesses make informed
decisions based on data-driven insights about the future.
Industry Example:
Bank XYZ can use data analysis to build a credit default risk model to improve their loan approval
process. This model can be built using decision trees or random forests. Decision trees ask a series of
questions about a new loan applicant's data to predict their default risk. Random forests involve
creating multiple decision trees, each with slight variations, and basing the final decision on the
majority vote from all the trees. This approach can provide a more robust and reliable prediction
compared to a single decision tree.
By using a credit default risk model, Bank XYZ can make more informed decisions about approving or
rejecting loan applications. They can also adjust loan terms (interest rate, loan amount) based on the
predicted risk.
Recommending Maintenance: Based on the prediction, the system might recommend preventive
maintenance, like replacing a worn-out part. This helps minimize downtime and keep production
running smoothly.
Types of Data:
Based on characteristics -
Methods of collection - Interview, Focus
Qualitative Data - It is unstructured data
groups.
collected in the form of text, images, and videos.
Purpose - Creating new theories,
Generally involves rigorous data collection for a generating hypotheses from sensitive and
deeper understanding of the research objective. complex data..
Mean– The average value of the dataset. It is affected by very small or large values, which can pull the
mean towards them, making it less effective as a central measure. The rare instances of these
extreme values (outliers) make the mean less representative.
Ex: Performance of students measured by average marks scored, average number of customers
coming to a shop in a day.
Median – The middle value in an ordered dataset. It is not affected by extreme values.
Ex: Consider The Great Khali and five female air hostesses; the mean of their weight is more than the
weight of the five air hostesses, making the mean a non-representative value.
Mode – The most frequently occurring value in the dataset. It is not affected by extreme values and is
mostly used for categorical data.
Ex: Most common foot size for producing footwear massively, saving production costs.
2. Measures of Dispersion –
These measures describe the spread of the data around the central value, indicating whether data is
scattered or concentrated.
Ex: The mean scores of 5 students is 80, but all of them may not score 80. Some may score 100, and
some 40. We need the spread of these values for accurate analysis.
Variance – An arithmetic method to calculate spread using the mean of the data. The variance is the
average of the squared deviations from the arithmetic mean for a set of numbers.
Standard deviation- The square root of variance. While variance is in squared units, standard
deviation is in normal units, making it useful for comparison.
3. Skewness –
It is an asymmetrical distribution of data
typically represented in a graph. Distributions
with more data points on the left side are called
left-skewed, while those with more data points
on the right side are called right-skewed.
4. Kurtosis -
It is the measure of the peak of the distribution
curve. It signifies the accumulation of points
around the mean. The higher the peak, the more
data points are present around the mean.
Ordinal
The data points of an object are measured with numerical values that have an ordering. Objects are
given ranks, but the intervals between any two consecutive ranks are not the same.
Ex: Three employees are given ranks 1, 2, and 3 based on productivity, but the differences in the
amount of productivity between workers ranked 1, 2, and 3 are not necessarily the same.
Interval
The ordinal scale, along with the difference between two consecutive ordered values, is equal but
does not have an absolute zero, a reference point.
Ex: Temperature measurement, the level of heat measured in degrees has equal numerical ordering.
0, -1, etc., are also temperatures.
Ratio
The interval scale, along with an absolute zero, a reference point for measurement.
Ex: Weight of two students, 100 kg and 50 kg. The weight of one student is twice that of the other, as
both weights are measured from 0; 100 is twice 50.
Data Distributions -
All data collected in any experiment or research tends to follow some standard distribution, such as
being concentrated or sparse, with unique characteristics that aid in quick evaluation and precise
analysis.
Normal Distribution -
The data collected is more centered around the
mean of the data. The graph looks like a bell-
shaped curve.
It is a continuous distribution.
It is a symmetrical distribution about its mean.
It is asymptotic to the horizontal axis.
It is unimodal.
Area under the curve is 1.
Normal Distribution Formula
Ex: The weight of new born babies
tends to follow a normal distribution.
Benefits:
Comparing parts of a bigger set of data, highlighting different categories,
or showing change over time.
Have long categories label — it offers more space.
Illustration of both positive and negative values in the dataset.
Limitations:
If you’re using multiple data points.
If you have many categories, avoid overloading your graph. Your graph
shouldn’t have more than 10 bars.
Pie Chart Definition : A pie chart is a circular visualization tool used to represent data
proportions by dividing a circle into sectors. Each sector's size corresponds to
the proportion of the whole dataset it represents, typically expressed as
percentages. Pie charts provide a clear and concise way to illustrate the
relative contributions or distributions of different categories within a dataset.
They are commonly employed in presentations, reports, and dashboards to
convey the composition or distribution of data in a visually accessible format.
Benefits
You have a total number that can be split up into 2-5 categories.
One category outweighs the other by a significant margin.
Limitations
Your dimension has too many categories.
Similar percentages/numbers exist between different values within the
chosen dimension.
Data doesn’t represent a uniform “whole”, or the percentages don’t
measure to 100 percent.
There are negative values or complex fractions in your measure value.
For further reading refer - Pie Charts: Using, Examples, and Interpreting
Benefits of Histogram
Visualize the distribution of a continuous dataset and frequency
Identify patterns like skewness, modality and kurtosis.
Detect outliers or anomalies that deviate from the rest of the data.
Limitations
Loss of individual data points due to binning.
Bin size choice impacts interpretation and can be misleading.
Not suitable for small datasets.
Limited to displaying one-dimensional data.
Difficult to compare multiple distributions on a single plot.
Line Chart Definition : Line chart is a graphical representation of data points connected
by lines, used to visualize trends and changes over time or across a
continuous variable. The horizontal axis typically represents time or the
independent variable, while the vertical axis represents the dependent
variable.
Benefits of Line Graphs:
Simplify complex data sets into a digestible format.
Highlight relevant trends in large data sets, reducing cognitive
overload.
Easier interpretation and friendlier for non-technical users.
Compellingly communicate stories through the progression of data
points.
Aid in predicting future scenarios by analyzing historical trends.
For further reading refer - Understanding and using Box and Whisker Plots
Benefits of Heatmaps:
Visualize large datasets concisely.
Identify patterns, trends, and correlations.
Facilitate exploration of multidimensional data.
Enhance decision-making with key insights.
Communicate complex information effectively to a wide audience.
Limitations of Heatmaps:
Limited to visualizing relationships between two variables.
Interpretation can be subjective and vary by viewer.
Sensitive to outliers, which can skew color representation.
Risk of overinterpretation and identifying spurious correlations.
When to Use :
Definition :
Hierarchical data
A tree map is a hierarchical visualization
Multiple levels of categories
method that displays data as nested
Comparing the relative size of categories.
rectangles. Each rectangle's size represents a
For Example- Visualizing the breakdown of
quantitative value, and the hierarchy is shown
expenses in a budget, where larger
through the placement and size of these
rectangles represent higher spending
rectangles.
categories.
Spider Chart
When to Use :
Comparing performance or characteristics
Definition :
of multiple entities across different variables,
A spider or radar chart displays multivariate
especially when the variables are not directly
data in a two-dimensional chart. Each variable comparable.
is represented by an axis starting from the
same point, and the data points are connected For example, evaluating the strengths and
to form a polygon. weaknesses of athletes in various sports
based on attributes like speed, endurance,
and accuracy.
Gantt Chart
When to Use :
Definition : Visualizing project schedules
A Gantt chart is a bar chart used for project Dependencies
management. It illustrates a project schedule Progress over time
by showing the start and finish dates of tasks For Example- Managing production
or elements as horizontal bars. schedules in manufacturing industries to
ensure timely delivery.
When To Use :
Relationships between three variables
Definition :
size of the data points is significant and
A bubble chart is a type of chart that displays
adds an extra dimension.
three dimensions of data using bubbles. The x
For Example , analyzing the correlation
and y-axis represent two variables, and the size
between population density, GDP per
of each bubble represents the third variable.
capita, and life expectancy in demographic
studies.
Area Charts
When To Use :
Definition : Displaying trends over time,
An area chart is a variation of a line chart Emphasize the cumulative aspect of the
where the area below the line is filled with data.
color or shading. It is commonly used to For Example, visualizing the trend of
represent cumulative data over time website traffic or user engagement over
weeks, months, or years.
Violin Chart
When To Use :
to visualize the distribution of data
Definition : compare it across different categories or
A violin chart is a method of plotting numeric groups
data and a probability density function. It when the data is non-parametric or lacks a
resembles a box plot with a rotated kernel normal distribution.
density plot on each side. For Example, analysing the distribution of
healthcare outcomes or patient recovery
times.
When To Use :
to understand the cumulative distribution of
Definition:
data
An ogive is a graph that represents the
visualize the proportion of values below a
cumulative frequency distribution of a dataset.
certain threshold, especially in statistical
It plots cumulative frequencies on the y-axis
analysis and quality control processes.
against corresponding data values on the x-axis.
For Example, analyzing the cumulative
distribution of product delivery times or
customer wait times.
Test Statistic -
A numeric value explaining the how close is the observed data measure to the required level in an
experiment.
Ex: In juice bottle sugar content example, test statistic
Significance level -
The % risk of rejecting null hypothesis when it is true. A experimenter sets this value if the test
statistic crosses this threshold then null hypothesis should be rejected.
Ex: If significance level is 0.05, if the mean of sample is in >21 or <19(20+_5%20) then null hypothesis is
rejected i.e. the sugar levels are not 20g.
Every experiment results p-value (Test Statistic expressed in probability). In case we have p-value
0.04 < 0.05, if error is less than 5% we must reject null hypothesis.
Level of Significance - The accuracy of model i.e. 100-significane level.
2. Z-tests:
Used to compare the means of two groups, helpful for testing if a new treatment has a significant
effect. One-Sample Z-test (compares a sample mean to a hypothesized population mean):
Test Statistic: z = (x̄ - μ) / (σ√n)
x̄ - sample mean.
μ - hypothesized population mean.
σ - population standard deviation (estimated using sample standard deviation if unknown).
n - the sample size.
Example
One-sample Z-test if the sample size is large (n > 30) and the population standard deviation is known.
The average spend per customer in the existing stores is $100 with a known standard deviation of $15.
The new store's average spend is $105 from a sample of 100 customers. The Z-test will help determine
if this difference is statistically significant.
3. Chi-Square Test:
Used to analyse relationships between categorical variables, useful for seeing if there's a connection
between hair colour and eye colour.
Test Statistic: χ² = Σ (Oi - Ei)² / Ei`
χ² (chi-square) is the test statistic.
Oi - observed frequencies in each category.
Ei - expected frequencies in each category (calculated based on the null hypothesis).
Example
Chi-square test of independence to see if customer satisfaction levels are (satisfied, neutral,
dissatisfied) are independent of the product category (electronics, clothing, groceries).Survey data
from 500 customers is analysed to check if the proportion of satisfied customers differs across
different product categories.
Linear Regression: It is a type of regression in which linear function formed between dependent
and independent variables. (Fitting a straight line between dependent and independent data
points). Example: Price of a house depends on the square ft area.
The estimation is done using a equation writing price as a linear function of Square ft area. The co-
efficient are estimated by using “Ordinary Least Squares Method”.
Applications:
1) Forecasting - It can be used for forecasting of the independent variables.
2) Testing Significance - To check a variable can be explained in terms of other variables or not.
Price of House
Price of House
Equation:
Price of House = 57.4 + 0.017*(Square ft area) Square feet area Square feet area
Equation:
Price of House = 57.4 + 0.017*(Square ft area) - 0.666*(Age)
Error (or) Residual : The difference between actual value and estimated value.
Residual
dependent variable and independent variables. This can be
verified by making plots. Actual
Residual
making increasing or decreasing trend. This makes current error
Actual
depends on previous error.
3. Homoskedasticity:
Residual
The error variance is constant across the independent variables
set. This ensures that the estimate error is uniform.
Actual
Residual
data point) are normally distributed around 0. This ensures that
the prediction error is low.
Actual
5. No Multicollinearity:
The independent variables should not correlate with each other.
Residual
This ensures no cascading effect and accurate modelling and
prediction. Actual
Estimating sales - Suppose a company made a high-quality salt which is a premium product. It
wants to predict the sales so it can estimate by regressing with number of cars and AC’s present in
the locality. For a premium product target customers are higher income people who have AC and
Car.
Regression Diagnostics - In case of regression model is not According to plot change the
significant what should be done. independent variable power
Post Regression Validating Model: After regression model is only valid in case
1) Errors are normally distributed, and error variance should remain same across the predictor variable
2) Error do not autocorrelated (if so, Time Series Regression needs to be done)
If hours studied is 0 then log odds ‘-3’(less value) On rearranging, we get probability of success, p.
indicates the prob. of passing is less than failing . If hours If Hours studied = 0, probability of pass = 0.047
studied is 6 then log odds ‘6’ prob. of passing is higher. If Hours studied = 6, probability of pass = 0.997
Predicted condition
Total
Positive Negative
population
(PP) (PN)
=P+N
True False
Positive (P) positive negative
(TP) (FN)
Actual
condition
False True
Negative (N) positive negative
(FP) (TN)
Three important attributes that every time series data must exhibit
3. Seasonality
Seasonality refers to regular, repeating patterns or cycles in the data that occur at fixed intervals
(such as monthly, quarterly, or annually). These patterns are often driven by seasonal factors like
weather, holidays, or other periodic events.
4. Cyclicity
Cyclicity involves longer-term fluctuations that do not occur at regular intervals and are often
related to economic or business cycles. Unlike seasonality, cyclic patterns do not have a fixed
period and can be irregular and unpredictable.
5. Noise
Noise represents the random variations or "white noise" in the data. These are the residuals that
cannot be attributed to the level, trend, seasonality, or cyclicity. Noise is essentially the
unpredictable part of the time series.
Method Example
2. Moving Average
Moving Average method smooths the time series by A 3-period moving average for sales
averaging the data points from several consecutive data (100, 150, 200) would be
periods. This helps to reduce noise and highlight (100+150+200)/3 = 150.
underlying trends.
3. Exponential Smoothing
Exponential Smoothing assigns exponentially
If the last observation was 100 units
decreasing weights to past observations, giving more
and the smoothing parameter is 0.5,
importance to recent data. This method is good for
the forecast might be 0.5 * 100 + 0.5 *
short-term forecasts and can be extended to handle
(previous forecast).
trends (Holt’s method) and seasonality (Holt-Winters
method).
8. Holt’s Linear Trend Model Level and trend equations are used
Holt’s Linear Trend Model extends simple exponential to update forecasts by taking both
smoothing to capture linear trends by incorporating two recent data and trend changes into
equations: one for the level and one for the trend. account.
9. Holt-Winters Seasonal Method Used for data with both trend and
Holt-Winters Seasonal Method extends Holt’s model to seasonality, like monthly sales data
include seasonality. It has three components: level, showing seasonal fluctuations and a
trend, and seasonal. trend.