Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
46 views

Business Analytics Process and Data Exploration

This document discusses the key steps in a business analytics life cycle for data exploration and model building. It covers collecting and preprocessing data, exploring the data through visualization, choosing appropriate modeling techniques, evaluating models, and deploying successful models. The goal is to understand a business problem, derive insights from data to inform decisions, and solve the problem through an iterative eight-phase process.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Business Analytics Process and Data Exploration

This document discusses the key steps in a business analytics life cycle for data exploration and model building. It covers collecting and preprocessing data, exploring the data through visualization, choosing appropriate modeling techniques, evaluating models, and deploying successful models. The goal is to understand a business problem, derive insights from data to inform decisions, and solve the problem through an iterative eight-phase process.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Business Analytics Process and Data

Exploration
Course Overview
• This chapter covers data exploration, validation, and cleaning
required for data analysis. You’ll learn the purpose of data
cleaning, why you need data preparation, how to go about
handling missing values, and some of the data-cleaning
techniques used in the industry.
Course Contents
• Business Analytics Life Cycle
• Understanding the Business Problem
• Collecting and Integrating the Data
• Preprocessing the Data
• Exploring and Visualizing the Data
• Using Modeling Techniques and Algorithms
• Evaluating the Model
• Presenting a Management Report and Review
• Deploying the Model
Business Analytics Life Cycle

• This purpose is to derive information from data in order to make


appropriate business decisions.
• Consists of eight phases:
a. Understand the Business Problem
b. Collect and Integrate the Data
c. Preprocess the Data
d. Explore and Visualize the Data
e. Choose Modeling Techniques and Algorithms
f. Evaluate the Model
g. Report to Management and Review
h. Deploy the Model

5-4
Business Analytics Life Cycle
Business Analytics Life Cycle

Phase 1  Understand the Business Problem


• the focus is to understand the problem, objectives, and
requirements from the perspective of the business.
• then converted into a data analytics problem with the aim of
solving it by using appropriate methods to achieve the objective.

Phase 2  Collect and Integrate the Data


• data is collected from various sources in this important phase.
• If the data is not available in a current database or data
warehouse, a survey may be required.

5-6
Business Analytics Life Cycle

Phase 3  Preprocess the Data


• data is cleaned and normalized
• This process may be repeated several times, depending on the
quality of data you get and the accuracy of the model obtained.

Phase 4  Explore and Visualize the Data


• to understand the characteristics of the data, such as its
distribution, trends, and relationships among variables

5-7
Business Analytics Life Cycle

Phase 5  Choose Modeling Techniques and Algorithms


• decide whether to use unsupervised or supervised machine-
learning techniques
• These choices depend on both the business requirements and the
data you have.

Phase 6  Evaluate the Model


• evaluate the model by using standard methods that measure the
accuracy of the model and its performance in the field.
• important to evaluate the model and to be certain that the model
achieves the business objectives specified by business leaders

5-8
Business Analytics Life Cycle

Phase 7  Report to Management and Review


• present the mathematical model to the business leaders

Phase 8  Deploy the Model


• a challenging phase of the project.
• The model is now deployed for end users and is in a production
environment, analyzing the live data

5-9
Understanding the Business
Problem
• Key purpose  to solve a business problem
• Need to thoroughly understand the problem from a business
perspective before solve the problem
Collecting and Integrating the Data
• the most important factor determining the accuracy of the results
is getting quality data.
• Can be either from a primary source or secondary source
• Most organizations have data spread across various databases
Collecting and Integrating the Data
• Sampling  a smaller collection of units from a population used
to determine truths about that population (Field, 2005)
• several variations on this type of sampling:
• Random sampling: A sample is picked randomly, and every
member has an equal opportunity to be selected.
• Stratified sampling: The population is divided into groups, and
data is selected randomly from a group, or strata.
• Systematic sampling: You select members systematically—say,
every tenth member—in that particular time or event.
𝑧∗𝑠𝑖𝑔𝑚𝑎 2
•𝑛 =  if standard deviation is known
𝐸
𝑧 2
•𝑛 = 𝑝 1 − 𝑝  if standard deviation is unknown
𝐸
Collecting and Integrating the Data
• Variable Selection
• If have more predictor variables, so need more records
• More records we have, the better the prediction results
Preprocessing the Data
• Data type: Qualitative or Quantitative
• Qualitative data  not numerical (e.g. type of car, favorite color)
• Quantitative data  numeric. Can be divided into discrete data or
continuous data
• Discrete Data  A variable can take a certain value that is
separate and distinct
• Continuous Data  A variable that can take numeric values
within a specific range or interval
Preprocessing the Data
• Handling Missing Values
• Methods used to resolve missing values:
a. Ignore the values (not very effective method)
b. Fill in the values with average value or mode (the simplest
method)
c. Fill in the values with an attribute mean belonging to the same
bin
Preprocessing the Data
• Handling Duplicates, Junk, and Null Values
• Should be cleaned from the database before the analytics process.
• Same process with handling missing values
• To identify the junk characters is the challenges
Preprocessing the Data
• Data preprocessing with R  discuss method:
a. Understanding the variable types
b. Changing the variable types
c. Finding missing and null values
d. Cleaning missing values with appropriate methods
• The following are the basic data types in R:
a. Numeric: Real numbers.
b. Integer: whole numbers.
c. Factor: Categorical data to define various categories.
d. Character: Data strings of characters defining
Exploring and Visualizing the Data
• Tables  View() command
Exploring and Visualizing the Data
• Summary Tables
Exploring and Visualizing the Data
• Box plots
Exploring and Visualizing the Data
• Scatter plots
Exploring and Visualizing the Data
• Scatter plots matrices: use pairs() function  > hou<-
read.table(header=TRUE,sep="\t","housing.data")
Exploring and Visualizing the Data
• Scatter plots matrices  Trellis Plot
Exploring and Visualizing the Data
• Scatter plots matrices  Correlation Plot
Exploring and Visualizing the Data
• Scatter plots matrices  Density by Class
Exploring and Visualizing the Data
• Normalization  some techniques:
a. Z-score normalization: The new value is created based on the
mean and standard deviations  A’ = A – meanA/SDA
b. Min-max normalization: values are transformed within the
range of values specified  A’ = ((A – MinA)/(MaxA –
MinA))(range of A’) + MinA, Range of A’ = MaxA – MinA
c. Data aggregation: sometime a new variable may be required to
𝐴′ −1
better understand the data  𝐴′ = ,𝜆>1
𝜆
Using Modelling Techniques And
Algorithms
• Descriptive Analytics  explains the patterns hidden in data.
• Any patterns like number of market segments, or sales numbers
based on regions are purely based on historical data.

• Predictive Analytics  consists of two methods:


a. Classification  a basic form of data analysis in which data is
classified into classes
b. Regression  predicting the value of a numerical variable
Using Modelling Techniques And
Algorithms
• Machine learning  making computers learn and perform task
better based on past historical data
• Two type of machine learning:
a. Supervised machine learning  a machine builds a predictive
model under supervision—that is, with the help of a training
data set.
b. Unsupervised machine learning  no training data to learn so
no target variable to predict.
Using Modelling Techniques And
Algorithms
Supervised Machine learning Unsupervised Machine Learning
Evaluating the Model
• Evaluating model performance is a key aspect to understanding
how good your prediction is when you apply new data.
• The data set is divided into three partitions:
a. Training Data Partition  used to train the model
b. Test Data Partition  a subset of the data set that is not in the
training set
c. Validation Data Partition  used to fine-tune the model
performance and reduce overfitting problems.
Evaluating the Model
• Cross Validation (k-fold cross validation)  divide the data into k
folds and build the model using k – 1 folds, and the last fold is
used for testing.
Classification Model Evaluation
• Confusion Matrix
The simplest way of measuring the performance of a classifier is by
judging the number of mistakes  misclassification error.
• Classification matrix  referred to as a confusion matrix  gives
an estimate of the true classification and misclassification rates
Classification Model Evaluation
• Lift Chart  commonly used for marketing problems
• The lift curve helps determine how effectively the online
advertisement campaign can be done by selecting a relatively small
group and getting maximum responders. The lift , is a measure of the
effectiveness of the model by calculating ratios with or with out the
model.
• A confusion matrix evaluates the effectiveness of the model as a
whole population, whereas the lift chart evaluates a portion of the
population.
Classification Model Evaluation
• ROC (receiver operating characteristic) Chart  similar to a lift
chart.
• It is a plot of the true-positive rate on the y axis and the false-
positive rate on the x axis.
• ROC is plotted as a function of true positive rate (sensitivity) vs.
function of false positive rate (specificity)
Regression Model Evaluation
• has many criteria for measuring its performance.
• Root-Mean-Square Error  A regression line predicts the y values
for a given x value. Note that the values are around the average

σ𝑛 𝑦ො
𝑘=0 𝑘 −𝑦 𝑘
2
•𝑅𝑀𝑆𝐸 =
𝑛
Presenting a Management Report
and Review
• Problem Description
• Data Set Used
• Data Cleaning Carried Out
• Method Used to Create The Model
• Model Deployment Prerequisities
• Model Deployment and Usage
• Issues Handling
Deploying the Model
• A challenging phase of the project.
• The model is now deployed for end users and is in a production
environment analyzing the live data
• Success of the deployment depends on the following:
a. Proper sizing of the hardware, ensuring required performance
b. Proper programming to handle the capabilities of the hardware
c. Proper data integration and cleaning
d. Effective reports, dashboards, views, decisions, and
interventions to be used by end users or end-user systems
e. Effective training to the users of the model
Question & Answers

You might also like