Business Analytics Process and Data Exploration
Business Analytics Process and Data Exploration
Exploration
Course Overview
• This chapter covers data exploration, validation, and cleaning
required for data analysis. You’ll learn the purpose of data
cleaning, why you need data preparation, how to go about
handling missing values, and some of the data-cleaning
techniques used in the industry.
Course Contents
• Business Analytics Life Cycle
• Understanding the Business Problem
• Collecting and Integrating the Data
• Preprocessing the Data
• Exploring and Visualizing the Data
• Using Modeling Techniques and Algorithms
• Evaluating the Model
• Presenting a Management Report and Review
• Deploying the Model
Business Analytics Life Cycle
5-4
Business Analytics Life Cycle
Business Analytics Life Cycle
5-6
Business Analytics Life Cycle
5-7
Business Analytics Life Cycle
5-8
Business Analytics Life Cycle
5-9
Understanding the Business
Problem
• Key purpose to solve a business problem
• Need to thoroughly understand the problem from a business
perspective before solve the problem
Collecting and Integrating the Data
• the most important factor determining the accuracy of the results
is getting quality data.
• Can be either from a primary source or secondary source
• Most organizations have data spread across various databases
Collecting and Integrating the Data
• Sampling a smaller collection of units from a population used
to determine truths about that population (Field, 2005)
• several variations on this type of sampling:
• Random sampling: A sample is picked randomly, and every
member has an equal opportunity to be selected.
• Stratified sampling: The population is divided into groups, and
data is selected randomly from a group, or strata.
• Systematic sampling: You select members systematically—say,
every tenth member—in that particular time or event.
𝑧∗𝑠𝑖𝑔𝑚𝑎 2
•𝑛 = if standard deviation is known
𝐸
𝑧 2
•𝑛 = 𝑝 1 − 𝑝 if standard deviation is unknown
𝐸
Collecting and Integrating the Data
• Variable Selection
• If have more predictor variables, so need more records
• More records we have, the better the prediction results
Preprocessing the Data
• Data type: Qualitative or Quantitative
• Qualitative data not numerical (e.g. type of car, favorite color)
• Quantitative data numeric. Can be divided into discrete data or
continuous data
• Discrete Data A variable can take a certain value that is
separate and distinct
• Continuous Data A variable that can take numeric values
within a specific range or interval
Preprocessing the Data
• Handling Missing Values
• Methods used to resolve missing values:
a. Ignore the values (not very effective method)
b. Fill in the values with average value or mode (the simplest
method)
c. Fill in the values with an attribute mean belonging to the same
bin
Preprocessing the Data
• Handling Duplicates, Junk, and Null Values
• Should be cleaned from the database before the analytics process.
• Same process with handling missing values
• To identify the junk characters is the challenges
Preprocessing the Data
• Data preprocessing with R discuss method:
a. Understanding the variable types
b. Changing the variable types
c. Finding missing and null values
d. Cleaning missing values with appropriate methods
• The following are the basic data types in R:
a. Numeric: Real numbers.
b. Integer: whole numbers.
c. Factor: Categorical data to define various categories.
d. Character: Data strings of characters defining
Exploring and Visualizing the Data
• Tables View() command
Exploring and Visualizing the Data
• Summary Tables
Exploring and Visualizing the Data
• Box plots
Exploring and Visualizing the Data
• Scatter plots
Exploring and Visualizing the Data
• Scatter plots matrices: use pairs() function > hou<-
read.table(header=TRUE,sep="\t","housing.data")
Exploring and Visualizing the Data
• Scatter plots matrices Trellis Plot
Exploring and Visualizing the Data
• Scatter plots matrices Correlation Plot
Exploring and Visualizing the Data
• Scatter plots matrices Density by Class
Exploring and Visualizing the Data
• Normalization some techniques:
a. Z-score normalization: The new value is created based on the
mean and standard deviations A’ = A – meanA/SDA
b. Min-max normalization: values are transformed within the
range of values specified A’ = ((A – MinA)/(MaxA –
MinA))(range of A’) + MinA, Range of A’ = MaxA – MinA
c. Data aggregation: sometime a new variable may be required to
𝐴′ −1
better understand the data 𝐴′ = ,𝜆>1
𝜆
Using Modelling Techniques And
Algorithms
• Descriptive Analytics explains the patterns hidden in data.
• Any patterns like number of market segments, or sales numbers
based on regions are purely based on historical data.
σ𝑛 𝑦ො
𝑘=0 𝑘 −𝑦 𝑘
2
•𝑅𝑀𝑆𝐸 =
𝑛
Presenting a Management Report
and Review
• Problem Description
• Data Set Used
• Data Cleaning Carried Out
• Method Used to Create The Model
• Model Deployment Prerequisities
• Model Deployment and Usage
• Issues Handling
Deploying the Model
• A challenging phase of the project.
• The model is now deployed for end users and is in a production
environment analyzing the live data
• Success of the deployment depends on the following:
a. Proper sizing of the hardware, ensuring required performance
b. Proper programming to handle the capabilities of the hardware
c. Proper data integration and cleaning
d. Effective reports, dashboards, views, decisions, and
interventions to be used by end users or end-user systems
e. Effective training to the users of the model
Question & Answers