Unit 4 Big Data Complete Notes
Unit 4 Big Data Complete Notes
Unit IV
Introduction
Data are key ingredients for any analytical exercise. Hence, it is important to
thoroughly consider and list all data sources that are potential interest before starting the
analysis.
Sampling
It is a statistical analysis technique used to select, manipulate and analyse a
representative subset of data points to identify patterns and trends in the larger data set being
examined.
The aim of sampling is to take a subset of past customer data and use that to build an
analytical model
Key requirement for a good sample is that it should be representative of the future
customers on which the analytical model will be run
Definitions:
1. Population:
It is a collection of observation about which the user would like to make an inference.
2. Sample:
It is a specific group of individuals that allows to collect data from.
3. Sampling Frame:
Sample Frame is the actual list of individuals that the sample will be drawn from.
E.g. working on conditions at Company X, Population is all 1000 employees of the
company, here the sampling frame is the Company’s HR database which list the names and
contact details of every employee.
4. Sample Size:
The number of individuals in the sample depends on the size of the population on
how precisely, to represent the population as a whole.
Types of Sampling
1. Simple Random Sampling:
In a Simple Random Sample, every member of the population has an equal chance of
being selected.
Sampling frame should include the whole population.
Tools like Random Generator are used
E.g: Want to select a simple random sample of 100 employee of company X & assign a
number of every employee in the company database from 1 to 1000, and use a random
number generator to select 100 numbers.
2. Systematic Sampling:
It is similar to simple random sampling, but it is usually slightly easier to conduct.
Every number of the population is listed with a number, but instead of randomly
generating numbers individuals are choosen at regular intervals.
E.g: All employees of the company are listed in alphabetical order; from first 10 numbers
randomly select a starting point, i.e., number 6
From number 6 onwards every 10th person on the list is selected (6, 16, 26, 36........).
3. Stratified Sampling:
It involves dividing the population into sub-populations that may differ in input ways
i.e., divide the population into subgroups called strata based on the relevant
characteristics.
It allows to draw more precise conclusions by ensuring that every subgroup is
properly represented in the sample.
E.g.. Company has 1000 employees, based on the strata Gender it has been divided into
800 female & 200 males.
4. Cluster Sampling:
It involves dividing the population into subgroups, but each subgroup should have
similar characteristics to the whole sample.
Instead of sampling individuals from each subgroup you randomly select the entire
subgroups.
E.g.. Company has offices in 10 cities across the country with the same number of
employees with similar roles, one doesn’t have to travel every office to collect the data, use
random sampling to select 3 offices. These are the clusters.
2. Categorical Data:
It represents the characteristics of the data
It can also take numerical values.
E.g., Person Gender, language etc..
Categorical data has been classified into
a. Nominal Data
b. Ordinal Data
c. Binary Data
a. Nominal Data:
These are data elements that can only take on a limited set of values with no
meaningful ordering in between.
E.g., Martial Status: Yes or No
Nominal data has no order, Therefore if would change the order of its values the
meaning would not change.
b. Ordinal Data:
These are data elements that can only take on a limited set of values with a
meaningful ordering in between.
E.g., Age coded as young, middle aged and old.
c. Binary Data:
These are data elements that can only take on two values.
E.g., Employment status
2. Bar Charts-It represents the frequency of each of the values either absolute or relative as
bars.
3. Histogram-It provides an easy way to visualise the central tendency and to determine the
variability or spread of the data.
4. Scatter Plots-It allows visualizing one variable against another to see whether there are
any correlation patterns in the data.
A next step after visual analysis could be inspecting some basic statistical
measurements such as averages, standard deviations, minimum, maximum, confidence etc.
Missing Values
Missing value occurs when no data value is stored for the variable in an
observation.
Missing data is a common problem and challenge for analysts.
Some of the analytical techniques called decision trees etc deals with the missing
values.
Missing values can occur because of various reasons. The information can be
nonapplicable.
E.g.,
Types of Outliers:
1. Univariate Outlier: It can be found when looking at a distribution of values in a
single feature space.
2. Multivariate Outlier:
It can be found in an n-dimensional space (n features).
Multivariate outliers are very difficult to handle, so only the model has to be trained.
Detection and Treatment are the two important steps in dealing with the outliers.
The first check for outliers is to calculate the minimum and maximum values for each
of the data elements.
Various graphical tools can be used to detect outliers and they are
1. Histograms
2. Box-Plots
3. Z-Score
1. Histograms:
It is a graphical representation that organizes a group of data points into user
specified ranges
It is similar in appearance to a bar graph
Box-plots:
Box-plot is a standardized way of displaying the distribution of data based on a three
key quartiles of the data.
a. Minimum, first Quartile(Q1) (25% of observations)
b. Median,(50% of observations)
c. Maximum, third Quartile(Q3) (75% of observations)
All quartiles are represented as a box. The minimum and maximum values are added
unless they are too far away from the edges of the box.
Z-Score:
It is a numerical measurement that describes a value’s relationship to the mean of a
group of values.
It is measured in terms of standard deviations from the mean.
If Z-score is 0, it indicates that the data point score is identical to the mean score.
Formula:
Standardizing Data:
It is a data pre-processing activity targeted at scaling variables to a similar range.
It focuses on transforming raw data into usable information before its analyzed.
Raw data can contain variations in entries that are meant to be the same that could
later affect data analysis.
But, standardizing the data will be changed to be consistent across all entries.
Once the information in the dataset is consistent and standardized, it will be
significantly easier to analyze and use.
Standardization processes create compatibility, similarity, measurement and symbol
standards.
Standardization is especially used for regression based approaches.
Min/Max Standardization:
Categorization
It is also known as coarse classification, classing, grouping, & binning.
It helps to identify and assign categories to a collection of data to allow for more
accurate analysis.
Classification helps the user for knowledge discovery and future plan.
E.g., Email classifying as ‘spam’ or ‘not spam’.
Binning
Binning is used to minimize the effects of small observation errors.
The original data values are divided into small intervals known as bins.
It has a smoothing effect on the input data and may also reduce the chances of over
fitting in case of small datasets.
Chi-Squared Analysis
It is a more sophisticated way to do coarse classification.
Chi-Square test is a test of statistical significance for categorical variables.
Chi-square test is a useful measure of comparing experimentally obtained result with
those expected theoretically & based on the hypothesis.
Formula:
If there is no difference between actual and observed frequencies, the value of chi-
square is zero.
If there is a difference between observed and expected frequencies, then the value of
chi-square would be more than zero.
Benefits of WOE
1. It can treat Outliers
2. It can handle missing values as missing values can be binned separately
3. It helps to built strict linear relationship with log odds.
Variable Selection
Variable selection means selecting which variables to include in the model rather
than some sort of selection.
Various filters measured used in the selection of variable & they are
1. Pearson’s Correlation: Pearson’s Correlation Coefficient is the test statistics that
measures the relationship or association between two continuous variables.
The linear dependency between two variables and always varies between -1 & +1.
2. Fisher Score: It is test analysis helps to measure the relationships between categorical
variable.
Formula:
Here XG & XB: Average value of the variable for Good & Bad.
Information Value
It is the one of the most useful technique to select important variables in predictive
model.
It helps to rank variables on the basis of their importance.
Information Value is calculated as:
Segmentation
Segmentation is the process of taking the data and dividing it up and grouping similar
data together based on the chosen parameter.
Analytics
1. Source Data
3. Data Cleaning
It helps to get rid of all inconsistencies such as missing values, outliers and
duplicate data.
4. Data transformation
5. Analytics
The last step is interpretation and evaluation where the results will be
1. Data Scientist: Data Scientists are analytical experts who utilize their skills in
both technology & social science to find trends and manage data.
2. Data Miner: Persons who involves exploring & analyzing large blocks of information to
glean meaningful patterns & trends.
3. Data Analyst: Persons who understand data & use it to make strategic business
decisions.
1. Business Performance:
The analytical model should solve the business problem for which it was developed.
2. Statistical Performance:
Interpretable: It refers to understanding the patterns that the analytical model captures.
It refers to the efforts needed to collect the data, preprocess it, evaluate the model and
feed is outputs to the business application.
5. Economic Cost:
Software costs, human & computing resources should be taken into consideration.
Types of Analytics
1. Predictive Analytics
2. Descriptive Analytics
Predictive Analytics
Predictive Analytics used for mining the data, using statistical algorithms and
Uses historical data & patterns in historical data to predict the future.
Create models based on patterns in data to predict the probability of
something happening in the future.
Better the model & the training data the better the prediction.
1. Retail: Predictive Analytics is used in retail is always looking to improve its sales position
and forge better relations with customers.
2. Health: User in the predicting epidemics or public health issues based on the probability
1. Regression
2. Classification
Regression
Variables are-
a. Independent Variables:
b. Dependent Variables:
Dependent Variables are those values that change as a consequence of change in the
other values in the system.
They are called Criterion Variable
Types of Regression
1. Linear regression
2. Logistic regression
Linear Regression
Linear Regression is where the relationships between the variables can be described
with the straight line.
Slope: The slope of a line is the change in y for a one unit increase in x.
Y-Intercept: It is the height at which the line crosses the vertical axis & it is obtaining by
setting x=0 in the above equation.
Example:
Logistic Regression:
Logistic Regression is the analysis conduct when the dependent variable is binary.
It is used to describe data and to explain the relationship between one dependent
binary variable & one or more nominal, ordinal independent variable.
Formula:
Example:
Regression Model can generate the predicted probability ranging from negative to
positive infinity, whereas probability of an outcome can only lie between 0<p(x)<1.
Decision Trees
Decision Trees are the graphical representation for getting all the possible solutions
to a problem/decision based on given conditions.
It can be used for both classification & regression problems but mostly used for
classification problems.
Decision tree is a tree structured classifier, where internal nodes represent the features
of a dataset, branches represent the decision rules and each leaf node represents the
outcome.
A decision tree simply asks a question,& based on the answer(Yes/No),features split
the tree into sub-trees.
In Decision tree, there are two nodes,
a. Decision Node - Decision nodes are used to make any decision and have
multiple branches,
b. Leaf Node-Leaf nodes are the output of those decisions and do not contain
any further branches.
a. Root Node:
b. Leaf Node:
These are final output node and the tree cannot be segregated further after
getting a leaf node.
c. Branch/SubTree:
d. Parent/Child Node:
The root node of the tree is called parent node & other nodes are called the
child nodes.
Process
1. Splitting:
It is the process of dividing the node/root into sub-nodes according to the condition.
2. Pruning:
Example:
Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a technique
which is called as Attribute selection measure or ASM. By this measurement, we can easily
select the best attribute for the nodes of the tree.
1. Information Gain
2. Gini Index
Information Gain:
Entropy:
Where,
Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in
the CART (Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high
Gini index.
It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
Gini index can be calculated using the below formula:
It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
It can be very useful for solving decision-related problems.
It helps to think about all the possible outcomes for a problem.
There is less requirement of data cleaning compared to other algorithms.
Neural Network
Neural networks are inspired from the biological neurons within the human body
which activate under certain circumstances resulting in a related action performed
by the body in response.
Neural nets consist of various layers of interconnected artificial neurons powered
by activation functions which help in switching them ON/OFF.
Here,
x0, x1,x2 are the inputs
w0, w1, w2 are the weights
F-Activation Function
Weights are numeric values which are multiplied with inputs
Activation Function is a mathematical formula which helps the neuron to switch
ON/OFF
Bias Component (B) The Neural Network takes the input and compute the weighted
sum of inputs & include a bias component.
Architecture of Neural Network
1. Input Layer:
Input layer represents dimensions of the input vector.
It accepts the inputs in several different formats provided by the programmers.
2. Hidden layer:
It represents the intermediary nodes that divide the input space into regions with (soft)
boundaries.
It takes in a set of weighted input and produces output through an activation function.
The hidden layer presents in between input & output layers.
Activation Functions:
Activation functions help to normalize the output of each neuron to a range between 1
&0 or between -1 &1.
The most popular transformation functions are:
Advantages of Perceptron
Perceptrons can implement Logic Gates like AND, OR, or NAND
Disadvantages of Perceptron
Perceptrons can only learn linearly separable problems such as boolean AND
problem. For non-linear problems such as boolean XOR problem, it does not work.
In this network, input data travels in one direction only passing through neural nodes
c. Multilayer Perceptron
In this model, the input data travels various layers of neurons.
Every single node is connected to all neurons in the next layer which makes it a fully
connected neural network.
Input &Output layers are present having multiple hidden layers.
It has bi-directional propagation.
This model used in Speech Recognition &Machine Translation
Radial Basis Function Network consists of an input vector followed by a layer of RBF
Classification is performed by measuring the input’s similarity to data points from the
When a new input vector [the n-dimensional vector that you are trying to classify]
needs to be classified, each neuron calculates the Euclidean distance between the
Recurrent Neural Network is fed back to the input to help in predicting the
This model used in Text processing like auto suggests grammar checks &Text to
speech processing.
Descriptive Analytics
It is analytics that creates a summary of historical data to yield useful information and
possibly prepare the data for further analysis.
The aim is to describe patterns of customer behaviour.
It serves as a preliminary step in the business intelligence process, creating a
foundation for further analysis & understanding.
This analysis seeks the answers about what happened, without performing the more
complex analysis.
E.g., Summarising the event such as sales & operations data.
The three most common types of Descriptive Analytics are:
Association rules
Sequence rules
Clustering
Association Rules
Association Rules helps to detect frequently occurring patterns between the items.
E g., Market Basket Analysis is one of the key techniques used by large relations to show
associations between items.
It allows the retailers to identify the relationships between the items that
people buy items frequently.
Implications of Association Rules:
XY
Basic Definitions:
1. Support Count (): Frequency of occurrence of an item set.
{Milk,Bread}=1
2. Frequent Item set: An item set whose support is greater than or equal to minimize
threshold.
2. Confidence(C): The strength of the association, measures of how often items appear in
transactions that contain X.
Problem: Suppose that the given support is 3 and the required confidence is 80%
The following rules can be obtained from the size oftwo frequent itemsets (2-frequent
itemsets):
Since our required confidence is 80%, only rules 1 and 4 are included in the result.
Therefore, it can be concluded that customers who bought item two (I2) always bought item
three (I3) with it, and customers who bought item four (I4) always bought item 3 (I3) with it.
Sequence Rules
Sequence rules are used for finding statistically relevant patterns between the data
where the values are delivered in a sequence.
Sequential Rules is to find maximal sequences among all sequences that have certain-
user specified minimum support and confidence.
E.g., Sequence of webpage visits in Web Analytics.
Consider the example of a transactions data set in a Web analytics,
The letter A, B,C……refer to the webpages.
Representation:
1. Sociograms: Social Networks can be represented as a sociograms.
Sociograms are good for small scale network.
Color of the nodes corresponds to the specific status.
2. Closeness: The average distance of a node to all other nodes in the network
Example: Fernando & Garth are the closest to all others. They are the best positioned to
communicate messages that need to flow quickly through to all other nodes in the network.
3. Betweeness: Counts the number of times a node or connection lies on the shortest path
between any two nodes in the network.
Example: Heather has the highest betweenness. She sits in between two important
communities .She plays a broker role between both communities but is also a single point of
failure.