Unit 01
Unit 01
Unit 01
UNIT - IV FRAMEWORKS
UNIT - V R LANGUAGE
Unit – I 1
REFERENCE:
Unit – I 2
Big Data
Types of Big Data
Characteristics of Big Data
Growth of Big Data
Sources of Big Data
Risks in Big Data
Big Data
o Big Data is a term used to describe a collection of data that is huge in size and yet
growing exponentially with time.
o A collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications.
Structured
Unstructured
Semi-structured
Structured Data
o Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
Unit – I 3
Unstructured Data
o Any data with unknown form or the structure is classified as unstructured data.
o The size being huge, – un-structured data poses multiple challenges in terms of its
processing for deriving value out of it.
o Example of unstructured data is a heterogeneous data source containing a combination
of simple text files, images, videos etc.
Semi-structured Data
Unit – I 4
Unit – I 5
Unit – I 6
• Users
• Application
• Systems
• Sensors
• Will be so overwhelmed
– Need the right people and solve the right problems
• Costs escalate too fast
– Isn’t necessary to capture 100%
• Many sources of big data is privacy
– self-regulation
– Legal regulation
Big Data Platform is integrated IT solution for Big Data management which combines
several software system, software tools and hardware to provide easy to use tools system
to enterprises.
Unit – I 7
a. Hadoop
b. Cloudera
c. Amazon Web Services
d. Hortonworks
e. MapR
f. IBM Open Platform
g. Microsoft HDInsight
h. Intel Distribution for Apache Hadoop
i. Datastax Enterprise Analytics
j. Teradata Enterprise Access for Hadoop
k. Pivotal HD
Conventional System
Comparison of Big Data with Conventional Data
Challenges of Conventional System.
Challenges of Big Data
Conventional System
o The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
o Big data is huge amount of data which is beyond the processing capacity of
conventional data base systems to manage and analyze the data in a specific time
interval.
Unit – I 8
The following list of challenges has been dominating in the case Conventional systems in
real time scenarios:
Unit – I 9
– used for extracting useful information from large quantities of online data;
extracting desirable knowledge or interesting patterns from existing databases;
Goal: Goal of Intelligent data analysis is to extract useful knowledge, the process
demands a combination of extraction, analysis, conversion, classification, organization,
reasoning, and so on.
Unit – I 10
• Data Engineering
• Database mining techniques, tools and applications
• Use of domain knowledge in data analysis
• Big Data applications
• Evolutionary algorithms
• Machine Learning(ML)
• Neural nets
• Fuzzy logic
• Statistical pattern recognition
• Knowledge Filtering and
• Post-processing
NATURE OF DATA
Data
Properties of Data
Types of Data
Data Conversion
Data Selection
Unit – I 11
Data is nothing but facts and statistics stored or free flowing over a network,
generally it's raw and unprocessed.
3 Actions on Data:
Capture
Transform
Store
Properties of Data
Clarity
Accuracy
Essence
Aggregation
Compression
Refinement
TYPES OF DATA:
Unit – I 12
2. Ordinal Scales:
At the risk of providing a tautological definition, ordinal scales measure, well, order. So,
our characteristics for ordinal scales are:
3 .Interval Scales:
Interval scales provide insight into the variability of the observations or data. Classic
interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).
Unit – I 13
Ratio scales appear as nominal scales with a true zero. They have the following
characteristics:
The table below summarizes the characteristics of all four types of scales.
Data Conversion
We can convert or transform our data from ratio to interval to ordinal to nominal. , we
cannot convert or transform our data from nominal to ordinal to interval to ratio.
Example:
Unit – I 14
Data Selection
a) Scaled
b) Ordinal
Example :
1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
Unit – I 15
Step 1: Deployment :
– For the further process, we need to gather initial data, describe and explore the data
and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its
application and the need for the project in this phase.
– This is also known as data exploration.
– This is necessary to verify the quality of data collected.
Unit – I 16
– we need to select data as per the need, clean it, construct it to get useful
information and – then integrate it all.
– Finally, we need to format the data to get the appropriate data.
– Data is selected, cleaned, and integrated into the format finalized for the
analysis in this phase.
– select a modeling technique, generate test design, build a model and assess the
model built.
– The data model is build to – analyze relationships between various selected
objects in the data,
– test cases are built for assessing the model and model is tested and
implemented on the data in this phase
Step 6: Data Evaluation
Analytical Tools
– Big data tools for HPC and supercomputing
– MPI
– Big data tools on clouds
– MapReduce model
– Iterative MapReduce model
– DAG model
Unit – I 17
Analysis
Reporting
Differences between Analysis and Reporting
Analysis
The process of exploring data and reports in order to extract meaningful in sights, which
can be used to better understand and improve business performance.
Reporting
Unit – I 18
Stream processing – Envisioning (predicting) the life in data as and when it transpires
– The key strength of stream processing is that it can provide insights faster,
often within milliseconds to seconds.
– It helps understanding the hidden patterns in millions of data records in real time.
– It translates into processing of data from single or multiple sources – in real or near-
real time applying the desired business logic and emitting the processed
information to the sink.
– Stream processing serves – multiple – resolves in today’s business arena.
Real time data streaming tools are:
Unit – I 19
c) Kinesis
Kinesis as an out of the box streaming data tool.
Kinesis comprises of shards which Kafka calls partitions.
For organizations that take advantage of real-time or near real-time access to large
stores of data,
Amazon Kinesis is great.
Kinesis Streams solves a variety of streaming data problems.
One common use is the real-time aggregation of data which is followed by
loading the aggregate data into a data warehouse.
Data is put into Kinesis streams.
This ensures durability and elasticity
3.Interactive Analysis -Big Data Tools
The interactive analysis presents – the data in an interactive environment,
– allowing users to undertake their own analysis of information.
Users are directly connected to – the computer and hence can interact with it in
real time.
The data can be : – reviewed, compared and analyzed in tabular or graphic
format or both at the same time.
IA -Big Data Tools –
a) Google’s Dremel:
Unit – I 20
b) Apache drill:
Apache drill is:
Drill is an Apache open-source SQL query engine for Big Data
exploration
It is similar to Google’s Dremel.
Other major Tools:
Unit – I 21
Fundamental Statistics
Elements in Statistics.
Types of Statistics
Statistics Vs Statistical Analysis
Basic Statistical Operations
Application of Statistical Concepts
Fundamental Statistics
Statistics is the methodology for collecting, analyzing, interpreting and drawing conclusions
from information.
Statistics is the methodology which scientists and mathematicians have developed for
interpreting and drawing conclusions from collected data.
Elements in Statistics
1. Experimental unit
• Object upon which we collect data
2. Population
• All items of interest
3. Variable
• Characteristic of an individual experimental unit
4. Sample
• Subset of the units of a population
Unit – I 22
5. Statistical Inference
• Estimate or prediction or generalization about a population based on information contained
in a sample
6. Measure of Reliability
• Statement (usually qualified) about the degree of uncertainty associated with a statistical
inference
The study of statistics has two major branches: descriptive statistics and inferential
statistics.
Descriptive statistics: –
Unit – I 23
Inferential statistics: –
– The methods used to determine something about a population on the basis of a sample:
– Population –The entire set of individuals or objects of interest or the measurements obtained
from all individuals or objects of interest
Mean: A measure of central tendency for Quantitative data i.e. the long term average
value.
Median :A measure of central tendency for Quantitative data i.e. the half-way point.
Mode :The most frequently occurring (discrete), or where the probability density
function peaks (contin- ious).
Minimum :The smallest value. •
Maximum: The largest value. Inter quartile range Can be thought or as the middle 50 of
the (Quantitative) data, used as a measure of spread.
Variance : Used as a measure of spread, may be thought of as the moment of inertia.
Unit – I 24
Statistical Concepts :
Application Areas :
• Economics
– Forecasting
– Demographics
• Sports
– Individual & Team Performance
• Engineering
– Construction
– Materials
• Business
– Consumer Preferences
– Financial Trends
Sampling Distribution
Sample
Types of Samples
Examples of Sampling Distribution
Errors on Sampling Distribution.
Sample
Unit – I 25
Types of Samples
1. Stratified Samples
2. Cluster Samples
3. Systematic Samples
4. Convenience Sample
1. Stratified Samples
A stratified sample has members from each segment of a population. This ensures that each
segment from the population is represented.
2. Cluster Samples :
A cluster sample has all members from randomly selected segments of a population. This is
used when the population falls into naturally occurring subgroups
3. Systematic Samples:
Unit – I 26
Example:
You are doing a study to determine the number of years of education each teacher at your college
has.
Identify the sampling technique used if you select the samples listed.
Unit – I 27
Re-Sampling
Re-Sampling
Re-Sampling in Statistics
Need for Re-Sampling
Re-Sampling Methods
Re-Sampling
• Re-sampling is:
– the method that consists of drawing repeated samples from the original data
samples.
Re-Sampling in statistics
• In statistics, re-sampling is any of a variety of methods for doing one of the following:
Unit – I 28
• Re-sampling involves:
– the selection of randomized cases with replacement from the original data sample
• in such a manner that each number of the sample drawn has a number of cases
that are similar to the original data sample.
• Due to replacement:
– the drawn number of samples that are used by the method of re-sampling consists of
repetitive cases.
• Re-sampling generates a unique sampling distribution on the basis of the actual data.
Re-Sampling Methods
– processes of repeatedly drawing samples from a data set and refitting a given model
on each sample with the goal of learning more about the fitted model.
• Re-sampling methods can be expensive since they require repeatedly performing the same
statistical methods on N different subsets of the data.
• Re-sampling methods refit a model of interest to samples formed from the training set,
– in order to obtain additional information about the fitted model.
• For example, they provide estimates of test-set prediction error, and the standard deviation
and bias of our parameter estimates.
Unit – I 29
1. Permutation:
2. Bootstrap :
• The bootstrap is
– a widely applicable tool that
Unit – I 30
Bootstrap Types
a) Parametric Bootstrap
b) Non-parametric Bootstrap
3.Jackknife Method:
Bootstrap
. Yields slightly different results when repeated on the same data (when estimating the
standard error)
. Not bound to theoretical distributions
Unit – I 31
4. Cross validation:
Statistical Inference
Inference
Statistical Inference
Types of Statistical Inference
Inference : -
Statistical Inference:
Unit – I 32
There are Two most common types of Statistical Inference and they are:
Confidence Intervals
Hypothesis Testing
Unit – I 33
Prediction Error
Prediction Error
o A prediction error is the failure of some expected event to occur.
o Errors are an inescapable element of predictive analytics that should also be
quantified and presented along with any model, often in the form of a confidence
interval that indicates how accurate its predictions are expected to be.
o A prediction error is the failure of some expected event to occur.
o When predictions fail, humans can use metacognitive functions, examining prior
predictions and failures.
o For example, whether there are correlations and trends, such as consistently being
unable to foresee outcomes accurately in particular situations.
o Applying that type of knowledge can inform decisions and improve the quality of
future predictions.
– Errors are an inescapable element of predictive analytics that should also be quantified and
presented along with any model, often in the form of a confidence interval that indicates how
accurate its predictions are expected to be.
– Analysis of prediction errors from similar or previous models can help determine
confidence intervals.
Unit – I 34
– In statistics the mean squared prediction error or mean squared error of the predictions of a
smoothing or curve fitting procedure is the expected value of the squared difference between
the fitted values implied by the predictive function and the values of the (unobservable)
function g.
– The MSE is a measure of the quality of an estimator—it is always non-negative, and values
closer to zero are better.
– Root-Mean-Square error or Root-Mean-Square Deviation (RMSE or RMSD)
Predication Error in Regression
Unit – I 35