Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

lOMoARcPSD|20574153

DS4015 BDA UNIT I KVL Notes

Master of Computer Applications (Anna University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by BARATH S (htarab86@gmail.com)
lOMoARcPSD|20574153

DS4015 Big Data Analytics

DS4015 BIG DATA ANALYTICS

UNIT - I INTRODUCTION TO BIG DATA

Introduction to Big Data Platform – Challenges of Conventional Systems -


Intelligent data analysis –Nature of Data - Analytic Processes and Tools -
Analysis Vs Reporting - Modern Data Analytic Tools- Statistical Concepts:
Sampling Distributions - Re-Sampling - Statistical Inference - Prediction Error.

UNIT - II SEARCH METHODS AND VISUALIZATION

Search by simulated Annealing – Stochastic, Adaptive search by Evaluation –


Evaluation Strategies –Genetic Algorithm – Genetic Programming – Visualization
– Classification of Visual Data Analysis Techniques – Data Types – Visualization
Techniques – Interaction techniques – Specific Visual data analysis Techniques

UNIT - III MINING DATA STREAMS

Introduction To Streams Concepts – Stream Data Model and Architecture -


Stream Computing - Sampling Data in a Stream – Filtering Streams – Counting
Distinct Elements in a Stream – Estimating Moments – Counting Oneness in a
Window – Decaying Window - Real time Analytics Platform(RTAP)
Applications - Case Studies - Real Time Sentiment Analysis, Stock Market
Predictions

UNIT - IV FRAMEWORKS

MapReduce – Hadoop, Hive, MapR – Sharding – NoSQL Databases - S3 -


Hadoop Distributed File Systems – Case Study- Preventing Private Information
Inference Attacks on Social Networks- Grand Challenge: Applying Regulatory
Science and Big Data to Improve Medical Device Innovation

UNIT - V R LANGUAGE

Overview, Programming structures: Control statements -Operators -Functions -


Environment and scope issues -Recursion -Replacement functions, R data
structures: Vectors -Matrices and arrays - Lists -Data frames -Classes,
Input/output, String manipulations

Unit – I 1

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

REFERENCE:

1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer, 2007.


2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets,
Cambridge
University Press, 3rd edition 2020.
3. Norman Matloff, The Art of R Programming: A Tour of Statistical Software
Design,
No Starch Press, USA, 2011.
4. Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge
Data
Streams with Advanced Analytics, John Wiley & sons, 2012.
5. Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007.

Unit – I 2

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

UNIT - I INTRODUCTION TO BIG DATA

Introduction to Big Data Platform – Challenges of Conventional Systems - Intelligent data


analysis –Nature of Data - Analytic Processes and Tools - Analysis Vs Reporting - Modern
Data Analytic Tools- Statistical Concepts: Sampling Distributions - Re-Sampling - Statistical
Inference - Prediction Error.

INTRODUCTION TO BIG DATA

Big Data
Types of Big Data
Characteristics of Big Data
Growth of Big Data
Sources of Big Data
Risks in Big Data

Big Data

o Big Data is a term used to describe a collection of data that is huge in size and yet
growing exponentially with time.
o A collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications.

Examples of Big Data generation includes


– stock exchanges,
– social media sites,
– jet engines, etc…

Types Of Big Data:

Structured
Unstructured
Semi-structured

Structured Data

o Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.

Unit – I 3

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


o Data stored in a relational database management system is one example of a
'structured' data.

An 'Employee' table in a database is an example of Structured Data

Unstructured Data

o Any data with unknown form or the structure is classified as unstructured data.
o The size being huge, – un-structured data poses multiple challenges in terms of its
processing for deriving value out of it.
o Example of unstructured data is a heterogeneous data source containing a combination
of simple text files, images, videos etc.

Example of Unstructured data


The output returned by 'Google Search'

Semi-structured Data

o Semi-structured data can contain both the forms of data.


o Semi-structured data as a structured in form, but it is actually not defined with e.g. a
table definition in relational DBMS

Example of semi-structured data is – a data represented in an XML file.


Personal data stored in an XML file.
<rec>
<name>Prashant Rao</name>
<sex>Male</sex>
<age>35</age>
</rec>
<rec>

Unit – I 4

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


<name>Seema R.</name>
<sex>Female</sex>
<age>41</age>
</rec>
<rec>
<name>Satish Mane</name>
<sex>Male</sex>
<age>29</age>
</rec>
<rec>
<name>Subrato Roy</name>
<sex>Male</sex>
<age>26</age>
</rec>
<rec>
<name>Jeremiah J.</name>
<sex>Male</sex>
<age>35</age>
</rec>

Characteristics of BD OR 3Vs of Big Data

Three Characteristics of Big Data V3s:


 Volume - Data quantity
 Velocity - Data Speed
 Variety - Data Types

Growth of Big Data:

Unit – I 5

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


Storing Big Data

• Analyzing your data characteristics


– Selecting data sources for analysis
– Eliminating redundant data
– Establishing the role of NoSQL
• Overview of Big Data stores
– Data models: key value, graph, document, column-family
– Hadoop Distributed File System (HDFS)
– Hbase
– Hive

Processing Big Data

• Integrating disparate data stores


– Mapping data to the programming framework
– Connecting and extracting data from storage
– Transforming data for processing
– Subdividing data in preparation for Hadoop MapReduce
• Employing Hadoop MapReduce
– Creating the components of Hadoop MapReduce jobs
– Distributing data processing across server farms
– Executing Hadoop MapReduce jobs
– Monitoring the progress of job flows

Growth of Big Data is needed


– Increase of storage capacities
– Increase of processing power
– Availability of data(different data types)
– Every day we create 2.5 quintillion bytes of data; 90% of the data in the
world today has been created in the last two years alone

Huge storage need in Real Time Applications


– FB generates 10TB daily
– Twitter generates 7TB of data Daily
– IBM claims 90% of today’s stored data was generated in just the last two years.

Unit – I 6

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


How Is Big Data Different?

1) Automatically generated by a machine (e.g. Sensor embedded in an engine)


2) Typically an entirely new source of data(e.g. Use of the internet)
3) Not designed to be friendly(e.g. Text streams)
4) May not have much values
– Need to focus on the important part

Sources of Big Data

• Users
• Application
• Systems
• Sensors

Risk in Big Data

• Will be so overwhelmed
– Need the right people and solve the right problems
• Costs escalate too fast
– Isn’t necessary to capture 100%
• Many sources of big data is privacy
– self-regulation
– Legal regulation

INTRODUCTION TO BIG DATA PLATFORM

Big Data Platform


Features of Big Data Platform
List of Big Data Platform

Big Data Platform

Big Data Platform is integrated IT solution for Big Data management which combines
several software system, software tools and hardware to provide easy to use tools system
to enterprises.

Features of Big Data Platform

1. It should support linear scale-out


2. It should have capability for rapid deployment

Unit – I 7

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


3. It should support variety of data format
4. Platform should provide data analysis and reporting tools
5. It should provide real-time data analysis software
6. It should have tools for searching the data through large data set

List of BigData Platforms

a. Hadoop
b. Cloudera
c. Amazon Web Services
d. Hortonworks
e. MapR
f. IBM Open Platform
g. Microsoft HDInsight
h. Intel Distribution for Apache Hadoop
i. Datastax Enterprise Analytics
j. Teradata Enterprise Access for Hadoop
k. Pivotal HD

CHALLENGES OF CONVENTIONAL SYSTEM

Conventional System
Comparison of Big Data with Conventional Data
Challenges of Conventional System.
Challenges of Big Data

Conventional System

o The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
o Big data is huge amount of data which is beyond the processing capacity of
conventional data base systems to manage and analyze the data in a specific time
interval.

Unit – I 8

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

Comparison of Big Data with Conventional Data

List of challenges of Conventional Systems:

The following list of challenges has been dominating in the case Conventional systems in
real time scenarios:

Unit – I 9

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


1. Uncertainty of Data Management Landscape
2. The Big Data Talent Gap
3. The talent gap that exists in the industry Getting data into the big data
platform
4. Need for synchronization across data sources
5. Getting important insights through the use of Big data analytics

Big Data Challenges

– The challenges include capture, duration, storage, search, sharing, transfer,


– analysis, and visualization.

Challenges of Big Data:

1. Dealing with outlier


2. Addressing data quality
3. Understanding the data
4. Visualization helps organizations perform analyses
5. Meeting the need for speed
6. Degree of granularity increases.
7. Displaying meaningful results.

INTELLIGENT DATA ANALYSIS

Intelligent Data Analysis


Benefits of Intelligent Data Analysis
Intelligent Data Analysis – Knowledge Acquisition
Evaluation of Intelligent Data Analysis Results

Intelligent Data Analysis (IDA)

– used for extracting useful information from large quantities of online data;
extracting desirable knowledge or interesting patterns from existing databases;

– interdisciplinary study concerned with the effective analysis of data;

Goal: Goal of Intelligent data analysis is to extract useful knowledge, the process
demands a combination of extraction, analysis, conversion, classification, organization,
reasoning, and so on.

Unit – I 10

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


Uses / Benefits of IDA:

• Data Engineering
• Database mining techniques, tools and applications
• Use of domain knowledge in data analysis
• Big Data applications
• Evolutionary algorithms
• Machine Learning(ML)
• Neural nets
• Fuzzy logic
• Statistical pattern recognition
• Knowledge Filtering and
• Post-processing

Intelligent Data Analysis :Knowledge Acquisition

The process of eliciting, analyzing, transforming, classifying, organizing and integrating


knowledge and representing that knowledge in a form that can be used in a computer
system. Knowledge in a domain can be expressed as a number of rules

A Rule : A formal way of specifying a recommendation, directive, or strategy, expressed


as "IF premise THEN conclusion" or "IF condition THEN action".

Evaluation of IDA results:

• Absolute& relative accuracy


• Sensitivity& specificity
• False positive & false negative
• Error rate
• Reliability of rule

NATURE OF DATA

Data
Properties of Data
Types of Data
Data Conversion
Data Selection

Unit – I 11

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


Data:

Data is a set of values of qualitative or quantitative variables; restated, pieces of data


are individual pieces of information.

Data is measured, collected and reported, and analyzed, whereupon it can be


visualized using graphs or images

Data is nothing but facts and statistics stored or free flowing over a network,
generally it's raw and unprocessed.

When data are processed, organized, structured or presented in a given context so as to


make them useful, they are called Information.

3 Actions on Data:
Capture
Transform
Store
Properties of Data

 Clarity
 Accuracy
 Essence
 Aggregation
 Compression
 Refinement

TYPES OF DATA:

Unit – I 12

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


1. Nominal scales:

Measure categories and have the following characteristics:

• Order: The order of the responses or observations does not matter.


• Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is
not the same as a 2 and 3.
• True Zero: There is no true or real zero. In a nominal scale, zero is
uninterruptable.
• Appropriate statistics for nominal scales: mode, count, frequencies
• Displays: histograms or bar charts

2. Ordinal Scales:

At the risk of providing a tautological definition, ordinal scales measure, well, order. So,
our characteristics for ordinal scales are:

 Order: The order of the responses or observations matters.


 Distance: Ordinal scales do not hold distance. The distance between first and
second is unknown as is the distance between first and third along with all
observations.
 True Zero: There is no true or real zero. An item, observation, or category
cannot finish zero.
 Appropriate statistics for ordinal scales: count, frequencies, mode
 Displays: histograms or bar charts

3 .Interval Scales:

Interval scales provide insight into the variability of the observations or data. Classic
interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).

 Order: The order of the responses or observations does matter.


 Distance: Interval scales do offer distance.
 True Zero: There is no zero with interval scales
 Appropriate statistics for interval scales: count, frequencies, mode, median,
mean, standard deviation (and variance), skewness, and kurtosis.
 Displays: histograms or bar charts, line charts, and scatter plots.

Unit – I 13

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


4. Ratio Scales:

Ratio scales appear as nominal scales with a true zero. They have the following
characteristics:

– Order: The order of the responses or observations matters.


– Distance: Ratio scales do do have an interpretable distance
– True Zero: There is a true zero.
– Appropriate statistics for ratio scales: count, frequencies, mode, median, mean,
standard deviation (and variance), skewness, and kurtosis.
– Displays: histograms or bar charts, line charts, and scatter plots.

The table below summarizes the characteristics of all four types of scales.

Data Conversion

We can convert or transform our data from ratio to interval to ordinal to nominal. , we
cannot convert or transform our data from nominal to ordinal to interval to ratio.

 Scaled data can be measured in exact amounts.

For example, 60 degrees , 12.5 feet, 80 Miles per hour

 Scaled data can be measured with equal intervals.

For example, Between 0 and 1 is 1 inch, Between 13 and 14 is also 1 inch

Ordinal or ranked data provides comparative Amounts

Example:

Unit – I 14

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

1st Place 2nd Place 3rd Place

 Not equal intervals

1st Place 2nd Place 3rd Place

19.6 feet 18.2 feet 12.4 feet

Data Selection

Example – Average Driving Speed

a) Scaled
b) Ordinal

Scaled – Speed:- Speed can be measured in exact amounts withequal intervals.

Example :

60 degrees 12.5 feet 80 Miles per hour

 Ordinal or ranked data provides comparative amounts.

For example, 1st Place 2nd Place 3rd Place

ANALYTIC PROCESS AND TOOLS:

There are 6 analytic processes:

1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation

Unit – I 15

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

Step 1: Deployment :

– plan the deployment and monitoring and maintenance, – we need to produce a


final report and review the project.
– In this phase,
we deploy the results of the analysis, this is also known as reviewing the project.

Step 2: Business Understanding :

– The very first step consists of business understanding.


– Whenever any requirement occurs, firstly we need to determine the business
objective, – assess the situation
– determine data mining goals and then
– produce the project plan as per the requirement.
– Business objectives are defined in this phase.

Step 3: Data Exploration :

The second step consists of Data understanding.

– For the further process, we need to gather initial data, describe and explore the data
and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its
application and the need for the project in this phase.
– This is also known as data exploration.
– This is necessary to verify the quality of data collected.

Unit – I 16

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

Step 4: Data Preparation:

– we need to select data as per the need, clean it, construct it to get useful
information and – then integrate it all.
– Finally, we need to format the data to get the appropriate data.
– Data is selected, cleaned, and integrated into the format finalized for the
analysis in this phase.

Step 5: Data Modeling:

– select a modeling technique, generate test design, build a model and assess the
model built.
– The data model is build to – analyze relationships between various selected
objects in the data,
– test cases are built for assessing the model and model is tested and
implemented on the data in this phase
Step 6: Data Evaluation

Where processing is hosted?


• Distributed Servers / Cloud (e.g. Amazon EC2)
Where data is stored?
– Distributed Storage (e.g. Amazon S3)
What is the programming model?
– Distributed Processing (e.g. MapReduce)
– How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
– What operations are performed on data?
– Analytic / Semantic Processing

Analytical Tools
– Big data tools for HPC and supercomputing
– MPI
– Big data tools on clouds
– MapReduce model
– Iterative MapReduce model
– DAG model

Unit – I 17

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


– Graph model
– Collective model
– Other BDA tools
– SaS
– R
– Hadoop

ANALYSIS AND REPORTING

Analysis
Reporting
Differences between Analysis and Reporting

Analysis

The process of exploring data and reports in order to extract meaningful in sights, which
can be used to better understand and improve business performance.
Reporting

Reporting is the process of organizing data into informational summaries, in order to


monitor how different areas of a business are performing.

Differences between Analysis and Reporting

Unit – I 18

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


 Reporting translates raw data into information.
 Analysis transforms data and information into insights.
 reporting shows you what is happening
 while analysis focuses on explaining why it is happening and what you can do
about it.

MODERN ANALYTIC TOOLS:


Modern Analy琀椀c Tools: Current Analy琀椀c tools concentrate on three classes:

1. Batch processing tools


2. Stream Processing tools and
3. Interactive Analysis tools.

1. Batch processing system :


Batch Processing System involves :
– collecting a series of processing jobs and carrying them out periodically as a
group (or batch) of jobs.
– It allows a large volume of jobs to be processed at the same time.
– An organization can schedule batch processing for a time when there is little
activity on their computer systems,
– One of the most famous and powerful batch process-based Big Data tools is
Apache Hadoop.
– It provides infrastructures and platforms for other specific Big Data
applications.
2. Stream Processing tools :

Stream processing – Envisioning (predicting) the life in data as and when it transpires
– The key strength of stream processing is that it can provide insights faster,
often within milliseconds to seconds.
– It helps understanding the hidden patterns in millions of data records in real time.
– It translates into processing of data from single or multiple sources – in real or near-
real time applying the desired business logic and emitting the processed
information to the sink.
– Stream processing serves – multiple – resolves in today’s business arena.
Real time data streaming tools are:

Unit – I 19

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


a) Storm
 Storm is a stream processing engine without batch support,
 a true real-time processing framework,
 taking in a stream as an entire ‘event’ instead of series of small batches. Apache
Storm is a distributed real-time computation system.
 It’s applications are designed as directed acyclic graphs.
b) Apache flink:
 Apache flink is – an open source platform
 which is a streaming data flow engine that provides communication fault tolerance
and – data distribution computation over data stream .
 flink is a top level project of Apache flink is scalable data analytics
framework that is fully compatible to hadoop .
 flink can execute both stream processing and batch processing easily.
 flink was designed as an alternative to map-reduce.

c) Kinesis
 Kinesis as an out of the box streaming data tool.
 Kinesis comprises of shards which Kafka calls partitions.
 For organizations that take advantage of real-time or near real-time access to large
stores of data,
 Amazon Kinesis is great.
 Kinesis Streams solves a variety of streaming data problems.
 One common use is the real-time aggregation of data which is followed by
loading the aggregate data into a data warehouse.
 Data is put into Kinesis streams.
 This ensures durability and elasticity
3.Interactive Analysis -Big Data Tools
 The interactive analysis presents – the data in an interactive environment,
– allowing users to undertake their own analysis of information.
 Users are directly connected to – the computer and hence can interact with it in
real time.
 The data can be : – reviewed, compared and analyzed in tabular or graphic
format or both at the same time.
IA -Big Data Tools –

a) Google’s Dremel:

Unit – I 20

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


 is the Google proposed an interactive analysis system in 2010. And named
named Dremel.
 which is scalable for processing nested data.
 Dremel provides a very fast SQL like interface to the data by using a different
technique than MapReduce

b) Apache drill:
Apache drill is:
 Drill is an Apache open-source SQL query engine for Big Data
exploration
 It is similar to Google’s Dremel.
Other major Tools:

a) AWS b) BigData c ) Cassandra d) Data Warehousing e) DevOps f) HBase


g) Hive h)MongoDB i) NiFi j) Tableau k) Talend l) ZooKeeper.

Categories of Modern Analytic Tools

a) Big data tools for HPC and supercomputing


– MPI
b) Big Data Tools for HPC and Supercomputing
• MPI(Message Passing Interface, 1992)
– Provide standardized function interfaces for communication
between parallel processes.
• Collective communication operations
– Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce,
Reduce- scPopular implementations
– atter.
– MPICH (2001)
– OpenMPI (2004)
c) Big data tools on clouds
 MapReduce model
 Iterative MapReduce model
 DAG model
 Graph model
 Collective model

Unit – I 21

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


STATISTICAL CONEPTS

Fundamental Statistics
Elements in Statistics.
Types of Statistics
Statistics Vs Statistical Analysis
Basic Statistical Operations
Application of Statistical Concepts

Fundamental Statistics

Statistics is the methodology for collecting, analyzing, interpreting and drawing conclusions
from information.

Statistics is the methodology which scientists and mathematicians have developed for
interpreting and drawing conclusions from collected data.

Statistics provides methods for:

1. Design: Planning and carrying out research studies.


2. Description: Summarizing and exploring data.
3. Inference: Making predictions and generalizing about phenomena represented by the data.

Elements in Statistics

1. Experimental unit
• Object upon which we collect data

2. Population
• All items of interest

3. Variable
• Characteristic of an individual experimental unit

4. Sample
• Subset of the units of a population

• P in Population & Parameter


• S in Sample & Statistic

Unit – I 22

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

5. Statistical Inference
• Estimate or prediction or generalization about a population based on information contained
in a sample

6. Measure of Reliability
• Statement (usually qualified) about the degree of uncertainty associated with a statistical
inference

Example for Statistics


o Agricultural problem: Is new grain seed or fertilizer more productive?
o Medical problem: What is the right amount of dosage of drug to treatment?
o Political science: How accurate are the gallups and opinion polls?
o Eeconomics: What will be the unemployment rate next year?
o Technical problem: How to improve quality of product?

Types or Branches of Sta琀椀s琀椀c:

The study of statistics has two major branches: descriptive statistics and inferential
statistics.

Descriptive statistics: –

– Methods of organizing, summarizing, and presenting data in an informative way.


– Involves: Collecting Data

Unit – I 23

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


Presenting Data
Characterizing Data
Purpose
Describe Data

Inferential statistics: –

– The methods used to determine something about a population on the basis of a sample:

– Population –The entire set of individuals or objects of interest or the measurements obtained
from all individuals or objects of interest

– Sample – A portion, or part, of the population of interest

Statistics Vs Statistical Analysis

• Statistics :- The science of


– collectiong,
– organizing,
– presenting,
– analyzing, and
– interpreting data
to assist in making more effective decisions.

• Statistical analysis: – used to


– manipulate summarize, and
– investigate data,
so that useful decision-making information results.

Basic Statistical Operations

Mean: A measure of central tendency for Quantitative data i.e. the long term average
value.
Median :A measure of central tendency for Quantitative data i.e. the half-way point.
Mode :The most frequently occurring (discrete), or where the probability density
function peaks (contin- ious).
Minimum :The smallest value. •
Maximum: The largest value. Inter quartile range Can be thought or as the middle 50 of
the (Quantitative) data, used as a measure of spread.
Variance : Used as a measure of spread, may be thought of as the moment of inertia.
Unit – I 24

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


Standard deviation : A measure of spread, the square root of the variance.

Application of Statistical Concepts and Areas

Statistical Concepts :

• Finance – correlation and regression, index numbers, time series analysis


• Marketing – hypothesis testing, chi-square tests, nonparametric statistics
• Personel – hypothesis testing, chi-square tests, nonparametric tests
• Operating management – hypothesis testing, estimation, analysis of variance, time series
analysis

Application Areas :

• Economics
– Forecasting
– Demographics
• Sports
– Individual & Team Performance
• Engineering
– Construction
– Materials
• Business
– Consumer Preferences
– Financial Trends

Sampling Distribution

Sample
Types of Samples
Examples of Sampling Distribution
Errors on Sampling Distribution.

Sample

A sample is “a smaller (but hopefully representative) collection of units from a population


used to determine truths about that population”

Unit – I 25

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

Types of Samples

1. Stratified Samples
2. Cluster Samples
3. Systematic Samples
4. Convenience Sample

1. Stratified Samples

A stratified sample has members from each segment of a population. This ensures that each
segment from the population is represented.

2. Cluster Samples :

A cluster sample has all members from randomly selected segments of a population. This is
used when the population falls into naturally occurring subgroups

3. Systematic Samples:

Unit – I 26

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

A systematic sample is a sample in which each member of the population is assigned a


number. A starting number is randomly selected and sample members are selected at regular
intervals.

4. Convenience Samples: A convenience sample consists only of available members of the


population.

Example:
You are doing a study to determine the number of years of education each teacher at your college
has.
Identify the sampling technique used if you select the samples listed.

Examples of Sampling Distribution

1) Your sample says that a candidate gets support from 47%.


2) Inferential statistics allow you to say that
– (a) the candidate gets support from 47% of the population
– (b) with a margin of error of +/- 4%
– This means that the support in the population is likely somewhere between 43% and
51%.

Errors on Sampling Distribution

• Margin of error is taken directly from a sampling distribution.


• It looks like this:

Unit – I 27

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

Re-Sampling

Re-Sampling
Re-Sampling in Statistics
Need for Re-Sampling
Re-Sampling Methods

Re-Sampling

• Re-sampling is:
– the method that consists of drawing repeated samples from the original data
samples.

• The method of Resampling is


– a nonparametric method of statistical inference. ...

• The method of resampling uses:


– experimental methods, rather than analytical methods, to generate the unique
sampling distribution.

Re-Sampling in statistics

• In statistics, re-sampling is any of a variety of methods for doing one of the following:

– Estimating the precision of sample statistics (medians, variances, percentiles)


– by using subsets of available data (jackknifing) or drawing randomly with
replacement from a set of data points (bootstrapping)

Unit – I 28

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

Need for Re-Sampling

• Re-sampling involves:
– the selection of randomized cases with replacement from the original data sample
• in such a manner that each number of the sample drawn has a number of cases
that are similar to the original data sample.
• Due to replacement:
– the drawn number of samples that are used by the method of re-sampling consists of
repetitive cases.

• Re-sampling generates a unique sampling distribution on the basis of the actual data.

• The method of re-sampling uses


– experimental methods, rather than analytical methods, to generate the unique
sampling distribution.

• The method of re-sampling yields


– unbiased estimates as it is based on the unbiased samples of all the possible results
of the data studied by the researcher.

Re-Sampling Methods

– processes of repeatedly drawing samples from a data set and refitting a given model
on each sample with the goal of learning more about the fitted model.
• Re-sampling methods can be expensive since they require repeatedly performing the same
statistical methods on N different subsets of the data.
• Re-sampling methods refit a model of interest to samples formed from the training set,
– in order to obtain additional information about the fitted model.
• For example, they provide estimates of test-set prediction error, and the standard deviation
and bias of our parameter estimates.

There are four major re-sampling methods available and are:


1. Permutation
2. Bootstrap
3. Jackknife
4. Cross validation

Unit – I 29

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

1. Permutation:

The term permutation refers to a mathematical calculation of the number of ways a


particular set can be arranged.

Permutation Re-sampling Processes:

Step 1: Collect Data from Control & Treatment Groups


Step 2: Merge samples to form a pseudo population
Step 3: Sample without replacement from pseudo population to simulate control Treatment
groups
Step 4: Compute target statistic for each example

2. Bootstrap :

• The bootstrap is
– a widely applicable tool that

Unit – I 30

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


– can be used to quantify the uncertainty associated with a given estimator or
statistical learning approach, including those for which it is difficult to obtain a measure of
variability.
• The bootstrap generates:
– distinct data sets by repeatedly sampling observations from the original data set.
– These generated data sets can be used to estimate variability in lieu of sampling
independent data sets from the full population.

Bootstrap Types

a) Parametric Bootstrap
b) Non-parametric Bootstrap

3.Jackknife Method:

Jackknife method was introduced by Quenouille (1949) to estimate the bias of an


estimator.
The method is later shown to be useful in reducing the bias as well as in estimating the
variance of an estimator.

A comparison of the Bootstrap & Jackknife

Bootstrap

. Yields slightly different results when repeated on the same data (when estimating the
standard error)
. Not bound to theoretical distributions

Unit – I 31

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


Jackknife

– Less general technique


– Explores sample variation differently
– Yields the same result each time
– Similar data requirement

4. Cross validation:

Cross-validation is a technique used to protect against over fitting in a predictive


model, particularly in a case where the amount of data may be limited.
In cross-validation, you make a fixed number of folds (or partitions) of the data, run
the analysis on each fold, and then average the overall error estimate.

Statistical Inference

Inference
Statistical Inference
Types of Statistical Inference

Inference : -

Use a random sample to learn something about a larger population


Two ways to make inference

Statistical Inference:

The process of making guesses about the truth from a sample.


Statistical inference is the process through which inferences about a population are
made based on certain statistics calculated from a sample of data drawn from that
population.

Unit – I 32

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics

Types of Statistical Inference

There are Two most common types of Statistical Inference and they are:

– Confidence intervals and


– Tests of significance.

Confidence Intervals

Range of values that m is expected to lie within • 95% confidence interval 95


probability that m will fall within range probability is the level of confidence

Test of Significance ( Hypothesis testing):

A statistical method that uses: – sample data to evaluate a hypothesis about a


population parameter.

• A hypothesis is an assumption about the population parameter.


– A parameter is a Population mean or proportion
– The parameter must be identified before analysis.

Hypothesis Testing

• Is also called significance testing


• Tests a claim about a parameter using evidence (data in a sample
• The technique is introduced by considering a one-sample z test
• The procedure is broken into four steps
• Each element of the procedure must be understood

Unit – I 33

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


Hypothesis Testing Steps

A. Null and alternative hypotheses


B. Test statistic
C. P-value and interpretation
D. Significance level (optional)

Prediction Error

Error in Predictive Analysis


Predication Error in Statistics
Predication Error in Regression

Prediction Error
o A prediction error is the failure of some expected event to occur.
o Errors are an inescapable element of predictive analytics that should also be
quantified and presented along with any model, often in the form of a confidence
interval that indicates how accurate its predictions are expected to be.
o A prediction error is the failure of some expected event to occur.
o When predictions fail, humans can use metacognitive functions, examining prior
predictions and failures.
o For example, whether there are correlations and trends, such as consistently being
unable to foresee outcomes accurately in particular situations.
o Applying that type of knowledge can inform decisions and improve the quality of
future predictions.

Error in Predictive Analysis

– Errors are an inescapable element of predictive analytics that should also be quantified and
presented along with any model, often in the form of a confidence interval that indicates how
accurate its predictions are expected to be.
– Analysis of prediction errors from similar or previous models can help determine
confidence intervals.

Predication Error in Statistics


1. Standard Error of the Estimate

The standard error of the estimate is a measure of the accuracy of predictions.

Unit – I 34

Downloaded by BARATH S (htarab86@gmail.com)


lOMoARcPSD|20574153

DS4015 Big Data Analytics


Recall that the regression line is the line that minimizes the sum of squared deviations
of prediction (also called the sum of squares error).

2. Mean squared prediction error

– In statistics the mean squared prediction error or mean squared error of the predictions of a
smoothing or curve fitting procedure is the expected value of the squared difference between
the fitted values implied by the predictive function and the values of the (unobservable)
function g.
– The MSE is a measure of the quality of an estimator—it is always non-negative, and values
closer to zero are better.
– Root-Mean-Square error or Root-Mean-Square Deviation (RMSE or RMSD)
Predication Error in Regression

Regressions differing in accuracy of prediction.


The standard error of the estimate is a measure of the accuracy of predictions.
Recall that the regression line is the line that minimizes the sum of squared deviations of
prediction (also called the sum of squares error).

Unit – I 35

Downloaded by BARATH S (htarab86@gmail.com)

You might also like