Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Module I(Introduction) Part I (1)

Uploaded by

20051694
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Module I(Introduction) Part I (1)

Uploaded by

20051694
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 114

MODULE:I

Data Analytics
INTRODUCTION
The Definition
 Data Analytics (DA) is the process of examining data
sets in order to find trends and draw conclusions
about the information they contain.
 Data analytics is the science of analyzing raw data
to make conclusions about that information.
 Data analytics helps individuals and organization
make sense of data. DA typically analyze raw data
for insight and trends.
 Data analytics help a business optimize its
performance, maximize profit, or make more
strategically-guided decisions.
MODULE-I DATA ANALYTICS 2
Why Big Data ?

MODULE-I DATA ANALYTICS 3


Why Big Data?
 According to IBM “ 90% of data in the world
today created in last two years”
 As data continues to grow, so does need to
organize it. Collecting such huge amount of
data would just be waste of time, effort, and
storage space if it cannot be put to any
logical use.
 The need to sort, organize, analyze, and offer
this critical data in a systematic manner leads
to the rise of much discussed term, Big Data.
MODULE-I DATA ANALYTICS 4
Source of Data
 Sensors used to collect the climate information
 Post to social media sites
 Digital pictures and videos
 Purchase transaction records
 Cell phone GPS signals
 Web logs
 Chat History

MODULE-I DATA ANALYTICS 5


Examples

Mobile devices
(Tracking all objects all the time)

Social media and networks Scientific instruments


(All of us are generating data) (Collecting all sorts of data)

Sensor technology and


networks
(Measuring all kinds of
data)

MODULE-I DATA ANALYTICS 6


Real World Examples
 Consumer product companies and retail
organizations are observing data on social
media websites such as Facebook and Twitter.
 Customer behaviour, preferences, and
product perception are analyzed and
accordingly the companies can line up their
products to gain profits.

MODULE-I DATA ANALYTICS 7


Real World Examples
 Manufactures are also monitoring social networks
but with a different goal. They are using it to
detect after market support issues before a
warranty failure becomes publicly detrimental.
 Financial service organizations are using the data
mined from customer interaction to slice and
dice their users into finely tuned segments. This
enables these financial institutions to create
increasingly relevant and sophisticated offers.

MODULE-I DATA ANALYTICS 8


Real World Examples
 Advertising and marketing agencies are
tracking social media to understand
responsiveness to campaign, promotions,
and other advertising medium.
 Insurance companies are using data analysis to
see which home insurance applications can be
immediately processed and which ones need
a validating in-person visit from an agent.

MODULE-I DATA ANALYTICS 9


Real World Examples
 Hospitals are analyzing medical data and
patient records to predict those patients that
are likely to seek readmission within a few
months of discharge. The hospital can then
intervene in hopes of preventing another costly
hospital stay.
 Health Bands/ Personal Fitness Device

MODULE-I DATA ANALYTICS 10


Real World Examples
Google Analytics
 Google Analytics is a free web analytics tool offered

by Google to help you analyze your website traffic.


 If you are running any marketing activities such as

search ads or social media ads, your users are most


likely going to visit your website somewhere along
their user journey.
 Google Analytics is a free tool that can help you

track your digital marketing effectiveness.

MODULE-I DATA ANALYTICS 11


Google Analytics (Customer Behaviour Analytics)
 Google Analytics puts several lines of tracking code (scripts)
into the code of your website. The code records various
activities of your users when they visit your website, along with
the attributes (such as age, gender, interests) of those users, . It
then sends all that information to the GA (Google Analytics)
server once the user exits your website.
 No. of users, bounce rates, average session durations, items
added to cart, goal completion, pages for session, impulsive
buying etc.
Next, Google Analytics aggregates the data collected from your website in
Audience
multiple ways, primarily by four levels:
Acquisition
 User level (related to actions by each user)
Behaviour
 Session level (each individual visit) --------------
 Page-view level (each individual page visited) # Bounce Rate
# Session
 Event level (button clicks, video views, etc)

MODULE-I DATA ANALYTICS 12


Uber/Ola - Data Analytics
 All of this data is collected, crunched, analyzed
and used to predict everything from the
customer’s wait time, Estimated Time of Arrival
(ETA)to recommending where drivers should
place themselves via heat map in order to take
advantage of the best fares and most passengers.

MODULE-I DATA ANALYTICS 13


Amazon Uses Data Analytics
 Personalized Recommendation System
 Book Recommendations from Kindle Highlighting
 One-Click Ordering

MODULE-I DATA ANALYTICS 14


The Vs of Big Data

MODULE-I DATA ANALYTICS 15


Volume

MODULE-I DATA ANALYTICS 16


Volume
The Earthscope is the world's largest science
project. Designed to track North America's
geological evolution, this observatory records data
over 3.8 million square miles, amassing 67
terabytes of data. It analyzes seismic slips in the
San Andreas fault, sure, but also the plume of
magma underneath Yellowstone and much, much
more.

MODULE-I DATA ANALYTICS 17


The 5 Vs

MODULE-I DATA ANALYTICS 18


Variety

MODULE-I DATA ANALYTICS 19


Variety
 Data is generated from internal, external,
social, and behavioural sources .
 It comes in different formats, such as images,
text, videos etc.
Data Source Definition Source Application
Internal Structured data CRM, ERP Support daily
business
operations
External Unstructured Internet Understand
data customers,
competitors,
markets.

MODULE-I DATA ANALYTICS 20


Variety

MODULE-I DATA ANALYTICS 21


Structuring Big Data

 In simple terms, is arranging the available data in a


manner such that it becomes easy to study,
analyze, and derive conclusion format .

MODULE-I DATA ANALYTICS 22


Why is structuring required?
In our daily life, you may have come across
questions like,
‒ How do I use my advantage the vast amount of
data and information I come across?
‒ Which news articles should I read of the thousands I
come across?
‒ How do I choose a book of the millions available on
my favorite sites or stores?
‒ How do I keep myself updated about new events,
sports, inventions, and discoveries taking place
across the globe?
MODULE-I DATA ANALYTICS 23
Structuring Big Data
Solution to these questions can be found by information
processing systems.
 Analysis can be done based on:

 What you searched?

 What you looked at?

 How far you remained at a particular website?

 Structuring data helps in understanding user

behaviour, requirements, and preferences to make


personalized recommendation for every individual.

MODULE-I DATA ANALYTICS 24


Characteristics of Data

Deals with the structure of the


data i.e. source, the granularity,
the type, nature whether static Composition
or real-time streaming

Deals with the state of the data


i.e. usability for analysis, does Condition Data
it require cleaning for further
enhancement and enrichment?
Deals with “where it has been Context
generated”, “ why was this
generated”, “how sensitive is
this”, “what are the associated
events” and so on.

MODULE-I DATA ANALYTICS 25


Classification of Digital Data
Digital data is classified into the following categories:
 Structured data
 Semi-structured data
 Unstructured data

Approximate percentage distribution of digital data

MODULE-I DATA ANALYTICS 26


Structured Data
 It is defined as the data that has a well-defined repeating pattern and this
pattern makes it easier for any program to sort, read, and process the data.
 This is data is in an organized form (e.g., in rows and columns) and can be easily
used by a computer program.
 Relationships exist between entities of data.
 Structured data:
 Organize data in a pre-defined format.
 Is stored in a tabular form.
 Is the data that resides in a fixed fields within a record of file.
 Is formatted data that has entities and their attributes mapped.
 Is used to query and report against predetermined data types.
 Sources:
Multidimensional
Relational
databases
database
Structured data
Legacy
Flat files
databases
MODULE-I DATA ANALYTICS 27
Ease with Structured Data
Insert/ DML operations provide the required ease with data
input, storage, access, process , analysis etc.
Update/Delete

Encryption and tokenization solution to warrant the


security of information throughout life cycle.
Security Organization able to retain control and maintain
compliance adherence by ensuring that only authorized
are able to decrypt and view sensitive information.

Indexing speed up the data retrieval operation at the


Structured data Indexing cost of additional writes and storage space, but the
benefits that ensure in search operation are worth the
additional writes and storage spaces.

The storage and processing capabilities of the traditional


Scalability DBMS can be easily be scaled up by increasing the
horsepower of the database server.

Transaction RDBMS has support of ACID properties of transaction


to ensure accuracy, completeness and data integrity.
Processing
MODULE-I DATA ANALYTICS 28
MODULE-I DATA ANALYTICS 29
Semi-structured Data
 Semi-structured data, also known as having a schema-less or self-describing
structure, refers to a form which does not conform to a data model as in
relational database but has some structure.
 In other words, data is stored inconsistently in rows and columns of a
database.
 However, it is not in a form which can be used easily by a computer program.
 Example, emails, XML, markup languages like HTML, etc. Metadata for this data
is available but is not sufficient.
 Sources:
Web data in
the form of XML
cookies Semi-structured
data
Other Markup JSON
languages

MODULE-I DATA ANALYTICS 30


XML, JSON, BSON format

Source (XML & JSON): http://sqllearnergroups.blogspot.com/2014/03/how-to-get-json-format-through-sql.html


Source (JSON & BSON): http://www.expert-php.fr/mongodb-bson/

MODULE-I DATA ANALYTICS 31


JSON & XML
 var myObj = {name: "John", age: 31, city: "New
York"};
 It is used primarily to transmit data between a server
and web application
 XML (eXtensible Markup Language)
 XML is a software- and hardware-independent tool
for storing and transporting data.
 Refer: https://www.w3schools.com/xml/xml_whatis.asp
 Example of XML :

MODULE-I DATA ANALYTICS 32


Characteristics of Semi-structured Data

Inconsistent
Structure

Self-describing
(level/value pair)

Other schema
Semi-structured information is
data blended with data
values

Data objects may


have different
attributes not
known beforehand
MODULE-I DATA ANALYTICS 33
Cookies
 Cookies allow us to understand who has seen which
pages, and to determine the most popular areas of
our web site.
 We also use cookies to store visitors’ preferences,
and to record session information, such as length of
visit.
 Depending on the type of cookie we use, cookies also
allow us to make our web site more user friendly, for
example, permanent cookies allow us to save your
password so that you do not have to re-enter it every
time you visit our web site

MODULE-I DATA ANALYTICS 34


Cookies
 These cookies are used to collect information
about how visitors use our site.
 We use the information to compile reports and

to help us improve the site. The cookies collect


information in an anonymous form, including
 the number of visitors to the site , where

visitors have come to the site from and the


pages they visited.

MODULE-I DATA ANALYTICS 35


Web Data
 It refers to the data that is publicly available on the
web sites.
 The web data has documents in pdf, doc, plain text as
well as images, music, and videos.
 The most widely used and best‐known source of big
data today is the detailed data collected from web
sites.
 The data is unstructured and inappropriate for access
by software application, and hence is converted to
either semi-structured or structured format that is
well suited for both humans and machines.
MODULE-I DATA ANALYTICS 36
Unstructured Data
 Unstructured data is a set of data that might or might not have any logical or
repeating patterns and is not recognized in a pre-defined manner.
 About 80 percent of enterprise data consists of unstructured content.
 Unstructured data:
 Typically consists of metadata i.e. additional information related to data.
 Comprises of inconsistent data such as data obtained from files, social
media websites, satellites etc
 Consists of data in different formats such as e-mails, text, audio, video, or
images.
 Sources: Body of email

Text both Chats, Text


internal and messages
external to
org. Mobile data
Unstructured data
Social Media Images,
data audios, videos
MODULE-I DATA ANALYTICS 37
Un Structured Data
 The CCTV footage in a super market are thoroughly
analyzed to identify
 The routes customer take to navigate through store

 Customer behaviour during a bottleneck during

network traffic
 Places where customer typically halt while shopping.

This unstructured data is combined with the structured data,


comprising the details obtained from the bill counters,
product sold, the amount and nature of payment etc.
 This helps the management to provide a pleasant

shopping experience to customers as well as improves


sales figure.
MODULE-I DATA ANALYTICS 38
MODULE-I DATA ANALYTICS 39
Challenges associated with
Unstructured data
Working with unstructured data poses certain challenges, which are as follows:
 Identifying the unstructured data that can be processed.
 Sorting, organizing, and arranging unstructured data indifferent sets and
formats.
 Combining and linking unstructured data in a more structured format to
derive any logical conclusions out of the available information.
 Costing in terms of storage space and human resources need to deal with the
exponential growth of unstructured data.
Data Analysis of Unstructured Data
The complexity of unstructured data lies within the language that created it. Human
language is quite different from the language used by machines, which prefer
structured information. Unstructured data analysis is referred to the process of
analyzing data objects that doesn’t follow a predefine data model and/or is
unorganized. It is the analysis of any data that is stored over time within an
organizational data repository without any intent for its orchestration, pattern or
categorization.
MODULE-I DATA ANALYTICS 40
Dealing with Unstructured data

Data Mining (DM)

Natural Language Processing (NLP)


Dealing with
Unstructured data Text Analytics (TA)
Noisy Text Analytics

MODULE-I DATA ANALYTICS 41


Velocity (Speed)

MODULE-I DATA ANALYTICS 42


Velocity
 ebay analyzes around 5 million transactions per day in real
time to detect and prevent frauds arising from the use of
Paypal.
 Social media messages going viral in minutes, the speed at
which credit card transactions are checked for fraudulent
activities.
 Big data technology now allows us to analyze the data
while it is being generated without ever putting it into
databases.

MODULE-I DATA ANALYTICS 43


Real-Time Analytics

MODULE-I DATA ANALYTICS 44


Veracity
 Veracity refers to the messiness or trustworthiness
(quality) of data. With many forms of bigdata,
quality and accuracy are less controllable, for
example Twitter posts with hashtags,
abbreviations and typos.
 Big data and analytics technology now allows
us to work with these types of data. The
volumes often makeup for the lack of quality
or accuracy.

MODULE-I DATA ANALYTICS 45


Veracity
 It refers to uncertainty of data –
 Is data is correct and consistent?
 Big data is messy in nature – unstructured &
semi structured form
 Clean the data for further analysis

MODULE-I DATA ANALYTICS 46


Value
 But all the volumes of fast-moving data of
different variety and veracity have to be
turned into value!
 This is why value is the one V of bigdata that
matters the most.

MODULE-I DATA ANALYTICS 47


Value
 Value is defined as the usefulness of data for an
enterprise.
 The value characteristic is intuitively related to the
veracity characteristic in that the higher the data
fidelity, the more value it holds for the business.
 Value is also dependent on how long data processing
takes because analytics results have a shelf-life; for
example, a 20 minute delayed stock quote has little to
no value for making a trade compared to a quote that
is 20 milliseconds old.
 Data that has high veracity and can be analyzed quickly,
has more value to business.
MODULE-I DATA ANALYTICS 48
The Vs (Extended)

MODULE-I DATA ANALYTICS 49


The 7 Vs

Representation of Data: Data clustering or using tree maps, sunbursts,


parallel coordinates, circular network diagrams, or cone trees.

MODULE-I DATA ANALYTICS 50


Definition of Big Data

Big Data is high-volume, high-velocity,


and high-variety information assets that
demand cost effective, innovative forms
of information processing for enhanced
insight and decision making.
Source: Gartner IT Glossary

MODULE-I DATA ANALYTICS 51


What is Big Data?
Think of following:

 Every second, there are around 822 tweets on Twitter.


 Every minutes, nearly 510 comments are posted, 293 K statuses are updated,
and 136K photos are uploaded in Facebook.
 Every hour, Walmart, a global discount departmental store chain, handles more
than 1 million customer transactions.
 Everyday, consumers make around 11.5 million payments by using PayPal.
In the digital world, data is increasing rapidly because of the ever increasing use of
the internet, sensors, and heavy machines at a very high rate. The sheer volume,
variety, velocity, and veracity of such data is signified the term ‘Big Data’.

Semi- Big
Structured Unstructur
structure Data
Data ed Data
d Data

MODULE-I DATA ANALYTICS 52


Why Big Data?
More data for analysis will result into greater analytical accuracy and greater
confidence in the decisions based on the analytical findings. This would entail a
greater positive impact in terms of enhancing operational efficiencies, reducing cost and
time, and innovating on new products, new services and optimizing existing services.

More data

More accurate analysis

Greater confidence in decision making

Greater operational efficiencies, cost


reduction, time reduction, new
product development, and optimized
offering etc.

MODULE-I DATA ANALYTICS 53


Challenges of Traditional Systems
The main challenge in the traditional approach for computing systems to manage
‘Big Data’ because of immense speed and volume at which it is generated. Some of
the challenges are:
Traditional approach cannot work on unstructured data efficiently.
Traditional approach is built on top of the relational data model, relationships
between the subjects of interests have been created inside the system and the
analysis is done based on them. This approach will not adequate for big data.
Traditional approach is batch oriented and need to wait for nightly ETL
(extract, transform and load) and transformation jobs to complete before the
required insight is obtained.
Traditional data management, warehousing, and analysis systems fizzle to analyze
this type of data. Due to it’s complexity, big data is processed with parallelism.
Parallelism in a traditional system is achieved through costly hardware like
MPP (Massively Parallel Processing) systems.
Inadequate support of aggregated summaries of data.

MODULE-I DATA ANALYTICS 54


Challenges of Traditional Systems
cont’d
Other challenges can be categorized as:
Data Challenges:
 Volume, velocity, veracity, variety
 Data discovery and comprehensiveness
 Scalability

Process challenges
 Capturing Data
 Aligning data from different sources
 Transforming data into suitable form for data analysis
 Modeling data(Mathematically, simulation)

Management Challenges:
 Security
 Privacy
 Governance
 Ethical issues MODULE-I DATA ANALYTICS 55
Evolution of Analytics Scalability
 As the amount of data organizations process continue to
increase, the world of big data requires new levels of
scalability. Organizations need to update the technology to
provide a higher level of scalability.
 Luckily, there are multiple technologies available that address
different aspects of the process of taming big data and making
use of it in analytic processes.
 The technologies are:
 MPP (massively parallel processing)
 Cloud computing (Appendix)
 Grid computing
 MapReduce (Hadoop)

MODULE-I DATA ANALYTICS 56


Traditional Analytics Architecture

Database 1
Analytic Server

Database 2
Extract
Database 3

The heavy processing occurs in the analytic environment.


Database n This may even a PC.

MODULE-I DATA ANALYTICS 57


Modern In-Database Analytics Architecture

Database 1
Analytic Server
Database 2
Submit
Consolidate
Request

Database 3 Enterprise Data


Warehouse (EDW)

Database n

In an in-database environment, the processing stays in the database where the data
has been consolidated. EDWs collect and aggregate data from multiple sources, acting
as a repository for most or all organizational data to facilitate broad access and analysis.
The user’s machine just submits the request; it doesn’t do heavy lifting.

MODULE-I DATA ANALYTICS 58


Distributed vs. Parallel Computing
Parallel Computing Distributed Computing
Shared memory system Distributed memory system
Multiple processors share a Autonomous computer nodes
single bus and memory unit connected via network
Processor is order of Tbps Processor is order of Gbps
Limited Scalability Better scalability and cheaper
Distributed computing in local
network (called cluster
computing). Distributed
computing in wide-area network
(grid computing)
MODULE-I DATA ANALYTICS 59
EDW & MPP
 Enterprise Data Warehouse: An enterprise data warehouse (EDW) is a database,
or collection of databases, that centralizes a business's information from
multiple sources and applications, and makes it available for analytics and use
across the organization. EDWs can be housed in an on-premise server or in the
cloud. The data stored in this type of digital warehouse can be one of a business’s
most valuable assets, as it represents much of what is known about the business, its
employees, its customers, and more.

 Massively Parallel Processing (MPP): It is a storage structure designed to handle


the coordinated processing of program operations by multiple processors. This
coordinated processing can work on different parts of a program, with each
processor using its own operating system and memory. This allows MPP
databases to handle massive amounts of data and provide much faster analytics
based on large datasets.

MODULE-I DATA ANALYTICS 60


MPP Analytics Architecture

Massively parallel processing (MPP) database systems is the most mature, proven, and
widely deployed mechanism for storing and analyzing large amounts of data. An MPP
database spreads data out into independent pieces managed by independent
storage and central processing unit (CPU) resources. Conceptually, it is like having
pieces of data loaded onto multiple network connected personal computers
around a house. The data in an MPP system gets split across a variety of disks managed
by a variety of CPUs spread across a number of servers.

In stead of single
overloaded database, an Single overloaded server
MPP database breaks the
data into independent
chunks with independent
disk and CPU.
Multiple lightly loaded server

MODULE-I DATA ANALYTICS 61


MPP Database Example

100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte


chunks chunks chunks chunks chunks

One-terabyte
table 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks

A Traditional database will query


a one-terabyte table one row at time 10 simultaneous 100-gigabyte queries

MPP database is based on the principle of SHARE THE WORK!


•A MPP database spreads data out across multiple sets of CPU and disk space.
•This allows much faster query execution, since many independent smaller queries are
running simultaneously instead of just one big query.
•If more processing power and more speed are required, just bolt on additional capacity
in the form of additional processing units.
•MPP systems build in redundancy to make recovery easy and have resource
management tools to manage the CPU and disk space.
MODULE-I DATA ANALYTICS 62
MPP Database Example cont’d
An MPP system allows the different sets of CPU and disk to run the process concurrently

An MPP system
breaks the job into pieces

Single Threaded
Process ★ Parallel Process ★
MODULE-I DATA ANALYTICS 63
Grid Computing
 Grid Computing can be defined as a network of computers working
together to perform a task that would rather be difficult for a single
machine.
 The task that they work on may include analyzing huge datasets or
simulating situations which require high computing power.
 Computers on the network contribute resources like processing
power and storage capacity to the network.
 Grid Computing is a subset of distributed computing, where a
virtual super computer comprises of machines on a network
connected by some bus, mostly Ethernet or sometimes the Internet.
 It can also be seen as a form of parallel computing where instead of
many CPU cores on a single machine, it contains multiple cores
spread across various locations.

MODULE-I DATA ANALYTICS 64


How Grid Computing works?

MODULE-I DATA ANALYTICS 65


Hadoop
Hadoop is an open-source project of the Apache Foundation. Apache
Hadoop is written in Java and a collection of open-source software utilities
that facilitate using a network of many computers to solve problems involving
massive amounts of data and computation. It provides a software
framework for distributed storage and processing of big data and uses
Google’s MapReduce and Google File System as its foundation.

Hadoop
Apache open-source software framework

Inspired by:
- Google MapReduce
- Google File System

Hadoop provides various tools and technologies, collectively termed as Hadoop


ecosystem, to enable development and deployment of Big Data solutions. It
accomplishes two tasks namely i) Massive data storage, and ii) Faster data
processing. 66
Flood of data/ Source of Big Data
Few stastics to get an idea of data gets generated every day, every minute, and
every second.
Every day
 NYSE generates 1.5 billion shares and trade data
 Facebook stores 2.7 billion comments and likes
 Google processes about 24 petabytes of data
Every minutes
 Facebook users share nearly 2.5 million pieces of content.
 Amazon generates over $ 80,000 in online sale
 Twitter users tweet nearly 300,000 times.
 Instagram users post nearly 220,000 new photos
 Apple users download nearly 50,000 apps.
 Email users send over 2000 million messages
 YouTube users upload 72 hrs of new video content
Every second
 Banking applications process more than 10,000 credit card
67
transactions.
Data Challenges
To process, analyze and made sense of these different kinds of data, a system is
needed that scales and address the challenges as shown:

“I have data in various sources. I have


“I am flooded with
data that rich in variety – structured,
data”. How to store
semi-structured and unstructured”. How
terabytes of mounting
to work with data that is so very
data?
different?

“I need this data to be


proceed quickly. My
decision is pending”.
How to access the
information quickly?

68
Why Hadoop
Its capability to handle massive amounts of data, different categories of data –
fairly quickly.
Considerations

69
Hadoop History

Hadoop was created by Doug Cutting, the creator of Apache Lucene (text search
library). Hadoop was part of Apace Nutch (open-source web search engine of
Yahoo project) and also part of Lucene project. The name Hadoop is not an
acronym; it’s a made-up name.
70
Key Aspects of Hadoop

71
Hadoop Components

72
Hadoop Components cont’d
Hadoop Core Components:
 HDFS
 Storage component
 Distributed data across several nodes
 Natively redundant
 MapReduce
 Computational Framework
 Splits a task across multiple nodes
 Process data in parallel

Hadoop Ecosystems: These are support projects to enhance the functionality


of Hadoop Core components. The projects are as follows:
 Hive  Flume  HBase
 Pig  Oozie
 Sqoop  Mahout
73
Hadoop Ecosystem

Data Management

Data Access

Data Processing

Data Storage
74
Version of Hadoop
YARN (Yet Another Resource
There are 3 versions of Hadoop available: Negotiator) is the resource
 Hadoop 1.x  Hadoop 3.x management(allocating resources to
 Hadoop 2.x various applications) and job/ task
scheduling technology
Hadoop 1.x vs. Hadoop 2.x

Hadoop 1.x Hadoop 2.x


Other Data Processing
MapReduce MapReduce
Framework
Data Processing & Resource
Management YARN
Resource Management
HDFS HDFS2
Distributed File Storage Distributed File Storage
(redundant, reliable storage) (redundant, highly-available, reliable storage)

75
Hadoop 2.x vs. Hadoop 3.x
Characteristics Hadoop 2.x Hadoop 3.x
Minimum Java 7 Java 8
supported version
of java
Fault tolerance Handled by replication (which is Handled by erasure coding
wastage of space).
Data Balancing Uses HDFS balancer Uses Intra-data node balancer,
which is invoked via the HDFS
disk balancer CLI.
Storage Scheme Uses 3X replication scheme. E.g. If Support for erasure encoding in
there is 6 block so there will be 18 HDFS. E.g. If there is 6 block so
blocks occupied the space because there will be 9 blocks occupied
of the replication scheme. the space 6 block and 3 for parity.
Scalability Scale up to 10,000 nodes per Scale more than 10,000 nodes per
cluster. cluster.
76
High Level Hadoop 2.0 Architecture
Hadoop is distributed Master-Slave architecture.
Distributed data storage Distributed data processing
Client

HDFS YARN

HDFS Master Node YARN Master Node


Active Namenode Resource Manager
Master
Standby Namenode
Secondary Namenode

HDFS Slave Node YARN Slave Node


DataNode 1 Slave Node Manager 1

DataNode n Node Manager n


77
High Level Hadoop 2.0 Architecture cont’d

Resource Node Node Node


YARN Manager Manager Manager Manager

HDFS
Cluster NameNode DataNode DataNode DataNode

78
Hadoop HDFS
 The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications.
 HDFS holds very large amount of data and employs a NameNode and
DataNode architecture to implement a distributed file system that provides
high-performance access to data across highly scalable Hadoop clusters.
 To store such huge data, the files are stored across multiple machines.
 These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
 It’s run on commodity hardware.
 Unlike other distributed systems, HDFS is highly fault-tolerant and
designed using low-cost hardware.

79
Hadoop HDFS Key points
Some key points of HDFS are as follows:
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
5. One can replicate a file /configure it number of times, which is tolerant in
terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. Sits on top of native file system

80
HDFS Physical Architecture
Key components of HDFS are as follows:
1. NameNode 3. Secondary NameNode
2. DataNodes 4. Standby NameNode
Blocks: Generally the user data is stored in the files of HDFS. HDFS breaks a
large file into smaller pieces called blocks. In other words, the minimum
amount of data that HDFS can read or write is called a block. By default
the block size is 128 MB in Hadoop 2.x and 64 MB in Hadoop 1.x. But it can
be increased as per the need to change in HDFS configuration.
Hadoop 2.X Hadoop 1.X
200 MB – abc.txt 200 MB – abc.txt
128 MB – Block 1
72 MB – Block 2

Why block size is large?


1. Reduce the cost of seek time and 2. Proper usage of storage space
81
Rack
A rack is a collection of 30 or 40 nodes that are physically stored close together
and are all connected to the same network switch. Network bandwidth between
any two nodes in rack is greater than bandwidth between two nodes on
different racks. A Hadoop Cluster is a collection of racks.
Switch

Node 1 Node 1 Node 1


S S S
Node 2 Node 2 Node 2
W W W
I I I
T T T
C C C
H H H
Node N Node N Node N

Rack 1 Rack 2 Rack N 82


NameNode
1. NameNode is the centerpiece of HDFS.
2. NameNode is also known as the Master.
3. NameNode only stores the metadata of HDFS – the directory tree of all files in the
file system, and tracks the files across the cluster.
4. NameNode does not store the actual data or the dataset. The data itself is actually
stored in the DataNodes
5. NameNode knows the list of the blocks and its location for any given file in HDFS.
With this information NameNode knows how to construct the file from blocks.
6. NameNode is usually configured with a lot of memory (RAM).
7. NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop
cluster is inaccessible and considered down.
8. NameNode is a single point of failure in Hadoop cluster.
Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 128 GB
Disk: 6 x 1TB SATA
Network: 10 Gigabit Ethernet
83
NameNode Metadata
1. Metadata stored about the file consists of file name, file path, number of
blocks, block Ids, replication level.
2. This metadata information is stored on the local disk. Namenode uses two
files for storing this metadata information.
 FsImage  EditLog
3. NameNode in HDFS also keeps in it’s memory, location of the DataNodes
that store the blocks for any given file. Using that information Namenode
can reconstruct the whole file by getting the location of all the blocks
of a given file.

Example
(File Name, numReplicas, rack-ids, machine-ids, block-ids, …)
/user/in4072/data/part-0, 3, r:3, M3, {1, 3}, …
/user/in4072/data/part-1, 3, r:2, M1, {2, 4, 5}, …
/user/in4072/data/part-2, 3, r:1, M2, {6, 9, 8}, …
84
DataNode
1. DataNode is responsible for storing the actual data in HDFS.
2. DataNode is also known as the Slave
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the NameNode along with
the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability of data or the
cluster. NameNode will arrange for replication for the blocks managed
by the DataNode that is not available.
6. DataNode is usually configured with a lot of hard disk space. Because the
actual data is stored in the DataNode.
Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 64 GB
Disk: 12-24 x 1TB SATA
Network: 10 Gigabit Ethernet
85
Secondary NameNode
1. Secondary NameNode in Hadoop is more of a helper to NameNode, it is not
a backup NameNode server which can quickly take over in case of
NameNode failure.
2. EditLog– All the file write operations done by client applications are first
recorded in the EditLog.
3. FsImage– This file has the complete information about the file system
metadata when the NameNode starts. All the operations after that are
recorded in EditLog.
4. When the NameNode is restarted it first takes metadata information
from the FsImage and then apply all the transactions recorded in
EditLog. NameNode restart doesn’t happen that frequently so EditLog
grows quite large. That means merging of EditLog to FsImage at the time of
startup takes a lot of time keeping the whole file system offline during that
process.
5. Secondary NameNode take over this job of merging FsImage and EditLog and keep
the FsImage current to save a lot of time. Its main function is to check point the file
system metadata stored on NameNode.
Secondary NameNode cont’d
The process followed by Secondary NameNode to periodically merge the
fsimage and the edits log files is as follows:
1.Secondary NameNode pulls the latest FsImage and EditLog files from the
primary NameNode.
2.Secondary NameNode applies each transaction from EditLog file to FsImage to
create a new merged FsImage file.
3.Merged FsImage file is transferred back to primary NameNode.

1
2
Secondary
NameNode
NameNode
3

It’s been an
hour, provide
your metadata
87
Standby NameNode
With Hadoop 2.0, built into the platform, HDFS now has automated failover
with a hot standby, with full stack resiliency.
1.Automated Failover: Hadoop pro-actively detects NameNode host and
process failures and will automatically switch to the standby NameNode to
maintain availability for the HDFS service. There is no need for human
intervention in the process – System Administrators can sleep in peace!
2.Hot Standby: Both Active and Standby NameNodes have up to date HDFS
metadata, ensuring seamless failover even for large clusters – which means no
downtime for your HDP cluster!
3.Full Stack Resiliency: The entire Hadoop stack (MapReduce, Hive, Pig,
HBase, Oozie etc.) has been certified to handle a NameNode failure scenario
without losing data or the job progress. This is vital to ensure long running jobs
that are critical to complete on schedule will not be adversely affected during a
NameNode failure scenario.

88
Replication
HDFS provides a reliable way to store huge data in a distributed environment as
data blocks. The blocks are also replicated to provide fault tolerance. The
default replication factor is 3 which is configurable. Therefore, if a file to be
stored of 128 MB in HDFS using the default configuration, it would occupy a
space of 384 MB (3*128 MB) as the blocks will be replicated three times and
each replica will be residing on a different DataNode.

89
Rack Awareness
All machines in rack are connected using the same network switch and if that
network goes down then all machines in that rack will be out of service. Rack
Awareness was introduced by Apache Hadoop to overcome this issue. In Rack
Awareness, NameNode chooses the DataNode which is closer to the same rack
or nearby rack. NameNode maintains Rack ids of each DataNode to achieve rack
information. Thus, this concept chooses DataNodes based on the rack
information. NameNode in Hadoop makes ensures that all the replicas
should not stored on the same rack or single rack. Default replication factor
is 3. Therefore according to Rack Awareness Algorithm:
When a Hadoop framework creates new block, it places first replica on the
local node, and place a second one in a different rack, and the third one is on
different node on same remote node.
When re-replicating a block, if the number of existing replicas is one, place the
second on a different rack.
When number of existing replicas are two, if the two replicas are in the same
rack, place the third one on a different rack.
90
Rack Awareness & Replication

File B1 Block 1 B3 Block 3


B1 B2 B3 B2 Block 2

B3 DN 1 B1 DN 1 B2 DN 1

B1 DN 2 B2 DN 2 B3 DN 2

B3 DN 3 B1 DN 3 B2 DN 3

DN 4 DN 4 DN 4

Rack 1 Rack 2 Rack 3


91
Hadoop Ecosystem
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing

 Spark: In-memory data processing


 PIG, HIVE: query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
 Sqoop: Data transfer between Hadoop and RDBMS or mainframes
 HCatlog: Metadata services
92
Hadoop Ecosystem cont…
PIG
 It was developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data
sets.
 Pig does the work of executing commands and in the background, all the activities
of MapReduce are taken care of. After the processing, pig stores the result in
HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
Hbase
 It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus
able to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small in
a huge database, the request must be processed within a short quick span of time.93 At
such times, HBase comes handy as it gives us a tolerant way of storing limited data.
Hadoop Ecosystem cont…
 HBase is a distributed column-oriented database built on top
of the Hadoop file system. It is an open-source project and is
horizontally scalable.
 HBase is a data model that is similar to Google’s big table

designed to provide quick random access to huge amounts of


structured data
Storage Mechanism :
 Table is a collection of rows.
 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.

HBase maps (rowkey, column family, column, timestamp) to a value.

MODULE-I DATA ANALYTICS 94


Hadoop Ecosystem cont…
 HBase is schema-less, it doesn't have the concept of fixed
columns schema; defines only column families.

HBase is a database built on top of the HDFS.


HBase provides fast lookups for larger tables (random access).
HBase is horizontally scalable.
MODULE-I DATA ANALYTICS 95
Hadoop Ecosystem cont…
HIVE
 With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive Query
Language).
 It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two components:
JDBC Drivers and HIVE Command Line. JDBC, along with ODBC drivers work on
establishing the data storage permissions and connection whereas HIVE Command
line helps in the processing of queries.

Sqoop
 It is a tool designed to transfer data between Hadoop and relational database.
 It is used to import data from relational databases such as MySQL, Oracle to
Hadoop HDFS, and export from Hadoop file system to relational databases.

96
MapReduce
1. MapReduce is a processing technique and a program model for distributed
computing based on java. It is built on divide and conquer algorithm.
2. In MapReduce Programming, the input dataset is split into independent
chunks.
3. It contains two important tasks, namely Map and Reduce.
4. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). The processing
primitive is called mapper. The processing is done in parallel manner. The
output produced by the map tasks serves as intermediate data and is stored on
the local disk of that server.
5. Reduce task takes the output from a map as an input and combines those
data tuples into a smaller set of tuples. The processing primitive is called
reducer. The input and output are stored in a file system.
6. Reduce task is always performed after the map job.
7. The major advantage of MapReduce is that it is easy to scale data processing
over multiple computing nodes and takes care of other tasks such as scheduling,
monitoring, re-executing failed tasks etc. 97
MapReduce cont’d

98
MapReduce cont’d
 The main advantages is that we write an application in the MapReduce
form, scaling the application to run over hundreds, thousands, or even tens
of thousands of machines in a cluster with a configuration change.
 MapReduce program executes in three stages: map stage, shuffle &
sorting stage, and reduce stage.
 Map Stage: The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the mapper function line by
line. The mapper processes the data and creates several small chunks of
data.
 Shuffle & Sorting Stage: Shuffle phase in Hadoop transfers the map output
from Mapper to a Reducer in MapReduce. Sort phase in MapReduce covers
the merging and sorting of map outputs.
 Reducer Stage: The Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set of output, which will be
stored in the HDFS. 99
MapReduce: The Big Picture

MODULE-I DATA ANALYTICS 100


How MapReduce Work?
At the crux of MapReduce are two functions: Map and Reduce. They are
sequenced one after the other.
The Map function takes input from the disk as <key,value> pairs, processes
them, and produces another set of intermediate <key,value> pairs as output.
The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.

101
MapReduce Example

MODULE-I DATA ANALYTICS 102


More examples
Draw the MapReduce process to generate the total sales

MODULE-I DATA ANALYTICS 103


More examples

MODULE-I DATA ANALYTICS 104


Example Contd…

MODULE-I DATA ANALYTICS 105


Example Comtinued…

MODULE-I DATA ANALYTICS 106


Working of MapReduce
The types of keys and values differ based on the use case. All inputs and outputs
are stored in the HDFS. While the map is a mandatory step to filter and sort the
initial data, the reduce function is optional.
<k1, v1> -> Map() -> list(<k2, v2>)
<k2, list(v2)> -> Reduce() -> list(<k3, v3>)
Mappers and Reducers are the Hadoop servers that run the Map and Reduce
functions respectively. It doesn’t matter if these are the same or different
servers.
Map: The input data is first split into smaller blocks. Each block is then
assigned to a mapper for processing. For example, if a file has 100 records to be
processed, 100 mappers can run together to process one record each. Or maybe
50 mappers can run together to process two records each. The Hadoop
framework decides how many mappers to use, based on the size of the data to
be processed and the memory block available on each mapper server.
107
Working of MapReduce cont’d
 Reduce: After all the mappers complete processing, the framework shuffles
and sorts the results before passing them on to the reducers. A reducer
cannot start while a mapper is still in progress. All the map output values
that have the same key are assigned to a single reducer, which then
aggregates the values for that key.
Class Exercise 1 Class Exercise 2
Draw the MapReduce process to Draw the MapReduce process to find the
count the number of words for maximum electrical consumption for each
the input: year:
Dog Cat Rat Year
Car Car Rat
Dog car Rat
Rat Rat Rat

108
Hadoop Limitations
 Not fit for small data: Hadoop does not suit for small data. HDFS lacks the ability
to efficiently support the random reading of small files because of its high capacity
design. The solution to this drawback of Hadoop to deal with small file issue is
simple. Just merge the small files to create bigger files and then copy bigger files to
HDFS.
 Security concerns: Hadoop is challenging in managing the complex application. If
the user doesn’t know how to enable a platform who is managing the platform, data
can be a huge risk. At storage and network levels, Hadoop is missing encryption,
which is a major point of concern. Hadoop supports Kerberos authentication, which
is hard to manage. Spark provides a security bonus to overcome the limitations of
Hadoop.
 Vulnerable by nature: Hadoop is entirely written in Java, a language most widely
used, hence java been most heavily exploited by cyber criminals and as a result,
implicated in numerous security breaches.
 No caching: Hadoop is not efficient for caching. In Hadoop, MapReduce cannot
cache the intermediate data in memory for a further requirement which
diminishes the performance of Hadoop. Spark can overcome this limitation.
109
NoSQL
 NoSQL database stands for "Not Only SQL" or "Not SQL."
 It is a non-relational database, that does not require a fixed schema, and avoids joins.
 It is used for distributed data stores and specifically targeted for big data, for
example Google or Facebook which collects terabytes of data every day for their
users.
 Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database
technologies that can store structured, semi-structured, and unstructured data.
 It adhere to Brewer’s CAP theorem.
 The tables are stored as ASCII files and each field is separated by tabs
 The data scale horizontally.

110
NoSQL cont…

Database

RDBMS NoSQL

OLAP OLTP
111
RDBMS vs. NoSQL
RDBMS NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide
column store or key-value pairs databases
Vertically scalable (by increasing system Horizontally scalable (by creating a cluster of
resources) commodity machines)
Uses SQL Uses UnQL (Unstructured Query Language)
Not preferred for large datasets Largely preferred for large datasets
Not a best fit for hierarchical data Best fit for hierarchical storage as it follows
the key-value pair of storing data similar to
JSON
Emphasis on ACID properties Follows Brewer’s CAP theorem
112
RDBMS vs. NoSQL cont’d
RDBMS NoSQL
Excellent support from vendors Relies heavily on community support
Supports complex querying and data keeping Does not have good support for complex
needs querying
Can be configured for strong consistency Few support strong consistency (e.g.,
MongoDB), few others can be configured for
eventual consistency (e.g., Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL, Examples: MongoDB, HBase, Cassandra,
PostgreSQL, etc. Redis, Neo4j, CouchDB, Couchbase, Riak,
etc.

113
MODULE-I DATA ANALYTICS 114

You might also like