Module I(Introduction) Part I (1)
Module I(Introduction) Part I (1)
Data Analytics
INTRODUCTION
The Definition
Data Analytics (DA) is the process of examining data
sets in order to find trends and draw conclusions
about the information they contain.
Data analytics is the science of analyzing raw data
to make conclusions about that information.
Data analytics helps individuals and organization
make sense of data. DA typically analyze raw data
for insight and trends.
Data analytics help a business optimize its
performance, maximize profit, or make more
strategically-guided decisions.
MODULE-I DATA ANALYTICS 2
Why Big Data ?
Mobile devices
(Tracking all objects all the time)
Inconsistent
Structure
Self-describing
(level/value pair)
Other schema
Semi-structured information is
data blended with data
values
network traffic
Places where customer typically halt while shopping.
Semi- Big
Structured Unstructur
structure Data
Data ed Data
d Data
More data
Process challenges
Capturing Data
Aligning data from different sources
Transforming data into suitable form for data analysis
Modeling data(Mathematically, simulation)
Management Challenges:
Security
Privacy
Governance
Ethical issues MODULE-I DATA ANALYTICS 55
Evolution of Analytics Scalability
As the amount of data organizations process continue to
increase, the world of big data requires new levels of
scalability. Organizations need to update the technology to
provide a higher level of scalability.
Luckily, there are multiple technologies available that address
different aspects of the process of taming big data and making
use of it in analytic processes.
The technologies are:
MPP (massively parallel processing)
Cloud computing (Appendix)
Grid computing
MapReduce (Hadoop)
Database 1
Analytic Server
Database 2
Extract
Database 3
Database 1
Analytic Server
Database 2
Submit
Consolidate
Request
Database n
In an in-database environment, the processing stays in the database where the data
has been consolidated. EDWs collect and aggregate data from multiple sources, acting
as a repository for most or all organizational data to facilitate broad access and analysis.
The user’s machine just submits the request; it doesn’t do heavy lifting.
Massively parallel processing (MPP) database systems is the most mature, proven, and
widely deployed mechanism for storing and analyzing large amounts of data. An MPP
database spreads data out into independent pieces managed by independent
storage and central processing unit (CPU) resources. Conceptually, it is like having
pieces of data loaded onto multiple network connected personal computers
around a house. The data in an MPP system gets split across a variety of disks managed
by a variety of CPUs spread across a number of servers.
In stead of single
overloaded database, an Single overloaded server
MPP database breaks the
data into independent
chunks with independent
disk and CPU.
Multiple lightly loaded server
One-terabyte
table 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks
An MPP system
breaks the job into pieces
Single Threaded
Process ★ Parallel Process ★
MODULE-I DATA ANALYTICS 63
Grid Computing
Grid Computing can be defined as a network of computers working
together to perform a task that would rather be difficult for a single
machine.
The task that they work on may include analyzing huge datasets or
simulating situations which require high computing power.
Computers on the network contribute resources like processing
power and storage capacity to the network.
Grid Computing is a subset of distributed computing, where a
virtual super computer comprises of machines on a network
connected by some bus, mostly Ethernet or sometimes the Internet.
It can also be seen as a form of parallel computing where instead of
many CPU cores on a single machine, it contains multiple cores
spread across various locations.
Hadoop
Apache open-source software framework
Inspired by:
- Google MapReduce
- Google File System
68
Why Hadoop
Its capability to handle massive amounts of data, different categories of data –
fairly quickly.
Considerations
69
Hadoop History
Hadoop was created by Doug Cutting, the creator of Apache Lucene (text search
library). Hadoop was part of Apace Nutch (open-source web search engine of
Yahoo project) and also part of Lucene project. The name Hadoop is not an
acronym; it’s a made-up name.
70
Key Aspects of Hadoop
71
Hadoop Components
72
Hadoop Components cont’d
Hadoop Core Components:
HDFS
Storage component
Distributed data across several nodes
Natively redundant
MapReduce
Computational Framework
Splits a task across multiple nodes
Process data in parallel
Data Management
Data Access
Data Processing
Data Storage
74
Version of Hadoop
YARN (Yet Another Resource
There are 3 versions of Hadoop available: Negotiator) is the resource
Hadoop 1.x Hadoop 3.x management(allocating resources to
Hadoop 2.x various applications) and job/ task
scheduling technology
Hadoop 1.x vs. Hadoop 2.x
75
Hadoop 2.x vs. Hadoop 3.x
Characteristics Hadoop 2.x Hadoop 3.x
Minimum Java 7 Java 8
supported version
of java
Fault tolerance Handled by replication (which is Handled by erasure coding
wastage of space).
Data Balancing Uses HDFS balancer Uses Intra-data node balancer,
which is invoked via the HDFS
disk balancer CLI.
Storage Scheme Uses 3X replication scheme. E.g. If Support for erasure encoding in
there is 6 block so there will be 18 HDFS. E.g. If there is 6 block so
blocks occupied the space because there will be 9 blocks occupied
of the replication scheme. the space 6 block and 3 for parity.
Scalability Scale up to 10,000 nodes per Scale more than 10,000 nodes per
cluster. cluster.
76
High Level Hadoop 2.0 Architecture
Hadoop is distributed Master-Slave architecture.
Distributed data storage Distributed data processing
Client
HDFS YARN
HDFS
Cluster NameNode DataNode DataNode DataNode
78
Hadoop HDFS
The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications.
HDFS holds very large amount of data and employs a NameNode and
DataNode architecture to implement a distributed file system that provides
high-performance access to data across highly scalable Hadoop clusters.
To store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
It’s run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault-tolerant and
designed using low-cost hardware.
79
Hadoop HDFS Key points
Some key points of HDFS are as follows:
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
5. One can replicate a file /configure it number of times, which is tolerant in
terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. Sits on top of native file system
80
HDFS Physical Architecture
Key components of HDFS are as follows:
1. NameNode 3. Secondary NameNode
2. DataNodes 4. Standby NameNode
Blocks: Generally the user data is stored in the files of HDFS. HDFS breaks a
large file into smaller pieces called blocks. In other words, the minimum
amount of data that HDFS can read or write is called a block. By default
the block size is 128 MB in Hadoop 2.x and 64 MB in Hadoop 1.x. But it can
be increased as per the need to change in HDFS configuration.
Hadoop 2.X Hadoop 1.X
200 MB – abc.txt 200 MB – abc.txt
128 MB – Block 1
72 MB – Block 2
Example
(File Name, numReplicas, rack-ids, machine-ids, block-ids, …)
/user/in4072/data/part-0, 3, r:3, M3, {1, 3}, …
/user/in4072/data/part-1, 3, r:2, M1, {2, 4, 5}, …
/user/in4072/data/part-2, 3, r:1, M2, {6, 9, 8}, …
84
DataNode
1. DataNode is responsible for storing the actual data in HDFS.
2. DataNode is also known as the Slave
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the NameNode along with
the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability of data or the
cluster. NameNode will arrange for replication for the blocks managed
by the DataNode that is not available.
6. DataNode is usually configured with a lot of hard disk space. Because the
actual data is stored in the DataNode.
Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 64 GB
Disk: 12-24 x 1TB SATA
Network: 10 Gigabit Ethernet
85
Secondary NameNode
1. Secondary NameNode in Hadoop is more of a helper to NameNode, it is not
a backup NameNode server which can quickly take over in case of
NameNode failure.
2. EditLog– All the file write operations done by client applications are first
recorded in the EditLog.
3. FsImage– This file has the complete information about the file system
metadata when the NameNode starts. All the operations after that are
recorded in EditLog.
4. When the NameNode is restarted it first takes metadata information
from the FsImage and then apply all the transactions recorded in
EditLog. NameNode restart doesn’t happen that frequently so EditLog
grows quite large. That means merging of EditLog to FsImage at the time of
startup takes a lot of time keeping the whole file system offline during that
process.
5. Secondary NameNode take over this job of merging FsImage and EditLog and keep
the FsImage current to save a lot of time. Its main function is to check point the file
system metadata stored on NameNode.
Secondary NameNode cont’d
The process followed by Secondary NameNode to periodically merge the
fsimage and the edits log files is as follows:
1.Secondary NameNode pulls the latest FsImage and EditLog files from the
primary NameNode.
2.Secondary NameNode applies each transaction from EditLog file to FsImage to
create a new merged FsImage file.
3.Merged FsImage file is transferred back to primary NameNode.
1
2
Secondary
NameNode
NameNode
3
It’s been an
hour, provide
your metadata
87
Standby NameNode
With Hadoop 2.0, built into the platform, HDFS now has automated failover
with a hot standby, with full stack resiliency.
1.Automated Failover: Hadoop pro-actively detects NameNode host and
process failures and will automatically switch to the standby NameNode to
maintain availability for the HDFS service. There is no need for human
intervention in the process – System Administrators can sleep in peace!
2.Hot Standby: Both Active and Standby NameNodes have up to date HDFS
metadata, ensuring seamless failover even for large clusters – which means no
downtime for your HDP cluster!
3.Full Stack Resiliency: The entire Hadoop stack (MapReduce, Hive, Pig,
HBase, Oozie etc.) has been certified to handle a NameNode failure scenario
without losing data or the job progress. This is vital to ensure long running jobs
that are critical to complete on schedule will not be adversely affected during a
NameNode failure scenario.
88
Replication
HDFS provides a reliable way to store huge data in a distributed environment as
data blocks. The blocks are also replicated to provide fault tolerance. The
default replication factor is 3 which is configurable. Therefore, if a file to be
stored of 128 MB in HDFS using the default configuration, it would occupy a
space of 384 MB (3*128 MB) as the blocks will be replicated three times and
each replica will be residing on a different DataNode.
89
Rack Awareness
All machines in rack are connected using the same network switch and if that
network goes down then all machines in that rack will be out of service. Rack
Awareness was introduced by Apache Hadoop to overcome this issue. In Rack
Awareness, NameNode chooses the DataNode which is closer to the same rack
or nearby rack. NameNode maintains Rack ids of each DataNode to achieve rack
information. Thus, this concept chooses DataNodes based on the rack
information. NameNode in Hadoop makes ensures that all the replicas
should not stored on the same rack or single rack. Default replication factor
is 3. Therefore according to Rack Awareness Algorithm:
When a Hadoop framework creates new block, it places first replica on the
local node, and place a second one in a different rack, and the third one is on
different node on same remote node.
When re-replicating a block, if the number of existing replicas is one, place the
second on a different rack.
When number of existing replicas are two, if the two replicas are in the same
rack, place the third one on a different rack.
90
Rack Awareness & Replication
B3 DN 1 B1 DN 1 B2 DN 1
B1 DN 2 B2 DN 2 B3 DN 2
B3 DN 3 B1 DN 3 B2 DN 3
DN 4 DN 4 DN 4
Sqoop
It is a tool designed to transfer data between Hadoop and relational database.
It is used to import data from relational databases such as MySQL, Oracle to
Hadoop HDFS, and export from Hadoop file system to relational databases.
96
MapReduce
1. MapReduce is a processing technique and a program model for distributed
computing based on java. It is built on divide and conquer algorithm.
2. In MapReduce Programming, the input dataset is split into independent
chunks.
3. It contains two important tasks, namely Map and Reduce.
4. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). The processing
primitive is called mapper. The processing is done in parallel manner. The
output produced by the map tasks serves as intermediate data and is stored on
the local disk of that server.
5. Reduce task takes the output from a map as an input and combines those
data tuples into a smaller set of tuples. The processing primitive is called
reducer. The input and output are stored in a file system.
6. Reduce task is always performed after the map job.
7. The major advantage of MapReduce is that it is easy to scale data processing
over multiple computing nodes and takes care of other tasks such as scheduling,
monitoring, re-executing failed tasks etc. 97
MapReduce cont’d
98
MapReduce cont’d
The main advantages is that we write an application in the MapReduce
form, scaling the application to run over hundreds, thousands, or even tens
of thousands of machines in a cluster with a configuration change.
MapReduce program executes in three stages: map stage, shuffle &
sorting stage, and reduce stage.
Map Stage: The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the mapper function line by
line. The mapper processes the data and creates several small chunks of
data.
Shuffle & Sorting Stage: Shuffle phase in Hadoop transfers the map output
from Mapper to a Reducer in MapReduce. Sort phase in MapReduce covers
the merging and sorting of map outputs.
Reducer Stage: The Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set of output, which will be
stored in the HDFS. 99
MapReduce: The Big Picture
101
MapReduce Example
108
Hadoop Limitations
Not fit for small data: Hadoop does not suit for small data. HDFS lacks the ability
to efficiently support the random reading of small files because of its high capacity
design. The solution to this drawback of Hadoop to deal with small file issue is
simple. Just merge the small files to create bigger files and then copy bigger files to
HDFS.
Security concerns: Hadoop is challenging in managing the complex application. If
the user doesn’t know how to enable a platform who is managing the platform, data
can be a huge risk. At storage and network levels, Hadoop is missing encryption,
which is a major point of concern. Hadoop supports Kerberos authentication, which
is hard to manage. Spark provides a security bonus to overcome the limitations of
Hadoop.
Vulnerable by nature: Hadoop is entirely written in Java, a language most widely
used, hence java been most heavily exploited by cyber criminals and as a result,
implicated in numerous security breaches.
No caching: Hadoop is not efficient for caching. In Hadoop, MapReduce cannot
cache the intermediate data in memory for a further requirement which
diminishes the performance of Hadoop. Spark can overcome this limitation.
109
NoSQL
NoSQL database stands for "Not Only SQL" or "Not SQL."
It is a non-relational database, that does not require a fixed schema, and avoids joins.
It is used for distributed data stores and specifically targeted for big data, for
example Google or Facebook which collects terabytes of data every day for their
users.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database
technologies that can store structured, semi-structured, and unstructured data.
It adhere to Brewer’s CAP theorem.
The tables are stored as ASCII files and each field is separated by tabs
The data scale horizontally.
110
NoSQL cont…
Database
RDBMS NoSQL
OLAP OLTP
111
RDBMS vs. NoSQL
RDBMS NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide
column store or key-value pairs databases
Vertically scalable (by increasing system Horizontally scalable (by creating a cluster of
resources) commodity machines)
Uses SQL Uses UnQL (Unstructured Query Language)
Not preferred for large datasets Largely preferred for large datasets
Not a best fit for hierarchical data Best fit for hierarchical storage as it follows
the key-value pair of storing data similar to
JSON
Emphasis on ACID properties Follows Brewer’s CAP theorem
112
RDBMS vs. NoSQL cont’d
RDBMS NoSQL
Excellent support from vendors Relies heavily on community support
Supports complex querying and data keeping Does not have good support for complex
needs querying
Can be configured for strong consistency Few support strong consistency (e.g.,
MongoDB), few others can be configured for
eventual consistency (e.g., Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL, Examples: MongoDB, HBase, Cassandra,
PostgreSQL, etc. Redis, Neo4j, CouchDB, Couchbase, Riak,
etc.
113
MODULE-I DATA ANALYTICS 114