Hadoop
Hadoop
Hadoop
Operational Analytical
Latency 1 ms - 100 ms 1 min - 100 min
Concurrency 1000 - 100,000 1 - 10
Access Pattern Writes and Reads Reads
Queries Selective Unselective
Data Scope Operational Retrospective
End User Customer Data Scientist
Technology NoSQL MapReduce, MPP Database
Big Data Challenges
• The major challenges associated with big data are as follows −
• Capturing data
• Curation
• Storage
• Searching
• Sharing
• Transfer
• Analysis
• Presentation
• To fulfill the above challenges, organizations normally take the help of
enterprise servers.
Introduction to Hadoop
• Hadoop is an open-source framework that allows to
store and process big data in a distributed
environment across clusters of computers using
simple programming models.
• It is designed to scale up from single server to
thousands of machines, each offering local
computation and storage.
Introduction to Hadoop
• Hadoop is named after a toy elephant belonging to
developer Doug Cutting’s son
• Derived from a web indexing software called Nutch. (Nutch
is an open-source Web search engine that can be used at
global, local, and even personal scale. )
• Google Released its two white papers namely Google File
System and Map Reduce in 2003 and 2004
• In 2011 yahoo was running its search engine across 42000
nodes
Introduction to Hadoop
• Hadoop is an Apache open source framework written in
java that allows distributed processing of large datasets
across clusters of computers using simple programming
models.
• The Hadoop framework application works in an
environment that provides distributed storage and
computation across clusters of computers.
• Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation
and storage.
Hadoop Architecture
• At its core, Hadoop has two major
layers namely −
• Data Processing Engine/Computation
layer (MapReduce),
• Storage layer/Storage Engine (Hadoop
Distributed File System-HDFS).
MapReduce
• MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large
amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner. The MapReduce program runs on Hadoop which is
an Apache open-source framework.
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) is based on the Google
File System (GFS) and provides a distributed file system that is
designed to run on commodity hardware.
• It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are
significant. It is highly fault-tolerant and is designed to be deployed
on low-cost hardware. It provides high throughput access to
application data and is suitable for applications having large
datasets.
Hadoop Distributed File System
• Apart from the above-mentioned two core components, Hadoop
framework also includes the following two modules −
• Hadoop Common − These are Java libraries and utilities required by
other Hadoop modules.
• Hadoop YARN − This is a framework for job scheduling and cluster
resource management.
Advantages of Hadoop
• Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes the
data and work across the machines and in turn, utilizes the underlying
parallelism of the CPU cores.