Lesson 1 - Introduction To Big Data and Hadoop
Lesson 1 - Introduction To Big Data and Hadoop
Lesson 1 - Introduction To Big Data and Hadoop
IBM reported that 2.5 billion gigabytes of data was generated every day in 2012. It is predicted that by 2020:
• About 1.7 megabytes of new information will be generated for every human, every second
• 31.25 million messages will be sent and 2.77 million videos viewed by Facebook users
Big data refers to the large volume of structured and unstructured data. The analysis
of big data leads to better insights for business.
Big Data: Case Study
NETFLIX
Netflix is one of the largest providers of commercial streaming video in the US with a customer base of
over 29 million.
Traditionally, the analysis of such data was done using a computer algorithm that was designed to
produce a correct solution for any given instance.
As the data started to grow, a series of computers were employed to do the analysis. They were also
known as distributed systems.
Distributed Systems
https://en.wikipedia.org/wiki/Distributed_computing
How Does a Distributed System Work?
Doug Cutting discovered Hadoop and named it after his son’s yellow toy
elephant. It is inspired by the technical document published by Google.
https://twitter.com/cutting
Characteristics of Hadoop
Data Processing
YARN Resource
Management
Storage
Hadoop Core
Introduction to Big Data and Hadoop
Topic 3—Components of Hadoop Ecosystem
Components of Hadoop Ecosystem
Data Ingestion
Data Processing
Workflow System
Sqoop
Cluster Resource
YARN Management
Hadoop Core
Components of Hadoop Ecosystem
HDFS (HADOOP DISTRIBUTED FILE SYSTEM)
pig
• HDFS is a storage layer of Hadoop suitable for distributed storage and processing.
• It provides file permissions, authentication, and streaming access to file system data.
Impala
pig
• HBase is a NoSQL database or non-relational database that stores data in HDFS.
pig
• Sqoop is a tool designed to transfer data between Hadoop and relational database
servers.
Impala
• It is used to import data from relational databases such as Oracle and MySQL to HDFS
and export data from HDFS to relational databases.
Components of Hadoop Ecosystem
FLUME
• Flume is a distributed service for ingesting streaming data suited for event data from pig
multiple systems.
• It has a simple and flexible architecture based on streaming data flows. Impala
• It uses a simple extensible data model that allows for online analytic application.
Components of Hadoop Ecosystem
SPARK
Spark is an open-source cluster computing framework that supports Machine learning, pig
Business intelligence, Streaming, and Batch processing.
Spark solves similar problems as Hadoop MapReduce does but has a fast in-memory Impala
approach and a clean functional style API.
pig
Impala
Apache Spark
Components of Hadoop Ecosystem
HADOOP MAPREDUCE
pig
• Hadoop MapReduce is a framework that processes data. It is the original Hadoop
processing engine, which is primarily Java-based.
• Once the data is processed, it is analyzed using an open-source high-level dataflow pig
system called Pig.
• Pig converts its scripts to Map and Reduce code to reduce the effort of writing complex Impala
map-reduce programs.
• Ad-hoc queries like Filter and Join, which are difficult to perform in MapReduce, can be
easily done using Pig.
Components of Hadoop Ecosystem
IMPALA
• It is an open-source high performance SQL engine that runs on the Hadoop pig
cluster.
• It is ideal for interactive analysis and has very low latency, which can be measured Impala
in milliseconds.
pig
• Hive is an abstraction layer on top of Hadoop that executes queries using MapReduce.
• It is preferred for data processing and ETL (Extract Transform Load) and ad hoc queries.
Impala
Components of Hadoop Ecosystem
CLOUDERA SEARCH
• Cloudera Search is a fully integrated data processing platform. It uses the flexible, Impala
scalable, and robust storage system included with CDH or Cloudera’s Distribution,
including Hadoop.
Components of Hadoop Ecosystem
OOZIE
• Oozie is a workflow or coordination system used to manage the Hadoop tasks. pig
• Oozie coordinator can trigger jobs by time (frequency) and data availability.
Impala
Components of Hadoop Ecosystem
OOZIE APPLICATION LIFECYCLE
pig
Action1 C
Action2
Action3
End
Components of Hadoop Ecosystem
HUE (HADOOP USER EXPERIENCE)
• Hue is an acronym for Hadoop User Experience. It is an open source Web interface for pig
analyzing data with Hadoop.
• It provides SQL editors for Hive, Impala, MySQL, Oracle, PostgreSQL, Spark SQL, and Impala
Solr SQL.
Big Data Processing
Components of Hadoop ecosystem work together to process big data. There are four stages of big
data processing:
Key Takeaways
Core components of Hadoop include HDFS for storage, YARN for cluster-resource
management, and MapReduce or Spark for processing.
The Hadoop ecosystem includes multiple components that support each stage of
big data processing:
c. A Traditional system
d. In-memory computation
QUIZ
What is a Distributed system?
1
c. A Traditional system
d. In-memory computation
a. Economical
b. Adaptable
c. Flexible
d. Reliable
QUIZ
Which of the following is NOT a key characteristic of Hadoop?
3
a. Economical
b. Adaptable
c. Flexible
d. Reliable
a. Impala
b. Spark
c. Hive
d. HDFS/HBase
QUIZ
Which of the following is used in the data storage processing stage?
4
a. Impala
b. Spark
c. Hive
d. HDFS/HBase
import data from relational databases to Hadoop HDFS and export from Hadoop file
a.
system to relational databases
enable non-technical users to search and explore data stored in or ingested into
c. Hadoop and HBase
import data from relational databases to Hadoop HDFS and export from Hadoop file
a.
system to relational databases
enable non-technical users to search and explore data stored in or ingested into
c. Hadoop and HBase