Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lesson 1 - Introduction To Big Data and Hadoop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Big Data Hadoop and Spark Developer

Lesson 1—Introduction to Big Data and Hadoop

© Simplilearn. All rights reserved.


Learning Objectives

Discuss the basics of big data with a case study

Explain the basics of Hadoop

Describe the components of the Hadoop Ecosystem


Introduction to Big Data and Hadoop
Topic 1—Introduction to Big Data
Data Is Exploding

IBM reported that 2.5 billion gigabytes of data was generated every day in 2012. It is predicted that by 2020:

• About 1.7 megabytes of new information will be generated for every human, every second

• 40,000 search queries will be performed on Google every second

• 300 hours of video will be uploaded to YouTube every minute

• 31.25 million messages will be sent and 2.77 million videos viewed by Facebook users

• 80% of photos will be taken on smartphones

• At least a third of all data will pass through Cloud


Data Is Exploding(Contd.)

By 2020, data will show an exponential rise!

Data in Zettabytes (ZB)


What Is Big Data?

Big data refers to the large volume of structured and unstructured data. The analysis
of big data leads to better insights for business.
Big Data: Case Study
NETFLIX

Netflix is one of the largest providers of commercial streaming video in the US with a customer base of
over 29 million.

It receives a huge volume of behavioral data.

• When do users watch a show?


• Where do they watch it?
• On which device do they watch the show?
• How often do they pause a program?
• How often do they re-watch a program?
• Do they skip the credits?
• What are the keywords searched?
Big Data: Case Study
NETFLIX

Traditionally, the analysis of such data was done using a computer algorithm that was designed to
produce a correct solution for any given instance.

As the data started to grow, a series of computers were employed to do the analysis. They were also
known as distributed systems.
Distributed Systems

A distributed system is a model in which components located on networked


computers communicate and coordinate their actions by passing messages.

https://en.wikipedia.org/wiki/Distributed_computing
How Does a Distributed System Work?

Data =1 Terabyte Data =1 Terabyte

In recent times, distributed systems have been replaced by Hadoop.


Challenges of Distributed Systems

High chances of High programming


1 2 Limited bandwidth 3
system failure complexity

HADOOP is used to overcome these challenges!


Introduction to Big Data and Hadoop
Topic 2—Introduction to Hadoop
What Is Hadoop?

Hadoop is a framework that allows distributed processing of large datasets across


clusters of computers using simple programming models.

Doug Cutting discovered Hadoop and named it after his son’s yellow toy
elephant. It is inspired by the technical document published by Google.

https://twitter.com/cutting
Characteristics of Hadoop

Scalable: Can follow both horizontal


and vertical scaling

Reliable: Stores copies of the data Flexible: Stores a lot of data


on different machines and is and enables you to use it later
resistant to hardware failure

Economical: Ordinary computers


can be used for data processing
Traditional Database Systems vs. Hadoop

Traditional Database Systems Hadoop


Data is stored in a central location and sent to In Hadoop, the program goes to the data. It
the processor at run time. initially distributes the data to multiple systems
and later runs the computation wherever the
data is located.
Traditional Database Systems cannot be used Hadoop works better when the data size is big. It
to process and store a large amount of data can process and store a large amount of data
(big data). easily and effectively.
Traditional RDBMS is used to manage only Hadoop has the ability to process and store a
structured and semi-structured data. It cannot variety of data, whether it is structured or
be used to manage unstructured data. unstructured.
Hadoop Core Components

Data Processing

YARN Resource
Management

Storage

Hadoop Core
Introduction to Big Data and Hadoop
Topic 3—Components of Hadoop Ecosystem
Components of Hadoop Ecosystem

Data Analysis Data Exploration

Data Ingestion
Data Processing
Workflow System

Sqoop
Cluster Resource
YARN Management

Distributed file NoSQL


Flume system

Hadoop Core
Components of Hadoop Ecosystem
HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

pig
• HDFS is a storage layer of Hadoop suitable for distributed storage and processing.

• It provides file permissions, authentication, and streaming access to file system data.
Impala

HDFS can be accessed through Hadoop command line interface.


Components of Hadoop Ecosystem
HBase

pig
• HBase is a NoSQL database or non-relational database that stores data in HDFS.

• It provides support to high volume of data and high throughput.


Impala
• It is used when you need random, real-time read/write access to your big data.

HBase tables can have thousands of columns.


Components of Hadoop Ecosystem
SQOOP

pig
• Sqoop is a tool designed to transfer data between Hadoop and relational database
servers.
Impala
• It is used to import data from relational databases such as Oracle and MySQL to HDFS
and export data from HDFS to relational databases.
Components of Hadoop Ecosystem
FLUME

• Flume is a distributed service for ingesting streaming data suited for event data from pig
multiple systems.

• It has a simple and flexible architecture based on streaming data flows. Impala

• It is robust and fault tolerant and has tunable reliability mechanisms.

• It uses a simple extensible data model that allows for online analytic application.
Components of Hadoop Ecosystem
SPARK

Spark is an open-source cluster computing framework that supports Machine learning, pig
Business intelligence, Streaming, and Batch processing.

Spark solves similar problems as Hadoop MapReduce does but has a fast in-memory Impala
approach and a clean functional style API.

Spark and MapReduce will be discussed in the upcoming lessons.


Components of Hadoop Ecosystem
SPARK: COMPONENTS

pig

Impala

Spark Core and


Machine
Resilient
Spark Learning
Distributed Spark SQL GraphX
Streaming Library
Datasets
(MLlib)
(RDDs)

Apache Spark
Components of Hadoop Ecosystem
HADOOP MAPREDUCE

pig
• Hadoop MapReduce is a framework that processes data. It is the original Hadoop
processing engine, which is primarily Java-based.

• It is based on the map and reduce programming model. Impala

• It has an extensive and mature fault tolerance.

• Hive and Pig are built on map-reduce model.


Components of Hadoop Ecosystem
PIG

• Once the data is processed, it is analyzed using an open-source high-level dataflow pig
system called Pig.

• Pig converts its scripts to Map and Reduce code to reduce the effort of writing complex Impala
map-reduce programs.

• Ad-hoc queries like Filter and Join, which are difficult to perform in MapReduce, can be
easily done using Pig.
Components of Hadoop Ecosystem
IMPALA

• It is an open-source high performance SQL engine that runs on the Hadoop pig
cluster.

• It is ideal for interactive analysis and has very low latency, which can be measured Impala
in milliseconds.

• Impala supports a dialect of SQL, so data in HDFS is modeled as a database table.


Components of Hadoop Ecosystem
HIVE

pig
• Hive is an abstraction layer on top of Hadoop that executes queries using MapReduce.

• It is preferred for data processing and ETL (Extract Transform Load) and ad hoc queries.
Impala
Components of Hadoop Ecosystem
CLOUDERA SEARCH

• It is Cloudera's near-real-time access product that enables non-technical users to pig


search and explore data stored in or ingested into Hadoop and HBase.

• Cloudera Search is a fully integrated data processing platform. It uses the flexible, Impala
scalable, and robust storage system included with CDH or Cloudera’s Distribution,
including Hadoop.
Components of Hadoop Ecosystem
OOZIE

• Oozie is a workflow or coordination system used to manage the Hadoop tasks. pig

• Oozie coordinator can trigger jobs by time (frequency) and data availability.
Impala
Components of Hadoop Ecosystem
OOZIE APPLICATION LIFECYCLE

pig

Oozie Coordinator Oozie Workflow


Engine Engine
Start Impala
B
Action A

Action1 C
Action2

Action3

End
Components of Hadoop Ecosystem
HUE (HADOOP USER EXPERIENCE)

• Hue is an acronym for Hadoop User Experience. It is an open source Web interface for pig
analyzing data with Hadoop.

• It provides SQL editors for Hive, Impala, MySQL, Oracle, PostgreSQL, Spark SQL, and Impala
Solr SQL.
Big Data Processing

Components of Hadoop ecosystem work together to process big data. There are four stages of big
data processing:
Key Takeaways

Hadoop is a framework for distributed storage and processing.

Core components of Hadoop include HDFS for storage, YARN for cluster-resource
management, and MapReduce or Spark for processing.

The Hadoop ecosystem includes multiple components that support each stage of
big data processing:

• Flume and Scoop ingest data


• HDFS and HBase store data
• Spark and MapReduce process data
• Pig, Hive, and Impala analyze data
• Hue and Search help to explore data
• Oozie manages the workflow of Hadoop tasks
Quiz
QUIZ
What is a Distributed system?
1

a. One machine processing a file

b. Multiple machines processing a file

c. A Traditional system

d. In-memory computation
QUIZ
What is a Distributed system?
1

a. One machine processing a file

b. Multiple machines processing a file

c. A Traditional system

d. In-memory computation

The correct answer is b.


In distributed systems, you use multiple machines to process one file.
QUIZ
What is Hadoop?
2

a. It is an in-memory tool used in Mahout algorithm computing.

b. It is a computing framework used for resource management.

It is a framework that allows for distributed processing of large datasets across


c. clusters of commodity computers using a simple programming model.

d. It is a search and analytics tool that provides access to analyze data.


QUIZ
What is Hadoop?
2

a. It is an in-memory tool used in Mahout algorithm computing.

b. It is a computing framework used for resource management.

It is a framework that allows for distributed processing of large datasets across


c. clusters of commodity computers using a simple programming model.

d. It is a search and analytics tool that provides access to analyze data.

The correct answer is c.


Hadoop is a framework that allows for distributed processing of large datasets across clusters of
commodity computers using a simple programming model.
QUIZ
Which of the following is NOT a key characteristic of Hadoop?
3

a. Economical

b. Adaptable

c. Flexible

d. Reliable
QUIZ
Which of the following is NOT a key characteristic of Hadoop?
3

a. Economical

b. Adaptable

c. Flexible

d. Reliable

The correct answer is b.


The four key characteristics of Hadoop are that it is economical, reliable, scalable, and flexible.
QUIZ
Which of the following is used in the data storage processing stage?
4

a. Impala

b. Spark

c. Hive

d. HDFS/HBase
QUIZ
Which of the following is used in the data storage processing stage?
4

a. Impala

b. Spark

c. Hive

d. HDFS/HBase

The correct answer is d.


HBase/HDFS is used in the data storage processing stage.
QUIZ
Scoop is used to _______.
5

import data from relational databases to Hadoop HDFS and export from Hadoop file
a.
system to relational databases

b. execute queries using MapReduce

enable non-technical users to search and explore data stored in or ingested into
c. Hadoop and HBase

d. stream event data from multiple systems


QUIZ
Scoop is used to _______.
5

import data from relational databases to Hadoop HDFS and export from Hadoop file
a.
system to relational databases

b. execute queries using MapReduce

enable non-technical users to search and explore data stored in or ingested into
c. Hadoop and HBase

d. stream event data from multiple systems

The correct answer is a.


Scoop is used to import data from relational databases to Hadoop HDFS and export from Hadoop
file system to relational databases.
This concludes “Introduction to Big Data and
Hadoop.”
The next lesson is “HDFS and YARN.”

©Simplilearn. All rights reserved

You might also like