0% found this document useful (0 votes)

265 views

Lesson 1 - Introduction To Big Data and Hadoop

This is introduction to Big data

Uploaded by

PoojaSampath

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

265 views

Lesson 1 - Introduction To Big Data and Hadoop

This is introduction to Big data

Uploaded by

PoojaSampath

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Big Data Hadoop and Spark Developer

Lesson 1—Introduction to Big Data and Hadoop

© Simplilearn. All rights reserved.

Learning Objectives

Discuss the basics of big data with a case study

Explain the basics of Hadoop

Describe the components of the Hadoop Ecosystem

Introduction to Big Data and Hadoop
Topic 1—Introduction to Big Data
Data Is Exploding

IBM reported that 2.5 billion gigabytes of data was generated every day in 2012. It is predicted that by 2020:

• About 1.7 megabytes of new information will be generated for every human, every second

• 40,000 search queries will be performed on Google every second

• 300 hours of video will be uploaded to YouTube every minute

• 31.25 million messages will be sent and 2.77 million videos viewed by Facebook users

• 80% of photos will be taken on smartphones

• At least a third of all data will pass through Cloud

Data Is Exploding(Contd.)

By 2020, data will show an exponential rise!

Data in Zettabytes (ZB)

What Is Big Data?

Big data refers to the large volume of structured and unstructured data. The analysis
of big data leads to better insights for business.
Big Data: Case Study
NETFLIX

Netflix is one of the largest providers of commercial streaming video in the US with a customer base of
over 29 million.

It receives a huge volume of behavioral data.

• When do users watch a show?

• Where do they watch it?
• On which device do they watch the show?
• How often do they pause a program?
• How often do they re-watch a program?
• Do they skip the credits?
• What are the keywords searched?
Big Data: Case Study
NETFLIX

Traditionally, the analysis of such data was done using a computer algorithm that was designed to
produce a correct solution for any given instance.

As the data started to grow, a series of computers were employed to do the analysis. They were also
known as distributed systems.
Distributed Systems

A distributed system is a model in which components located on networked

computers communicate and coordinate their actions by passing messages.

https://en.wikipedia.org/wiki/Distributed_computing
How Does a Distributed System Work?

Data =1 Terabyte Data =1 Terabyte

In recent times, distributed systems have been replaced by Hadoop.

Challenges of Distributed Systems

High chances of High programming

1 2 Limited bandwidth 3
system failure complexity

HADOOP is used to overcome these challenges!

Introduction to Big Data and Hadoop
Topic 2—Introduction to Hadoop
What Is Hadoop?

Hadoop is a framework that allows distributed processing of large datasets across

clusters of computers using simple programming models.

Doug Cutting discovered Hadoop and named it after his son’s yellow toy
elephant. It is inspired by the technical document published by Google.

https://twitter.com/cutting
Characteristics of Hadoop

Scalable: Can follow both horizontal

and vertical scaling

Reliable: Stores copies of the data Flexible: Stores a lot of data

on different machines and is and enables you to use it later
resistant to hardware failure

Economical: Ordinary computers

can be used for data processing
Traditional Database Systems vs. Hadoop

Traditional Database Systems Hadoop

Data is stored in a central location and sent to In Hadoop, the program goes to the data. It
the processor at run time. initially distributes the data to multiple systems
and later runs the computation wherever the
data is located.
Traditional Database Systems cannot be used Hadoop works better when the data size is big. It
to process and store a large amount of data can process and store a large amount of data
(big data). easily and effectively.
Traditional RDBMS is used to manage only Hadoop has the ability to process and store a
structured and semi-structured data. It cannot variety of data, whether it is structured or
be used to manage unstructured data. unstructured.
Hadoop Core Components

Data Processing

YARN Resource
Management

Storage

Hadoop Core
Introduction to Big Data and Hadoop
Topic 3—Components of Hadoop Ecosystem
Components of Hadoop Ecosystem

Data Analysis Data Exploration

Data Ingestion
Data Processing
Workflow System

Sqoop
Cluster Resource
YARN Management

Distributed file NoSQL

Flume system

Hadoop Core
Components of Hadoop Ecosystem
HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

pig
• HDFS is a storage layer of Hadoop suitable for distributed storage and processing.

• It provides file permissions, authentication, and streaming access to file system data.
Impala

HDFS can be accessed through Hadoop command line interface.

Components of Hadoop Ecosystem
HBase

pig
• HBase is a NoSQL database or non-relational database that stores data in HDFS.

• It provides support to high volume of data and high throughput.

Impala
• It is used when you need random, real-time read/write access to your big data.

HBase tables can have thousands of columns.

Components of Hadoop Ecosystem
SQOOP

pig
• Sqoop is a tool designed to transfer data between Hadoop and relational database
servers.
Impala
• It is used to import data from relational databases such as Oracle and MySQL to HDFS
and export data from HDFS to relational databases.
Components of Hadoop Ecosystem
FLUME

• Flume is a distributed service for ingesting streaming data suited for event data from pig
multiple systems.

• It has a simple and flexible architecture based on streaming data flows. Impala

• It is robust and fault tolerant and has tunable reliability mechanisms.

• It uses a simple extensible data model that allows for online analytic application.
Components of Hadoop Ecosystem
SPARK

Spark is an open-source cluster computing framework that supports Machine learning, pig
Business intelligence, Streaming, and Batch processing.

Spark solves similar problems as Hadoop MapReduce does but has a fast in-memory Impala
approach and a clean functional style API.

Spark and MapReduce will be discussed in the upcoming lessons.

Components of Hadoop Ecosystem
SPARK: COMPONENTS

pig

Impala

Spark Core and

Machine
Resilient
Spark Learning
Distributed Spark SQL GraphX
Streaming Library
Datasets
(MLlib)
(RDDs)

Apache Spark
Components of Hadoop Ecosystem
HADOOP MAPREDUCE

pig
• Hadoop MapReduce is a framework that processes data. It is the original Hadoop
processing engine, which is primarily Java-based.

• It is based on the map and reduce programming model. Impala

• It has an extensive and mature fault tolerance.

• Hive and Pig are built on map-reduce model.

Components of Hadoop Ecosystem
PIG

• Once the data is processed, it is analyzed using an open-source high-level dataflow pig
system called Pig.

• Pig converts its scripts to Map and Reduce code to reduce the effort of writing complex Impala
map-reduce programs.

• Ad-hoc queries like Filter and Join, which are difficult to perform in MapReduce, can be
easily done using Pig.
Components of Hadoop Ecosystem
IMPALA

• It is an open-source high performance SQL engine that runs on the Hadoop pig
cluster.

• It is ideal for interactive analysis and has very low latency, which can be measured Impala
in milliseconds.

• Impala supports a dialect of SQL, so data in HDFS is modeled as a database table.

Components of Hadoop Ecosystem
HIVE

pig
• Hive is an abstraction layer on top of Hadoop that executes queries using MapReduce.

• It is preferred for data processing and ETL (Extract Transform Load) and ad hoc queries.
Impala
Components of Hadoop Ecosystem
CLOUDERA SEARCH

• It is Cloudera's near-real-time access product that enables non-technical users to pig

search and explore data stored in or ingested into Hadoop and HBase.

• Cloudera Search is a fully integrated data processing platform. It uses the flexible, Impala
scalable, and robust storage system included with CDH or Cloudera’s Distribution,
including Hadoop.
Components of Hadoop Ecosystem
OOZIE

• Oozie is a workflow or coordination system used to manage the Hadoop tasks. pig

• Oozie coordinator can trigger jobs by time (frequency) and data availability.
Impala
Components of Hadoop Ecosystem
OOZIE APPLICATION LIFECYCLE

pig

Oozie Coordinator Oozie Workflow

Engine Engine
Start Impala
B
Action A

Action1 C
Action2

Action3

End
Components of Hadoop Ecosystem
HUE (HADOOP USER EXPERIENCE)

• Hue is an acronym for Hadoop User Experience. It is an open source Web interface for pig
analyzing data with Hadoop.

• It provides SQL editors for Hive, Impala, MySQL, Oracle, PostgreSQL, Spark SQL, and Impala
Solr SQL.
Big Data Processing

Components of Hadoop ecosystem work together to process big data. There are four stages of big
data processing:
Key Takeaways

Hadoop is a framework for distributed storage and processing.

Core components of Hadoop include HDFS for storage, YARN for cluster-resource
management, and MapReduce or Spark for processing.

The Hadoop ecosystem includes multiple components that support each stage of
big data processing:

• Flume and Scoop ingest data

• HDFS and HBase store data
• Spark and MapReduce process data
• Pig, Hive, and Impala analyze data
• Hue and Search help to explore data
• Oozie manages the workflow of Hadoop tasks
Quiz
QUIZ
What is a Distributed system?
1

a. One machine processing a file

b. Multiple machines processing a file

c. A Traditional system

d. In-memory computation
QUIZ
What is a Distributed system?
1

a. One machine processing a file

b. Multiple machines processing a file

c. A Traditional system

d. In-memory computation

The correct answer is b.

In distributed systems, you use multiple machines to process one file.
QUIZ
What is Hadoop?
2

a. It is an in-memory tool used in Mahout algorithm computing.

b. It is a computing framework used for resource management.

It is a framework that allows for distributed processing of large datasets across

c. clusters of commodity computers using a simple programming model.

d. It is a search and analytics tool that provides access to analyze data.

QUIZ
What is Hadoop?
2

a. It is an in-memory tool used in Mahout algorithm computing.

b. It is a computing framework used for resource management.

It is a framework that allows for distributed processing of large datasets across

c. clusters of commodity computers using a simple programming model.

d. It is a search and analytics tool that provides access to analyze data.

The correct answer is c.

Hadoop is a framework that allows for distributed processing of large datasets across clusters of
commodity computers using a simple programming model.
QUIZ
Which of the following is NOT a key characteristic of Hadoop?
3

a. Economical

b. Adaptable

c. Flexible

d. Reliable
QUIZ
Which of the following is NOT a key characteristic of Hadoop?
3

a. Economical

b. Adaptable

c. Flexible

d. Reliable

The correct answer is b.

The four key characteristics of Hadoop are that it is economical, reliable, scalable, and flexible.
QUIZ
Which of the following is used in the data storage processing stage?
4

a. Impala

b. Spark

c. Hive

d. HDFS/HBase
QUIZ
Which of the following is used in the data storage processing stage?
4

a. Impala

b. Spark

c. Hive

d. HDFS/HBase

The correct answer is d.

HBase/HDFS is used in the data storage processing stage.
QUIZ
Scoop is used to _______.
5

import data from relational databases to Hadoop HDFS and export from Hadoop file
a.
system to relational databases

b. execute queries using MapReduce

enable non-technical users to search and explore data stored in or ingested into
c. Hadoop and HBase

d. stream event data from multiple systems

QUIZ
Scoop is used to _______.
5

import data from relational databases to Hadoop HDFS and export from Hadoop file
a.
system to relational databases

b. execute queries using MapReduce

enable non-technical users to search and explore data stored in or ingested into
c. Hadoop and HBase

d. stream event data from multiple systems

The correct answer is a.

Scoop is used to import data from relational databases to Hadoop HDFS and export from Hadoop
file system to relational databases.
This concludes “Introduction to Big Data and
Hadoop.”
The next lesson is “HDFS and YARN.”

Engineering. Software. Eric J. Braude. Michael E. Bernstein. Modern Approaches Universitatsbibliothek Hannover ' Technische Inform Ationsbibliothek
No ratings yet
Engineering. Software. Eric J. Braude. Michael E. Bernstein. Modern Approaches Universitatsbibliothek Hannover ' Technische Inform Ationsbibliothek
8 pages
Kaiser Permanente Six Sigma Case Study
No ratings yet
Kaiser Permanente Six Sigma Case Study
2 pages
Risk Assessment Template
No ratings yet
Risk Assessment Template
1 page
Microsoft Excel 2013: MOS Foundation: Lesson 3 Create Cells and Ranges
No ratings yet
Microsoft Excel 2013: MOS Foundation: Lesson 3 Create Cells and Ranges
39 pages
Lesson 5 - Supervised Learning-Classification
100% (1)
Lesson 5 - Supervised Learning-Classification
91 pages
Microsoft Excel 2013: MOS Foundation
No ratings yet
Microsoft Excel 2013: MOS Foundation
11 pages
B.A. 1st Notes PDF
No ratings yet
B.A. 1st Notes PDF
64 pages
Overview of Six Sigma
100% (1)
Overview of Six Sigma
51 pages
Lesson - 1.2 - Lean Principles in The Organization PDF
100% (1)
Lesson - 1.2 - Lean Principles in The Organization PDF
66 pages
The Six Sigma
No ratings yet
The Six Sigma
263 pages
LSSGB - Project - 5 - Improving Manufacturing Process
No ratings yet
LSSGB - Project - 5 - Improving Manufacturing Process
6 pages
PGP Purdue Projects DS
No ratings yet
PGP Purdue Projects DS
5 pages
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
No ratings yet
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
32 pages
Lesson 05 Lean in Service
No ratings yet
Lesson 05 Lean in Service
10 pages
LSSGB Project1 Improvingbankcallcenteroperations Problem
No ratings yet
LSSGB Project1 Improvingbankcallcenteroperations Problem
7 pages
Sam Yu: Software / RDBMS Development Enterprise Implementations Systems Engineering
No ratings yet
Sam Yu: Software / RDBMS Development Enterprise Implementations Systems Engineering
2 pages
Six Sigma
No ratings yet
Six Sigma
42 pages
Session 2 - Excel Fundamentals For Data Exploration
No ratings yet
Session 2 - Excel Fundamentals For Data Exploration
56 pages
Project 2: Library Management System For Stanford
No ratings yet
Project 2: Library Management System For Stanford
15 pages
Lesson 1.0 LSSGB Introduction
No ratings yet
Lesson 1.0 LSSGB Introduction
15 pages
LSSGB Project 4 Improvingmanufacturingprocess
No ratings yet
LSSGB Project 4 Improvingmanufacturingprocess
6 pages
Lesson 02 Types of Waste
No ratings yet
Lesson 02 Types of Waste
18 pages
A Fictitious Six Sigma Green Belt Part I
No ratings yet
A Fictitious Six Sigma Green Belt Part I
19 pages
Lesson 01 Introduction
No ratings yet
Lesson 01 Introduction
24 pages
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
22 pages
Sample Certification Problems DEFINE Phase: Ads by Google
No ratings yet
Sample Certification Problems DEFINE Phase: Ads by Google
11 pages
Six Sigma Black Belt Improve Chapter
No ratings yet
Six Sigma Black Belt Improve Chapter
135 pages
Explore What Power BI Can Do For You: Angeles University Foundation College of Computer Studies
No ratings yet
Explore What Power BI Can Do For You: Angeles University Foundation College of Computer Studies
58 pages
Big Data My Studies
No ratings yet
Big Data My Studies
28 pages
Lesson - 5.1 - Design of Experiments - Improve - Phase
No ratings yet
Lesson - 5.1 - Design of Experiments - Improve - Phase
39 pages
IOM 599 Business Analytics Syllabus
No ratings yet
IOM 599 Business Analytics Syllabus
29 pages
Publish and Share in Power BI: Angeles University Foundation College of Computer Studies
No ratings yet
Publish and Share in Power BI: Angeles University Foundation College of Computer Studies
33 pages
Chapter # 21 Software Quality Metrics
No ratings yet
Chapter # 21 Software Quality Metrics
15 pages
Big Data Platforms
No ratings yet
Big Data Platforms
8 pages
1 - BBDS - Why Learning Data Science Is An Absolute Must
No ratings yet
1 - BBDS - Why Learning Data Science Is An Absolute Must
59 pages
01-Introduction To DS With Python
No ratings yet
01-Introduction To DS With Python
32 pages
Six Sigma Project Guideline: Relevance of Topic: Suitable For: Own Contribution
No ratings yet
Six Sigma Project Guideline: Relevance of Topic: Suitable For: Own Contribution
2 pages
Simplilearn in Brief PDF
No ratings yet
Simplilearn in Brief PDF
16 pages
Big Data
0% (1)
Big Data
2 pages
Benchmark Six Sigma Green Belt
No ratings yet
Benchmark Six Sigma Green Belt
18 pages
Week1 - Introduction To Machine Learning and Toolkit
No ratings yet
Week1 - Introduction To Machine Learning and Toolkit
102 pages
MBA Full Project List RM Solution
No ratings yet
MBA Full Project List RM Solution
34 pages
Pugh Matrix For Service Selection
No ratings yet
Pugh Matrix For Service Selection
46 pages
1 - Understanding Big Data
No ratings yet
1 - Understanding Big Data
46 pages
BDH Admin Ebook
No ratings yet
BDH Admin Ebook
807 pages
P Chart
No ratings yet
P Chart
71 pages
ASQ - Course Outline Lean Six SIgma Black Belt Minitab - ASQ - LSSBBM01MS PDF
No ratings yet
ASQ - Course Outline Lean Six SIgma Black Belt Minitab - ASQ - LSSBBM01MS PDF
5 pages
LSS YB Material
No ratings yet
LSS YB Material
101 pages
Lesson 6 NoSQL Databases HBase
100% (1)
Lesson 6 NoSQL Databases HBase
47 pages
Quality Management: Production and Operations Management
No ratings yet
Quality Management: Production and Operations Management
15 pages
HCI Sem1 202021 LU5
No ratings yet
HCI Sem1 202021 LU5
57 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
Lecture 01 - Introduction To Requirements Analysis and Modeling 2020
No ratings yet
Lecture 01 - Introduction To Requirements Analysis and Modeling 2020
31 pages
Brochure Wharton Business Analytics 03 June 21 V58
No ratings yet
Brochure Wharton Business Analytics 03 June 21 V58
16 pages
Data Analyst Chapter 3
No ratings yet
Data Analyst Chapter 3
20 pages
Business Relationship Manager Complete Self-Assessment Guide
From Everand
Business Relationship Manager Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Notes
No ratings yet
Notes
53 pages
Hadoop Single Node Cluster Setup Steps
No ratings yet
Hadoop Single Node Cluster Setup Steps
7 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Final Exam
17% (6)
Final Exam
6 pages
Developing Big Data Solutions On Microsoft Azure HDInsight
No ratings yet
Developing Big Data Solutions On Microsoft Azure HDInsight
346 pages
Divide and Recombine Approach For Detailed Analysis and Visualization of Large Complex Data
No ratings yet
Divide and Recombine Approach For Detailed Analysis and Visualization of Large Complex Data
13 pages
Apache Pig
100% (2)
Apache Pig
80 pages
R20 M.Tech DS
No ratings yet
R20 M.Tech DS
64 pages
Big Data: Understanding Big Data: January 2016
No ratings yet
Big Data: Understanding Big Data: January 2016
9 pages
Assignment 1 Spec
No ratings yet
Assignment 1 Spec
5 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Distributed System Lab Manual
No ratings yet
Distributed System Lab Manual
42 pages
Bigdataqcm PDF
100% (1)
Bigdataqcm PDF
206 pages
Literature Review On Big Data
No ratings yet
Literature Review On Big Data
10 pages
DSBDA Lab Manual 23 - 24
No ratings yet
DSBDA Lab Manual 23 - 24
50 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Syllabus:: Introduction To Hadoop (T1)
No ratings yet
Syllabus:: Introduction To Hadoop (T1)
23 pages
Final
No ratings yet
Final
276 pages
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in The Cloud
No ratings yet
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in The Cloud
14 pages
Flipkart Recommendation
0% (1)
Flipkart Recommendation
35 pages
Big Data 1 PDF
No ratings yet
Big Data 1 PDF
17 pages
Big Data Chapter-I_new
No ratings yet
Big Data Chapter-I_new
49 pages
Responsibilities: Real Time Analytics Processing Specialist
No ratings yet
Responsibilities: Real Time Analytics Processing Specialist
20 pages
Bda Lab Manual 21-22 - 22-08-2022
No ratings yet
Bda Lab Manual 21-22 - 22-08-2022
44 pages
Donkal, Gita Verma, Gyanendra K. (2018)
No ratings yet
Donkal, Gita Verma, Gyanendra K. (2018)
12 pages
CCS334 BDA Syllabus
No ratings yet
CCS334 BDA Syllabus
5 pages
100+ Hadoop Interview Questions From Interviews
No ratings yet
100+ Hadoop Interview Questions From Interviews
32 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
DC-5 Message Ordering and Termination Detection 14th May 2023
No ratings yet
DC-5 Message Ordering and Termination Detection 14th May 2023
47 pages
Bda Test1 Key Answers
No ratings yet
Bda Test1 Key Answers
7 pages