Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

U1-Lec 4

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 12

BIG DATA

UNIT 1

Lecture 4
Introducing Apache Hadoop

Prepared By
Mrs.J.Gokulapriya
Assistant Professor- CS
Department of Computer Science
Rathinam College of Arts and Science

22MCS3CB- Big Data Analytics - Lecture 4 | Page 1


INTRODUCTION

• Apache Software Foundation is the developers of


Hadoop, and it’s co-founders are Doug
Cutting and Mike Cafarella. It’s co-founder Doug
Cutting named it on his son’s toy elephant. In
October 2003 the first paper release was Google File
System. In January 2006, MapReduce development
started on the Apache Nutch which consisted of
around 6000 lines coding for it and around 5000 lines
coding for HDFS. In April 2006 Hadoop 0.1.0 was
released.

22MCS3CB- Big Data Analytics - Lecture 4


| Page 2
• What is Hadoop?
• Hadoop is an open source software programming
framework for storing a large amount of data and
performing the computation. Its framework is based
on Java programming with some native code in C and
shell scripts.
• Hadoop is an open-source software framework that is
used for storing and processing large amounts of data
in a distributed computing environment. It is
designed to handle big data and is based on the
MapReduce programming model, which allows for
the parallel processing of large datasets.

22MCS3CB- Big Data Analytics - Lecture 4


| Page 3
• Hadoop has two main components:
• HDFS (Hadoop Distributed File System): This is the
storage component of Hadoop, which allows for the
storage of large amounts of data across multiple
machines. It is designed to work with commodity
hardware, which makes it cost-effective.
• YARN (Yet Another Resource Negotiator): This is the
resource management component of Hadoop, which
manages the allocation of resources (such as CPU and
memory) for processing the data stored in HDFS.

22MCS3CB- Big Data Analytics - Lecture 4


| Page 4
• Hadoop also includes several additional modules that
provide additional functionality, such as Hive (a
SQL-like query language), Pig (a high-level platform
for creating MapReduce programs), and HBase (a
non-relational, distributed database).
• Hadoop is commonly used in big data scenarios such
as data warehousing, business intelligence, and
machine learning. It’s also used for data processing,
data analysis, and data mining. It enables the
distributed processing of large data sets across
clusters of computers using a simple programming
model.
22MCS3CB- Big Data Analytics - Lecture 4
| Page 5
• Features of hadoop:
• 1. it is fault tolerance.
• 2. it is highly available.
• 3. it’s programming is easy.
• 4. it have huge flexible storage.
• 5. it is low cost.

22MCS3CB- Big Data Analytics - Lecture 4


| Page 6
Hadoop Distributed File System

22MCS3CB- Big Data Analytics - Lecture 4


| Page 7
• Advantages of HDFS: It is inexpensive, immutable
in nature, stores data reliably, ability to tolerate faults,
scalable, block structured, can process a large amount
of data simultaneously and many more. 
• Disadvantages of HDFS: It’s the biggest
disadvantage is that it is not fit for small quantities of
data. Also, it has issues related to potential stability,
restrictive and rough in nature. Hadoop also supports
a wide range of software packages such as Apache
Flumes, Apache Oozie, Apache HBase, Apache
Sqoop, Apache Spark, Apache Storm, Apache Pig,
Apache Hive, Apache Phoenix, Cloudera Impala.
22MCS3CB- Big Data Analytics - Lecture 4
| Page 8
• Hadoop framework is made up of the following
modules:
• Hadoop MapReduce- a MapReduce programming
model for handling and processing large data.
• Hadoop Distributed File System- distributed files in
clusters among nodes.
• Hadoop YARN- a platform which manages
computing resources.
• Hadoop Common- it contains packages and libraries
which are used for other modules.

22MCS3CB- Big Data Analytics - Lecture 4


| Page 9
• Advantages:
• Ability to store a large amount of data.
• High flexibility.
• Cost effective.
• High computational power.
• Tasks are independent.
• Linear scaling.

22MCS3CB- Big Data Analytics - Lecture 4


| Page 10
• Disadvantages:
• Not very effective for small data.
• Hard cluster management.
• Has stability issues.
• Security concerns.

22MCS3CB- Big Data Analytics - Lecture 4


| Page 11
Thank You

22MCS3CB- Big Data Analytics - Lecture 4


| Page 12

You might also like