Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Big Data Analytics Notess

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 69

STUDY MATERIAL

FOR

BIG DATA ANALYTICS

Department of CSE

Vadlamudi, AP, India - 522201

The material is for internal circulation only and not for commercial purpose. The
contents in this material were taken from internet as one source and some contents
from the text book titled Big Data Analytics, SeemaAcharya, Subhashini Chellappan,
Wiley. Thanks to authors.

Big Data Analytics


Syllabus

UNIT – I : Introduction to big data: Data, Characteristics of data and Types of digital
data:, Sources of data, Working with unstructured data, Evolution and Definition of big
data, Characteristics and Need of big data, Challenges of big data

Big data analytics: Overview of business intelligence, Data science and Analytics, Meaning
and Characteristics of big data analytics, Need of big data analytics, Classification of
analytics, Challenges to big data analytics, Importance of big data analytics, Basic
terminologies in big data environment.

UNIT – II: Introduction to Hadoop : Introducing Hadoop, need of Hadoop, limitations of


RDBMS, RDBMS versus Hadoop, Distributed Computing Challenges, History of Hadoop ,
Hadoop Overview, Use Case of Hadoop, Hadoop Distributors, HDFS (Hadoop Distributed

1
File System) , Processing Data with Hadoop, Managing Resources and Applications with
Hadoop YARN (Yet another Resource Negotiator), Interacting with Hadoop Ecosystem

UNIT – III: Introduction to MAPREDUCE Programming:Introduction , Mapper,


Reducer, Combiner, Partitioner , Searching, Sorting , Compression, Real time applications
using MapReduce, Data serialization and Working with common serialization formats, Big
data serialization formats

UNIT – IV: Introduction to Hive: Introduction to Hive, Hive Architecture , Hive Data
Types, Hive File Format, Hive Query Language (HQL), User-Defined Function (UDF) in
Hive.

Introduction to Pig: Introduction to Pig, The Anatomy of Pig , Pig on Hadoop , Pig
Philosophy , Use Case for Pig: ETL Processing , Pig Latin Overview , Data Types in Pig ,
Running Pig , Execution Modes of Pig, HDFS Commands, Relational Operators, Piggy
Bank , Word Count Example using Pig , Pig at Yahoo!, Pig versus Hive

UNIT – V: Spark: Introduction to data analytics with Spark, Programming with RDDS,
Working with key/value pairs, advanced spark programming.

Text Books:

Big Data Analytics, SeemaAcharya, SubhashiniChellappan, Wiley

Learning Spark: Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski, Patrick
Wendell, MateiZaharia, O'Reilly Media, Inc.

Reference Books:

Boris lublinsky, Kevin t. Smith, AlexeyYakubovich, “Professional Hadoop Solutions”,


Wiley, ISBN: 9788126551071, 2015.

Chris Eaton,Dirkderooset al. , “Understanding Big data ”, McGraw Hill, 2012.

Tom White, “HADOOP: The definitive Guide”, O Reilly 2012.

VigneshPrajapati, “Big Data Analyticswith R and Haoop”, Packet Publishing 2013.

2
UNIT - I

a. What are various types of digital data? Explain How to deal with unstructured data.

b. How is traditional BI environment different from Big data environment.

Digital data is the data that is stored in a computer system or digitally. Digital data is
classified into the following categories:

Structured data

Semi-structured data

Unstructured data

Structured data: This is the data which is in an organized form (e.g., in rows and columns)
and can be easily used by a computer program. Data stored in databases is an eg. Data in
RDBMS, spreadsheets.

Sources of Structured data:

Ease of working with structured data:

3
Semi-structured data: This is the data which does not conform to a data model but has
some structure. However, it is not in a form which can be used easily by a computer
program.

Eg. XML, markup languages like HTML, etc. Metadata for this data is available but is not
sufficient.

Sources of semi-structured data:

JSON (Java Script object Notation)

Characteristics of semi-structured data:

Unstructured data:

This is the data which does not conform to a data model or is not in a form which can be
used easily by a computer program.

About 80–90% data of an organization is in this format.

Example: memos, chat rooms, PowerPoint presentations, images, videos,


letters, researches, white papers, body of an email, etc.

4
Sources of unstructured data:

Document

Dealing with unstructured data:

Unstructured Information Management


Architecture
Data Mining: Knowledge discovery in databases, popular Mining algorithms are
Association rule mining, Regression Analysis, and Collaborative filtering

Natural Language Processing: It is related to HCI. It is about enabling computers to


understand human or natural language input.

Text Analytics: Text mining is the process of gleaning high quality and meaningful
information from text. It includes tasks such as text categorization, text clustering, sentiment
analysis and concept/entity extraction.

Noisy text analytics: Process of extraction structured or semi-structured from noisy


unstructured data such as chats, blogs, wikis, emails .. Spelling mistakes, abbreviations, uh,
hm, non standard words. .

Manual Tagging with meta data: This is about tagging manually with adequate meta data
to provide the requisite semantics to understand unstructured data.

Parts of Speech Tagging : POST is the process of reading text and tagging each word in the
sentence belonging to particular parts of speech such as noun, verb, objective...

Unstructured Information management architecture: Open source platform from IBM


used for real time content analytics.

B. Traditional Business Intelligence versus Big Data:

1. In Traditional BI environment, all the enterprise’s data is housed in a central server where
as in a Big data environment data resides in a distributed file system.

The distributed file system scales by scaling in or out horizontally as compared to


typical database sever that scales vertically.

2. In traditional BI, data is generally analyzed in an offline mode whereas in Big data, it is
analyzed both real time as well as in offline mode.

5
Traditional BI is about structured data and the data is taken to process functions (move data
to code).

Where as Big data is about variety: Structured, semi structured, and unstructured data and
here the processing functions are taken to the data (move code to data)

What is Big Data? Explain the challenges and evolution Big Data.

"Big data" is high-volume, -velocity and -variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.

“Big data is high-volume, -velocity and -variety information assets” talks about
voluminous data that may have great variety(structured, semi-structured and unstructured)
and will require a good speed/pace for storage, preparation, processing and analysis.

“Cost-effective, innovative forms of information processing” talks about new techniques


and technologies to capture, store, process, persist, integrate, and visualize high volume,
high variety and high velocity data.

“Enhanced insight and decision making” talks about deriving deeper, richer, and
meaningful insights and then using these insights to make faster and better decisions to gain
business value and this competitive edge.

Volume:Bits-> Bytes-> KBs-> MBs-> GBs-> TBs-> PBs-> Exabytes-> Zettabytes->


Yottabytes

Where does this data get generated? There are a multiple of sources for Big data. An XLS, a
DOC, a PDF etc. is unstructured data; a video, YouTube, a chat conversation on internet
messenger, a customer feedback form on an online retail website is unstructured data; a
CCTV coverage and weather forecast data also unstructured data.

The sources of Big data:

Typical internal data sources: data present within an organization’s firewall.

Data storage: File systems, SQL (RDBMSs- oracle, MS SQL server, DB2, MySQL,
PostgreSQL etc.) NoSQL, (MangoDB, Cassandra etc) and so on.

Archives: Archives of scanned documents, paper archives, customer correspondence


records, patient’s health records, student’s admission records, student’s assessment records,
and so on.

External data sources: data residing outside an organization’s Firewall.


6
Public web: Wikipedia, regulatory, compliance, weather, census etc.,

Both (internal + external sources)

Sensor data, machine log data, social media, business apps, media and docs.

Velocity: we have moved from the days of batch processing to Real-time processing:

Batch -> periodic ->Near real time ->Real time processing.

Variety: Variety deals with the wide range of data types and sources of data. Structured,
semi-structured and Unstructured.

Structured data: From traditional transaction processing systems and RDBMS, etc.

Semi-structured data: For example Hypertext Markup Language (HTML), eXtensible


Markup Language (XML).

Unstructured data: For example unstructured text documents, audios, videos, emails, photos,
PDFs , social media, etc.

7
Data: Big in volume, variety and velocity.

The challenges with big data:

Data today is growing at an exponential rate. Most of the data that we have today has been
generated in the last two years. The key question is : will all this data be useful for analysis
how will separate knowledge from noise.

How to host big data solutions outside the world.

The period of retention of big data.

Dearth of skilled professionals who possess a high level of proficiency in data science that is
vital in implementing Big data solutions.

Challenges with respect to capture, curation, storage, search, sharing, transfer, analysis,
privacy violations and visualization.

Shortage of data visualization experts.

Evolution of Big Data

1970s and before was the era of mainframes. The data is essentially primitive and structured.
Relational data bases evolved in 1980s and 1990s. The era was of data intensive
applications. The world wide web and Internet of things (IoT) have led to an onslaught of of
structured, unstructured, and multimedia data.

Data Generation Data Utilization Data Driven


and storage
Complex and Structured data,
unstructured Unstructured
data, Multimedia
data
Complex and Relational
Relational databases : Data
intensive
applications
Primitive and Main frames:
structured Basic data storage

1970s and before Relational 2000s and


1980s and 1990s beyond

3. a.What are the various types of analytics? What is Big Data Analytics? Why it is
important? Discuss the top challenges facing Big Data.
8
b. What is analytics 3.0? What can we expect from analytics 3.0?

Big data Analytics is the process of examining big data to uncover patterns, unearth trends,
and find unknown correlations and other useful information to make faster and better
decisions.

Analytics begin with analyzing all available data.

Analyze all available data

Websites Billing (POS) ERP CRM RFID Social media

Figure : Types of structured data available for analysis

Few Top Analytics tools are: MS Excel, SAS, IBM SPSS Modeler, R analytics, Statistica,
World Programming Systems (WPS), and Weka.

The open source analytics tools are: R analytics and Weka.

Classification of Analytics: There are basically two schools of thought:

Those that classify analytics into basic, operational, advanced and monetized.

Those that classify analytics into analytics 1.0, analytics 2.0 and analytics 3.0.
9
First school of thought:

Basic analytics: This primarily slicing and slicing of data to help with basic business
insights. This is about reporting on historical data, basic visualization etc.

Operationalized Analytics: It is operationalized analytics if it gets woven into the


enterprise’s business process.

Advanced Analytics: This largely is about forecasting for the future by way of predictive and
prescriptive modeling.

Monetized analytics: This is analytics in use to derive direct business revenue.

Second school of thought:

Analytics 1.0 Analytics 2.0 Analytics 3.0


Era: 1950s to 2009 Era: 2005 to 2012 Era: 2012 to present
Descriptive Descriptive statistics + Descriptive statistics + Predictive
statistics (report Predictive statistics statistics + prescriptive statistics
events, (use data from the past (use data from the past to make
occurrences etc of to make predictions for prophecies for the future and at
the past. the future. the same time make
recommendations to leverage the
situations to one’s advantage.
Key questions Key questions are: Key questions are:
asked: What will happen? What will happen?
What happened? Why will it happen? When will it happen?
Why did it Why will it happen?
happen? What should be the action taken
to take advantage of what will
happen?
Data from legacy Big Data A blend of big data and data from
systems, legacy systems, ERP,CRM and
ERP,CRM and third party applications.
third party
applications.
Small and Big data is being taken A blend of big data and
structured data up seriously. Data is traditional analytics to yield
sources. Data mainly unstructured, insights and offerings with speed
stored in enterprise arriving at a higher and impact.
data warehouses or pace. This fast flow of
data marts. big volume data had to
be stored and
processed rapidly,
often on massively
parallel servers

10
running hadoop.
Data was Data was often Data is being both internally and
internally sourced. externally sourced. externally sourced.
Relational Database applications, In ,memory analytics, in database
databases Hadoopo clusters, processing, agile analytical
SQL to hadoop methods, Machine learning
environments etc.. techniques etc ..
Benefits of Big Data Analytics

Organizations decide to deploy big data analytics for a wide variety of reasons, including the
following:

Business Transformation In general, executives believe that big data analytics offers
tremendous potential to revolution their organizations.

Competitive Advantage According survey 57 percent of enterprises said their use of


analytics was helping them achieve competitive advantage, up from 51 percent who said the
same thing in 2015.

Innovation Big data analytics can help companies develop products and services that appeal
to their customers, as well as helping them identify new opportunities for revenue
generation.

Lower Costs In the NewVantage Partners Big Data Executive Survey 2017, 49.2 percent of
companies surveyed said that they had successfully decreased expenses as a result of a big
data project.

Improved Customer Service Organizations often use big data analytics to examine social


media, customer service, sales and marketing data. This can help them better gauge customer
sentiment and respond to customers in real time.

Increased Security Another key area for big data analytics is IT security. Security software
creates an enormous amount of log data.

Top Challenges facing Big Data:

Scale : The storage of data is becoming a challenge for everyone.

Security: The production of more and more data increases  security and privacy concerns.

Schema: there is no place for rigid schema, need of dynamic schema.

Continuous availability: How to provide 24X7 support

Consistency: Should one opt for consistency or eventual consistency.

11
Partition tolerant: how to build partition tolernant systems that can take of both hardware and
software failures.

Data quality: Inconsistent data, duplicates, logic conflicts, and missing data all result in data
quality challenges.

To resolve the challenges of big data analytics we need a technology which provide
highstorage and cheap cost is the first requirement. We need faster processors to help
quickerprocessing of the data. Affordable and economical open-source is required. Parallel
processing in terms of huge connectivity, high quantity rather than the low potential. Cloud
computing and other flexible resource allocation arrangements are required to meet the
challenges in big data.

4. Explain the following terminology of Big Data

In-Memory Analytics

In-Database processing

Symmetric Mulit-processor system

Massively parallel processing

Shared nothing architecture

CAP Theorem

In-memory Analytics: Data access from non-volatile storage such as hard disk is a slow
process. This problem has been addressed using in-memory analytics. Here all the relevant
data is stored in Random Access memory (RAM) or primary storage thus eliminating the
need to access the data from hard disk. The advantage is faster access rapid deployment,
better insights, and minimal IT involvement.

In-Database Processing: In-Database processing is also called as in-database analytics. It


works by fusing data warehouses with analytical systems. Typically the data from various
enterprise OLTP systems after cleaning up through the process of ETL is stored in the
Enterprise dataware house or data marts. The huge data sets are then exported to analytical
programs for complex and extensive computations.

Symmetric Multi-Processor System: In this there is single common main memory that is
shared by two or more identical processors. The processors have full access to all I/O
devices and are controlled by single operating system instance.

12
SMP are tightly coupled multiprocessor systems. Each processor has its own high speed
memory called cache memory and are connected using a system bus.

Massively Parallel Processing:

Massively parallel Processing (MPP) refers to the coordinated processing of programs by a


number of processors working parallel. The processors, each have their own OS and
dedicated memory. They work on different parts of the same program. The MPP processors
communicate using some sort of messaging interface.

MPP is different from symmetric multiprocessing in that SMP works with processors
sharing the same OS and same memory. SMP also referred as tightly coupled
Multiprocessing.

Shared nothing Architecture:The three most common types of architecture for


multiprocessor systems:

Shared memory

Shared disk

Shared nothing.

In shared memory architecture, a common central memory is shared by multiple processors.


In shared disk architecture, Multiple processors share a common collection of disks while
having their own private memory. In shared nothing architecture, neither memory nor disk is
shared among multiple processors.

Advantages of shared nothing architecture:

13
Fault isolation:

Scalability:

CAP Theorem:The CAP theorem is also called the Brewer’s theorem. It states that in a
distributed computing environment, it is possible to provide the following guarantees:

Consistency

Availability

Partition tolerance

Consistency implies that every read fetches the last write.

Availability implies that reads and writes always succeed. In other words, each non-failing
node will return response in a reasonable amount of time.

Partition tolerance implies that the system will continue to function when network partition
occurs.

Definition - What does Basically Available, Soft State, Eventual Consistency


(BASE) mean?

Basically Available, Soft State, Eventual Consistency (BASE) is a data system design
philosophy that prizes availability over consistency of operations. BASE may be explained
in contrast to another design philosophy - Atomicity, Consistency, Isolation, Durability
(ACID). The ACID model promotes consistency over availability, whereas BASE promotes
availability over consistency.

14
BIG DATA ANALYTICS

UNIT – II : Introduction to Hadoop

Introducing Hadoop, need of Hadoop, limitations of RDBMS, RDBMS versus Hadoop,


Distributed Computing Challenges, History of Hadoop , Hadoop Overview, Use Case of
Hadoop, Hadoop Distributors, HDFS, Processing Data with Hadoop, Managing Resources
and Applications with Hadoop YARN, Interacting with Hadoop Ecosystem.

Explain the differences between Hadoop and RDBMS

Parameters RDBMS Hadoop


System Relational Database Management Node based flat structure
system
Data Suitable for structured data Suitable for Structured, unstructured
data, supports variety of
formats(xml,json)
Processing OLTP Analytical, big data processing
Choice When the data needs consistent Big data processing, which does not
relationship require any consistent relationships
between data
Processor Needs expensive hardware or In a Hadoop clusters, node require
high-end processors to store huge any consistent relationships between
volumes of data data
Cost Cost around $10,000 to $14,000 Cost around $4000 per terabytes of
per terabytes of storage storage.

What is Hadoop? Explain features of hadoop.

Hadoop is an open source framework that is meant for storage and processing of big data in
a distributed manner.

It is the best solution for handling big data challenges.

Some important features of Hadoop are –

Open Source – Hadoop is an open source framework which means it is available free of
cost. Also, the users are allowed to change the source code as per their requirements.

Distributed Processing – Hadoop supports distributed processing of data i.e. faster


processing. The data in Hadoop HDFS is stored in a distributed manner and MapReduce is
responsible for the parallel processing of data.
15
Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each block
(default) at different nodes.

Reliability – Hadoop stores data on the cluster in a reliable manner that is independent of
machine. So, the data stored in Hadoop environment is not affected by the failure of the
machine.

Scalability – It is compatible with the other hardware and we can easily ass the new
hardware to the nodes.

High Availability – The data stored in Hadoop is available to access even after the hardware
failure. In case of hardware failure, the data can be accessed from another node.

The core components of Hadoop are –

HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop.
The large data files running on a cluster of commodity hardware are stored in HDFS. It can
store data in a reliable manner even when hardware fails. The key aspects of HDFS are:

Storage component

Distributes data across several nodes

Natively redundant.

Map Reduce: MapReduce is the Hadoop layer that is responsible for data processing. It
writes an application to process unstructured and structured data stored in HDFS.

It is responsible for the parallel processing of high volume of data by dividing data into
independent tasks. The processing is done in two phases Map and Reduce.

The Map is the first phase of processing that specifies complex logic code and the 

Reduce is the second phase of processing that specifies light-weight operations.

The key aspects of Map Reduce are:

Computational frame work

Splits a task across multiple nodes

Processes data in parallel

Explain features of HDFS.Discuss the design of Hadoop distributed file system and
concept in detail.

16
HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop.
The large data files running on a cluster of commodity hardware are stored in HDFS. It can
store data in a reliable manner even when hardware fails. The key aspects of HDFS are:

HDFS is developed by the inspiration of Google File System(GFS).

Storage component: Stores data in hadoop

Distributes data across several nodes: divides large file into blocks and stores in various data
nodes.

Natively redundant: replicates the blocks in various data nodes.

High Throughput Access: Provides access to data blocks which are nearer to the client.

Re-replicates the nodes when nodes are failed.

Fig. Features of HDFS

HDFS Daemons:

NameNode

The NameNode is the master of HDFS that directs the slave DataNodes to perform I/O tasks.

17
Blocks: HDFS breaks large file into smaller pieces called blocks.

rackID: NameNode uses rackID to identify data nodes in the rack.

(rack is a collection of datanodes with in the cluster) NameNode keep track of blocks of a
file.

File System Namespace: NameNode is the book keeper of HDFS. It keeps track of how
files are broken down into blocks and which DataNode stores these blocks. It is a collection
of files in the cluster.

FsImage: file system namespace includes mapping of blocks of a file, file properties and is
stored in a file called FsImage.

EditLog: namenode uses an EditLog (transaction log) to record every transaction that
happens to the file system metadata.

NameNode is single point of failure of Hadoop cluster.

HDFS KEY POINTS


BLOCK STRUCTURED DEFAULT DEFAULT BLOCK SIZE:
FILE REPLICATION 64MB/128MB
FACTOR: 3

Fig. HDFS Architecture

DataNode
18
Multiple data nodes per cluster. Each slave machine in the cluster have DataNode daemon
for reading and writing HDFS blocks of actual file on local file system

During pipeline read and write DataNodes communicate with each other.

It also continuously Sends “heartbeat” message to NameNode to ensure the connectivity


between the Name node and the data node.

If no heartbeat is received for a period of time NameNode assumes that the DataNode had
failed and it is re-replicated.

Fig. Interaction between NameNode and DataNode.

(iii)Secondary name node

Takes snapshot of HDFS meta data at intervals specified in the hadoop configuration.

Memory is same for secondary node as NameNode.

But secondary node will run on a different machine.

In case of name node failure secondary name node can be configured manually to bring up
the cluster i.e; we make secondary namenode as name node.

File Read operation:

The steps involved in the File Read are as follows:

The client opens the file that it wishes to read from by calling open() on the DFS.

The DFS communicates with the NameNode to get the location of data blocks. NameNode
returns with the addresses of the DataNodes that the data blocks are stored on.
19
Subsequent to this, the DFS returns an FSD to client to read from the file.

Client then calls read() on the stream DFSInputStream, which has addresses of DataNodes
for the first few block of the file.

Client calls read() repeatedly to stream the data from the DataNode.

When the end of the block is reached, DFSInputStream closes the connection with the
DataNode. It repeats the steps to find the best DataNode for the next block and subsequent
blocks.

When the client completes the reading of the file, it calls close() on the FSInputStream to the
connection.

Fig. File Read Anatomy

File Write operation:

The client calls create() on DistributedFileSystem to create a file.

An RPC call to the namenode happens through the DFS to create a new file.

As the client writes data, data is split into packets by DFSOutputStream, which is then writes
to an internal queue, called data queue. Datastreamer consumes the data queue.

Data streamer streams the packets to the first DataNode in the pipeline. It stores packet and
forwards it to the second DataNode in the pipeline.

20
In addition to the internal queue, DFSOutputStream also manages on “Ackqueue” of the
packets that are waiting for acknowledged by DataNodes.

When the client finishes writing the file, it calls close() on the stream.

Fig. File Write Anatomy

Special features of HDFS:

Data Replication: There is absolutely no need for a client application to track all blocks. It
directs client to the nearest replica to ensure high performance.

Data Pipeline: A client application writes a block to the first DataNode in the pipeline. Then
this DataNode takes over and forwards the data to the next node in the pipeline. This process
continues for all the data blocks, and subsequently all the replicas are written to the disk.

21
Fig. File Replacement Strategy

Explain basic HDFS File operations with an example.

Creating a directory:

Syntax: hdfs dfs –mkdir <path>

Eg. hdfs dfs –mkdir /chp

Remove a file in specified path:

Syntax: hdfs dfs –rm <src>

Eg. hdfs dfs –rm /chp/abc.txt

Copy file from local file system to hdfs:

Syntax: hdfs dfs –copyFromLocal <src> <dst>

Eg. hdfs dfs dfs –copyFromLocal /home/vignan/sample.txt /chp/abc1.txt


22
To display list of contents in a directory:

Syntax: hdfs dfs –ls <path>

Eg. hdfs dfs –ls /chp

To display contents in a file:

Syntax: hdfs dfs –cat <path>

Eg. hdfs dfs –cat /chp/abc1.txt

Copy file from hdfs to local file system:

Syntax: hdfs dfs –copyToLocal <src <dst>

Eg. hdfs dfs –copyToLocal /chp/abc1.txt /home/vignan/Desktop/sample.txt

To display last few lines of a file:

Syntax: hdfs dfs –tail <path>

Eg. hdfs dfs –tail /chp/abc1.txt

Display aggregate length of file in bytes:

Syntax: hdfs dfs –du <path>

Eg. hdfs dfs –du /chp

To count no.of directories, files and bytes under given path:

Syntax: hdfs dfs –count <path>

Eg. hdfs dfs –count /chp

o/p: 1 1 60

23
Remove a directory from hdfs

Syntax: hdfs dfs –rmr <path>

Eg. hdfs dfs rmr /chp

Explain the importance of MapReduce in Hadoop environment for processing data.

MapReduce programming helps to process massive amounts of data in parallel.

Input data set splits into independent chunks. Map tasks process these independent chunks
completely in a parallel manner.

Reduce task-provides reduced output by combining the output of various mapers. There are
two daemons associated with MapReduce Programming: JobTracker and TaskTracer.

JobTracker:

JobTracker is a master daemon responsible for executing over MapReduce job.

It provides connectivity between Hadoop and application.

Whenever code submitted to a cluster, JobTracker creates the execution plan by deciding
which task to assign to which node.

It also monitors all the running tasks. When task fails it automatically re-schedules the task
to a different node after a predefined number of retires.

There will be one job Tracker process running on a single Hadoop cluster. Job Tracker
processes run on their own Java Virtual machine process.

24
Fig. Job Tracker and Task Tracker interaction

TaskTracker:

This daemon is responsible for executing individual tasks that is assigned by the Job
Tracker.

Task Tracker continuously sends heartbeat message to job tracker. When a job tracker fails
to receive a heartbeat message from a TaskTracker, the JobTracker assumes that the
TaskTracker has failed and resubmits the task to another available node in the cluster.

Map Reduce Framework


Phases: Daemons:
Map: Converts input into key-value JobTracker: Master, Schedules Task
pairs. TaskTracker:Slave, Execute task
Reduce: Combines output of mappers
and produces a reduced result set.

MapReduce working:

MapReduce divides a data analysis task into two parts – Map and Reduce. In the example
given below: there two mappers and one reduce.

25
Each mapper works on the partial data set that is stored on that node and the reducer
combines the output from the mappers to produce the reduced result set.

Steps:

First, the input dataset is split into multiple pieces of data.

Next, the framework creates a master and several slave processes and executes the worker
processes remotely.

Several map tasks work simultaneously and read pieces of data that were assigned to each
map task.

Map worker uses partitioner function to divide the data into regions.

When the map slaves complete their work, the master instructs the reduce slaves to begin
their work.

When all the reduce slaves complete their work, the master transfers the control to the user
program.

Fig. MapReduce Programming Architecture

A MapReduce programming using Java requires three classes:

Driver Class: This class specifies Job configuration details.

MapperClass: this class overrides the MapFunction based on the problem statement.
26
Reducer Class: This class overrides the Reduce function based on the problem statement.

NOTE: Based on marks given write MapReduce example if necessary with program.

Explain difference between Hadoop1X and Hadoop2X

Limitations of Hadoop 1.0: HDFS and MapReduce are core components, while other
components are built around the core.

Single namenode is responsible for entire namespace.

It is Restricted processing model which is suitable for batch-oriented mapreduce jobs.

Not supported for interactive analysis.

Not suitable for Machine learning algorithms, graphs, and other memory intensive
algorithms

MapReduce is responsible for cluster resource management and data Processing.

HDFS Limitation: The NameNode can quickly become overwhelmed with load on the
system increasing. In Hadoop 2.x this problem is resolved.

Hadoop 2: Hadoop 2.x is YARN based architecture. It is general processing platform.


YARN is not constrained to MapReduce only. One can run multiple applications in Hadoop
2.x in which all applications share common resource management.

Hadoop 2.x can be used for various types of processing such as Batch, Interactive, Online,
Streaming, Graph and others.

HDFS 2 consists of two major components

a) NameSpace: Takes care of file related operations such as creating files, modifying files
and directories

b) Block storage service: It handles daa node cluster management and replication.

HDFS 2 Features:

Horizontal scalability: HDFS Federation uses multiple independent NameNodes for


horizontal scalability. The DataNodes are common storage for blocks and shared by all
NameNodes. All DataNodes in the cluster registers with each NameNode in the cluster.

High availability: High availability of NameNode is obtained with the help of Passive
Standby NameNode.

27
Active-Passive NameNode handles failover automatically. All namespace edits are recorded
to a shared NFS(Network File Storage) Storage and there is a single writer at an point of
time.

Passive NameNode reads edits from shared storage and keeps updated metadata information.

Incase of Active NameNode failure, Passive NameNode becomes an Active NameNode


automatically. Then it starts writing to the shared storage.

Active NameNode Passive NameNode

Shared Edit Logs


Write Read

Fig. Active and Passive NameNode Interaction

Hadoop1X Hadoop2X
1 Supports MapReduce (MR) Allows to work in MR as well as other
processing model only. Does not distributed computing models like Spark,
support non-MR tools & HBase coprocessors.
2 MR does both processing and cluster- YARN does cluster resource management
resource management. and processing is done using different
processing models.
3 Has limited scaling of nodes. Limited Has better scalability. Scalable up to
to 4000 nodes per cluster 10000 nodes per cluster
4 Works on concepts of slots – slots Works on concepts of containers. Using
can run either a Map task or a containers can run generic tasks.
Reduce task only.
5 A single Namenode to manage the Multiple Namenode servers manage
28
entire namespace. multiple namespaces.
6 Has Single-Point-of-Failure (SPOF) Has to feature to overcome SPOF with a
– because of single Namenode. standby Namenode and in the case of
Namenode failure, it is configured for
automatic recovery.
7 MR API is compatible with MR API requires additional files for a
Hadoop1x. A program written in program written in Hadoop1x to execute
Hadoop1 executes in Hadoop1x in Hadoop2x.
without any additional files.
8 Has a limitation to serve as a Can serve as a platform for a wide variety
platform for event processing, of data analytics-possible to run event
streaming and real-time operations.
processing, streaming and real-time
operations.
9 Does not support Microsoft Windows Added support for Microsoft windows
Explain in detail about YARN?

The fundamental idea behind the YARN(Yet Another Resource Negotiator) architecture is
to splitting the JobTracker reponsibility of resource management and job
scheduling/monitoring into separate daemons.

Daemons that are part of YARN architecture are:

1. Global Resource Manager: The main responsibility of Global Resource Manager is to


distribute resources among various applications.

It has two main components:

Scheduler: The pluggable scheduler of ResourceManager decides allocation of resources to


various running applications. The scheduler is just that, a pure scheduler, meaning it does
NOT monior or track the status of the application.

Application Manager: It does:

Accepting job submissions.

Negotiating resources(container) for executing the application specific ApplicationMaster

Restarting the ApplicationMaster in case of failure

NodeManager:

This is a per-machine slave daemon. NodeManager responsibility is launching the


application containers for application execution.

NodeManager monitors the resource usage such as memory, CPU, disk, network, etc.

It then reports the usage of resources to the global ResourceManager.

29
Per-Application Application Master: Per-application Application master is an application
specific entity. It’s responsibility is to negotiate required resources for execution from the
ResourceManager.

It works along with the NodeManager for executing and monitoring component tasks.

Basic concepts of YARN are: Application and Container.

Application is a job submitted to system.

Ex: MapReduce job.

Container: Basic unit of allocation. Replaces fixed map/reduce slots. Fine-grained resource
allocation across multiple resource type

Eg. Container_0: 2GB,1CPU

Container_1: 1GB,6CPU

Fig. YARN Architecture

The steps involved in YARN architecture are:

The client program submits an application.

The Resource Manager launches the Application Master by assigning some container.

30
The Application Master registers with the Resource manager.

On successful container allocations, the application master launches the container by


providing the container launch specification to the NodeManager.

The NodeManager executes the application code.

During the application execution, the client that submitted the job directly communicates
with the Application Master to get status, progress updates.

Once the application has been processed completely, the application master deregisters with
the ResourceManager and shutsdown allowing its own container to be repurposed.

Explain Hadoop Ecosystem in detail.

The following are the components of Hadoop ecosystem:

HDFS: Hadoop Distributed File System. It simply stores data files as close to the original
form as possible.

HBase: It is Hadoop’s distributed column based database. It supports structured data storage
for large tables.

Hive: It is a Hadoop’s data warehouse, enables analysis of large data sets using a language
very similar to SQL. So, one can access data stored in hadoop cluster by using Hive.

Pig: Pig is an easy to understand data flow language. It helps with the analysis of large data
sets which is quite the order with Hadoop without writing codes in MapReduce paradigm.

31
ZooKeeper: It is an open source application that configures synchronizes the distribured
systems.

Oozie: It is a workflowscheduler system to manage apache hadoop jobs.

Mahout: It is a scalable Machine Learning and data mining library.

Chukwa: It is a data collection system for managing large distributed systems.

Sqoop: it is used to transfer bulk data between Hadoop and structured data stores such as
relational databases.

Ambari: it is a web based tool for provisioning, Managing and Monitoring Apache Hadoop
clusters.

Explain the following

Modules of Apache Hadoop framework

There are four basic or core components:

Hadoop Common: It is a set of common utilities and libraries which handle other Hadoop
modules. It makes sure that the hardware failures are managed by Hadoop cluster
automatically.

Hadoop YARN: It allocates resources which in turn allow different users to execute various
applications without worrying about the increased workloads.

HDFS: It is a Hadoop Distributed File System that stores data in the form of small memory
blocks and distributes them across the cluster. Each data is replicated multiple times to
ensure data availability.

Hadoop MapReduce: It executes tasks in a parallel fashion by distributing the data as small
blocks.

Hadoop Modes of Installations

Standalone, or local mode: which is one of the least commonly used environments, which
only for running and debugging of MapReduce programs. This mode does not use HDFS
nor it launches any of the hadoop daemon.

Pseudo-distributed mode(Cluster of One), which runs all daemons on single machine. It is


most commonly used in development environments.

Fully distributed mode, which is most commonly used in production environments. This
mode runs all daemons on a cluster of machines rather than single one.
32
XML File configrations in Hadoop.

core-site.xml – This configuration file contains Hadoop core configuration settings, for
example, I/O settings, very common for MapReduce and HDFS. mapred-site.xml – This
configuration file specifies a framework name for MapReduce by setting
mapreduce.framework.name

hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It


also specifies default block permission and replication checking on HDFS.

yarn-site.xml – This configuration file specifies configuration settings for ResourceManager


and NodeManager.

Describe differences between SQL and MapReduce

Characteristic SQL MapReduce(Hadoop1X)


Access Interactive and Batch Batch
Structure Static Dynamic
Updates Read and Write many Write once, Read many
times times
Integrity High Low
Scalability Nonlinear Linear
Explain Hadoop Architecture with a neat sketch.

Fig. Hadoop Architecture

33
Hadoop Architecture is a distributed Master-slave architecture.

Master HDFS: Its main responsibility is partitioning the data storage across the slave nodes.
It also keep track of locations of data on Datanodes.

Master Map Reduce: It decides and schedules computation task on slave nodes.

NOTE: Based on marks for the question explain hdfs daemons and mapreduce daemons.

34
BDA UNIT – III

Syllabus : Introduction to MAPREDUCE Programming: Introduction , Mapper,


Reducer, Combiner, Partitioner , Searching, Sorting , Compression, Real time applications
using MapReduce, Data serialization and Working with common serialization formats, Big
data serialization formats

In MapReduce programming, Jobs(applications) are split into a set of map tasks and reduce
tasks. Then these tasks are executed in a distributed fashion on Hadoop cluster. Each task
processes small subset of data that has been assigned to it. This way, Hadoop distributes the
load across the cluster. Map Reduce job takes a set of files that is stored in HDFS as input.

Map task takes care of loading, parsing, transforming and filtering. The responsibility of
reduce task is grouping and aggregating data that is produced by map tasks to generate final
output. Each map task is broken down into the following phases:

Record Reader

Mapper

Combiner

Partitioner.

The output produced by the map task is known as intermediate keys and values. These
intermediate keys and values are sent to reducer. The reduce tasks are broken down into the
following phases:

Shuffle.

Sort

Reducer

Output format.

Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network; only computational code moved to process daa which saves network bandwidth.

Mapper Phases:

Mapper maps the input key-value pairs into a set of intermediate key-value pairs.

Each map task is broken into following phases:

35
RecordReader: converts byte oriented view of input in to Record oriented view and
presents it to the Mapper tasks. It presents the tasks with keys and values.

Mapper: Map function works on the key-value pair produced by RecordReader and
generates intermediate (key, value) pairs.

Combiner: It takes intermediate key-value pair provided by mapper and applies user
specific aggregate function to only one mapper. It is also known as local Reducer.data

Partitioner: Take intermediate key key value pairs produced by the mapper, splits them into
partitions the data using a user-defined condition.

Combiner: MapReduce without combiner

MapReduce with combiner:

Each reduce task is broken into following phases:

Shuffle & Sort:


36
Downloads the grouped key-value pairs onto the local machine, where the Reducer is
running.

The individual key-value pairs are sorted by key into a larger data list.

The data list groups the equivalent keys together so that their values can be iterated easily in
the Reducer task

Reducer:

The Reducer takes the grouped key-value paired data as input and runs a Reducer function
on each one of them.

Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires
a wide range of processing.

Once the execution is over, it gives zero or more key-value pairs to the final step.

Output format:  

In the output phase, we have an output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file using a record writer.

Compression: Compression provides two benefits as follows:

Reduces the space to store files

37
Speeds up data transfer across the network

What are the Real time applications using MapReduce Programming:

Social networks

Media and Entertainment

Health Care

Business

Banking

Stock Market

Weather Forecasting

Data Serialization:

Data Serialization is the process of converting object data into byte stream data for
transmission over a network across different nodes in a cluster or for persistent data storage.

MapReduce offers straightforward, well-documented support for working with simple data
formats such as log files.

38
But the use of MapReduce has evolved beyond log files to more sophisticated data
serialization formats—such as text, XML, and JSON—to the point that its documentation
and built-in support runs dry.

The goal of this topic is how to work with common data serialization formats, as well as to
examine more structured serialization formats and compare their fitness for use with
MapReduce.

Working with XML and JSON in MapReduce, however, poses two equally important
challenges:

Though MapReduce requires classes that can support reading and writing a particular data
serialization format, there’s a good chance it doesn’t have such classes to support the
serialization format you’re working with.

MapReduce’s power lies in its ability to parallelize reading your input data. If your input
files are large (think hundreds of megabytes or more), it’s crucial that the classes reading
your serialization format be able to split your large files so multiple map tasks can read them
in parallel.

Data serialization support in MapReduce is a property of the input and output classes
that read and write MapReduce data.

Working with common serialization formats:

XML and JSON are industry-standard data interchange formats. Their ubiquity in the
technology industry is evidenced by their heavy adoption in data storage and exchange.

XML has existed since 1998 as a mechanism to represent data that’s readable by machine
and human alike. It became a universal language for data exchange between systems. It’s
employed by many standards today such as SOAP (simple object Access Protocol) and RSS,
and used as an open data format for products such as Microsoft Office.

MapReduce and XML

While MapReduce comes bundled with an InputFormat that works with text, it doesn’t come
with one that supports XML. Working on a single XML file in parallel in MapReduce is
tricky because XML doesn’t contain a synchronization marker in its data format.

Problem You want to work with large XML files in MapReduce and be able to split and
process them in parallel.

Solution Mahout’s XMLInputFormat can be used to work with XML files in HDFS with
MapReduce. It reads records that are delimited by a specific XML begin and end tag. This
technique also covers how XML can be emitted as output in MapReduce output.

39
JSON shares the machine- and human-readable traits of XML, and has existed since the
early 2000s. It’s less verbose than XML, and doesn’t have the rich typing and validation
features available in XML.

MapReduce and JSON Imagine you have some code that’s downloading JSON data from a
streaming REST service and every hour writes a file into HDFS. The data amount that’s
being downloaded is large, so each file being produced is multiple gigabytes in size. You’ve
been asked to write a MapReduce job that can take as input these large JSON files.

characteristics for big data serialization:

Code generation—The ability to generate Java classes and utilities that can be used for
serialization and deserialization.

Versioning—The ability for the file format to support backward or forward compatibility.

Language support—The programming languages supported by the library.

Transparent compression—The ability for the file format to handle compressing records
internally.

Splittability—The ability of the file format to support multiple input splits.

Native support in MapReduce—The input/output formats that support reading and writing
files in their native format (that is, produced directly from the data format library).

Pig and Hive support—The Pig Store and Load Functions (referred to as Funcs) and Hive
SerDe classes to support the data format.

40
UNIT – IV HIIVE & PIG

HIVE:

Hive is data warehousing tool and is used to query structured data built on top of
Hadoop for providing data summarization, query, and analysis. Hive Provides HQL
(Hive Query Language) which is similar to SQL. Hive compiles SQL queries into
MapReduce jobs and then runs the job in the Hadoop cluster.

Features of Hive:

It is similar to SQL

HQL is easy to code

Hive supports rich datatypes such as structs, lists and maps

Hive supports SQL filters, group-by and order-by clauses

Custom types and custom functions can be defined.

Hive Data Units:

Databases: The name space for tables

Tables: set of records that have similar schema

Partitions: Logical separations of data based on classification of given information as per


specific attributes.

Buckets or clusters: Similar to partitions but uses hash function to segregate data and
determines the cluster or bucket into which the record should be placed.

Hive Architecture:

41
Externel Interfaces- CLI, WebUI, JDBC, ODBC programming interfaces

Hive CLI: The most commonly used interface to interact with Hadoop.

Hive Web Interface: It is simple graphic interface to interact with Hive and to execute
query.

Thrift Server – Cross Language service framework . This is an optional Sever. This can be
used to submit Hive jobs from a remote client.

JDBC/ODBC: Jobs can be submitted from a JDBC client. One can write java code to
connect to Hive and submit jobs on it.

Metastore- Meta data about the Hive tables, partitions. A metastore consists of Meta store
service and Database.

There are three kinds of Metastore:

Embedded Meta store

Local Metastore

Remote Metastore

Driver- Brain of Hive! Hive queries are sent to the driver for Compilation, Optimization
and Execution

Apache Hive Data Types :

Hive Data types are used for specifying the column/field type in Hive tables.

42
Types of Data Types in Hive

Mainly Hive Data Types are classified into 5 major categories, let’s discuss them one by
one:

a. Primitive Data Types in Hive

Primitive Data Types also divide into 4 types which are as follows:

Numeric Data Type

Date/Time Data Type

String Data Type

Miscellaneous Data Type

Numeric Data Type

The Hive Numeric Data types also classified into two types-

Integral Data Types

Floating Data Types

* Integral Data Types

Integral Hive data types are as follows-

TINYINT (1-byte (8 bit) signed integer, from -128 to 127)

SMALLINT (2-byte (16 bit) signed integer, from -32, 768 to 32, 767)

INT (4-byte (32-bit) signed integer, from –2,147,483,648to 2,147,483,647)

BIGINT (8-byte (64-bit) signed integer, from –9,223,372,036,854,775,808 to


9,223,372,036,854,775,807)

* Floating Data Types

Floating Hive data types are as follows-

FLOAT (4-byte (32-bit) single-precision floating-point number)

DOUBLE (8-byte (64-bit) double-precision floating-point number)

DECIMAL (Arbitrary-precision signed decimal number)

ii. Date/Time Data Type

43
The second category of Apache Hive primitive data type is Date/Time data types. The
following Hive data types comes into this category-

TIMESTAMP (Timestamp with nanosecond precision)

DATE (date)

INTERVAL

iii. String Data Type

String data types are the third category under Hive data types. Below are the data types that
come into this-

STRING (Unbounded variable-length character string)

VARCHAR (Variable-length character string)

CHAR (Fixed-length character string)

iv. Miscellaneous Data Type

Miscellaneous data types has two types of Hive data types-

BOOLEAN (True/false value)

BINARY (Byte array)

b. Complex Data Types in Hive

In this category of Hive data types following data types are come-

Array

MAP

STRUCT

UNION

ARRAY

An ordered collection of fields. The fields must all be of the same type.

Syntax: ARRAY<data_type>

E.g. array (1, 2)

ii. MAP

44
An unordered collection of key-valuepairs. Keys must be primitives; values may be any
type. For a particular map, the keys must be the same type, and the values must be the same
type.

Syntax: MAP<primitive_type, data_type>

E.g.map(‘a’, 1, ‘b’, 2).

iii. STRUCT

A collection of named fields. The fields may be of different types.

Syntax: STRUCT<col_name :data_type [COMMENT col_comment],…..>

E.g. struct(‘a’, 1 1.0),[b] named_struct(‘col1’, ‘a’, ‘col2’, 1,  ‘col3’, 1.0)

iv. UNION

A value that may be one of a number of defined data. The value is tagged with an integer
(zero-indexed) representing its data type in the union.

Syntax: UNIONTYPE<data_type, data_type, …>

E.g.create_union(1, ‘a’, 63)

c. Column Data Types in Hive

Column Hive data types are furthermore divide into 6 categories:

Integral Type

Strings

Timestamp

Dates

Decimals

Union Types

HIVE file format:

The file formats in Hive specify how records are encoded in a file.

The file formats are :

Text File: The default file format is text file. In this format, each record is a line in the file.

45
Sequential file: Sequential files are flat files that store binary key-value pairs. It includes
compression support which reduces the CPU, I/O requirement.

RC File ( Record Columnar File): RCFile stores the data in column oriented manner which
ensures that Aggregation operation is not an expensive operation.

RC File ( Record Columnar File):Instead of only partitioning the table harizontally like the
row oriented DBMS, RCFile partitions this table first horizontally and then vertically to
serialize the data.

HIVE Query Language (HQL):

Hive query language provides basic SQL like operations.

Basic tasks of HQL are:

Create and Manage tables and partitions

Support various relational, arithmentic and logic operations

Evaluate functions

Down load the contents of a table to a local directory or result of queries to HDFS directory.

HIVE DDL Statements:

These statements are used to build and modify the tables and other objects in the
database.

Create/Drop/Alter Database

Create/Drop/truncate Table
46
Alter Table/partition/column

Create/Drop/Alter view

Create/Drop/Alter index

Show

Describe

HIVE DDL Statements:

These statements are used to retrieve , store, modify, delete and update data in
database. The DML commands are:

Loading files into table

Inserting data into Hive table from queries

HIVE Example – 1: (Joins)

CREATE TABLE customer (id INT,nameSTRING,address STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '#';

CREATE TABLEorder_cust (id INT,cus_idINT,prod_idINT,price INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

SELECT * FROM customer c JOIN order_cust o ON (c.id=o.cus_id);

SELECT c.id, c.name, c.address, ce.exp

FROM customer c JOIN (SELECT cus_id,sum(price) AS exp

FROM order_cust

GROUP BYcus_id)ce ON (c.id=ce.cus_id);

HIVE Example – 2:

To create join between student and department tables where we use RollNo from both the
tables as the join key

CREATE TABLE IF NOT EXISTS STUDENTS (rollno INT, name STRING gpa
FLOAT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t';

47
LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv’ OVERWRITE INTO
TABLE STUDENT;

CREATE TABLE IF NOT EXISTS DEPARTMENT (rollno INT, deptno int, name
STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t';

LOAD DATA LOCAL INPATH ‘/root/hivedemos/department.tsv’ OVERWRITE INTO


TABLE DEPARTMENT;

SELECT a.rollno, a.name, a.gpa, b.deptno FROM STUDENT a JOIN DEPARTMENT b


ON a.rollno = b.rollno;

HIVE example – 3:

Write HQL sub-query to count occurrence of similar words in the file

CREATE TABLE docs (line STRING);

LOAD DATA LOCAL INPATH ‘/root/hivedemos/line.txt’ OVERWRITE INTO TABLE


docs;

CREATE TABLE word_count AS

SELECT word, count(1) as count FROM

(SELECT explode (split (line,’ ‘)) AS word FROM docs )

GROUP BY word

ORDER BY word;

SELECT * FROM word_count;

The explode function takes an array as input and outputs the elements of the array as
separate rows.

PIG

History of Pig:

In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and
execute MapReduce jobs on every dataset.

In 2007, Apache Pig was open sourced via Apache incubator. In 2008, the first release of
Apache Pig came out.
48
In 2010, Apache Pig graduated as an Apache top-level project.

Features of Pig:

Apache Pig is an abstraction over MapReduce.

It is a tool/platform which is used to analyze larger sets of data representing them as data
flows.

Pig is generally used with Hadoop; we can perform all the data manipulation operations in
Hadoop using Apache Pig.

To write data analysis programs, Pig provides a high-level language known as Pig Latin.

This language provides various operators using which programmers can develop their own
functions for reading, writing, and processing data.

To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language.

All these scripts are internally converted to Map and Reduce tasks.

Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.

Why Do We Need Apache Pig?

Programmers who are not so good at Java normally used to struggle working with Hadoop,
especially while performing any MapReduce tasks. Apache Pig is a boon for all such
programmers.

Features of Pig:

Rich set of operators: It provides many operators to perform operations like join, sort, filer,
etc.

Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you
are good at SQL.

Optimization opportunities: The tasks in Apache Pig optimize their execution


automatically, so the programmers need to focus only on semantics of the language.

Extensibility: Using the existing operators, users can develop their own functions to read,
process, and write data.

UDF’s: Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.

49
Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.

Pig Vs Hive:

PIG HIVE

Apache Pig uses a language called Hive uses a language called


Pig Latin. It was originally created HiveQL. It was originally created
at Yahoo. at Facebook.

Pig Latin is a data flow language. HiveQL is a query processing


language.

Pig Latin is a procedural language HiveQL is a declarative language.


and it fits in pipeline paradigm.

Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured
data.

Pig Vs SQL:

PIG SQL

Pig Latin is a procedural language. SQL is a declarative language

In Apache Pig, schema is optional. We can Schema is mandatory in SQL.


store data without designing a schema
(values are stored as $01, $02 etc.)

The data model in Apache Pig is nested The data model used in SQL is
relational. flat relational.

Apache Pig provides limited opportunity There is more opportunity for


for Query optimization. query optimization in SQL.

PIG architecture:

The language used to analyze data in Hadoop using Pig is known as Pig Latin.

50
It is a high-level data processing language which provides a rich set of data types and
operators to perform various operations on the data.

To perform a particular task Programmers using Pig, programmers need to write a Pig script
using the Pig Latin language, and execute them using any of the execution mechanisms
(Grunt Shell, UDFs, Embedded).

After execution, these scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.

Parser :Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script,
does type checking, and other miscellaneous checks. The output of the parser will be a DAG
(directed acyclic graph), which represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data
flows are represented as edges.

Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.

Compiler :The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
51
Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the desired results.

PIG Latin Data model: The data model of Pig Latin is fully nested and it allows complex
non-atomic datatypes such as map and tuple. Given below is the diagrammatical
representation of Pig Latin’s data model.

PIG Latin Data Types:

Atom :Any single value in Pig Latin, irrespective of their data, type is known as an Atom.

It is stored as string and can be used as string and number. int, long, float, double, chararray,
and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is
known as a field.

Tuple :A record that is formed by an ordered set of fields is known as a tuple, the fields can
be of any type. A tuple is similar to a row in a table of RDBMS.

Example: (Raja, 30)

Bag : A collection of tuples (non-unique) is known as a bag. Each tuple can have any
number of fields (flexible schema). A bag is represented by ‘{}’.
52
It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that
every tuple contain the same number of fields or that the fields in the same position (column)
have the same type.

Example: {(Raja, 30), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example: {Raja, 30, {9848022338, raja@gmail.com,}}

Relation :A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).

Map :A map (or data map) is a set of key-value pairs. The key needs to be of type chararray
and should be unique. The value might be of any type. It is represented by ‘[]’

Example: [name#Raja, age#30]

Apache PIG execution modes:

You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode :In this mode, all the files are installed and run from your local host and local
file system. There is no need of Hadoop or HDFS. This mode is generally used for testing
purpose.

MapReduce Mode :MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig
Latin statements to process the data, a MapReduce job is invoked in the back-end to perform
a particular operation on the data that exists in the HDFS.

Apache Pig scripts can be executed in three ways (Execution Mechanisms):

Interactive Mode (Grunt shell) – You can run Apache Pig in interactive mode using the
Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using
Dump operator).

Batch Mode (Script) – You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pigextension.

Embedded Mode (UDF) – Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our
script.

Grunt Shell:

53
After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to that,
there are certain useful shell and utility commands provided by the Grunt shell. This chapter
explains the shell and utility commands provided by the Grunt shell.

Note: In some portions of this chapter, the commands like Load and Store are used. Refer
the respective chapters to get in-detail information on them.

Shell Commands :The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts.
Prior to that, we can invoke any shell commands using shand fs.

Utility commands:

The Grunt shell provides a set of utility commands. These include utility commands such as
clear, help, history, quit, and set; and commands such as exec, kill, and run to control Pig
from the Grunt shell. Given below is the description of the utility commands provided by the
Grunt shell.

PIG: ETL Processing

Pig widely used for ETL (Extract, Transform and Load). Pig can extract data from
different sources such as ERP, accounting, flat files etc.. Pig then makes use of various
operators to perform transformation the data and subsequently loads into the data warehouse.

PIG Philosophy:

Pigs Eat Anything: Pig can process different kinds of data such as Structured or
unstructured. And it can easily be extended to operate on data beyond files, including
key/value stores, databases, etc.

Pigs Live Anywhere: Pig not only processes files in HDFS, it also processes files in other
sources such as files in the local file system.

Pigs Are Domestic Animals: pigs allows us to develop user defined functions and the same
can be included in the script for complex operations.
54
Pigs Fly: Pig processes data quickly.

PIG on Hadoop:

PIG run on Hadoop. PIG uses both HDFS and Map Reduce Programming. By default,
PIG uses reads input files from HDFS. Pig stores the intermediate data (data produced
by Map Reduce jobs) and the output in HDFS. How everPg can also read input from
the place output to other sources.

PIG supports the following:

HDFS commands

UNIX Shell commands

Relational operators: FILTER, FOREACH, GROUP, distinct, limit, order by, join,
split, sample

Positional parameters

Common mathematical functions

Custom functions

Complex data structures

PIG Latin Relational operators:

Exercise Problem:

55
How to find the number of occurrences of the words in a file using the pig script?

lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;

grouped = GROUP words BY word;

wordcount = FOREACH grouped GENERATE group, COUNT(words);

DUMP wordcount;

The above pig script, first splits each line into words using the TOKENIZE operator. The
tokenize function creates a bag of words. Using the FLATTEN function, the bag is
converted into a tuple. In the third statement, the words are grouped together so that the
count can be computed which is done in fourth statement.

56
UNIT – V

Introduction to SPARK PROGRAMMING

Apache Spark is a lightning-fast cluster computing technology designed for fast


computation.

It was built on top of Hadoop MapReduce and it extends the MapReduce model to
efficiently use more types of computations which includes Interactive Queries and Stream
Processing.

Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.

As against a common belief, Spark is not a modified version of Hadoop. Hadoop is just one
of the ways to implement Spark.

Spark uses Hadoop in two ways – one is storage and second is processing.

Since Spark has its own cluster management computation, it uses Hadoop for storage
purpose only.

Why Spark:

As we know, there was no general purpose computing engine in the industry, since

To perform batch processing, we were using HadoopMapReduce.

Also, to perform stream processing, we were using Apache Storm / S4.

Moreover, for interactive processing, we were using Apache Impala / Apache Tez.

To perform graph processing, we were using Neo4j / Apache Giraph.

Hence there was no powerful engine in the industry, that can process the data both in real-
time and batch mode. Also, there was a requirement that one engine can respond in sub-
second and perform in-memory processing.

Apache Spark is a powerful open source engine. Since, it offers real-time stream processing,
interactive processing, graph processing, in-memory processing as well as batch processing.

Components of Apache Spark:

a. Spark Core: Spark Core is a central point of Spark. Basically, it provides an execution
platform for all the Spark applications.

b. Spark SQL: On the top of Spark, Spark SQL enables users to run SQL/HQL queries.

57
c. Spark Streaming: Spark Streaming enables a powerful interactive and data analytics
application.

d. Spark Mllib: Machine learning library delivers both efficiencies as well as the high-
quality algorithms.

e. Spark GraphX

Basically, Spark GraphX is the graph computation engine built on top of Apache Spark that
enables to process graph data at scale.

f. SparkR

It is R package that gives light-weight frontend.

Features of Spark:

a. Swift Processing :Apache Spark offers high data processing speed. That is about 100x
faster in memory and 10x faster on the disk. However, it is only possible by reducing the
number of read-write to disk. 

b. Dynamic in Nature: Basically, it is possible to develop a parallel application in Spark.


Since there are 80 high-level operators available in Apache Spark.

c. In-Memory Computation in Spark: The increase in processing speed is possible due to


in-memory processing. It enhances the processing speed.

d. Reusability: We can easily reuse spark code for batch-processing or join stream against
historical data. Also to run ad-hoc queries on stream state.

e. Spark Fault Tolerance: Spark offers fault tolerance. It is possible through Spark’s core
abstraction-RDD.

f. Real-Time Stream Processing: We can do real-time stream processing in Spark.


Basically, Hadoop does not support real-time processing.
58
g. Lazy Evaluation in Spark: All the transformations we make in Spark RDD are Lazy in
nature, that is it does not give the result right away rather a new RDD is formed from the
existing one. Thus, this increases the efficiency of the system.

h. Support Multiple Languages: Spark supports multiple languages. Such as Java, R,


Scala, Python. Hence, it shows dynamicity. Moreover, it also overcomes the limitations of
Hadoop since it can only build applications in Java.

i. Support for Sophisticated Analysis: There are dedicated tools in Apache Spark. Such as
for streaming data interactive/declarative queries, machine learning which add-on to map
and reduce.

j. Integrated with Hadoop: As we know Spark is flexible. It can run independently and
also on Hadoop YARN Cluster Manager. Even it can read existing Hadoop data.

k. Spark GraphX: In Spark, a component for graph and graph-parallel computation, we


have GraphX.

l. Cost Efficient: For Big data problem as in Hadoop, a large amount of storage and the
large data center is required during replication. Hence, Spark programming turns out to be a
cost-effective solution

RDD (Resilient Distributed Dataset):

59
The key abstraction of Spark is RDD. RDD is an acronym for Resilient Distributed Dataset.
It is the fundamental unit of data in Spark. Basically, it is a distributed collection of elements
across cluster nodes. Also performs parallel operations.

Ways to create spark RDD:

Basically, there are 3 ways to create Spark RDDs

i. Parallelized collections

By invoking parallelize method in the driver program, we can create parallelized collections.

ii. External datasets

One can create Spark RDDs, by calling a textFile method. Hence, this method takes URL of
the file and reads it as a collection of lines.

iii. Existing RDDs

Moreover, we can create new RDD in spark, by applying transformation operation on


existing RDDs.

Spark RDD Operations:

There are two types of operations, which Spark RDDs supports:

i. Transformation Operations

It creates a new Spark RDD from the existing one. Moreover, it passes the dataset to the
function and returns new dataset.

ii. Action Operations

In Apache Spark, Action returns final result to driver program or write it to the external data
store.

b. Spark RDDs operations

There are two types of operations, which Spark RDDs supports:

i. Transformation Operations

It creates a new Spark RDD from the existing one. Moreover, it passes the dataset to the
function and returns new dataset.

ii. Action Operations

In Apache Spark, Action returns final result to driver program or write it to the external data
store.
60
RDD Transformations

RDD transformations returns pointer to new RDD and allows you to create dependencies

between RDDs. Each RDD in dependency chain (String of Dependencies) has a function

for calculating its data and has a pointer (dependency) to its parent RDD.

Actions:

61
c. Sparkling Features of Spark RDD

There are various advantages of using RDD. Some of them are

i. In-memory computation: Basically, while storing data in RDD, data is stored in memory
for as long as you want to store. It improves the performance by an order of magnitudes by
keeping the data in memory.

ii. Lazy Evaluation: Spark Lazy Evaluation means the data inside RDDs are not evaluated
on the go. Basically, only after an action triggers all the changes or the computation is
performed. Therefore, it limits how much work it has to do.

iii. Fault Tolerance: If any worker node fails, by using lineage of operations, we can re-
compute the lost partition of RDD from the original one. Hence, it is possible to recover lost
data easily.

iv. Immutability: Immutability means once we create an RDD, we can not manipulate it.
Moreover, we can create a new RDD by performing any transformation. Also, we achieve
consistency through immutability.

v. Persistence: In in-memory, we can store the frequently used RDD. Also, we can retrieve
them directly from memory without going to disk. It results in the speed of the execution.
Moreover, we can perform multiple operations on the same data. It is only possible by
storing the data explicitly in memory by calling persist() or cache() function.
62
vi. Partitioning: Basically, RDD partition the records logically. Also, distributes the data
across various nodes in the cluster. Moreover, the logical divisions are only for processing
and internally it has no division. Hence, it provides parallelism.

vii. Parallel: While we talk about parallel processing, RDD processes the data parallelly
over the cluster.

viii. Location-Stickiness: To compute partitions, RDDs are capable of defining placement


preference. Moreover, placement preference refers to information about the location of
RDD. Although, the DAGScheduler places the partitions in such a way that task is close to
data as much as possible. Moreover, it speeds up computation.

ix. Coarse-grained Operation: Generally, we apply coarse-grained transformations to


Spark RDD. It means the operation applies to the whole dataset not on the single element in
the data set of RDD in Spark.

x. Typed: There are several types of Spark RDD. Such as: RDD [int], RDD [long], RDD
[string].

xi. No limitation: There are no limitations to use the number of Spark RDD. We can use
any no. of RDDs. Basically, the limit depends on the size of disk and memory.

63
BDA Question Bank

List various types of digital data?

Structured, Semi-structured and unstructured

Why an email placed in the Unstructured category?

Because it contains hyperlinks, attachments, videos, images, free flowing text...

What category will you place a CCTV footage into? unstructured

You have just got a book issued from the library. What are the details about the book that
can be placed in an RDBMS table.

Ans: Title, author, publisher, year, no.of pages, type of book, price, ISBN, with CD or not.

Which category would you place the consumer complaints and feedback? Unstructured.

Which category (structured, semi-structured or Unstructured) will you place a web page in?
Unstructured

Which category (structured, semi-structured or Unstructured) will you place a Power point
presentation in? Unstructured

Which category (structured, semi-structured or Unstructured) will you place a word


document in? Unstructured

Doug Laney__________, A gartner analyst coined the term Big Data

Volatality____________is the characteristic of data dealing with its retention.

Data lakes____________is a large data repository that stores data in its native format until it
is needed.

variability_________ is the characteristic of data explains the spikes in data.

In-memory Analytics____________technology helps query data that resides in a computer’s


random access memory (RAM) rather than data stored on Physical disks.

Eventual consistency is consistency model used in distributed computing to achieve high


______Availability

A collection of independent computers that appear to its users as a single coherent system is
__________Distributed systems.

CAP Theorem is also called as _________brewer_______

64
System will continue to function even when network partition occurs is
called_______Partition tolerance_

Every read fetches the most recent write is called ___________Consistency__

A non failing node will return a reasonable response within a reasonable amount of time is
called_______Availability

What is BASE?

State few examples of human generated and machine generated data.

What are the characteristics of data?

Mention few top analytics tools.

Mention few open source analytics tools

Hadoop is ___node__based flat structure

RDBMS is best choice when _consistency_______ is the main concern.

RDBMS supports _______structured____data formats.

In Hadoop, Data is processed in ____parallel__________.

HDFS can be deployed on __________low cost hardware____.

NameNode uses____Fsimage____to store file system namespace.

NameNode uses_____editlog___to record every transaction.

SecondayNameNode is a ___helper or house keeping_________daemon.

Data node is responsible for __read/write_______file operation.

Hadoop 2.x is based on ____YARN____architecture.

YARN is responsible for _________CLUSTER MANAGEMENT_____.

HDFS has a ___MASTER__________ / ___SLAVE_____________ architecture.

HDFS is built using ___JAVA_____ language.

The ___Name node_______maintains the files system Namespace.

The number of copies of a file is called the ___Replication factor____of that file.

The typical block size used by HDFS is _____64mb____

Hadoop 2.x is based on ________architecture.


65
YARN is responsible for ______________.

One ______gigabytes are there in one Exabyte.

__________open source software was developed from Google MapReduce concept.

The MapReduce programming model widely used in analytics was developed at ______

___________created the popular Hadooop software framework for storage and processing of
large data sets.

_______traditional IT company is the largest Big Data vendor in the world.

According to a study by IBM, approximately ______amount of data existed in the digital


universe in 2012.

are foundation.

HDFS has a _____________ / ________________ architecture.

HDFS is built using ________ language.

The __________maintains the files system Namespace.

The number of copies of a file is called the _______of that file.

A ______contains a list of all blocks on a data node.

The blocks of a file are replicated for ______tolerance.

The typical block size used by HDFS is _________

________perform block creation, deletion and replication upon instruction from the ______

_________is a single point of failure of Hadoop cluster.

_______is a book keeper of HDFS.

There is only one _________daemon per hadoop cluster.

There is a single _______per slave mode.

Hadoop is best used as a _______once and _____many times type of data store.

How many NameNodes can run on a single Hadoop cluster?

How many data nodes can run on a single Hadoop cluster?

Hadoop runs on a large clusters of ________

________is the official development and production plat form for Hadoop.
66
Paritioner phase belongs to _____ task

Combiner is also known as________

What is RecordReader in a MapReduce?

MapReduce sort the intermediate value based on _____

In MapReduce programming, the reduce function is applied ________group at a time.

The metastore consists of ______ and a ____________

The most commonly used interface to interact with Hive is _________

The default metastore for Hive is _________

Metastore contains _________of Hive tables.

_________is responsible for compilation, optimization and execution of Hive queries.

PIG is ______language

In Pig, _________ is used to specify data flow.

Pig provides an ________to execute data flow

_________ and __________ are execution modes of Pig.

The interactive mode of Pig is _______________.

__________,__________and _________are complex data types of Pig.

Pig is used in ___________process.

What is SERDE in HIVE?

DISTINCT key word removes duplicate fields

LIMIT keyword is used to display limited number of tuples in Pig.

ORDERBY is used for sorting.

Transformation are operations on RDD’s, that return a new RDD (True/False)______.

ETL means ______________.

List the different ways in which RDDs are created.

How does RDD. Persisit() do?

List any two features of Spark.

67
Define RDD.

Compare MapReduce and Spark

What does a Spark engine do?

List any two transformations of RDD

List any two actions of RDD.

List any two numeric RDD operations

10 Mark Questions

What is Big Data? Explain the evolution and Challenges of Big Data.

a. What are various types of digital data? Explain How to deal with unstructured data.

b. How is traditional BI environment different from Big data environment.

a.What are the various types of analytics? What is Big Data Analytics? Why it is important?
Discuss the top challenges facing Big Data.

b. What is analytics 3.0? What can we expect from analytics 3.0?

Explain the following in terms of Big Data:

In-Memory Analytics B) In-Database Processing

c. Symmetric Multi processor system d. Massively Parallel processing

e. Shared nothing architecture f. CAP Theorem

a. Explain the difference between Hadoop and RDBMS.

b. Explain the core components of Hadoop. Discuss the design of Hadoop distributed file
system and concept in detail.

Explain how to manage resources and applications with Hadoop YARN.

Discuss about the interaction with Hadoop eco system.

a. Write a MapReduce program to arrange the data on user-id, then within the user id sort
them in increasing order of the page count.

b. illustrate the Mapper task and Reduce task of MapReduce programming with a simple
example.

a. Explain HIVE architecture in detail

68
b. Write HQL statements to create join between student and department tables where we use
RollNo from both the tables as the join key

c. Write HQL sub-query to count occurrence of similar words in the file

a. Explain the architecture of Pig. Discuss various data types in Pig.

b. Write a word count program in Pig to count the occurrence of similar words in a file.

a. Explain in detail how Hive is different from Pig.

b. Perform the following operations using Hive Query language

Create a database named “STUDENTS” with comments and database properties,

Display a list of databases

Describe a database

To make the databases current working database

To delete or remove a database

What is RDD? Explain the features of RDD. Discuss any five transformation functions and
five actions on pair RDD’s.

What is spark? State the advantages of using Apache spark over Hadoop MapReduce for Big
data processing with example.

a. Explain the spark components in detail. Also list the features of Spark.

b. Write a brief note on : Spark Unified Stack.

69

You might also like