Big Data Analytics Notess
Big Data Analytics Notess
Big Data Analytics Notess
FOR
Department of CSE
The material is for internal circulation only and not for commercial purpose. The
contents in this material were taken from internet as one source and some contents
from the text book titled Big Data Analytics, SeemaAcharya, Subhashini Chellappan,
Wiley. Thanks to authors.
UNIT – I : Introduction to big data: Data, Characteristics of data and Types of digital
data:, Sources of data, Working with unstructured data, Evolution and Definition of big
data, Characteristics and Need of big data, Challenges of big data
Big data analytics: Overview of business intelligence, Data science and Analytics, Meaning
and Characteristics of big data analytics, Need of big data analytics, Classification of
analytics, Challenges to big data analytics, Importance of big data analytics, Basic
terminologies in big data environment.
1
File System) , Processing Data with Hadoop, Managing Resources and Applications with
Hadoop YARN (Yet another Resource Negotiator), Interacting with Hadoop Ecosystem
UNIT – IV: Introduction to Hive: Introduction to Hive, Hive Architecture , Hive Data
Types, Hive File Format, Hive Query Language (HQL), User-Defined Function (UDF) in
Hive.
Introduction to Pig: Introduction to Pig, The Anatomy of Pig , Pig on Hadoop , Pig
Philosophy , Use Case for Pig: ETL Processing , Pig Latin Overview , Data Types in Pig ,
Running Pig , Execution Modes of Pig, HDFS Commands, Relational Operators, Piggy
Bank , Word Count Example using Pig , Pig at Yahoo!, Pig versus Hive
UNIT – V: Spark: Introduction to data analytics with Spark, Programming with RDDS,
Working with key/value pairs, advanced spark programming.
Text Books:
Learning Spark: Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski, Patrick
Wendell, MateiZaharia, O'Reilly Media, Inc.
Reference Books:
2
UNIT - I
a. What are various types of digital data? Explain How to deal with unstructured data.
Digital data is the data that is stored in a computer system or digitally. Digital data is
classified into the following categories:
Structured data
Semi-structured data
Unstructured data
Structured data: This is the data which is in an organized form (e.g., in rows and columns)
and can be easily used by a computer program. Data stored in databases is an eg. Data in
RDBMS, spreadsheets.
3
Semi-structured data: This is the data which does not conform to a data model but has
some structure. However, it is not in a form which can be used easily by a computer
program.
Eg. XML, markup languages like HTML, etc. Metadata for this data is available but is not
sufficient.
Unstructured data:
This is the data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
4
Sources of unstructured data:
Document
Text Analytics: Text mining is the process of gleaning high quality and meaningful
information from text. It includes tasks such as text categorization, text clustering, sentiment
analysis and concept/entity extraction.
Manual Tagging with meta data: This is about tagging manually with adequate meta data
to provide the requisite semantics to understand unstructured data.
Parts of Speech Tagging : POST is the process of reading text and tagging each word in the
sentence belonging to particular parts of speech such as noun, verb, objective...
1. In Traditional BI environment, all the enterprise’s data is housed in a central server where
as in a Big data environment data resides in a distributed file system.
2. In traditional BI, data is generally analyzed in an offline mode whereas in Big data, it is
analyzed both real time as well as in offline mode.
5
Traditional BI is about structured data and the data is taken to process functions (move data
to code).
Where as Big data is about variety: Structured, semi structured, and unstructured data and
here the processing functions are taken to the data (move code to data)
What is Big Data? Explain the challenges and evolution Big Data.
"Big data" is high-volume, -velocity and -variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.
“Big data is high-volume, -velocity and -variety information assets” talks about
voluminous data that may have great variety(structured, semi-structured and unstructured)
and will require a good speed/pace for storage, preparation, processing and analysis.
“Enhanced insight and decision making” talks about deriving deeper, richer, and
meaningful insights and then using these insights to make faster and better decisions to gain
business value and this competitive edge.
Where does this data get generated? There are a multiple of sources for Big data. An XLS, a
DOC, a PDF etc. is unstructured data; a video, YouTube, a chat conversation on internet
messenger, a customer feedback form on an online retail website is unstructured data; a
CCTV coverage and weather forecast data also unstructured data.
Data storage: File systems, SQL (RDBMSs- oracle, MS SQL server, DB2, MySQL,
PostgreSQL etc.) NoSQL, (MangoDB, Cassandra etc) and so on.
Sensor data, machine log data, social media, business apps, media and docs.
Velocity: we have moved from the days of batch processing to Real-time processing:
Variety: Variety deals with the wide range of data types and sources of data. Structured,
semi-structured and Unstructured.
Structured data: From traditional transaction processing systems and RDBMS, etc.
Unstructured data: For example unstructured text documents, audios, videos, emails, photos,
PDFs , social media, etc.
7
Data: Big in volume, variety and velocity.
Data today is growing at an exponential rate. Most of the data that we have today has been
generated in the last two years. The key question is : will all this data be useful for analysis
how will separate knowledge from noise.
Dearth of skilled professionals who possess a high level of proficiency in data science that is
vital in implementing Big data solutions.
Challenges with respect to capture, curation, storage, search, sharing, transfer, analysis,
privacy violations and visualization.
1970s and before was the era of mainframes. The data is essentially primitive and structured.
Relational data bases evolved in 1980s and 1990s. The era was of data intensive
applications. The world wide web and Internet of things (IoT) have led to an onslaught of of
structured, unstructured, and multimedia data.
3. a.What are the various types of analytics? What is Big Data Analytics? Why it is
important? Discuss the top challenges facing Big Data.
8
b. What is analytics 3.0? What can we expect from analytics 3.0?
Big data Analytics is the process of examining big data to uncover patterns, unearth trends,
and find unknown correlations and other useful information to make faster and better
decisions.
Few Top Analytics tools are: MS Excel, SAS, IBM SPSS Modeler, R analytics, Statistica,
World Programming Systems (WPS), and Weka.
Those that classify analytics into basic, operational, advanced and monetized.
Those that classify analytics into analytics 1.0, analytics 2.0 and analytics 3.0.
9
First school of thought:
Basic analytics: This primarily slicing and slicing of data to help with basic business
insights. This is about reporting on historical data, basic visualization etc.
Advanced Analytics: This largely is about forecasting for the future by way of predictive and
prescriptive modeling.
10
running hadoop.
Data was Data was often Data is being both internally and
internally sourced. externally sourced. externally sourced.
Relational Database applications, In ,memory analytics, in database
databases Hadoopo clusters, processing, agile analytical
SQL to hadoop methods, Machine learning
environments etc.. techniques etc ..
Benefits of Big Data Analytics
Organizations decide to deploy big data analytics for a wide variety of reasons, including the
following:
Business Transformation In general, executives believe that big data analytics offers
tremendous potential to revolution their organizations.
Innovation Big data analytics can help companies develop products and services that appeal
to their customers, as well as helping them identify new opportunities for revenue
generation.
Lower Costs In the NewVantage Partners Big Data Executive Survey 2017, 49.2 percent of
companies surveyed said that they had successfully decreased expenses as a result of a big
data project.
Increased Security Another key area for big data analytics is IT security. Security software
creates an enormous amount of log data.
Security: The production of more and more data increases security and privacy concerns.
11
Partition tolerant: how to build partition tolernant systems that can take of both hardware and
software failures.
Data quality: Inconsistent data, duplicates, logic conflicts, and missing data all result in data
quality challenges.
To resolve the challenges of big data analytics we need a technology which provide
highstorage and cheap cost is the first requirement. We need faster processors to help
quickerprocessing of the data. Affordable and economical open-source is required. Parallel
processing in terms of huge connectivity, high quantity rather than the low potential. Cloud
computing and other flexible resource allocation arrangements are required to meet the
challenges in big data.
In-Memory Analytics
In-Database processing
CAP Theorem
In-memory Analytics: Data access from non-volatile storage such as hard disk is a slow
process. This problem has been addressed using in-memory analytics. Here all the relevant
data is stored in Random Access memory (RAM) or primary storage thus eliminating the
need to access the data from hard disk. The advantage is faster access rapid deployment,
better insights, and minimal IT involvement.
Symmetric Multi-Processor System: In this there is single common main memory that is
shared by two or more identical processors. The processors have full access to all I/O
devices and are controlled by single operating system instance.
12
SMP are tightly coupled multiprocessor systems. Each processor has its own high speed
memory called cache memory and are connected using a system bus.
MPP is different from symmetric multiprocessing in that SMP works with processors
sharing the same OS and same memory. SMP also referred as tightly coupled
Multiprocessing.
Shared memory
Shared disk
Shared nothing.
13
Fault isolation:
Scalability:
CAP Theorem:The CAP theorem is also called the Brewer’s theorem. It states that in a
distributed computing environment, it is possible to provide the following guarantees:
Consistency
Availability
Partition tolerance
Availability implies that reads and writes always succeed. In other words, each non-failing
node will return response in a reasonable amount of time.
Partition tolerance implies that the system will continue to function when network partition
occurs.
Basically Available, Soft State, Eventual Consistency (BASE) is a data system design
philosophy that prizes availability over consistency of operations. BASE may be explained
in contrast to another design philosophy - Atomicity, Consistency, Isolation, Durability
(ACID). The ACID model promotes consistency over availability, whereas BASE promotes
availability over consistency.
14
BIG DATA ANALYTICS
Hadoop is an open source framework that is meant for storage and processing of big data in
a distributed manner.
Open Source – Hadoop is an open source framework which means it is available free of
cost. Also, the users are allowed to change the source code as per their requirements.
Reliability – Hadoop stores data on the cluster in a reliable manner that is independent of
machine. So, the data stored in Hadoop environment is not affected by the failure of the
machine.
Scalability – It is compatible with the other hardware and we can easily ass the new
hardware to the nodes.
High Availability – The data stored in Hadoop is available to access even after the hardware
failure. In case of hardware failure, the data can be accessed from another node.
HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop.
The large data files running on a cluster of commodity hardware are stored in HDFS. It can
store data in a reliable manner even when hardware fails. The key aspects of HDFS are:
Storage component
Natively redundant.
Map Reduce: MapReduce is the Hadoop layer that is responsible for data processing. It
writes an application to process unstructured and structured data stored in HDFS.
It is responsible for the parallel processing of high volume of data by dividing data into
independent tasks. The processing is done in two phases Map and Reduce.
The Map is the first phase of processing that specifies complex logic code and the
Explain features of HDFS.Discuss the design of Hadoop distributed file system and
concept in detail.
16
HDFS: (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop.
The large data files running on a cluster of commodity hardware are stored in HDFS. It can
store data in a reliable manner even when hardware fails. The key aspects of HDFS are:
Distributes data across several nodes: divides large file into blocks and stores in various data
nodes.
High Throughput Access: Provides access to data blocks which are nearer to the client.
HDFS Daemons:
NameNode
The NameNode is the master of HDFS that directs the slave DataNodes to perform I/O tasks.
17
Blocks: HDFS breaks large file into smaller pieces called blocks.
(rack is a collection of datanodes with in the cluster) NameNode keep track of blocks of a
file.
File System Namespace: NameNode is the book keeper of HDFS. It keeps track of how
files are broken down into blocks and which DataNode stores these blocks. It is a collection
of files in the cluster.
FsImage: file system namespace includes mapping of blocks of a file, file properties and is
stored in a file called FsImage.
EditLog: namenode uses an EditLog (transaction log) to record every transaction that
happens to the file system metadata.
DataNode
18
Multiple data nodes per cluster. Each slave machine in the cluster have DataNode daemon
for reading and writing HDFS blocks of actual file on local file system
During pipeline read and write DataNodes communicate with each other.
If no heartbeat is received for a period of time NameNode assumes that the DataNode had
failed and it is re-replicated.
Takes snapshot of HDFS meta data at intervals specified in the hadoop configuration.
In case of name node failure secondary name node can be configured manually to bring up
the cluster i.e; we make secondary namenode as name node.
The client opens the file that it wishes to read from by calling open() on the DFS.
The DFS communicates with the NameNode to get the location of data blocks. NameNode
returns with the addresses of the DataNodes that the data blocks are stored on.
19
Subsequent to this, the DFS returns an FSD to client to read from the file.
Client then calls read() on the stream DFSInputStream, which has addresses of DataNodes
for the first few block of the file.
Client calls read() repeatedly to stream the data from the DataNode.
When the end of the block is reached, DFSInputStream closes the connection with the
DataNode. It repeats the steps to find the best DataNode for the next block and subsequent
blocks.
When the client completes the reading of the file, it calls close() on the FSInputStream to the
connection.
An RPC call to the namenode happens through the DFS to create a new file.
As the client writes data, data is split into packets by DFSOutputStream, which is then writes
to an internal queue, called data queue. Datastreamer consumes the data queue.
Data streamer streams the packets to the first DataNode in the pipeline. It stores packet and
forwards it to the second DataNode in the pipeline.
20
In addition to the internal queue, DFSOutputStream also manages on “Ackqueue” of the
packets that are waiting for acknowledged by DataNodes.
When the client finishes writing the file, it calls close() on the stream.
Data Replication: There is absolutely no need for a client application to track all blocks. It
directs client to the nearest replica to ensure high performance.
Data Pipeline: A client application writes a block to the first DataNode in the pipeline. Then
this DataNode takes over and forwards the data to the next node in the pipeline. This process
continues for all the data blocks, and subsequently all the replicas are written to the disk.
21
Fig. File Replacement Strategy
Creating a directory:
o/p: 1 1 60
23
Remove a directory from hdfs
Input data set splits into independent chunks. Map tasks process these independent chunks
completely in a parallel manner.
Reduce task-provides reduced output by combining the output of various mapers. There are
two daemons associated with MapReduce Programming: JobTracker and TaskTracer.
JobTracker:
Whenever code submitted to a cluster, JobTracker creates the execution plan by deciding
which task to assign to which node.
It also monitors all the running tasks. When task fails it automatically re-schedules the task
to a different node after a predefined number of retires.
There will be one job Tracker process running on a single Hadoop cluster. Job Tracker
processes run on their own Java Virtual machine process.
24
Fig. Job Tracker and Task Tracker interaction
TaskTracker:
This daemon is responsible for executing individual tasks that is assigned by the Job
Tracker.
Task Tracker continuously sends heartbeat message to job tracker. When a job tracker fails
to receive a heartbeat message from a TaskTracker, the JobTracker assumes that the
TaskTracker has failed and resubmits the task to another available node in the cluster.
MapReduce working:
MapReduce divides a data analysis task into two parts – Map and Reduce. In the example
given below: there two mappers and one reduce.
25
Each mapper works on the partial data set that is stored on that node and the reducer
combines the output from the mappers to produce the reduced result set.
Steps:
Next, the framework creates a master and several slave processes and executes the worker
processes remotely.
Several map tasks work simultaneously and read pieces of data that were assigned to each
map task.
Map worker uses partitioner function to divide the data into regions.
When the map slaves complete their work, the master instructs the reduce slaves to begin
their work.
When all the reduce slaves complete their work, the master transfers the control to the user
program.
MapperClass: this class overrides the MapFunction based on the problem statement.
26
Reducer Class: This class overrides the Reduce function based on the problem statement.
NOTE: Based on marks given write MapReduce example if necessary with program.
Limitations of Hadoop 1.0: HDFS and MapReduce are core components, while other
components are built around the core.
Not suitable for Machine learning algorithms, graphs, and other memory intensive
algorithms
HDFS Limitation: The NameNode can quickly become overwhelmed with load on the
system increasing. In Hadoop 2.x this problem is resolved.
Hadoop 2.x can be used for various types of processing such as Batch, Interactive, Online,
Streaming, Graph and others.
a) NameSpace: Takes care of file related operations such as creating files, modifying files
and directories
b) Block storage service: It handles daa node cluster management and replication.
HDFS 2 Features:
High availability: High availability of NameNode is obtained with the help of Passive
Standby NameNode.
27
Active-Passive NameNode handles failover automatically. All namespace edits are recorded
to a shared NFS(Network File Storage) Storage and there is a single writer at an point of
time.
Passive NameNode reads edits from shared storage and keeps updated metadata information.
Hadoop1X Hadoop2X
1 Supports MapReduce (MR) Allows to work in MR as well as other
processing model only. Does not distributed computing models like Spark,
support non-MR tools & HBase coprocessors.
2 MR does both processing and cluster- YARN does cluster resource management
resource management. and processing is done using different
processing models.
3 Has limited scaling of nodes. Limited Has better scalability. Scalable up to
to 4000 nodes per cluster 10000 nodes per cluster
4 Works on concepts of slots – slots Works on concepts of containers. Using
can run either a Map task or a containers can run generic tasks.
Reduce task only.
5 A single Namenode to manage the Multiple Namenode servers manage
28
entire namespace. multiple namespaces.
6 Has Single-Point-of-Failure (SPOF) Has to feature to overcome SPOF with a
– because of single Namenode. standby Namenode and in the case of
Namenode failure, it is configured for
automatic recovery.
7 MR API is compatible with MR API requires additional files for a
Hadoop1x. A program written in program written in Hadoop1x to execute
Hadoop1 executes in Hadoop1x in Hadoop2x.
without any additional files.
8 Has a limitation to serve as a Can serve as a platform for a wide variety
platform for event processing, of data analytics-possible to run event
streaming and real-time operations.
processing, streaming and real-time
operations.
9 Does not support Microsoft Windows Added support for Microsoft windows
Explain in detail about YARN?
The fundamental idea behind the YARN(Yet Another Resource Negotiator) architecture is
to splitting the JobTracker reponsibility of resource management and job
scheduling/monitoring into separate daemons.
NodeManager:
NodeManager monitors the resource usage such as memory, CPU, disk, network, etc.
29
Per-Application Application Master: Per-application Application master is an application
specific entity. It’s responsibility is to negotiate required resources for execution from the
ResourceManager.
It works along with the NodeManager for executing and monitoring component tasks.
Container: Basic unit of allocation. Replaces fixed map/reduce slots. Fine-grained resource
allocation across multiple resource type
Container_1: 1GB,6CPU
The Resource Manager launches the Application Master by assigning some container.
30
The Application Master registers with the Resource manager.
During the application execution, the client that submitted the job directly communicates
with the Application Master to get status, progress updates.
Once the application has been processed completely, the application master deregisters with
the ResourceManager and shutsdown allowing its own container to be repurposed.
HDFS: Hadoop Distributed File System. It simply stores data files as close to the original
form as possible.
HBase: It is Hadoop’s distributed column based database. It supports structured data storage
for large tables.
Hive: It is a Hadoop’s data warehouse, enables analysis of large data sets using a language
very similar to SQL. So, one can access data stored in hadoop cluster by using Hive.
Pig: Pig is an easy to understand data flow language. It helps with the analysis of large data
sets which is quite the order with Hadoop without writing codes in MapReduce paradigm.
31
ZooKeeper: It is an open source application that configures synchronizes the distribured
systems.
Sqoop: it is used to transfer bulk data between Hadoop and structured data stores such as
relational databases.
Ambari: it is a web based tool for provisioning, Managing and Monitoring Apache Hadoop
clusters.
Hadoop Common: It is a set of common utilities and libraries which handle other Hadoop
modules. It makes sure that the hardware failures are managed by Hadoop cluster
automatically.
Hadoop YARN: It allocates resources which in turn allow different users to execute various
applications without worrying about the increased workloads.
HDFS: It is a Hadoop Distributed File System that stores data in the form of small memory
blocks and distributes them across the cluster. Each data is replicated multiple times to
ensure data availability.
Hadoop MapReduce: It executes tasks in a parallel fashion by distributing the data as small
blocks.
Standalone, or local mode: which is one of the least commonly used environments, which
only for running and debugging of MapReduce programs. This mode does not use HDFS
nor it launches any of the hadoop daemon.
Fully distributed mode, which is most commonly used in production environments. This
mode runs all daemons on a cluster of machines rather than single one.
32
XML File configrations in Hadoop.
core-site.xml – This configuration file contains Hadoop core configuration settings, for
example, I/O settings, very common for MapReduce and HDFS. mapred-site.xml – This
configuration file specifies a framework name for MapReduce by setting
mapreduce.framework.name
33
Hadoop Architecture is a distributed Master-slave architecture.
Master HDFS: Its main responsibility is partitioning the data storage across the slave nodes.
It also keep track of locations of data on Datanodes.
Master Map Reduce: It decides and schedules computation task on slave nodes.
NOTE: Based on marks for the question explain hdfs daemons and mapreduce daemons.
34
BDA UNIT – III
In MapReduce programming, Jobs(applications) are split into a set of map tasks and reduce
tasks. Then these tasks are executed in a distributed fashion on Hadoop cluster. Each task
processes small subset of data that has been assigned to it. This way, Hadoop distributes the
load across the cluster. Map Reduce job takes a set of files that is stored in HDFS as input.
Map task takes care of loading, parsing, transforming and filtering. The responsibility of
reduce task is grouping and aggregating data that is produced by map tasks to generate final
output. Each map task is broken down into the following phases:
Record Reader
Mapper
Combiner
Partitioner.
The output produced by the map task is known as intermediate keys and values. These
intermediate keys and values are sent to reducer. The reduce tasks are broken down into the
following phases:
Shuffle.
Sort
Reducer
Output format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network; only computational code moved to process daa which saves network bandwidth.
Mapper Phases:
Mapper maps the input key-value pairs into a set of intermediate key-value pairs.
35
RecordReader: converts byte oriented view of input in to Record oriented view and
presents it to the Mapper tasks. It presents the tasks with keys and values.
Mapper: Map function works on the key-value pair produced by RecordReader and
generates intermediate (key, value) pairs.
Combiner: It takes intermediate key-value pair provided by mapper and applies user
specific aggregate function to only one mapper. It is also known as local Reducer.data
Partitioner: Take intermediate key key value pairs produced by the mapper, splits them into
partitions the data using a user-defined condition.
The individual key-value pairs are sorted by key into a larger data list.
The data list groups the equivalent keys together so that their values can be iterated easily in
the Reducer task
Reducer:
The Reducer takes the grouped key-value paired data as input and runs a Reducer function
on each one of them.
Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires
a wide range of processing.
Once the execution is over, it gives zero or more key-value pairs to the final step.
Output format:
In the output phase, we have an output formatter that translates the final key-value pairs
from the Reducer function and writes them onto a file using a record writer.
37
Speeds up data transfer across the network
Social networks
Health Care
Business
Banking
Stock Market
Weather Forecasting
Data Serialization:
Data Serialization is the process of converting object data into byte stream data for
transmission over a network across different nodes in a cluster or for persistent data storage.
MapReduce offers straightforward, well-documented support for working with simple data
formats such as log files.
38
But the use of MapReduce has evolved beyond log files to more sophisticated data
serialization formats—such as text, XML, and JSON—to the point that its documentation
and built-in support runs dry.
The goal of this topic is how to work with common data serialization formats, as well as to
examine more structured serialization formats and compare their fitness for use with
MapReduce.
Working with XML and JSON in MapReduce, however, poses two equally important
challenges:
Though MapReduce requires classes that can support reading and writing a particular data
serialization format, there’s a good chance it doesn’t have such classes to support the
serialization format you’re working with.
MapReduce’s power lies in its ability to parallelize reading your input data. If your input
files are large (think hundreds of megabytes or more), it’s crucial that the classes reading
your serialization format be able to split your large files so multiple map tasks can read them
in parallel.
Data serialization support in MapReduce is a property of the input and output classes
that read and write MapReduce data.
XML and JSON are industry-standard data interchange formats. Their ubiquity in the
technology industry is evidenced by their heavy adoption in data storage and exchange.
XML has existed since 1998 as a mechanism to represent data that’s readable by machine
and human alike. It became a universal language for data exchange between systems. It’s
employed by many standards today such as SOAP (simple object Access Protocol) and RSS,
and used as an open data format for products such as Microsoft Office.
While MapReduce comes bundled with an InputFormat that works with text, it doesn’t come
with one that supports XML. Working on a single XML file in parallel in MapReduce is
tricky because XML doesn’t contain a synchronization marker in its data format.
Problem You want to work with large XML files in MapReduce and be able to split and
process them in parallel.
Solution Mahout’s XMLInputFormat can be used to work with XML files in HDFS with
MapReduce. It reads records that are delimited by a specific XML begin and end tag. This
technique also covers how XML can be emitted as output in MapReduce output.
39
JSON shares the machine- and human-readable traits of XML, and has existed since the
early 2000s. It’s less verbose than XML, and doesn’t have the rich typing and validation
features available in XML.
MapReduce and JSON Imagine you have some code that’s downloading JSON data from a
streaming REST service and every hour writes a file into HDFS. The data amount that’s
being downloaded is large, so each file being produced is multiple gigabytes in size. You’ve
been asked to write a MapReduce job that can take as input these large JSON files.
Code generation—The ability to generate Java classes and utilities that can be used for
serialization and deserialization.
Versioning—The ability for the file format to support backward or forward compatibility.
Transparent compression—The ability for the file format to handle compressing records
internally.
Native support in MapReduce—The input/output formats that support reading and writing
files in their native format (that is, produced directly from the data format library).
Pig and Hive support—The Pig Store and Load Functions (referred to as Funcs) and Hive
SerDe classes to support the data format.
40
UNIT – IV HIIVE & PIG
HIVE:
Hive is data warehousing tool and is used to query structured data built on top of
Hadoop for providing data summarization, query, and analysis. Hive Provides HQL
(Hive Query Language) which is similar to SQL. Hive compiles SQL queries into
MapReduce jobs and then runs the job in the Hadoop cluster.
Features of Hive:
It is similar to SQL
Buckets or clusters: Similar to partitions but uses hash function to segregate data and
determines the cluster or bucket into which the record should be placed.
Hive Architecture:
41
Externel Interfaces- CLI, WebUI, JDBC, ODBC programming interfaces
Hive CLI: The most commonly used interface to interact with Hadoop.
Hive Web Interface: It is simple graphic interface to interact with Hive and to execute
query.
Thrift Server – Cross Language service framework . This is an optional Sever. This can be
used to submit Hive jobs from a remote client.
JDBC/ODBC: Jobs can be submitted from a JDBC client. One can write java code to
connect to Hive and submit jobs on it.
Metastore- Meta data about the Hive tables, partitions. A metastore consists of Meta store
service and Database.
Local Metastore
Remote Metastore
Driver- Brain of Hive! Hive queries are sent to the driver for Compilation, Optimization
and Execution
Hive Data types are used for specifying the column/field type in Hive tables.
42
Types of Data Types in Hive
Mainly Hive Data Types are classified into 5 major categories, let’s discuss them one by
one:
Primitive Data Types also divide into 4 types which are as follows:
The Hive Numeric Data types also classified into two types-
SMALLINT (2-byte (16 bit) signed integer, from -32, 768 to 32, 767)
43
The second category of Apache Hive primitive data type is Date/Time data types. The
following Hive data types comes into this category-
DATE (date)
INTERVAL
String data types are the third category under Hive data types. Below are the data types that
come into this-
In this category of Hive data types following data types are come-
Array
MAP
STRUCT
UNION
ARRAY
An ordered collection of fields. The fields must all be of the same type.
Syntax: ARRAY<data_type>
ii. MAP
44
An unordered collection of key-valuepairs. Keys must be primitives; values may be any
type. For a particular map, the keys must be the same type, and the values must be the same
type.
iii. STRUCT
iv. UNION
A value that may be one of a number of defined data. The value is tagged with an integer
(zero-indexed) representing its data type in the union.
Integral Type
Strings
Timestamp
Dates
Decimals
Union Types
The file formats in Hive specify how records are encoded in a file.
Text File: The default file format is text file. In this format, each record is a line in the file.
45
Sequential file: Sequential files are flat files that store binary key-value pairs. It includes
compression support which reduces the CPU, I/O requirement.
RC File ( Record Columnar File): RCFile stores the data in column oriented manner which
ensures that Aggregation operation is not an expensive operation.
RC File ( Record Columnar File):Instead of only partitioning the table harizontally like the
row oriented DBMS, RCFile partitions this table first horizontally and then vertically to
serialize the data.
Evaluate functions
Down load the contents of a table to a local directory or result of queries to HDFS directory.
These statements are used to build and modify the tables and other objects in the
database.
Create/Drop/Alter Database
Create/Drop/truncate Table
46
Alter Table/partition/column
Create/Drop/Alter view
Create/Drop/Alter index
Show
Describe
These statements are used to retrieve , store, modify, delete and update data in
database. The DML commands are:
FROM order_cust
HIVE Example – 2:
To create join between student and department tables where we use RollNo from both the
tables as the join key
CREATE TABLE IF NOT EXISTS STUDENTS (rollno INT, name STRING gpa
FLOAT)
47
LOAD DATA LOCAL INPATH ‘/root/hivedemos/student.tsv’ OVERWRITE INTO
TABLE STUDENT;
CREATE TABLE IF NOT EXISTS DEPARTMENT (rollno INT, deptno int, name
STRING)
HIVE example – 3:
GROUP BY word
ORDER BY word;
The explode function takes an array as input and outputs the elements of the array as
separate rows.
PIG
History of Pig:
In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and
execute MapReduce jobs on every dataset.
In 2007, Apache Pig was open sourced via Apache incubator. In 2008, the first release of
Apache Pig came out.
48
In 2010, Apache Pig graduated as an Apache top-level project.
Features of Pig:
It is a tool/platform which is used to analyze larger sets of data representing them as data
flows.
Pig is generally used with Hadoop; we can perform all the data manipulation operations in
Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin.
This language provides various operators using which programmers can develop their own
functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language.
All these scripts are internally converted to Map and Reduce tasks.
Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.
Programmers who are not so good at Java normally used to struggle working with Hadoop,
especially while performing any MapReduce tasks. Apache Pig is a boon for all such
programmers.
Features of Pig:
Rich set of operators: It provides many operators to perform operations like join, sort, filer,
etc.
Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you
are good at SQL.
Extensibility: Using the existing operators, users can develop their own functions to read,
process, and write data.
UDF’s: Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
49
Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.
Pig Vs Hive:
PIG HIVE
Apache Pig can handle structured, Hive is mostly for structured data.
unstructured, and semi-structured
data.
Pig Vs SQL:
PIG SQL
The data model in Apache Pig is nested The data model used in SQL is
relational. flat relational.
PIG architecture:
The language used to analyze data in Hadoop using Pig is known as Pig Latin.
50
It is a high-level data processing language which provides a rich set of data types and
operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig script
using the Pig Latin language, and execute them using any of the execution mechanisms
(Grunt Shell, UDFs, Embedded).
After execution, these scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.
Parser :Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script,
does type checking, and other miscellaneous checks. The output of the parser will be a DAG
(directed acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data
flows are represented as edges.
Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
Compiler :The compiler compiles the optimized logical plan into a series of MapReduce
jobs.
51
Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
PIG Latin Data model: The data model of Pig Latin is fully nested and it allows complex
non-atomic datatypes such as map and tuple. Given below is the diagrammatical
representation of Pig Latin’s data model.
Atom :Any single value in Pig Latin, irrespective of their data, type is known as an Atom.
It is stored as string and can be used as string and number. int, long, float, double, chararray,
and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is
known as a field.
Tuple :A record that is formed by an ordered set of fields is known as a tuple, the fields can
be of any type. A tuple is similar to a row in a table of RDBMS.
Bag : A collection of tuples (non-unique) is known as a bag. Each tuple can have any
number of fields (flexible schema). A bag is represented by ‘{}’.
52
It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that
every tuple contain the same number of fields or that the fields in the same position (column)
have the same type.
Relation :A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no
guarantee that tuples are processed in any particular order).
Map :A map (or data map) is a set of key-value pairs. The key needs to be of type chararray
and should be unique. The value might be of any type. It is represented by ‘[]’
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode :In this mode, all the files are installed and run from your local host and local
file system. There is no need of Hadoop or HDFS. This mode is generally used for testing
purpose.
MapReduce Mode :MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig
Latin statements to process the data, a MapReduce job is invoked in the back-end to perform
a particular operation on the data that exists in the HDFS.
Interactive Mode (Grunt shell) – You can run Apache Pig in interactive mode using the
Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using
Dump operator).
Batch Mode (Script) – You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pigextension.
Embedded Mode (UDF) – Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java, and using them in our
script.
Grunt Shell:
53
After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to that,
there are certain useful shell and utility commands provided by the Grunt shell. This chapter
explains the shell and utility commands provided by the Grunt shell.
Note: In some portions of this chapter, the commands like Load and Store are used. Refer
the respective chapters to get in-detail information on them.
Shell Commands :The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts.
Prior to that, we can invoke any shell commands using shand fs.
Utility commands:
The Grunt shell provides a set of utility commands. These include utility commands such as
clear, help, history, quit, and set; and commands such as exec, kill, and run to control Pig
from the Grunt shell. Given below is the description of the utility commands provided by the
Grunt shell.
Pig widely used for ETL (Extract, Transform and Load). Pig can extract data from
different sources such as ERP, accounting, flat files etc.. Pig then makes use of various
operators to perform transformation the data and subsequently loads into the data warehouse.
PIG Philosophy:
Pigs Eat Anything: Pig can process different kinds of data such as Structured or
unstructured. And it can easily be extended to operate on data beyond files, including
key/value stores, databases, etc.
Pigs Live Anywhere: Pig not only processes files in HDFS, it also processes files in other
sources such as files in the local file system.
Pigs Are Domestic Animals: pigs allows us to develop user defined functions and the same
can be included in the script for complex operations.
54
Pigs Fly: Pig processes data quickly.
PIG on Hadoop:
PIG run on Hadoop. PIG uses both HDFS and Map Reduce Programming. By default,
PIG uses reads input files from HDFS. Pig stores the intermediate data (data produced
by Map Reduce jobs) and the output in HDFS. How everPg can also read input from
the place output to other sources.
HDFS commands
Relational operators: FILTER, FOREACH, GROUP, distinct, limit, order by, join,
split, sample
Positional parameters
Custom functions
Exercise Problem:
55
How to find the number of occurrences of the words in a file using the pig script?
DUMP wordcount;
The above pig script, first splits each line into words using the TOKENIZE operator. The
tokenize function creates a bag of words. Using the FLATTEN function, the bag is
converted into a tuple. In the third statement, the words are grouped together so that the
count can be computed which is done in fourth statement.
56
UNIT – V
It was built on top of Hadoop MapReduce and it extends the MapReduce model to
efficiently use more types of computations which includes Interactive Queries and Stream
Processing.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop. Hadoop is just one
of the ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing.
Since Spark has its own cluster management computation, it uses Hadoop for storage
purpose only.
Why Spark:
As we know, there was no general purpose computing engine in the industry, since
Moreover, for interactive processing, we were using Apache Impala / Apache Tez.
Hence there was no powerful engine in the industry, that can process the data both in real-
time and batch mode. Also, there was a requirement that one engine can respond in sub-
second and perform in-memory processing.
Apache Spark is a powerful open source engine. Since, it offers real-time stream processing,
interactive processing, graph processing, in-memory processing as well as batch processing.
a. Spark Core: Spark Core is a central point of Spark. Basically, it provides an execution
platform for all the Spark applications.
b. Spark SQL: On the top of Spark, Spark SQL enables users to run SQL/HQL queries.
57
c. Spark Streaming: Spark Streaming enables a powerful interactive and data analytics
application.
d. Spark Mllib: Machine learning library delivers both efficiencies as well as the high-
quality algorithms.
e. Spark GraphX
Basically, Spark GraphX is the graph computation engine built on top of Apache Spark that
enables to process graph data at scale.
f. SparkR
Features of Spark:
a. Swift Processing :Apache Spark offers high data processing speed. That is about 100x
faster in memory and 10x faster on the disk. However, it is only possible by reducing the
number of read-write to disk.
d. Reusability: We can easily reuse spark code for batch-processing or join stream against
historical data. Also to run ad-hoc queries on stream state.
e. Spark Fault Tolerance: Spark offers fault tolerance. It is possible through Spark’s core
abstraction-RDD.
i. Support for Sophisticated Analysis: There are dedicated tools in Apache Spark. Such as
for streaming data interactive/declarative queries, machine learning which add-on to map
and reduce.
j. Integrated with Hadoop: As we know Spark is flexible. It can run independently and
also on Hadoop YARN Cluster Manager. Even it can read existing Hadoop data.
l. Cost Efficient: For Big data problem as in Hadoop, a large amount of storage and the
large data center is required during replication. Hence, Spark programming turns out to be a
cost-effective solution
59
The key abstraction of Spark is RDD. RDD is an acronym for Resilient Distributed Dataset.
It is the fundamental unit of data in Spark. Basically, it is a distributed collection of elements
across cluster nodes. Also performs parallel operations.
i. Parallelized collections
By invoking parallelize method in the driver program, we can create parallelized collections.
One can create Spark RDDs, by calling a textFile method. Hence, this method takes URL of
the file and reads it as a collection of lines.
i. Transformation Operations
It creates a new Spark RDD from the existing one. Moreover, it passes the dataset to the
function and returns new dataset.
ii. Action Operations
In Apache Spark, Action returns final result to driver program or write it to the external data
store.
i. Transformation Operations
It creates a new Spark RDD from the existing one. Moreover, it passes the dataset to the
function and returns new dataset.
ii. Action Operations
In Apache Spark, Action returns final result to driver program or write it to the external data
store.
60
RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create dependencies
between RDDs. Each RDD in dependency chain (String of Dependencies) has a function
for calculating its data and has a pointer (dependency) to its parent RDD.
Actions:
61
c. Sparkling Features of Spark RDD
i. In-memory computation: Basically, while storing data in RDD, data is stored in memory
for as long as you want to store. It improves the performance by an order of magnitudes by
keeping the data in memory.
ii. Lazy Evaluation: Spark Lazy Evaluation means the data inside RDDs are not evaluated
on the go. Basically, only after an action triggers all the changes or the computation is
performed. Therefore, it limits how much work it has to do.
iii. Fault Tolerance: If any worker node fails, by using lineage of operations, we can re-
compute the lost partition of RDD from the original one. Hence, it is possible to recover lost
data easily.
iv. Immutability: Immutability means once we create an RDD, we can not manipulate it.
Moreover, we can create a new RDD by performing any transformation. Also, we achieve
consistency through immutability.
v. Persistence: In in-memory, we can store the frequently used RDD. Also, we can retrieve
them directly from memory without going to disk. It results in the speed of the execution.
Moreover, we can perform multiple operations on the same data. It is only possible by
storing the data explicitly in memory by calling persist() or cache() function.
62
vi. Partitioning: Basically, RDD partition the records logically. Also, distributes the data
across various nodes in the cluster. Moreover, the logical divisions are only for processing
and internally it has no division. Hence, it provides parallelism.
vii. Parallel: While we talk about parallel processing, RDD processes the data parallelly
over the cluster.
x. Typed: There are several types of Spark RDD. Such as: RDD [int], RDD [long], RDD
[string].
xi. No limitation: There are no limitations to use the number of Spark RDD. We can use
any no. of RDDs. Basically, the limit depends on the size of disk and memory.
63
BDA Question Bank
You have just got a book issued from the library. What are the details about the book that
can be placed in an RDBMS table.
Ans: Title, author, publisher, year, no.of pages, type of book, price, ISBN, with CD or not.
Which category would you place the consumer complaints and feedback? Unstructured.
Which category (structured, semi-structured or Unstructured) will you place a web page in?
Unstructured
Which category (structured, semi-structured or Unstructured) will you place a Power point
presentation in? Unstructured
Data lakes____________is a large data repository that stores data in its native format until it
is needed.
A collection of independent computers that appear to its users as a single coherent system is
__________Distributed systems.
64
System will continue to function even when network partition occurs is
called_______Partition tolerance_
A non failing node will return a reasonable response within a reasonable amount of time is
called_______Availability
What is BASE?
The number of copies of a file is called the ___Replication factor____of that file.
The MapReduce programming model widely used in analytics was developed at ______
___________created the popular Hadooop software framework for storage and processing of
large data sets.
are foundation.
________perform block creation, deletion and replication upon instruction from the ______
Hadoop is best used as a _______once and _____many times type of data store.
________is the official development and production plat form for Hadoop.
66
Paritioner phase belongs to _____ task
PIG is ______language
67
Define RDD.
10 Mark Questions
What is Big Data? Explain the evolution and Challenges of Big Data.
a. What are various types of digital data? Explain How to deal with unstructured data.
a.What are the various types of analytics? What is Big Data Analytics? Why it is important?
Discuss the top challenges facing Big Data.
b. Explain the core components of Hadoop. Discuss the design of Hadoop distributed file
system and concept in detail.
a. Write a MapReduce program to arrange the data on user-id, then within the user id sort
them in increasing order of the page count.
b. illustrate the Mapper task and Reduce task of MapReduce programming with a simple
example.
68
b. Write HQL statements to create join between student and department tables where we use
RollNo from both the tables as the join key
b. Write a word count program in Pig to count the occurrence of similar words in a file.
Describe a database
What is RDD? Explain the features of RDD. Discuss any five transformation functions and
five actions on pair RDD’s.
What is spark? State the advantages of using Apache spark over Hadoop MapReduce for Big
data processing with example.
a. Explain the spark components in detail. Also list the features of Spark.
69