Introduction to Big Data
Introduction to Big Data
Introduction to Big Data
Objectives
After studying this unit, you will be able to:
• understand what is BIG DATA.
• understand Applications of BIG DATA
• learn tools used in BIG DATA
• known challenges in BIG DATA
Introduction
The quantity of data created by humans is quickly increasing every year as a result of the
introduction of new technology, gadgets, and communication channels such as social networking
sites.Big data is a group of enormous datasets that can't be handled with typical computer methods.
It is no longer a single technique or tool; rather, it has evolved into a comprehensive subject
including a variety of tools, techniques, and frameworks.Quantities, letters, or symbols on which a
computer performs operations and which can be stored and communicated as electrical signals and
recorded on magnetic, optical, or mechanical media.
1
Notes
Big data is a term that defines the massive amount of organized and unstructured data that a
company encounters on a daily basis.
Note
It may be studied for insights that lead to improved business choices and strategic
movements.
It is a collection of organized, semi-structured, and unstructured data that may be mined
for information and utilized in machine learning, predictive modelling, and other
advanced analytics initiatives.
2
Notes
Volume
The term 'Big Data' refers to a massive amount of information. The term "volume" refers to a large
amount of data. The magnitude of data plays a critical role in determining its worth. When the
amount of data is extremely vast, it is referred to as 'Big Data.'
This means that the volume of data determines whether or not a set of data may be classified as Big
Data. As a result, while dealing with Big Data, it is vital to consider a certain 'Volume.'
Example:
In 2016, worldwide mobile traffic was predicted to be 6.2 Exabytes (6.2 billion GB) per month.
Furthermore, by 2020, we will have about 40000 ExaBytes of data.
Velocity
The term "velocity" refers to the rapid collection of data. Data comes in at a high rate from
machines, networks, social media, mobile phones, and other sources in Big Data velocity. A large
and constant influx of data exists. This influences the data's potential, or how quickly data is
created and processed in order to satisfy needs. Data sampling can assist in dealing with issues
such as'velocity.' For instance, Google receives more than 3.5 billion queries every day. In addition,
the number of Facebook users is growing at a rate of around 22% every year.
Variety
Structured data is just data that has been arranged. It usually refers to data that has been specified
in terms of length and format.
Semi-structured data is a type of data that is semi-organized. It's a type of data that doesn't follow
the traditional data structure. This sort of data is represented by log files.
Unstructured data is just data that has not been arranged. It usually refers to data that doesn't fit
cleanly into a relational database's standard row and column structure.Texts, pictures, videos etc.
are the examples of unstructured data which can’t be stored in the form of rows and columns.
3
Notes
3. Improved customer service (Traditional customer feedback systems are getting replaced by
new systems designed with Big Data technologies.
4. Improved customer service (In these new systems, Big Data and natural language processing
technologies are being used to read and evaluate consumer responses.
5. Early identification of risk to the product/services, if any
6. Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big
Data technologies and data warehouse helps an organization to offload infrequently accessed data.
Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights
All company activities are shaped by big data analytics. It allows businesses to meet client
expectations. Big data analytics aids in the modification of a company's product range. It
guarantees that marketing initiatives are effective.
• Big Data Analytics as a Driver of Innovations and Product Development
Companies may use big data to innovate and revamp their goods.
4
Notes
1. Monitoring
and tracking
applications
Three major
types of
Business
applications
2. Analysis
and Insight
Applications
3. New
Product
Development
5
Notes
• Asset Tracking
The US department of defence is encouraging the industry to devise a tiny RFID chip that
could prevent the counterfeiting of electronic parts that end up in avionics or circuit board
for other devices. Airplanes are one of the heaviest users of sensors which track every
aspect of the performance of every part of the plane. The data can be displayed on the
dashboard as well as stored for later detailed analysis. Working with communicating
devices, these sensors can produce a torrent of data.Theft by shoppers and employees is a
major source of loss of revenue for retailers. All valuable items in the store can be assigned
RFID tags, and the gates of the store can be equipped with RF readers. This can help
secure the products, and reduce leakage(theft) from the store.
• Supply chain monitoring
All containers on ships communicate their status and location using RFID tags. Thus
retailers and their suppliers can gain real-time visibility to the inventory throughout the
global supply chain. Retailers can know exactly where the items are in the warehouse, and
so can bring them into the store at the right time. This is particularly relevant for seasonal
items that must be sold on time, or else they will be sold at a discount.With item-level
RFID tacks, retailers also gain full visibility of each item and can serve their customers
better.
6
Notes
7
Notes
created personal profiles of millions of supporters and what they had done and could do
for the campaign. Data was used to determine undecided voters who could be converted
to their side. They provided phone numbers of these undecided voters to the volunteers.
The results of the calls were recorded in real time using interactive web applications.
Obama himself used his twitter account to communicate his message directly with his
millions of followers.After the elections, Obama converted his list of tens of millions of
supporters to an advocacy machine that would provide the grassroots support for the
president initiatives. Since then, almost all campaigns use big data.
Senator Bernie sanders used the same big data playbook to build an effective national
political machine powered entirely by small donors. Election analyst, Nate silver, created
sophistical predictive models using inputs from many political polls and surveys to win
pundits to successfully predict winner of the US elections. Nate was however,
unsuccessful in predicting Donald trump’s rise and ultimate victory and that shows the
limits of big data.
Personal health
Medical knowledge and technology is growing by leaps and bounds. IBM’s Watson
system is a big data analytics engine that ingests and digests all the medical information in
the world, and then applies it intelligently to an individual situation.Watson can provide a
detailed and accurate medical diagnosis using current symptoms, patient history, medical
history and environmental trends, and other parameters. Similar products might be
offered as an APP to licensed doctors, and even individuals, to improve productivity and
accuracy in health care.
8
Notes
Recommendation service
Ecommerce has been a fast-growing industry in the last couple of decades. A variety of
products are sold and shared over the internet. Web users browsing and purchase history
on ecommerce sites is utilized to learn about their preference and needs, and to advertise
relevant product and pricing offers in real-time. Amazon uses a personalized
recommendation engine system to suggest new additional products to consumers based
on affinities of various products.
Netflix also use a recommendation engine to suggest entertainment options to its users.Big
data is valuable across all industries.
These are three major types of data sources of big data. Example (people to people communication,
people-machine communications, Machine-machine communications.)Each type has many sources
of data. There are three types of applications. They are the monitoring type, the analysis type and
new product development.They have an impact on efficiency, effectiveness and even disruption of
industries.
Apache Hadoop
A large data framework is the Apache Hadoop software library. It enables massive data sets to be
processed across clusters of computers in a distributed manner. It's one of the most powerful big
data technologies, with the ability to grow from a single server to thousands of computers.
Features
• When utilising an HTTP proxy server, authentication is improved.
• Hadoop Compatible Filesystem effort specification. Extended characteristics for POSIX-
style filesystems are supported.
9
Notes
• It has big data technologies and tools that offers robust ecosystem that is well suited to
meet the analytical needs of developer.
• It brings Flexibility in Data Processing. It allows for faster data Processing
HPCC
HPCC is a big data tool developed by LexisNexis Risk Solution. It delivers on a single platform, a
single architecture and a single programming language for data processing.
Features
• It is one of the Highly efficient big data tools that accomplish big data tasks with far less
code.
• It is one of the big data processing tools which offers high redundancy and availability.
• It can be used both for complex data processing on a Thor cluster. Graphical IDE for
simplifies development, testing and debugging. It automatically optimizes code for
parallel processing
• Provide enhance scalability and performance. ECL code compiles into optimized C++, and
it can also extend using C++ libraries
Apache STORM
Storm is a free big data open source computation system. It is one of the best big data tools which
offers distributed real-time, fault-tolerant processing system. With real-time computation
capabilities.
Features
• It is one of the best tool from big data tools list which is benchmarked as processing one
million 100 byte messages per second per node
• It has big data technologies and tools that uses parallel calculations that run across a
cluster of machines.
• It will automatically restart in case a node die. The worker will be restarted on another
node. Storm guarantees that each unit of data will be processed at least once or exactly
once
• Once deployed Storm is surely easiest tool for Bigdata analysis
Qubole
Qubole Data is Autonomous Big data management platform. It is a big data open-source tool which
is self-managed, self-optimizing and allows the data team to focus on business outcomes.
Features
• Features:
• Single Platform for every use case
• It is an Open-source big data software having Engines, optimized for the Cloud.
• Comprehensive Security, Governance, and Compliance
• Provides actionable Alerts, Insights, and Recommendations to optimize reliability,
performance, and costs.
• Automatically enacts policies to avoid performing repetitive manual actions
10
Notes
Apache Cassandra
The Apache Cassandra database is widely used today to provide an effective management of large
amounts of data.
Features
• Support for replicating across multiple data centers by providing lower latency for users
• Data is automatically replicated to multiple nodes for fault-tolerance
• It one of the best big data tools which is most suitable for applications that can't afford to
lose data, even when an entire data center is down
• Cassandra offers support contracts and services are available from third parties
•
Statwing
Statwing is an easy-to-use statistical tool. It was built by and for big data analysts. Its modern
interface chooses statistical tests automatically.
Features
• It is a big data software that can explore any data in seconds. Statwing helps to clean data,
explore relationships, and create charts in minutes
• It allows creating histograms, scatterplots, heatmaps, and bar charts that export to Excel or
PowerPoint. It also translates results into plain English, so analysts unfamiliar with
statistical analysis
CouchDB
CouchDB stores data in JSON documents that can be accessed web or query using JavaScript. It
offers distributed scaling with fault-tolerant storage. It allows accessing data by defining the Couch
Replication Protocol.
Features
• CouchDB is a single-node database that works like any other database
• It is one of the big data processing tools that allows running a single logical database
server on any number of servers.
• It makes use of the ubiquitous HTTP protocol and JSON data format. Easy replication of a
database across multiple server instances. Easy interface for document insertion, updates,
retrieval and deletion
• JSON-based document format can be translatable across different languages
Pentaho
Pentaho provides big data tools to extract, prepare and blend data. It offers visualizations and
analytics that change the way to run any business. This Big data tool allows turning big data into
big insights.
Features:
• Data access and integration for effective data visualization. It is a big data software that
empowers users to architect big data at the source and stream them for accurate analytics.
11
Notes
• Seamlessly switch or combine data processing with in-cluster execution to get maximum
processing. Allow checking data with easy access to analytics, including charts,
visualizations, and reporting
• Supports wide spectrum of big data sources by offering unique capabilities
Apache Flink
Apache Flink is one of the best open source data analytics tools for stream processing big data. It is
distributed, high-performing, always-available, and accurate data streaming applications.
Features:
• Provides results that are accurate, even for out-of-order or late-arriving data
• It is stateful and fault-tolerant and can recover from failures.
• It is a big data analytics software which can perform at a large scale, running on thousands
of nodes
• Has good throughput and latency characteristics
• This big data tool supports stream processing and windowing with event time semantics.
It supports flexible windowing based on time, count, or sessions to data-driven windows
• It supports a wide range of connectors to third-party systems for data sources and sinks
Cloudera
Cloudera is the fastest, easiest and highly secure modern big data platform. It allows anyone to get
any data across any environment within single, scalable platform.
Features:
• High-performance big data analytics software
• It offers provision for multi-cloud
• Deploy and manage Cloudera Enterprise across AWS, Microsoft Azure and Google Cloud
Platform. Spin up and terminate clusters, and only pay for what is needed when need it
• Developing and training data models
• Reporting, exploring, and self-servicing business intelligence
• Delivering real-time insights for monitoring and detection
• Conducting accurate model scoring and serving
Open Refine
OpenRefine is a powerful big data tool. It is a big data analytics software that helps to work with
messy data, cleaning it and transforming it from one format into another. It also allows extending it
with web services and external data.
Features:
• OpenRefine tool help you explore large data sets with ease. It can be used to link and
extend your dataset with various webservices. Import data in various formats.
• Explore datasets in a matter of seconds
• Apply basic and advanced cell transformations
• Allows to deal with cells that contain multiple values
12
Notes
• Create instantaneous links between datasets. Use named-entity extraction on text fields to
automatically identify topics. Perform advanced data operations with the help of Refine
Expression Language
RapidMiner
RapidMiner is one of the best open-source data analytics tools. It is used for data prep, machine
learning, and model deployment. It offers a suite of products to build new data mining processes
and setup predictive analysis.
Features
• Allow multiple data management methods
• GUI or batch processing
• Integrates with in-house databases
• Interactive, shareable dashboards
• Big Data predictive analytics
• Remote analysis processing
• Data filtering, merging, joining and aggregating
• Build, train and validate predictive models
• Store streaming data to numerous databases
• Reports and triggered notifications
Data cleaner
Data Cleaner is a data quality analysis application and a solution platform. It has strong data
profiling engine. It is extensible and thereby adds data cleansing, transformations, matching, and
merging.
Feature:
• Interactive and explorative data profiling
• Fuzzy duplicate record detection.
• Data transformation and standardization
• Data validation and reporting
• Use of reference data to cleanse data
• Master the data ingestion pipeline in Hadoop data lake. Ensure that rules about the data
are correct before user spends thier time on the processing. Find the outliers and other
devilish details to either exclude or fix the incorrect data
Kaggle
Kaggle is the world's largest big data community. It helps organizations and researchers to post
their data & statistics. It is the best place to analyze data seamlessly.
Features:
• The best place to discover and seamlessly analyze open data
• Search box to find open datasets.
13
Notes
• Contribute to the open data movement and connect with other data enthusiasts
Apache Hive
Hive is an open-source big data software tool. It allows programmers analyze large data sets on
Hadoop. It helps with querying and managing large datasets real fast.
Features:
• It Supports SQL like query language for interaction and Data modeling
• It compiles language with two main tasks map, and reducer.
• It allows defining these tasks using Java or Python
• Hive designed for managing and querying only structured data
• Hive's SQL-inspired language separates the user from the complexity of Map Reduce
programming
• It offers Java Database Connectivity (JDBC) interface
Solution
In order to handle these large data sets, companies are opting for modern techniques, such
as compression, tiering, and deduplication. Compression is used for reducing the number of
bits in the data, thus reducing its overall size. Deduplication is the process of removing
duplicate and unwanted data from a data set. Data tiering allows companies to store data in
different storage tiers. It ensures that the data is residing in the most appropriate storage space.
Data tiers can be public cloud, private cloud, and flash storage, depending on the data size and
importance. Companies are also opting for Big Data tools, such as Hadoop, NoSQL and other
technologies. This leads us to the third Big Data problem.
14
Notes
answers. They end up making poor decisions and selecting an inappropriate technology.
As a result, money, time, efforts and work hours are wasted.
Solution
The best way to go about it is to seek professional help. You can either hire experienced
professionals who know much more about these tools. Another way is to go for Big Data
consulting. Here, consultants will give a recommendation of the best tools, based on your
company’s scenario. Based on their advice, you can work out a strategy and then select the best
tool for you.
Solution
Companies are investing more money in the recruitment of skilled professionals. They also
have to offer training programs to the existing staff to get the most out of them.Another
important step taken by organizations is the purchase of data analytics solutions that are
powered by artificial intelligence/machine learning. These tools can be run by professionals
who are not data science experts but have basic knowledge. This step helps companies to save
a lot of money for recruitment.
Securing data
• Securing these huge sets of data is one of the daunting challenges of Big Data. Often
companies are so busy in understanding, storing and analyzing their data sets that they
push data security for later stages. But, this is not a smart move as unprotected data
repositories can become breeding grounds for malicious hackers.
• Companies can lose up to $3.7 million for a stolen record or a data breach.
• Solution
• Companies are recruiting more cybersecurity professionals to protect their data. Other
steps taken for securing data include:
• Data encryption
• Data segregation
• Identity and access control
• Implementation of endpoint security
• Real-time security monitoring
Solution
Companies have to solve their data integration problems by purchasing the right tools. Some
of the best data integration tools are mentioned below:
15
Notes
IBM InfoSphere
Xplenty
Informatica PowerCenter
CloverDX
Microsoft SQL
QlikView
Oracle Data Service Integrator
In order to put Big Data to the best use, companies have to start doing things differently. This
means hiring better staff, changing the management, reviewing existing business policies and
the technologies being used. To enhance decision making, they can hire a Chief Data Officer – a
step that is taken by many of the fortune 500 companies.
Summary
Big data refers to massive, difficult-to-manage data quantities – both organised and
unstructured – that inundate enterprises on a daily basis. Big data may be evaluated for
insights that help people make better judgments and feel more confident about making key
business decisions.
These are the most basic and basic Big Data applications. They assist in enhancing company
efficiency in almost every industry.
These are the big data apps of the future. They have the potential to alter businesses and boost
corporate effectiveness. Big data may be organised and analysed to uncover patterns and
insights that can be used to boost corporate performance.
These are brand-new concepts that didn't exist before. These applications have the potential to
disrupt whole industries and generate new income streams for businesses.
Apache Hadoop is a set of open-source software tools for solving issues involving large
volumes of data and processing utilising a network of many computers. It uses the
MapReduce programming concept to create a software framework for distributed storage and
processing of massive data.
Apache Cassandra is a distributed, wide-column store, NoSQL database management system
that is designed to handle massive volumes of data across many commodity servers while
maintaining high availability and avoiding single points of failure.
Cloudera, Inc. is a Santa Clara, California-based start-up that offers a subscription-based
enterprise data cloud. Cloudera's platform, which is based on open-source technology,
leverages analytics and machine learning to extract insights from data through a secure
connection.
RapidMiner is a data science software platform built by the same-named firm that offers a
unified environment for data preparation, machine learning, deep learning, text mining, and
predictive analytics.
Kaggle, a Google LLC subsidiary, is an online community of data scientists and machine
learning experts.
LexisNexis Risk Solutions created HPCC, often known as DAS, an open source data-intensive
computing system platform. The HPCC platform is based on a software architecture that runs
on commodity computing clusters and provides high-performance, data-parallel processing
for big data applications.
16
Notes
Keywords
Big Data: Big data refers to massive, difficult-to-manage data quantities – both organised and
unstructured – that inundate enterprises on a daily basis. But it's not simply the type or quantity of
data that matters; it's also what businesses do with it. Big data may be evaluated for insights that
help people make better judgments and feel more confident about making key business decisions.
Volume: Transactions, smart (IoT) devices, industrial equipment, videos, photos, audio, social
media, and other sources are all used to collect data. Previously, keeping all of that data would
have been too expensive; now, cheaper storage options such as data lakes, Hadoop, and the cloud
have alleviated the strain.
Velocity:Data floods into organisations at an unprecedented rate as the Internet of Things grows,
and it must be handled quickly. The need to cope with these floods of data in near-real time is
being driven by RFID tags, sensors, and smart metres.
Variety: From organised, quantitative data in traditional databases to unstructured text
documents, emails, movies, audios, stock ticker data, and financial transactions, data comes in a
variety of formats.
Variability: Data flows are unpredictable, changing often and altering substantially, in addition to
rising velocities and variety of data. It's difficult, but companies must recognise when something is
hot on social media and how to manage high data loads on a daily, seasonal, and event-triggered
basis.
Veracity:The quality of data is referred to as veracity. Information's tough to link, match, cleanse,
and convert data across systems since it originates from so many diverse places. Relationships,
hierarchies, and numerous data links must all be connected and correlated by businesses. If they
don't, their data will rapidly become out of hand.
Self Assessment
Q1: What are the fundamental elements of BIG DATA?
A. HDFS
B. YARN
C. MapReduce
D. All of these
Q2: What distinguishes BIG DATA Analytics from other types of analytics?
A. Open-Source
B. Scalability
C. Data Recovery
D. All of these
A. Volume
B. Veracity
C. Both a and b
D. Vivid
A. Hadoop is an excellent platform for extracting and analyzing tiny amounts of data.
B. Hadoop uses HDFS to store data and enables data compression and decompression.
17
Notes
C. To solve graph and machine learning problems, the giraph framework is less useful than a
MapReduce framework.
D. None of the mentioned
A. Bare metal
B. Cross-Platform
C. Unix-Like
D. None of the mentioned
Q6: The Hadoop list includes the HBase database, the Apache Mahout ___________ System, and
matrix operations.
A. Pattern recognition
B. HPCC
C. Machine Learning
D. SPSS
Q7: The element of MapReduce is in charge of processing one or more data chunks and
providing output results.
A. MapTask
B. Mapper
C. Task execution
D. All of the mentioned
Q8: Although the Hadoop framework is implemented in Java, MapReduce applications need
not be written in ____________
A. Java
B. C
C. C#
D. None of the mentioned
Q9: Input key/value pairs are mapped to a collection of intermediate key/value pairs using
_____.
A. Mapper
B. Reducer
C. Both
D. None of the mentioned
Q10: The number of maps is usually driven by the total size of ____________
A. inputs
B. outputs
C. tasks
D. None of the mentioned
18
Notes
Q11: The _____________ software library is a big data framework. It allows distributed
processing of large data sets across clusters of computers.
A. Apple Programming
B. R Programming
C. Apache Hadoop
D. All of above
Q12: Which big data tool was developed by LexisNexis Risk Solution?
A. SPCC System
B. HPCC System
C. TOCC System
D. None of above
Q13: Which big data tools offers distributed real-time, fault-tolerant processing system with
real-time computation capabilities.
A. Storm
B. HPCC
C. Qubole
D. Cassandra
Q15: ______ stores data in JSON documents that can be accessed web or query using JavaScript
A. CouchDB
B. Storm
C. Hive
D. None of above
6. C 7. C 8. A 9. A 10. A
Review Questions
1. Explain five effective characteristics of BIG DATA.
2. Write down applications of BIG DATA.
19
Notes
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think. Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
20
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 02: Foundations for Big Data
Objectives
• differentiate between file system (FS) and distributed file system (DFS)
• understand scalable computing over the internet.
• understand programming models for Big Data.
Introduction
The first storage mechanism used by computers to store data was punch cards. Each
group of related punch cards (Punch cards related to same program) used to be stored into
a file; and files were stored in file cabinets. This is very similar to what we do nowadays to
archive papers in government intuitions who still use paper work on daily basis. This is
where the word “File System” (FS) comes from. The computer systems evolved; but the
concept remains the same.
21
Notes
file extension indicates the type of information stored in that file. for example; EXE
extension refers to executable files, TXT refers to text files…etc.File management system is
used by the operating system to access the files and folders stored in a computer or any
external storage devices.
Features of DFS
Transparency
• Structure transparency: There is no need for the client to know about the number or
locations of file servers and the storage devices. Multiple file servers should be provided
for performance, adaptability, and dependability.
• Access transparency: Both local and remote files should be accessible in the same
manner. The file system should be automatically located on the accessed file and send it to
the client’s side.
22
Notes
• Naming transparency: There should not be any hint in the name of the file to the
location of the file. Once a name is given to the file, it should not be changed during
transferring from one node to another.
• Replication transparency: If a file is copied on multiple nodes, both the copies of the
file and their locations should be hidden from one node to another.
User mobility: It will automatically bring the user’s home directory to the node where the user
logs in.
Performance: Performance is based on the average amount of time needed to convince the client
requests. This time covers the CPU time + time taken to access secondary storage + network access
time. It is advisable that the performance of the Distributed File System be similar to that of a
centralized file system.
Simplicity and ease of use: The user interface of a file system should be simple and the
number of commands in the file should be small.
High availability: A Distributed File System should be able to continue in case of any partial
failures like a link failure, a node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and independent file
servers for controlling different and independent storage devices.
Data replication is a good way to achieve fault tolerance and high concurrency; but it’s
very hard to maintain frequent changes. Assume that someone changed a data block on
one cluster; these changes need to be updated on all data replica of this block.
• High Concurrency: avail same piece of data to be processed by multiple clients at
the same time. It is done using the computation power of each node to parallel
process data blocks.
23
Notes
24
Notes
25
Notes
On the HTC side, peer-to-peer (P2P) networks are formed for distributed file sharing and content
delivery applications.
A P2P system is built over many client machines. Peer machines are globally distributed in nature.
P2P, cloud computing, and web service platforms are more focused on HTC applications than on
HPC applications.Clustering and P2P technologies lead to the development of computational grids
or data grids.
High-Performance Computing
• For many years, HPC systems emphasize the raw speed performance. The speed of HPC
systems has increased from Gflops in the early 1990s to now Pflops in 2010. This
improvement was driven mainly by the demands from scientific, engineering, and
manufacturing communities.For example, the Top 500 most powerful computer systems
in the world are measured by floating-point speed in Linpack benchmark results.
However, the number of supercomputer users is limited to less than 10% of all computer
users. Today, the majority of computer users are using desktop computers or large servers
when they conduct Internet searches and market-driven computing tasks.
26
Notes
Degrees of Parallelism
• Fifty years ago, when hardware was bulky and expensive, most computers were designed
in a bit-serial fashion.In this scenario, bit-level parallelism (BLP) converts bit-serial
processing to word-level processing gradually.Over the years, users graduated from 4-bit
microprocessors to 8-,16-, 32-, and 64-bit CPUs. This led us to the next wave of
improvement, known as instruction-level parallelism (ILP), in which the processor
executes multiple instructions simultaneously rather than only one instruction at a time.
For the past 30 years, we have practiced ILP through pipelining, superscalar computing,
VLIW (very long instruction word) architectures, and multithreading.ILP requires branch
prediction, dynamic scheduling, speculation, and compiler support to work efficiently
27
Notes
What is a MapReduce?
• Map takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the
output from a map as an input and combines those data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce task is always performed
after the map job. Secondly, reduce task, which takes the output from a map as an input
and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
The Algorithm
• Generally, MapReduce paradigm is based on sending the computer to where the data
resides. MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
• Map stage − The map or mapper’s job is to process the input data. Generally, the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
• Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
Figure 11 MapReduce
• The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the
network traffic. After completion of the given tasks, the cluster collects and reduces the
data to form an appropriate result, and sends it back to the Hadoop server.
28
Notes
Advantages of MapReduce
• It is easy to scale data processing over multiple computing nodes.
• Under the MapReduce model, the data processing primitives are called mappers and
reducers.
• Decomposing a data processing application into mappers and reducers is sometimes
nontrivial.
• But, once we write an application in the MapReduce form, scaling the application to run
over hundreds, thousands, or even tens of thousands of machines in a cluster are merely a
configuration change.
• This simple scalability is what has attracted many programmers to use the MapReduce
model.
In a directed graph or a digraph, each edge is associated with a direction from a start vertex to an
end vertex.If we traverse along the direction of the edges and we find that no closed loops are
formed along any path, we say that there are no directed cycles. The graph formed is a directed
acyclic graph.A DAG is always topologically ordered, i.e., for each edge in the graph, the start
vertex of the edge occurs earlier in the sequence than the ending vertex of the edge.
Topological sorting for Directed Acyclic Graph (DAG) is a linear ordering of vertices such that for
every directed edge u v, vertex u comes before v in the ordering. Topological Sorting for a graph is
not possible if the graph is not a DAG.For example, a topological sorting of the following graph is
“5 4 2 3 1 0”. There can be more than one topological sorting for a graph. For example, another
topological sorting of the following graph is “4 5 2 3 1 0”. The first vertex in topological sorting is
always a vertex with in-degree as 0 (a vertex with no incoming edges).
29
Notes
Application Areas
Some of the main application areas of DAG are −
• Routing in computer networks
• Job scheduling
• Data processing
• Genealogy
• Citation graphs
Message Passing
Process communication is the mechanism provided by the operating system that allows processes
to communicate with each other. This communication could involve a process letting another
process know that some event has occurred or transferring of data from one process to another.
One of the models of process communication is the message passing model.Message passing model
allows multiple processes to read and write data to the message queue without being connected to
each other. Messages are stored on the queue until their recipient retrieves them. Message queues
are quite useful for inter-process communication and are used by most operating systems.
In the above diagram, both the processes P1 and P2 can access the message queue and store and
retrieve data.The message passing model is much easier to implement than the shared memory
model.It is easier to build parallel hardware using message passing model as it is quite tolerant of
higher communication latencies.
Is your organization struggling to keep up with the demands of Big Data and under pressure to
prove quick results? If so, you’re not alone. According to analysts, up to 60% of Big Data projects
are failing because they can’t scale at the enterprise level. Fortunately, taking a step-by-step
approach to application workflow orchestration can help you succeed. It begins with assessing the
various technologies for supporting multiple Big Data projects that relate to these four steps:
• Ingesting data
30
Notes
Improve reliability
it’s important to run Big Data workflows successfully to minimize service interruptions. Using
a patchwork of tools and processes makes it hard to identify issues and understand root
cause, putting SLAs at risk. If you can manage your entire Big Data workflow from A to Z,
then if something goes wrong in the process, you’ll see it immediately and know where it
happened and what happened. Using the same solution orchestrating your entire processes
and managing them from one single plane of glass, simplifies managing your services and
assuring they run successfully.
Looking ahead
Taking a step-by-step approach to application workflow orchestration simplifies the
complexity of your Big Data workflows. It avoids automation silos and helps assure you meet
SLAs and deliver insights to business users on time. Discover how Control-M provides all
31
Notes
ofthe capabilities to enable your organization to follow this approach and how it easily
integrates with your existing technologies to support Big Data projects.
Summary
A file system is a programme that controls how and where data is saved, retrieved, and
managed on a storage disc, usually a hard disc drive (HDD). It's a logical disc component
that maintains a disk's internal activities as they relate to a computer while remaining
invisible to the user.
A distributed file system (DFS) or network file system is a type of file system that allows
many hosts to share files over a computer network. Multiple users on multiple machines can
share data and storage resources as a result of this.
The distinction between local and remote access techniques should be indistinguishable.
Users who have access to similar communication services at multiple locations are said to be
mobile. For example, a user can use a smartphone and access his email account from any
computer to check or compose emails. The travel of a communication device with or
without a user is referred to as device portability.
Big data refers to massive, difficult-to-manage data quantities – both organised and
unstructured – that inundate enterprises on a daily basis. Big data may be evaluated for
insights that help people make better judgments and feel more confident about making key
business decisions.
These are the most basic and basic Big Data applications. They assist in enhancing company
efficiency in almost every industry.
These are the big data apps of the future. They have the potential to alter businesses and
boost corporate effectiveness. Big data may be organised and analysed to uncover patterns
and insights that can be used to boost corporate performance.
The process of replicating a double-stranded DNA molecule into two identical DNA
molecules is known as DNA replication. Because every time a cell splits, the two new
daughter cells must have the same genetic information, or DNA, as the parent cell,
replication is required.
The capacity of a system to increase or decrease in performance and cost in response to
changes in application and system processing demands is known as scalability. When
considering hardware and software, businesses that are rapidly expanding should pay
special attention to scalability.
In a Hadoop cluster, MapReduce is a programming paradigm that permits tremendous
scalability across hundreds or thousands of computers. MapReduce, as the processing
component, lies at the heart of Apache Hadoop. The reduction job is always carried out after
the map job, as the term MapReduce implies.
Keywords
MapReduce: MapReduce is a framework that allows us to create applications that reliably process
enormous volumes of data in parallel on vast clusters of commodity hardware.
Map Stage:The map's or mapper's job is to process the data that is given to them. In most cases, the
input data is stored in the Hadoop file system as a file or directory (HDFS). Line by line, the input
file is supplied to the mapper function. The mapper divides the data into little bits and processes it.
32
Notes
Reduce Stage:This level is the result of combining the Shuffle and Reduce stages. The Reducer's job
is to take the data from the mapper and process it. It generates a new set of outputs after
processing, which will be stored in the HDFS.
Data Node:Data is supplied in advance before any processing takes occur at this node.
Directed Cyclic Graph:A directed cycle graph is a cycle graph with all edges pointing in the same
direction.
Message Passing:Message passing is a way for invoking activity (i.e., running a programme) on a
computer in computer science. The calling programme delivers a message to a process (which
could be an actor or an object), and that process and its supporting infrastructure choose and run
relevant code.
Bulk Synchronous Parallel:Bulk Synchronous Parallel (BSP) is a parallel computing programming
model and processing framework. The computation is broken down into a series of supersteps. A
group of processes running the same code executes concurrently in each superstep and generates
messages that are delivered to other processes.
Replication:The process of replicating a double-stranded DNA molecule into two identical DNA
molecules is known as DNA replication. Because every time a cell splits, the two new daughter cells
must have the same genetic information, or DNA, as the parent cell, replication is required.
Review Questions
Q1: The EXE extension stands for _________________
A. executable files
B. extension files
C. extended files
D. None of above
Q3: Data replication is a good way to achieve ________ and high concurrency; but it’s very hard
to maintain frequent changes.
A. fault tolerance
B. detection tolerance
C. both
D. none of above
A. RFID
B. Sensor technologies
C. GPS
D. All of the above
A. Filename
33
Notes
B. File identifier
C. File extension
D. None of the mentioned.
Q6: What computer technology is used to describe services and applications that run on a
dispersed network using virtualized resources?
A. Distributed Computing
B. Cloud Computing
C. Soft Computing
D. Parallel Computing
Q7: Which one of the following options can be considered as the Cloud?
A. Hadoop
B. Intranet
C. Web Applications
D. All of the mentioned
Q9: The name of three stages in which MapReduce program executes are:
Q10: A directed acyclic graph (DAG) refers to a directed graph which has no ______ cycles.
A. Infinite
B. Directed
C. Direction
D. Noe of above
A. Map
B. Reduce
C. Both
D. None of above
A. Genealogy
34
Notes
B. Citation graphs
C. Job Scheduling
D. All of above
Q14: Map takes a set of data and converts it into another set of data, where individual elements
are broken down into ________
A. Tables.
B. Tuples (key/value pairs).
C. Reduce stage.
D. None of above.
6. B 7. A 8. D 9. A 10. B
Review Questions
1. Differentiate between file system and distributed file system.
2. Write down features of Distributed file system.
3. Write down popular models of BIG DATA.
4. Write down challenges of BIG DATA.
5. Write note on the following.
a. The Age of Internet Computing
b. High Throughput Computing
6. What are the advantages and disadvantages of distributed file system?
35
Notes
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think. Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier for Innovation,
Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practicesof
Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark.
OReilley.White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
36
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 03: Data Models
Objectives
• Understand what is data mart
• Understand data format
• Understand data model
• Differentiate between data warehouse and data mart
• Understand what is data stream
• understand streaming sensor data
Introduction
A data mart is a smaller version of a data warehouse that caters to specialised data analysis
requirements. It is often derived as a subset of a larger data warehouse. The primary goal of data
marts is to do analysis that is difficult to perform in a traditional warehouse due to the varying
levels of granularity of data or the need to perform sophisticated computations.
37
Notes
A computer programmer typically uses a wide variety of tools to store and work with data
in the programs they build.They may use simple variables (single value), arrays (multiple
values), hashes (key-value pairs), or even custom objects built in the syntax of the
language they’re using.Portable format is required.
• Another program may have to communicate with this program in a similar way, and the
programs may not even be written in the same language, as is often the case with
something like traditional client-server communications as shown in Figure 1.
• This is all perfectly standard within the confines of the software being written. However,
sometimes a more abstract, portable format is required. For instance, a non-programmer
may need to move data in and out of these programs.
• For example, many third-party user interfaces (UIs) are used to interface with public cloud
providers. This is made possible (in a simplified fashion) thanks to standard data formats.
The moral of the story is that we need a standard format to allow a diverse set of software
to communicate with each other, and for humans to interface with it.
38
Notes
39
Notes
40
Notes
from many sources within an organization. The collated data is used to guide business decisions
through analysis, reporting, and data mining tools.
41
Notes
42
Notes
•
Figure 3 Data streaming
43
Notes
4 and cameras, each record needs to be processed in a way that preserves its relation to other
data and sequence in time.Log files, e-commerce purchases, weather events, utility service
usage, geo-location of people and things, server activity, and more are all examples where
real-time streaming data is created.When companies are able to analyze streaming data they
receive, they can get real-time insights to understand exactly what is happening at any given
point in time. This enables better decision-making as well as provide customers with better
and more personalized services. Nearly every company is or can use streaming data.
Predictive Maintenance
When companies can identify maintenance issues prior to breakdowns or system failure, they will
save time, money, and other potentially catastrophic effects on the business. Any company that has
equipment of any kind that has sensors or cameras—again, that’s most equipment these days—
will create streaming data. From monitoring the performance of trucks, and airplanes, to
predicting issues with complex manufacturing equipment, real-time data and analytics is
becoming critical to modern enterprises today.
Healthcare
Just like in a manufacturing environment, wearables, and healthcare equipment such as
glucometers, connected scales, heart rate and blood pressure monitors have sensors that monitor a
patient’s vitals and essential body functions. This equipment is also crucial for effective remote
44
Notes
patient monitoring that supports clinicians who don’t have the bandwidth to be everywhere all the
time. It’s literally a matter of life or death. Immediate insights can improve patient outcomes and
experiences.
Retail
Real-time data streaming from IoT sensors and video are driving a modern retail renaissance.
Brick-and-mortar retail stores can engage customers in the moment thanks to streaming data.
Location-based marketing, trend insights, and improvements to operational efficiency, such as
product movement or product freshness, are all possible with real-time insights. Understanding
what a consumer wants when they want it “in the moment” is not only valuable in retail. Any
company that is able to understand and respond immediately to what its customer wants in micro-
moments will have a better chance of being successful, whether it's to deliver something a
consumer wants to learn, discover, watch or buy.
Social media
With cries of “fake news” and instances of social media bullying continuing to rise, the need for
real-time monitoring of posts to quickly take action on offensive and “fake news” is more
important than ever. Under mounting pressure, social media platforms are creating tools to be able
to process the huge volume of data created quickly and efficiently to be able to take action as
immediately as possible, especially to prevent bullying.
Finance
On the trading floor, it's easy to see how understanding and acting on information in real-time is
vital, but streaming data also helps the financial functions of any company by processing
transactional information, identify fraudulent actions, and more. For example, MasterCard is using
data and analytics to helping financial organizations quickly and easily identify fraudulent
merchants to reduce risk. Similarly, by gaining the ability to process real-time data, Rabobank is
able to detect warning signals in extremely early stages of where clients may go into default.
KPIs
Leaders can make decisions based on real-time KPIs such as financial, customer, or operational
performance data. Previously, this analysis was reactive and looked back at past performance.
Today, real-time data can be compared with historical information to give leaders a perspective on
business that informs real-time decisions.As you can see, streaming data is increasingly important
to most companies in most industries. Successful companies are integrating streaming analytics to
move their data analytics from a reactive to a more proactive real-time approach. The best ones will
be thinking about integrating their real-time data with predictive models and scenario analysis to
gain strategic foresight.However, in order to harness fast and streaming data, organizations today
need an end-to-end data management and analytics platform that can collect, process, manage, and
analyze data in real-time to drive insights and enable machine learning to implement some of the
most compelling use cases. Most importantly, they need to be able to do these with the robust
security, governance, data protection, and management capabilities that enterprises require.
45
Notes
46
Notes
Data movement
Data Lakes allow you to import any amount of data that can come in real-time. Data is collected
from multiple sources, and moved into the data lake in its original format. This process allows you
to scale to data of any size, while saving time of defining data structures, schema, and
transformations.
Machine Learning
Data Lakes will allow organizations to generate different types of insights including reporting on
historical data, and doing machine learning where models are built to forecast likely outcomes, and
suggest a range of prescribed actions to achieve the optimal result.
47
Notes
48
Notes
The distributed, wireless and battery-powered nature of sensor networks will force data
management to take sensor failure, network latency and loss into account. At the other hand, there
will be a lot of redundant (or, in statistical terms, highly correlated) data to counter these negative
features. A couple of remarks to sketch the situation. At the other hand, there will be a lot of
redundant (or, in statistical terms, highly correlated) data to counter these negative features. A
couple of remarks to sketch the situation.
• Sensors come and sensors go. They can fail because their battery runs out, and start up again
when it is replaced. They can be disconnected, moved and connected at a different place. They
can be replaced altogether by a newer model. They can have wireless connections which do not
work all the time.
• Sensors do not produce clean data. Averages have to be taken, noise filters have to be applied,
environmental influences (e.g.echos) have to be accounted for.
• The same sensor may be used for different purposes. Different algorithms are applied on the
raw data depending on what you want to know, e.g. using a microphone for speaker
identification, speaker positioning or estimation of the environmental noise level.
• The data rate and latency may differ greatly between sensors/algorithms, and over time: In
some cases, it may be parameterizable (i.e. a sensor or algorithm can be configured to produce
output at several rates). In some cases, the term “data rate” might not even apply at all (e.g.
RFID readers which produce a reading (or a burst of readings) whenever a tag is detected).
• They might only produce data “on demand” because of the cost associated with it. This cost
may be power, but it may also be money if the sensor belongs to another party (think of
weather or traffic sensors).
• Applications come and go. They can be turned on and off at will; they are duplicated for
each new user; they are upgraded. They are disconnected at one place and connected at
another, and might be interested in what happened in the meantime.
• They might want to know what kind of sensors are around, and adapt their information
demands to this.
49
Notes
• They might be totally decoupled from sensors, and just want to know e.g. which person is
at a certain desk.
• They might have (static or dynamic) requirements about the rate at which data is
delivered to them. This rate may vary greatly from application to application.
• They might demand a ‘memory’ from the environment to discover details of specific
events in the past.
• They might be interested in trends or summaries rather than in specifics.
Weather data
Many satellites provide real-time weather data streaming in order to capture critical signals for the
weather. This information is used to forecast the weather.
Summary
A data mart is a structure / access pattern used to get client-facing data in data warehouse
setups. A data mart is a subset of a data warehouse that is often focused on a single business
line or team.
A data mart is a subset of a data warehouse that is focused on a certain business line,
department, or topic area. Data marts make specialised data available to a designated group
of users, allowing them to rapidly obtain key insights without having to sift through a whole
data warehouse.
A dependent data mart enables data from several organisations to be sourced from a single
Data Warehouse. It is an example of a data mart that provides the benefit of centralization.
You must setup one or more physical data marts as dependent data marts if you need to
create them.
Without the usage of a central data warehouse, an independent data mart is formed. This type
of Data Mart is best suited for smaller groups inside a company.
A data lake is a system or repository that stores data in its original/raw form, which is often
object blobs or files.
Data that is continually created by several sources is referred to as streaming data. Without
having access to all of the data, such data should be handled sequentially utilising stream
processing techniques.
Keywords
Predictive maintenance: Predictive maintenance is the application of data-driven, proactive
maintenance approaches to examine equipment status and anticipate when repair should be
conducted.
50
Notes
Dependent data marts:An enterprise data warehouse is used to establish a reliant data mart. It's a
top-down technique that starts with keeping all company data in one single area and then
extracting a clearly defined piece of the data for analysis as needed.
Independent data marts: An independent data mart is a stand-alone system that concentrates on a
single topic area or business activity without the usage of a data warehouse. Data is retrieved from
internal or external data sources (or both), processed, and then deposited into a data mart
repository, where it is kept until it is required for business analytics.
Hybrid data marts: Data from an existing data warehouse and other operational source systems is
combined in a hybrid data mart. It combines the speed and end-user emphasis of a top-down
strategy with the benefits of the bottom-up method's enterprise-level integration.
Maintenance: In industrial, commercial, and residential settings, maintenance include functioning
inspections, servicing, repairing, or replacing essential devices, equipment, machinery, building
structures, and supporting utilities.
Data Lake:A data lake is a storage repository that stores a large amount of raw data in its original
format until it is required for analytics applications. A data lake differs from a standard data
warehouse in that it stores data in flat architecture, mostly in files or object storage, rather than
hierarchical dimensions and tables.
Data ware house: A data warehouse is a huge collection of corporate data used to aid decision-
making inside a company. The data warehouse idea has been around since the 1980s, when it was
created to aid in the transfer of data from being used to power operations to being used to feed
decision support systems that disclose business insight.
Self Assessment
Q1: Which of the following is an aim of data mining?
A. Dependent datamart
B. Independent datamart
C. Hybrid datamart
D. All of the above
51
Notes
Q5: __________ is built by drawing data from central data warehouse that already exists.
A. Dependent datamart
B. Independent datamart
C. Hybrid datamart
D. All of the above
Q6: ___________ is built by drawing from operational or external sources of data or both.
A. Dependent datamart
B. Independent datamart
C. Hybrid datamart
D. All of the above
Q7: A ________ data mart combines input from sources apart from Data warehouse
A. Dependent datamart
B. Independent datamart
C. Hybrid datamart
D. All of the above
Q8: Big data streaming is a process in which big data is quickly processed in order to extract
_________ insights from it.
A. real-time
B. streaming data
C. both a and b
D. None of above
Q9: Dynamic data that is generated continuously from a variety of sources is considered
_____________
A. real-time
B. steaming data
C. both a and b
D. None of above
Q10: ____________ is using data and analytics to helping financial organizations quickly and
easily identify fraudulent merchants to reduce risk.
A. Debit Card
B. Credit Card
C. MasterCard
D. None of the above
52
Notes
Q12: The increasing availability of cheap, small, low-power sensor hardware and has led to the
prediction that ___________ will arise in the near future.
A. Small-environment
B. Supply side
C. both a and b
D. None of above
Q13: The ________ of a smart environment consists of a myriad of sensors that produce data at
possibly very high rates real-time
A. streaming data
B. supply side
C. both a and b
D. None of above
A. streaming data
B. supply side
C. PocketLab
D. None of the above
Q15: ____________________ can fail because their battery runs out, and start up again when it is
replaced
6. B 7. C 8 A 9 B 10 C
11 B 12 A 13 B 14 C 15 A
Review Questions
1. Difference between data mart and data ware house.
2. Write down the Tips for Creating Effective Big Data Models.
3. Explain different types of data mart.
4. Write down advantages and disadvantages of data mart.
5. What do you understand by data streaming? Explain Use Cases for Real-Time and
Streaming Data.
53
Notes
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practice of Scalable
Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley White,
Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
54
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 04: NOSQL Data Management
Objectives
• identify key differences between NOSQL and relational databases
• appreciate the architecture and types of NOSQL databases
• describe the major types of NOSQL databases and their features
• learn distribute data models
• learn hadoop partitioner
Introduction
• A NOSQL database is a clever way of cost-effectively organizing large amounts of
heterogeneous data for efficient access and updates. An ideal NOSQL database is
completely aligned with the nature of the problems being solved, and is superfast in
accomplishing that task.This is achieved by relaxing many of the integrity and
redundancy constraints of storing data in relational databases. Data is thus stored in
many innovative formats closely aligned with business need. The diverse NOSQL
databases will ultimately collectively evolve into a holistic set of efficient and elegant
knowledge stored at the heart of a cosmic computer.
55
Notes
NOSQL databases are next-generation databases that are non-relational in their design. The name
NOSQL is meant to differentiate it from antiquated, ‘pre-relational’ databases.Today, almost every
organization that must gather customer feedback and sentiments to improve their business, uses a
NOSQL database. NOSQL is useful when an enterprise needs to access, analyze, and utilize
massive amounts of either structured or unstructured data that’s stored remotely in virtual servers
across the globe.
• NOSQL database is useful when
• The constraints of a relational database are relaxed in many ways. For example, relational
databases require that any data element could be randomly accessed and its value could
be updated in that same physical location. However, the simple physics of storage says
that it is simpler and faster to read or write sequential blocks of data on a disk. Therefore,
NOSQL database files are written once and almost never updated in place. If a new
version of a part of the data becomes available, it would be appended to the respective
files. The system would have the intelligence to link the appended data to the original
56
Notes
57
Notes
Columnar or
Key/Value Store Document Store Graph DB
Extensible record
Google
Memcached Couch DB Neo4j
Big Table
Tokyo Infinite
Cassandra Simple DB
Cabinet Graph
Lotus
Dyanamo Hyper Table
Domino
Dynomite Mnesia
Risk
Project
Voldemort
58
Notes
59
Notes
Columnar or
Key/Value Store Document Store Graph DB
Extensible record
Google
Memcached Couch DB Neo4j
Big Table
Tokyo Infinite
Cassandra Simple DB
Cabinet Graph
Lotus
Dyanamo Hyper Table
Domino
Dynomite Mnesia
Risk
Project
Voldemort
Columnar Databases:
These are database structures that include only the relevant columns of the dataset, along with the
key-identifying information. These are useful in speeding up some oft-sought queries from very
large data sets. Suppose there is an extremely large data warehouse of web log access data, which is
rolled up by the number of web access by the hour. This needs to be queried, or summarized often,
involving only some of the data fields from the database. Thus the query could be speeded up by
organizing the database in a columnar format. This is useful for content management systems,
blogging platforms, maintaining counters, expiring usage, heavy write volume such as log
aggregation. Column family databases for systems well when the query patterns have stabilized.
HBase and Cassandra are the two of the more popular Columnar database offerings. HBase was
developed at Yahoo, and comes as part of the Hadoop ecosystem. Cassandra was originally
developed at Facebook to serve its exponentially growing user base, which is now close to 2 billion
people. It was open sourced in 2008.
60
Notes
Document Database
These databases store an entire document of any size, as a single value for a key element. Suppose
one is storing a 10GB video movie file as a single object. An index could store the identifying
information about the movie, and the address of the starting block. The system could handle the
rest of storage details. This storage format would be a called document store format. Document
databases are generally useful for content management systems, blogging platforms, web analytics,
real-time analytics, ecommerce-applications. Document databases would not be useful for systems
that need complex transactions spanning multiple operations or queries against varying aggregate
structures.
MongoDB is an open-source document database that provides high performance, high availability,
and automatic scaling. A record in MongoDB is a document, which is a data structure composed of
61
Notes
Graph Database
Graph databases are very well suited for problem spaces where we have connected data, such as
social networks, spatial data, routing information, and recommendation engines. The following
graph shows an example of a social network graph. Given the people (nodes) and their
relationships (edges), you can find out who the "friends of friends" of a particular person are—for
example, the friends of Howard's friends. For example, geographic map data used in Google Maps
is stored in set of relationships or links between points. For intensive data relationship handling,
graph databases improve performance by several orders of magnitude. Tech giants like Google,
Facebook, and LinkedIn use graph databases to deliver scalable, insightful, and quick service
Neo4j is a highly scalable and most popular ACID-compliant transactional database with native
graph storage and processing. It is an open-source graph database, implemented in Java, and
accessible from software written in other languages.The first popular NOSQL database was HBase,
which is a part of the Hadoop family. The most popular NOSQL database used today is Apache
Cassandra, which was developed and owned by Facebook till it was released as open source in
2008. Other NOSQL database systems are SimpleDB, Google’sBigTable, MemcacheDB, Oracle
NOSQL, Voldemort, etc.
62
Notes
63
Notes
Aggregates make it easier for the database to manage data storage over clusters, since the unit of
data now could reside on any machine and when retrieved from the database gets all the related
data along with it. Aggregate-oriented databases work best when most data interaction is done
with the same aggregate, for example when there is need to get an order and all its details, it better
to store order as an aggregate object but dealing with these aggregates to get item details on all the
orders is not elegant.
We can use this scenario to model the data using a relation data store as well as NOSQL data stores
and talk about their pros and cons.. For relational model, we start with a data model shown in this
figure. As we are good relational soldiers, everything is properly normalized, so that no data is
repeated in multiple tables. We also have referential integrity.
A realistic order system would naturally be more involved than this, but this is the benefit of the
rarefied air of a book.Let’s see how this model looks when we think in more aggregate-oriented
terms:
64
Notes
65
Notes
//Customer
{
"id": 1,
"name": "Fabio",
"billingAddress": [
{
"city": "Bari"
}
]
}
//Orders
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 34,
"productName": “NoSQL Distilled”
} ],
"shippingAddress": [
{
"city": "Bari”} ],
"orderPayment": [
{ "ccinfo": "100-432423-545-134",
"txnId": "afdfsdfsd",
"billingAddress": [ {"city": “Chicago” }]
}]
}
important thing to notice here isn’t the particular way we have drawn the aggregate boundary so
much as the fact that you have to think about accessing that data- and make that part of your
thinking when developing the application data model. Indeed, we could draw aggregate
boundaries differently, putting all the orders for a customer into the customer aggregate.
Figure 14: Embed all the objects for customer and the customer’s order
66
Notes
Like most things in modelling, there’s no universal answer for how to draw your aggregate
boundaries. It depends entirely in how you tend to manipulate your data. If you tend to access a
customer together with all of that customer’s order at once, then you would prefer a single
aggregate. However, if you tend to focus on accessing a single order at a time, then you should
prefer having separate aggregates for each other. Naturally, this is very context-specific; some
applications will prefer one or the other, even within a single system, which is exactly why many
people prefer aggregate ignorance.
67
Notes
Running on a Cluster
It gives several advantages on computation power and data distribution. However, it requires
minimizing the number of nodes to query when gathering data.By explicitly including aggregates,
we give the database an important of which information should be stored together.
68
Notes
NOSQL databases are capable of storing and processing big data which is characterized by various
properties such as volume, variety and velocity. Such databases are used in a variety of user
applications that need large volume of data which is highly available and efficiently accessible. But
they do not enforce or require strong data consistency nor do they support transactions. For
example, social media such as Twitter and Facebook [5] generate terabytes of daily data which is
beyond the processing capabilities of relational databases. Such applications need high
performance but may not need strong consistency. Different vendors design and implement
NOSQL databases differently. Indeed, there are different types of NOSQL databases such as
document databases, key-value databases, column stores and graph databases. But their common
objective is to use data replication in order to ensure high efficiency, availability and scalability of
data.
69
Notes
70
Notes
Features of NOSQL
• Non-relational
• Never follow the relational model
• Never provide tables with flat fixed-column records
• Work with self-contained aggregates or BLOBs
• Doesn't require object-relational mapping and data normalization
• No complex features like query languages, query planners, referential integrity joins,
ACID
• Schema-free
NOSQL databases are either schema-free or have relaxed schemas. Do not require any sort of
definition of the schema of the data. Offers heterogeneous structures of data in the same domain
• Simple API
Offers easy to use interfaces for storage and querying data provided. APIs allow low-level data
manipulation & selection methods. Text-based protocols mostly used with HTTP REST with JSON.
Mostly used no standard based NOSQL query language. Web-enabled databases running as
internet-facing services
• Distributed
• Multiple NOSQL databases can be executed in a distributed fashion. Offers auto-
scaling and fail-over capabilities. Often ACID concept can be sacrificed for
scalability and throughput.
• Mostly no synchronous replication between distributed nodes Asynchronous
Multi-Master Replication, peer-to-peer, HDFS Replication. Only providing
eventual consistency
71
Notes
•
Figure 18: NoSQL is shared nothing
72
Notes
Document-Oriented
In this diagram on your left you can see we have rows and columns, and in the right, we have a
document database which has a similar structure to JSON.
Now for the relational database, you have to know what columns you have and so on. However,
for a document database, you have data store like JSON object. You do not require to define which
make it flexible.The document type is mostly used for CMS systems, blogging platforms, real-time
analytics & e-commerce applications. It should not use for complex transactions which require
multiple operations or queries against varying aggregate structures.Amazon SimpleDB, CouchDB,
MongoDB, Riak, Lotus Notes, MongoDB, are popular Document originated DBMS systems
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes. Every
node and edge have a unique identifier.Compared to a relational database where tables are loosely
connected, a Graph database is a multi-relational in nature. Traversing relationship is fast as they
are already captured into the DB, and there is no need to calculate them. Graph base database
mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based
databases.A graph database is a database that uses graph structures for semantic queries with
nodes, edges, and properties to represent and store data. A graph database is any storage system
that provides index-free adjacency. This means that every element contains a direct pointer to its
adjacent elements and no index lookups are necessary. General graph databases that can store any
graph are distinct from specialized graph databases such as triplestores and network databases.
73
Notes
Depending on your distribution model, you can get a data store that will give you the ability to
handle larger quantities of data, the ability to process a greater read or write traffic, or more
availability in the face of network slowdowns or breakages. These are often important benefits, but
they come at a cost. Running over a cluster introduces complexity—so it’s not something to do
unless the benefits are compelling.
74
Notes
These are often important benefits, but they come at a cost. Running over a cluster introduces
complexity—so it’s not something to do unless the benefits are compelling.Broadly, there are two
paths to data distribution: replication and sharding.
There are two styles of
distributing data:
Sharding:
Master –slave
Replication
Peer-to-peer
Replication takes the same data and copies it over multiple nodes. Sharding puts different data on
different nodes.
Sharding puts different data on different nodes. Replication and sharding are orthogonal
techniques: You can use either or both of them.
75
Notes
Replication comes into two forms: master-slave and peer-to-peer. We will now discuss these
techniques starting at the simplest and working up to the more complex: first single-server, then
master-slave replication, then sharding, and finally peer-to-peer replication.
Single Server
The first and the simplest distribution option is the one we would most often recommend—no
distribution at all. Run the database on a single machine that handles all the reads and writes to the
data store. We prefer this option because it eliminates all the complexities that the other options
introduce; it’s easy for operations people to manage and easy for application developers to reason
about.
Although a lot of NOSQLdatabases are designed around the idea of running on a cluster, it can
make sense to use NOSQLwith a single-server distribution model if the data model of the NOSQL
store is more suited to the application. Graph databases are the obvious category here—these work
best in a single-server configuration
Graph Database
76
Notes
Sharding
Often, a busy data store is busy because different people are accessing different parts of the dataset.
In these circumstances we can support horizontal scalability by putting different parts of the data
onto different servers—a technique that’s called sharding.
77
Notes
How to Clump the Data up so that One User Mostly gets her Data from a Single Server?
• The first part of this question is how to clump the data up so that one user mostly gets her
data from a single server. This is where aggregate orientation comes in really handy. The
whole point of aggregates is that we design them to combine data that’s commonly
accessed together—so aggregates leap out as an obvious unit of distribution. When it
comes to arranging the data on the nodes, there are several factors that can help improve
performance. If you know that most accesses of certain aggregates are based on a physical
location, you can place the data close to where it’s being accessed. If you have orders for
someone who lives in Boston, you can place that data in your eastern US data centre.
Another factor is trying to keep the load even. This means that you should try to arrange
aggregates so they are evenly distributed across the nodes which all get equal amounts of
the load. This may vary over time, for example if some data tends to be accessed on certain
days of the week—so there may be domain-specific rules you’d like to use.In some cases,
it’s useful to put aggregates together if you think they may be read in sequence. The
Bigtable paper [Chang etc.] described keeping its rows in lexicographic order and sorting
web addresses based on reversed domain names (e.g., com.martinfowler). This way data
for multiple pages could be accessed together to improve processing efficiency.
• Historically most people have done sharding as part of application logic. You might put
all customers with surnames starting from A to D on one shard and E to G on another.
This complicates the programming model, as application code needs to ensure that queries
are distributed across the various shards.
• Furthermore, rebalancing the sharding means changing the application code and
migrating the data. Many NOSQLdatabases offer auto-sharding, where the database takes
on the responsibility of allocating data to shards and ensuring that data access goes to the
right shard. This can make it much easier to use sharding in an application.
78
Notes
Master-Slave Replication
With master-slave distribution, you replicate data across multiple nodes. One node is designated as
the master, or primary. This master is the authoritative source for the data and is usually
responsible for processing any updates to that data. The other nodes are slaves, or secondaries. A
replication process synchronizes the slaves with the master (see Figure 4.2).Master-slave replication
is most helpful for scaling when you have a read-intensive dataset. You can scale horizontally to
handle more read requests by adding more slave nodes and ensuring that all read requests are
routed to the slaves. You are still, however, limited by the ability of the master to process updates
and its ability to pass those updates on. Consequently, it isn’t such a good scheme for datasets with
heavy write traffic, although offloading the read traffic will help a bit with handling the write
load.A second advantage of master-slave replication is read resilience: Should the master fail, the
slaves can still handle read requests. Again, this is useful if most of your data access is reads. The
failure of the master does eliminate the ability to handle writes until either the master is restored or
a new master is appointed. However, having slaves as replicates of the master does speed up
recovery after a failure of the master since a slave can be appointed a new master very quickly
Peer-to-Peer Replication
Master-slave replication helps with read scalability but doesn’t help with scalability of writes. It
provides resilience against failure of a slave, but not of a master. Essentially, the master is still a
bottleneck and a single point of failure. Peer-to-peer replication (see Figure 4.3) attacks these
problems by not having a master. All the replicas have equal weight, they can all accept writes, and
the loss of any of them doesn’t prevent access to the data store. The prospect here looks mighty fine.
With a peer-to-peer replication cluster, you can ride over node failures without losing access to
data. Furthermore, you can easily add nodes to improve your performance. There’s much to like
here—but there are complications. The biggest complication is, again, consistency. When you can
write to two different places, you run the risk that two people will attempt to update the same
record at the same time—a write-write conflict. Inconsistencies on read lead to problems but at least
they are relatively transient. Inconsistent writes are forever
79
Notes
Replication and sharding are strategies that can be combined. If we use both master-slave
replication and sharding (see Figure 4.4), this means that we have multiple masters, but each data
item only has a single master. Depending on your configuration, you may choose a node to be a
master for some data and slaves for others, or you may dedicate nodes for master or slave
duties.Using peer-to-peer replication and sharding is a common strategy for column-family
databases. In a scenario like this you might have tens or hundreds of nodes in a cluster with data
sharded over them. A good starting point for peer-to-peer replication is to have a replication factor
of 3, so each shard is present on three nodes. Should a node fail, then the shards on that node will
be built on the other nodes (see Figure 4.5).
80
Notes
81
Notes
• Then, framework sends the map output to reduce task. Reduce processes the user-defined
reduce function on map outputs. Before reduce phase, partitioning of the map output take
place on the basis of the key.Hadoop Partitioning specifies that all the values for each key
are grouped together. It also makes sure that all the values of a single key go to the same
reducer. This allows even distribution of the map output over the reducer. Partitioner in a
MapReduce job redirects the mapper output to the reducer by determining which reducer
handles the particular key.
• Partitioner makes sure that same key goes to the same reducer!
82
Notes
Keywords
Relational Database: A relational database is a collection of data elements that are linked together
by pre-defined connections. These elements are laid down in a tabular format with columns and
rows. Tables store data about the things that will be represented in the database. A field keeps the
actual value of an attribute, while each column in a table carries a specific type of data.
NOSQL Database:Rather than relational tables, NOSQL databases store data as documents. As a
result, we categorise them as "not simply SQL" and divide them into several flexible data models.
Pure document databases, key-value stores, wide-column databases, and graph databases are
examples of NOSQL databases. NOSQL databases are designed from the bottom up to store and
handle large volumes of data at scale, and they are increasingly used by modern enterprises.
Relational Database:Relational DataBase Management Systems (RDBMS) is an acronym for
Relational DataBase Management Systems. It's an application that lets us build, remove, and
update relational databases. A relational database is a database system that stores and retrieves
data in the form of rows and columns in a tabular format. It is a minor subset of DBMS that was
created in the 1970s by E.F Codd.
Key/Value Store:A key-value store, sometimes known as a key-value database, is a simple database
that employs an associative array (think of a map or dictionary) as its basic data model, with each
key corresponding to one and only one item in a collection. A key-value pair is the name for this
type of connection.
83
Notes
Self Assessment
1. A NOSQL database is defined as which of the following?
A. SQLServer
B. MongoDB
C. Cassandra
D. None of the mentioned
2. NOSQL databases is used mainly for handling large volumes of ________ data.
A. Unstructured
B. Structured
C. Semi-structured
D. All of the mentioned
3. NOSQL is useful when an enterprise needs to access, analyze, and utilize massive amounts
of either structured or unstructured data
A. Access
B. Analyze
C. Utilize
D. All of the above
84
Notes
8. In which year Carlo Strozzi use the term NOSQL for his lightweight, open-source relational
database
A. 1998
B. 2000
C. 2004
D. None of the above
85
Notes
6. A 7. C 8. A 9. D 10. A
Review Questions
1. Explain types of NOSQL.
2. Write down features of NOSQL.
3. Write down about data models.
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Real time Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NOSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
86
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 05: Introduction to Hadoop
Objectives
• Learn introduction about Hadoop.
• Learn benefits of Hadoop for bigdata
• Learn Open-Source Software Related to Hadoop
• Learn what is big data
• Learn why big data in the cloud makes perfect sense
• Learn Big opportunities, big challenges
Introduction
Hadoop is a framework that allows us to store and process large datasets in parallel and distributed
fashion.Two major problems in dealing with BIG DATA
• Storage
• Processing
Storage problem resolved by
• HDFS
All big amount of data that we are dumping is gets distributed over different machine.These
machines are interconnected
Processing problem resolved by
87
Notes
• mapReduce
Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel
with others. In short, Hadoop is used to develop applications that could perform complete
statistical analysis on huge amounts of data.Hadoop is an Apache open source framework written
in java that allows distributed processing of large datasets across clusters of computers using
simple programming models.The Hadoop framework application works in an environment that
provides distributed storage and computation across clusters of computers. . Hadoop is designed to
scale up from single server to thousands of machines, each offering local computation and storage.
By using a distributed file system called an HDFS (Hadoop Distributed File System), the data is
split into chunks and saved across clusters of commodity servers. As these commodity servers are
built with simple hardware configurations, these are economical and easily scalable as the data
grows.
HDFS is the pillar of Hadoop that maintains the distributed file system. It makes it possible to store
and replicate data across multiple servers.HDFS has a NameNode and DataNode. DataNodes are
the commodity servers where the data is actually stored. The NameNode, on the other hand,
contains metadata with information on the data stored in the different nodes. The application only
interacts with the NameNode, which communicates with data nodes as required.
• Speed: Hadoop stores and retrieves data faster.
Hadoop uses the MapReduce functional programming model to perform parallel processing across
data sets. So, when a query is sent to the database, instead of handling data sequentially, tasks are
88
Notes
split and concurrently run across distributed servers. Finally, the output of all tasks is collated and
sent back to the application, drastically improving the processing speed.
89
Notes
Apache Spark
It’s a platform that handles all the process consumptive tasks like batch processing, interactive or
iterative real-time processing, graph conversions, and visualization, etc.It consumes in memory
resources hence, thus being faster than the prior in terms of optimization.Spark is best suited for
real-time data whereas Hadoop is best suited for structured data or batch processing, hence both
are used in most of the companies interchangeably. Spark is best suited for real-time data whereas
Hadoop is best suited for structured data or batch processing, hence both are used in most of the
companies interchangeably
90
Notes
PIG
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.It is a platform for structuring the data flow, processing and analyzing
huge data sets. Pig does the work of executing commands and in the background, all the activities
of MapReduce are taken care of. After the processing, pig stores the result in HDFS.Pig Latin
language is specially designed for this framework which runs on Pig Runtime. Just the way Java
runs on the JVM. Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE
With the help of SQL methodology and interface, HIVE performs reading and writing of large data
sets. However, its query language is called as HQL (Hive Query Language).It is highly scalable as it
allows real-time processing and batch processing both. Also, all the SQL datatypes are supported
by Hive thus, making the query processing easier.Similar to the Query Processing frameworks,
HIVE too comes with two components: JDBC Drivers and HIVE Command Line.JDBC, along with
ODBC drivers work on establishing the data storage permissions and connection whereas HIVE
Command line helps in the processing of queries.
Hbase
It’s a NOSQL database which supports all kinds of data and thus capable of handling anything of
Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data.
Mahout
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name
suggests helps the system to develop itself based on some patterns, user/environmental interaction
or on the basis of algorithms.It provides various libraries or functionalities such as collaborative
filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows
invoking algorithms as per our need with the help of its own libraries.
Solr, Lucene
These are the two services that perform the task of searching and indexing with the help of some
java libraries, especially Lucene is based on Java which allows spell check mechanism, as well.
However, Lucene is driven by Solr.
Zookeeper
There was a huge issue of management of coordination and synchronization among the resources
or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the
problems by performing synchronization, inter-component based communication, grouping, and
maintenance.
Oozie
Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a
single unit. There is two kinds of jobs.i.e., Oozie workflow and Oozie coordinator jobs. Oozie
workflow is the jobs that need to be executed in a sequentially ordered manner whereas Oozie
Coordinator jobs are those that are triggered when some data or external stimulus is given to it.
91
Notes
1. Lucene is an open-source Java based search library. It is very popular and a fast search
library. It is used in Java based applications to add document search capability to any kind
of application in a very simple and efficient way.Lucene is a simple yet powerful Java-
based Search library. It can be used in any application to add search capability to it.
Lucene is an open-source project. It is scalable. This high-performance library is used to
index and search virtually any kind of text. Lucene library provides the core operations
which are required by any search application. Indexing and searching.
How Search Application works?
92
Notes
Build Query
When a user requests to search for a text, the application should create a query object based on that
text, which may be used to query the index database for relevant information.
Search Query
The index database is then examined using a query object to obtain the necessary information and
content documents.
Render Results
Once the result has been obtained, the programme must select how to provide the information to
the user through the user interface. How much information should be displayed?
2. Eclipse
Eclipse is a Java IDE that is one of the 3 biggest and most popular IDE’s in the world. It was written
mostly in Java but it can also be used to develop applications in other programming languages
apart from Java using plug-ins. Some of the features of Eclipse are as follows:
• PDE (Plugin Development Environment) is available in Eclipse for Java programmers that
want to create specific functionalities in their applications.Eclipse flaunts powerful tools
for the various processes in application development such as charting, modeling,
reporting, testing, etc. so that Java developers can develop the application as fast as
93
Notes
possible.Eclipse can also be used to create various mathematical documents with LaTeX
using the TeXlipse plug-in as well as packages for the Mathematica software.Eclipse can
be used on platforms like Linux, macOS, Solaris and Windows.
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.HBase is a data model that is similar to Google’s
big table designed to provide quick random access to huge amounts of structured data. It leverages
the fault tolerance provided by the Hadoop File System (HDFS).It is a part of the Hadoop
ecosystem that provides random real-time read/write access to data in the Hadoop File
System.One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File
System and provides read and write access.
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column families
and each column family can have any number of columns. Subsequent column values are stored
contiguously on the disk. Each cell value of the table has a timestamp. In short, in an HBase:
94
Notes
Architecture of Hive
User Interface
Hive is a data warehouse infrastructure programme that allows users to interface with HDFS. Hive
provides three user interfaces: hive Web UI, Hive command line, and Hive HD insight (In windows
server)
MetaStore
Hive selects appropriate database servers to hold the schema or metadata of tables, databases, table
columns, data types, and HDFS mapping.
HiveQL Process Engine
HiveQL is a query language for the meta-store structure that is comparable to SQL. It is one of the
MapReduce program's substitutes for the old technique. We can construct a query for a MapReduce
task and process it instead of building a Java MapReduce application.
Execution Engine
The hive execution engine connects the HiveQL process engine to MapReduce. The query is
processed by the execution engine, which generates the same MapReduce results. It makes use of
the MapReduce flavor.
HDFS or HBASE
The data storage strategies used to store data in a file system are Hadoop distributed file system or
HBASE.
95
Notes
JSON has found wide use in Web and mobile applications, including large-scale big
data and enterprise data warehouse applications. JAQL can run in local mode on
individual systems and in cluster mode, in the latter case supporting Hadoop applications.
It automatically generates MapReduce jobs and parallel queries on Hadoop systems. JAQL
was created by workers at IBM Research Labs in 2008 and released to open source. While
it continues to be hosted as a project on Google Code, where a downloadable version is
available under an Apache 2.0 license, the major development activity around JAQL has
remained centered at IBM. The company offers the query language as part of the tools
suite associated with InfosphereBig Insights, its Hadoop platform. Working together with
a workflow orchestrator, JAQL is used in Big Insights to exchange data between storage,
processing and analytics jobs. It also provides links to external data and services, including
relational databases and machine learning data.
This language provides various operators using which programmers can develop their
own functions for reading, writing, and processing data.To analyze data using Apache
Pig, programmers need to write scripts using Pig Latin language. All these scripts are
internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.
Pig Latin
Using Pig Latin, programmers can perform MapReduce tasks easily without having to
type complex codes in Java.
Multi-query Approach
Apache Pig uses multi-query approach, thereby reducing the length of codes. For
example, an operation that would require you to type 200 lines of code (LoC) in Java can
be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig
reduces the development time by almost 16 times.
SQL-like Language
Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar
with SQL.
96
Notes
Built-in Operators
Apache Pig provides many built-in operators to support data operations like joins, filters,
ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps
that are missing from MapReduce.
Features of Pig
Rich set of operators
− It provides many operators to perform operations like join, sort, filer, etc.
Ease of Programming
Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at
SQL.
Optimization Opportunities
The tasks in Apache Pig optimize their execution automatically, so the
programmers need to focus only on semantics of the language.
Extensibility
Using the existing operators, users can develop their own functions to read,
process, and write data.
User-defined Functions
Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
Handles all Kinds of Data
Apache Pig analyzes all kinds of data, both structured as well as unstructured. It
stores the results in HDFS.
7. ZooKeeper
Zookeeper is the easiest way for effective configuration management. It has two main
benefits. First, it can be accessed from anywhere as it is stored centrally. This also reduces
the issue with data integrity. Second, dynamic configuration management can be done as
configuration data is stored centrally. This allows adjusting the system settings without
restarting the system. Thus creating “znode” and storing configuration data is a handy
way for configuration management.
This is a simplified version of how we are going to setup Zookeeper.Zookeeper stores data in a tree
of ZNodes similar to Linux file system structure, a ZNode may contain another ZNodes or may
have a value.App1 and App2 are sharing data from / and config znodes.However db.host, db.
username and db.password are specific to App1.Zookeeper is one of the best centralized services
for maintaining configuration, it is widely used by many other solutions like: Apache Hadoop,
Kafka, SolrCloud
97
Notes
Doug Cutting
Features of AVRO
• language-neutral
• processed by many languages
• compressible and splittable.
• rich data structures
• Avro schemas defined in JSON
• self-describing file named Avro Data File
• Remote Procedure Calls (RPCs).
Avro is a language-neutral data serialization system.It can be processed by many
languages (currently C, C++, C#, Java, Python, and Ruby).Avro creates binary structured
format that is both compressible and splittable. Hence it can be efficiently used as the
input to Hadoop MapReduce jobs.Avro provides rich data structures. For example, you
can create a record that contains an array, an enumerated type, and a sub record. These
datatypes can be created in any language, can be processed in Hadoop, and the results can
be fed to a third language.Avro schemas defined in JSON, facilitate implementation in the
languages that already have JSON libraries.Avro creates a self-describing file named Avro
Data File, in which it stores data along with its schema in the metadata section.Avro is also
used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas
in the connection handshake.
98
Notes
An example UIM application might ingest plain text and identify entities, such as persons,
places, organizations; or relations, such as works-for or located-at.UIMA enables
applications to be decomposed into components, for example "language identification" =>
"language specific segmentation" => "sentence boundary detection" => "entity detection
(person/place names etc.)". Each component implements interfaces defined by the
framework and provides self-describing metadata via XML descriptor files. The
framework manages these components and the data flow between them. Components are
written in Java or C++; the data that flows between components is designed for efficient
mapping between these languages.
Presto (or PrestoDB) is an open source, distributed SQL query engine, designed from the
ground up for fast analytic queries against data of any size. It supports both non-
relational sources, such as the Hadoop Distributed File System (HDFS), Amazon S3,
Cassandra, MongoDB, and HBase, and relational data sources such as MySQL,
PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata. Presto can query data
where it is stored, without needing to move data into a separate analytics system. Query
execution runs in parallel over a pure memory-based architecture, with most results
returning in seconds. You’ll find it used by many well-known companies
like Facebook, Airbnb, Netflix, Atlassian, and Nasdaq.Presto is an open source, distributed
SQL query engine designed for fast, interactive queries on data in HDFS, and others.
Unlike Hadoop/HDFS, it does not have its own storage system. Thus, Presto is
complimentary to Hadoop, with organizations adopting both to solve a broader business
challenge. Presto can be installed with any implementation of Hadoop, and is packaged in
the Amazon EMR Hadoop distribution.
99
Notes
The Concept of Big Data and What it Encompasses can be Better Understood with Four
Vs:
• Volume
The amount of data accumulated by private companies, public agencies, and other
organizations on a daily basis is extremely large. This makes volume the defining
characteristic for big data.
• Velocity
It’s a given that data can and will pile up really fast. But what matters is the speed with
which you can process and examine this data so that it becomes useful information.
• Variety
The types of data that get collected can be very diverse. Structured data contained in
databases, and unstructured data such as tweets, emails, images, videos, and more, need
to be consumed and processed all the same.
• Veracity
Because of its scale and diversity, big data can contain a lot of noise. Veracity thus refers to
the the certainty of the data and how your big data tools and analysis strategies can
separate the poor quality data from those that really matter to your business.
• Technology leaders also name a fifth V – value. But this one isn’t inherent within the huge
amounts of raw data. Instead, the true value of big data can only be realized when the
right information is captured and analyzed to gain actionable insights.To get a better idea
of how big big data is, let’s review some statistics:
• Over 1 billion Google searches are made and 294 billion emails are sent everyday
• Every minute, 65,972 Instagram photos are posted, 448,800 tweets are composed, and 500
hours-worth of YouTube videos are uploaded.
• By 2020, the number of smartphone users could reach 6.1 billion. And taking Internet of
Things (IoT) into account, there could be 26 billion connected devices by then. For sure,
big data is really big.
100
Notes
Why Should Big Data and its Exponential Growth Matter to your Business?
For one, an Accenture study (PDF) reveals that 79 percent of corporate executives surveyed believe
that ‘companies that do not embrace big data will lose their competitive position and may even face
extinction’. Furthermore, an overwhelming 83 percent have taken on big data projects with the aim
of outperforming others in their respective industries.Big data projects can impact almost any
aspect of an organization. But as this survey by New Vantage Partners (PDF) shows, where it
delivers most value to enterprises is in reducing costs (49.2%) and driving innovation (44.3%).
101
Notes
102
Notes
103
Notes
104
Notes
Operational
Big Data
Two classes
of technology
•Analytical
Big Data
Operational Analytical
105
Notes
Pattern Reads
Solution
Workshops and seminars on big data should be offered at firms for everyone. All staff that handle
data on a regular basis and are involved in Big Data projects should receive basic training. All
levels of the company must have a fundamental awareness of data ideas.
Solution
Companies are using current approaches like compression, tiering, and deduplication to handle
these massive data collections. Compression reduces the number of bits in data, resulting in a
smaller total size. The process of deleting duplicate and unnecessary data from a data set is known
as deduplication.
Companies can store data in separate storage levels via data tiering. It guarantees that the data is
stored in the best possible location. Depending on the size and relevance of the data, data tiers
might include public cloud, private cloud, and flash storage.
Companies are also turning to Big Data technologies like Hadoop, NOSQL, and others.
This brings us to the third issue with Big Data.
106
Notes
Solution
The best course of action is to seek expert assistance. You may either engage seasoned specialists
who are far more knowledgeable about these instruments. Another option is to hire Big Data
consultants. Consultants will provide recommendations for the appropriate tools based on the
situation at your firm. You may devise a plan and then choose the ideal instrument for you based
on their recommendations.
Companies are facing a scarcity of Big Data experts. This is due to the fact that data processing tools
have advanced fast, but most experts have not. In order to close the gap, concrete efforts must be
done.
Solution
Companies are devoting greater resources to the recruitment of talented workers. They must also
provide training programmes for current employees in order to get the most out of them.
Another key move made by businesses is the procurement of artificial intelligence/machine
learning-powered data analytics solutions. These tools may be used by professionals who aren't
data scientists but have a rudimentary understanding of the subject. This stage allows businesses to
save a significant amount of money on recruitment.
Securing Data
One of the most difficult aspects of Big Data is securing these massive data collections. Companies
are frequently so preoccupied with comprehending, preserving, and analyzing their data sets that
data security is pushed to the back burner. Unprotected data stores, on the other hand, may become
breeding grounds for malevolent hackers.
A stolen record or a data breach may cost a company up to $3.7 million.
Solution
To secure their data, businesses are hiring more cybersecurity workers. Other measures made to
protect data include:
Encrypting data
Separation of data
Control of identity and access
Endpoint security implementation
Security monitoring in real time
Make use of Big Data security technologies like IBM Guardian.
Summary
Apache Hadoop is a set of open-source software tools for solving issues involving large
volumes of data and processing utilising a network of many computers. It's a MapReduce
programming model-based software framework for distributed storage and processing of
massive data.
Big data refers to massive, difficult-to-manage data quantities – both organised and
unstructured – that inundate enterprises on a daily basis. Big data may be evaluated for
insights that help people make better judgments and feel more confident about making
key business decisions.
107
Notes
HDFS, or Hadoop Distributed File System, is a distributed file system that runs on
commodity hardware. It has a lot in common with other distributed file systems.
However, there are considerable distinctions between it and other distributed file systems.
HDFS is meant to run on low-cost hardware and is extremely fault-tolerant. HDFS is a file
system that allows high-throughput access to application data and is well-suited to
applications with huge data collections. To provide streaming access to file system data,
HDFS relaxes a few POSIX criteria.
In a Hadoop cluster, MapReduce is a programming paradigm that permits tremendous
scalability over hundreds or thousands of computers. MapReduce, as the processing
component, lies at the heart of Apache Hadoop
Hadoop Ecosystem is a platform or a suite that offers a variety of services to address big
data issues. It consists of Apache projects as well as a variety of commercial tools and
solutions. HDFS, MapReduce, YARN, and Hadoop Common are the four core components
of Hadoop.
Apache Pig is a high-level framework for developing Hadoop-based apps. Pig Latin is the
name of the platform's language. Pig's Hadoop tasks may be run in MapReduce, Apache
Tez, or Apache Spark.
Eclipse is a robust Java programming environment. Because Hadoop and Mapreduce
programming is done in Java, we should use an Integrated Development Environment
with a lot of features (IDE)
Jaql is one of the languages used to abstract the intricacies of Hadoop's MapReduce
programming architecture. It's a functional language with a weakly typed syntax and lazy
evaluation.
Keywords
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
BigData: Big Data is a massive collection of data that continues to increase dramatically over time.
It is a data set that is so huge and complicated that no typical data management technologies can
effectively store or process it. Big data is similar to regular data, except it is much larger.
HDFS: Hadoop File System was built on distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with low-
cost hardware in mind.
Name Node:The name node is a piece of commodity hardware that houses the GNU/Linux
operating system as well as name node software. It's a piece of software that can run on standard
hardware.
Data Node: The data node is a commodity computer with the GNU/Linux operating system and
data node software installed. In a cluster, there will be a data node for each node (common
hardware/system).
HBase: HBase is a Hadoop-based open-source database with sorted map data. It's horizontally
scalable and column-oriented.
JAQL: Any software package that is used in connection with databases for searching, processing, or
even generating JavaScript Object Notion (JSON)-based documents is known as JSON query
language (JAQL).
108
Notes
Self Assessment
Q1: A parallel computer system can do a lot of things.
A. Decentralized computing
B. Parallel computing
C. Centralized computing
D. All of these
A. Parallel computation
B. Parallel processes
C. Parallel programming
D. Parallel development
A. Two
B. Three
C. Four
D. Five
A. Twitter
B. Facebook
C. Google
D. Yahoo
A. Open-Source tool
B. Commercial tool
C. House tool
D. Vendor tool
A. Pig
B. HBase
C. Hive
D. All of above
109
Notes
A. Hive
B. Jaql
C. Pig
D. All of above
A. Search
B. Reporting
C. Both
D. None of above
Q10: _________ is a Java IDE that is one of the 3 biggest and most popular IDE’s in the world
A. Paint
B. Notebook
C. Eclipse
D. All of above
Q11: The concept of big data and what it encompasses can be better understood with four Vs.
Those
are:
A. Volume
B. Velocity
C. Veracity
D. All of above
Q12: _________ refers to the certainty of the data and how your big data tools and analysis
strategies can separate the poor-quality data from those that really matter to your business.
A. Volume
B. Velocity
C. Veracity
D. All of above
A. Terra
B. Mega
C. Giga
D. Peta
A. Structured Data
B. Unstructured Data
C. Semi-structured Data
D. All of the above
110
Notes
Q15: ________ is a component of helicopter, airplanes, and jets, etc. It captures voices of the
flight crew, recordings of microphones and earphones, and the performance information of
the aircraft.
6. B 7. D 8 D 9 A 10 C
11 D 12 A 13 D 14 D 15 B
Review Questions
1. Difference between data mart and data ware house.
2. Writedown the Tips for Creating Effective Big Data Models.
3. Explain different types of data mart.
4. Write down advantages and disadvantages of data mart.
5. What do you understand by data streaming? Explain Use Cases for Real-Time and
Streaming Data.
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NOSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
111
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 06: Hadoop Administration
Objectives
Learn Hadoop installation step by step
Learn HDFS
Learn about HDFS Architecture
Learn Goals of HDFS
Learn basic commands in HDFS
Introduction
Hadoop is primarily supported by the Linux operating system and its features. If you're using
Windows, you can use Cloudera VMware, which comes with Hadoop preconfigured, or Oracle
VirtualBox, or VMware Workstation. In this chapter, we will learn how to install Hadoop on
VMware Workstation 12 using VMware Workstation. This will be accomplished by installing
CentOS on my virtual machine.
Prerequisites
You can use any of these to install the operating system.
VirtualBox/VMWare/Cloudera
You can use any of these to install the operating system.
Operating System:
On Linux-based operating systems, Hadoop may be installed. Ubuntu and CentOS are two of the
most popular operating systems among them. We'll be using CentOS for this course.
Java
On your computer, you must install the Java 8 package.
Hadoop
The Hadoop 2.7.3 package is required.
112
Notes
You can download the VMWare workstation by using the below link
https://customerconnect.vmware.com/en/downloads/info/slug/desktop_end_user_computing/
vmware_workstation_pro/15_0
Open the.exe file after it has been downloaded and change the path to the desired location.
Follow the installation instructions to the letter.
Step2: Installing CentOS
Select Create a New Virtual Machine from the drop-down menu as shown in Figure 2.
113
Notes
1. Browse to the location of the CentOS file you downloaded, as seen in the image above. It is
important to note that it must be a disc image file.
2. Click on Next
3. Choose the name of your machine.
4. Then, click Next
Figure 4: Options
8. You may see three options in the image above: I Finished Installing, Change Disc, and
Help. You don't have to touch any of them until your CentOS installation is complete.
9. Your system is currently being tested and prepared for installation as shown in Figure 5.
114
Notes
10. When the checking percentage hits 100%, you will be brought to the following screen:
115
Notes
11. You may select your preferred language here. English is the default language, and that is
what I have chosen. Then, click on continue.
Step4:
The login screen will look like this:
116
Notes
Figure 8: Login
The Java 8 Package may be downloaded by clicking here. This file should be saved in your
home directory.
Using the following command, extract the Java tar file:
tar -xvf jdk-8u101-linux-i586.tar.gz
117
Notes
Step 8: To make an entry on sudoers file for Hadoop users. edits the sudoers file, which is used by
the sudo command. To change what users and groups are allowed to run sudo, run visudo
$ visudo
We want hadoop3 must be allowed to run any command anywhere.
118
Notes
119
Notes
On the terminal, use the following command to extract the Hadoop file:
tar -xvf hadoop-3.2.2.tar.gz
Extracting hadoop file as shown in figure below:
120
Notes
Step13:Editing and Configuring Hadoop you must first set the path in the ~/.bashrc file. The
command ~ /.bashrc can be used to set the path from the root user. You should check your Java
configurations before editing ~/.bashrc.
update-alternatives-config java
You'll now be able to see all of the Java versions installed on the computer. Because I only have one
version of Java, which is the most recent, it is displayed below:
You can also have several versions.
The next step is to choose the version you wish to work on. In the screenshot above, you can see a
path that has been marked. Copy and paste this path into a gedit file. This route will only be
utilised in the next phases.
Enter the number of the selection you've made. I've picked number one in this case.Now use the vi
editor to open /.bashrc (the screen-oriented text editor in Linux)
Note that you must first become a root user before editing /.bashrc.
When you get logged into your root user, enter the command:
vi ~/.bashrc
The command above should bring you to the vi editor, where you should see the following screen:
121
Notes
To get to this, hit the Insert key on your keyboard, and then start typing the following code to set a
Java path:
fi
#HADOOP VARIABLES START
export JAVA_HOME= (path you copied in the previous step)
export HADOOP_HOME=/home/(your username)/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/HOME/lib/native
export HADOOP_OPTS=”Djava.library.path”=$HADOOP_HOME/lib”
#HADOOP VARIABLES END
After writing the code, click on Esc on your keyboard and write the command: wq!
This will save and exit you from the vi editor. The path has been set now as it can be seen in the
image below:
Step14:Using the vi editor, open hadoop-env.sh. To inform Hadoop which path to use, replace this
path with the Java path. You will be presented with the following window:
122
Notes
Step15:There are multiple XML files that need to be modified now, and you must specify the
property and path for each one. All configuration files are shown in the image below:
Editing core-site.xml
o use the command below to open core-site.xml file
vim core-site.xml
o Enter the following code in between the configuration tags as below
Editing hdfs-site.xml
o Use the command below to open hdfs-site.xml
123
Notes
vim hdfs-site.xml
o Enter the following code in between the configuration tags as below
Editing mapred-site.xml
o Use the command below to open hdfs-site.xml
vim mapred-site.xml
o Enter the following code in between the configuration tags as below:
Editing yarn-site.xml
o Use the command below to open hdfs-site.xml
124
Notes
vim yarn-site.xml
o Exit from this window by pressing Esc and then writing the command: wq!
Step16:Create a directory namenode, datanode, and secondary using the command below:
Step 17:As we can see in the above image, permissions have been given only to the root. So, next,
we will use chown command as shown in image below to change permission to hadoop2.
Step 18: To check permissions of all the file that comes under Hadoop_datawarehouse, following
command will be executed:
125
Notes
All the files that comes into this folder their permission has changed.
Step22:Lets go to the Hadoop directory and run the command as shown below to format the name
node.
hadoopnamenode -format
So, we will get a message that namenode has been successfully formatted.
Step 19:To start all the services or start Hadoop daemons. To start services, we will go to sbin folder
and will see all services.
start dfs.sh
126
Notes
127
Notes
6.1 HDFS
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulted tolerant and designed using
low-cost hardware.HDFS holds a very large amount of data and provides easier access. To store
such huge data, the files are stored across multiple machines. These files are stored redundantly to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS
• It is suitable for distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of the namenode and datanode help users to easily check the status of
the cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
128
Notes
Namenode
The NameNode is the centerpiece of an HDFS file system. It keeps thedirectory tree of all files in
the file system, and tracks where across the cluster the file data iskept. It does not store the data of
these files itself.Clientapplications talk to the Name Node whenever they wish to locate a file, or
when they want to add/copy/move/delete a file. The Name Node respondsthe successful requests
by returning a list of relevant Data Node servers wherethe data lives.The Name Node is a Single
Point of Failure for the HDFS Cluster.HDFS is not currently a High Availability system.When
the Name Node goesdown, the file system goes offline. There is an optional Secondary Name The
node can be hosted on a separate machine.It only creates checkpoints of thenamespace by merging
the edits file into the fsimage file and does not provide anyreal redundancy. Hadoop 0.21+ has
a Backup Name Node that is part of a plan to have an HA name service, but it needs active
contributions from thepeople who want it (i.e. you) to make it Highly Available.Tracks where
acrossthe cluster the file data is kept. It does not store the data of these files itself. Client
applications talk to the Name Node . Name Node responds the successful requests. Name
Node works as Master in Hadoop cluster. Below listed are the main function performed by Name
Node:
1. Stores metadata of actual data.
2. Manages File system namespace.
Regulates client access request for actual file data file. Assign work toSlaves(DataNode). Executes
file system name space operation like opening/closing files, renaming files and directories.As
Name node keep metadata in memory for fast retrieval, the huge amount of memory is required for
its operation. This should be hosted on reliable hardware.
Data node
Data Node works as Slave in Hadoop cluster . Below listed are the main function performed by
Data Node:
Block
Generally, the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as
blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change in HDFS
configuration.
Goal of HDFS
• Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
• Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
129
Notes
hadoop fs -ls
Display the Contents of Directory
• Syntax
$hadoop fs –ls directoryname
Create a Directory in hdfs
• Syntax
hadoop fs –mkdirabc
130
Notes
131
Notes
To Check Either Given File have Some Content or its Empty File
$hadoop fs –test –z destination
To print result
Echo $?
0 means it is zero content file
1 means it is non-zero content file.
Move a File from Local File System to hdfs System
Syntax
$hadoop fs –moveFromLocal a2.txt destination/
Verify
cat a2.txt
It will return error because file has already move to hdfs directory named destination.
getmerge: I want to merge contents of multiple files in one file which will be available under the
hdfs.
• Syntax
$Hadoop fs –cat destination/xyz.txt(to check contents of file)
$Hadoop fs –cat destination/xyz1.txt
$hadoop fs –getmerge –nl destination/xyz.txt destination/xyz1.txt ~/Destop.mergeresult.txt
Check with cat command
$cat Desktop/mergeresult.txt
Append to File: Append the Contents of Two Files(in Local System) Which is available
under the hdfs
$gedit q1.txt
$gedit q2.txt
$hadoop fs –touchz q3.txt
$hadoop fs –appendToFile q1.txt q2.txt q3.txt
132
Notes
Checksum Command: to Verify the Integrity of the File. Whether File is Modifies or
Not.
• Syntax
$hadoop fs –checksum destination/xyz.txt
Fsck: Check Status of Particular Location
To check for root
$hadoopfsck - /
$hadoop fs –rm destination/xyz.txt
To Change Group of the File
• Syntax
$hadoop fs –ls destination/
Hadoop fs –chgrpamrit destination/a2.txt
Stat Command: To Check Stat
$hadoop fs –stat %b destination/xyz1.txt
To Check Supername
$hadoop fs –stat %g destination/xyz1.txt
To Check Replication
$hadoop fs –stat %r destination/xyz1.txt
To Check User of this File
%u
To Check Who modified the File Last Time
%y
Summary
Apache Hadoop is a Java-based open-source software framework for managing data
processing and storage in large data applications. Hadoop works by breaking down huge
data sets and analytical jobs into smaller workloads that can be handled in parallel across
nodes in a computing cluster. Hadoop can handle both organised and unstructured data,
and it can scale up from a single server to thousands of servers with ease.
Java is an object-oriented programming language with a high level of abstraction and as
few implementation dependencies as feasible.
Ssh-keygen is an utility that allows you to generate fresh SSH authentication key pairs.
This type of key pair is used to automate logins, provide single sign-on, and authenticate
hosts.
The GNOME desktop environment's official text editor is gedit. Gedit is a strong general-
purpose text editor that aims for simplicity and ease of use. It can produce and modify a
wide range of text files.
A bashrc file is a shell script file that Linux uses to load modules and aliases into your
profile when it boots up. In your /home directory, you'll find your bashrc file. With a text
editor like nano, you can make modifications to it. Adding parts to your bashrc file, such
as modules to load when you sign in or aliases for commonly used commands, can help
you save time in your workflows.
133
Notes
The Hadoop daemon receives information from the core-site.xml file about where
NameNode is located in the cluster. It provides Hadoop Core configuration parameters,
such as I/O settings shared by HDFS and MapReduce.
The configuration parameters for HDFS daemons; the NameNode, Secondary NameNode,
and DataNodes, are included in the hdfs-site. xml file.... xml to provide default block
replication and permission checking on HDFS. When the file is generated, the number of
replications can also be selected.
YARN has a resource model that may be expanded. YARN records CPU and RAM for all
nodes, applications, and queues by default, but the resource definition may be expanded
to include any "countable" resource.
MapReduce is well suited to iterative computations with massive amounts of data that
require parallel processing. Rather than a method, it depicts a data flow. MapReduce may
be used to process a graph in parallel. The map, shuffle, and reduce stages of graph
algorithms all follow the same pattern.
An HDFS file system is built around the NameNode. It maintains the directory tree of all
files in the file system and records where the file data is stored across the cluster.... In
response to successful queries, the NameNode returns a list of relevant DataNode servers
where the data is stored.
Keywords
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
Java is a platform as well as a programming language. Java is a high-level programming language
that is also robust, object-oriented, and secure.
Process is what daemons stand for. Hadoop Daemons are a collection of Hadoop processes.
Because Hadoop is a Java platform, all of these processes are Java Processes.
NameNode is a component of the Master System. Namenode's main function is to manage all of the
MetaData. The list of files saved in HDFS is known as metadata (Hadoop Distributed File System).
In a Hadoop cluster, data is stored in the form of blocks, as we all know.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with low-
cost hardware in mind.
Data node: The data node is a commodity computer with the GNU/Linux operating system and
data node software installed. In a cluster, there will be a data node for each node (common
hardware/system).
Map-red:It is one of the most significant configuration files for Hadoop's runtime environment
settings. It includes MapReduce's setup options. By setting the MapReduce.framework.name
variable in this file, we may give MapReduce a name.
Data dependability refers to the completeness and accuracy of data, and it is a critical basis for
establishing data confidence within an organisation. One of the key goals of data integrity
programmes, which are also used to maintain data security, data quality, and regulatory
compliance, is to ensure data dependability.
Fault tolerance: Because it replicates data across several DataNodes, HDFS is fault-tolerant. A block
of data is duplicated on three DataNodes by default. Different DataNodes are used to hold the data
blocks. Data can still be obtained from other DataNodes if one node fails.
HBase: HBase is a Hadoop-based open-source database with sorted map data. It's horizontally
scalable and column-oriented.
Blocks:Large files were broken into little segments known as Blocks in Hadoop HDFS. The physical
representation of data is called a block. Except for the final block, which might be the same size or
134
Notes
Self Assessment
1. _________ is the main prerequisite for Hadoop.
A. Java
B. HTML
C. C#
D. None of above
6. Hadoop cluster operate in three supported modes. Those modes are __________
A. Local/Standalone mode
B. Psuedo Distributed mode
C. Fully Distributed mode
D. All of above
135
Notes
11. When a computer is designated as a datanode, the disc space available to it is reduced.
A. Can be used only for HDFS storage
B. Can be used for both HDFS and non-HDFs storage
C. Cannot be accessed by non-hadoop commands
D. Cannot store text files.
136
Notes
15. When the Primary Name Node fails, the ___________ Name Node is utilized.
A. Data
B. Primary
C. Secondary
D. None of above
6. D 7. B 8. A 9. A 10. D
Review Questions
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That
Will Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
137
Notes
138
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 07: Hadoop Architecture
Objectives
• Learn what is hadoop
• Understand the Hadoop Core components
• Learn How Hdfs Works.
• What is Hadoop Cluster
• Learn Architecture of Hadoop Cluster
• HDFS Architecture and Hadoop features
Introduction
Apache Hadoop is an open-source software framework that stores data in a distributed manner
and process that data in parallel. Hadoop provides the world’s most reliable storage layer – HDFS,
a batch processing engine – MapReduce and a resource management layer – YARN.
139
Notes
HDFS – Hadoop
Distributed File
System
MapReduce
• HDFS – Hadoop Distributed File System provides for the storage of Hadoop. As the
name suggests it stores the data in a distributed manner. The file gets divided into a
number of blocks which spreads across the cluster of commodity hardware.This, however,
is transparent to the user working on HDFS. To them, it seems like storing all the data
onto a single machine.These smaller units are the blocks in HDFS. The size of each of
these blocks is 128MB by default, you can easily change it according to requirement. So, if
you had a fileof size 512MB, it would be divided into 4 blocks storing 128MB each.If,
however, you had a file of size 524MB, then, it would be divided into 5 blocks. 4 of these
would store 128MB each, amounting to 512MB. And the 5th would store the remaining
12MB. That’s right! This last block won’t take up the complete 128MB on the disk.
•
140
Notes
Well, the amount of data with which we generally deal with in Hadoop is usually in the order of
petra bytes or higher.Therefore, if we create blocks of small size, we would end up with a colossal
number of blocks. This would mean we would have to deal with equally large metadata regarding
the location of the blocks which would just create a lot of overhead. And we don’t really want
that!.The file itself would be too large to store on any single disk alone. Therefore, it is prudent to
spread it across different machines on the cluster.It would also enable a proper spread of the
workload and prevent the choke of a single machine by taking advantage of parallelism.
HDFS operates in a master-slave architecture, this means that there are one master node and
several slave nodes in the cluster. The master node is the Namenode.
• Namenode is the master node that runs on a separate node in the cluster.Manages the
filesystem namespace which is the filesystem tree or hierarchy of the files and
directories.Stores information like owners of files, file permissions, etc for all the files. It is
also aware of the locations of all the blocks of a file and their size.
141
Notes
All this information is maintained persistently over the local disk in the form of two
files: Fsimage and Edit Log.
Fsimage
Information is maintained
persistently over the local disk
in the form of two files:
and Edit Log.
• Fsimage stores the information about the files and directories in the filesystem. For files, it
stores the replication level, modification and access times, access permissions, blocks the
file is made up of, and their sizes. For directories, it stores the modification time and
permissions.
• Edit Log on the other hand keeps track of all the write operations that the client performs.
This is regularly updated to the in-memory metadata to serve the read requests.
Whenever a client wants to write information to HDFS or read information from HDFS, it connects
with the Namenode. The Namenode returns the location of the blocks to the client and the
operation is carried out.Yes, that’s right, the Namenode does not store the blocks. For that, we have
separate nodes.Datanodes are the worker nodes. They are inexpensive commodity hardware that
can be easily added to the cluster.They periodically send heartbeats to the Namenode so that it is
aware of their health. With that, a DataNode also sends a list of blocks that are stored on it so that
the Namenode can maintain the mapping of blocks to Datanodes in its memory. But in addition to
these two types of nodes in the cluster, there is also another node called the Secondary
Namenode.Datanodes are responsible for storing, retrieving, replicating, deletion, etc. of blocks
when asked by the Namenode.They periodically send heartbeats to the Namenode so that it is
aware of their health. With that, a DataNode also sends a list of blocks that are stored on it so that
the Namenode can maintain the mapping of blocks to Datanodes in its memory.But in addition to
these two types of nodes in the cluster, there is also another node called the Secondary Namenode.
Suppose we need to restart the Namenode, which can happen in case of a failure. This would mean
that we have to copy the Fsimage from disk to memory. Also, we would also have to copy the latest
copy of Edit Log to Fsimage to keep track of all the transactions. But if we restart the node after a
long time, then the Edit log could have grown in size. This would mean that it would take a lot of
142
Notes
Figure 6: MapReduce
The input dataset is first split into chunks of data. In this example, the input has three lines of text
with three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is then
split into three chunks, based on these entities, and processed parallelly.In the map phase, the data
is assigned a key and a value of 1. In this case, we have one bus, one car, one ship, and one
train.These key-value pairs are then shuffled and sorted together based on their keys. At the reduce
phase, the aggregation takes place, and the final output is obtained.
143
Notes
Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of
Hadoop and is available as a component of Hadoop version 2. Hadoop YARN acts like an OS to
Hadoop. It is a file system that is built on top of HDFS.It is responsible for managing cluster
resources to make sure you don't overload one machine. It performs job scheduling to make sure
that the jobs are scheduled in the right place.
Suppose a client machine wants to do a query or fetch some code for data analysis. This job request
goes to the resource manager (Hadoop Yarn), which is responsible for resource allocation and
management.In the node section, each of the nodes has its node managers. These node managers
manage the nodes and monitor the resource usage in the node. The containers contain a collection
of physical resources, which could be RAM, CPU, or hard drives. Whenever a job request comes in,
the app master requests the container from the node manager. Once the node manager gets the
resource, it goes back to the Resource Manager.
Hadoop Daemons
The Hadoop Daemons arethe processes that run in the background. These 4 daemons run for
Hadoop to be functional.
The Hadoop Daemons are:
144
Notes
YARN is the resource management layer in Hadoop. It schedules the task in the Hadoop cluster
and assigns resources to the applications running in the cluster. It is responsible for providing the
computational resources needed for executing the applications.There are two YARN daemons
running in the Hadoop cluster for serving YARN core services. They are:
a. ResourceManager: It is the master daemon of YARN. It runs on the master node per
cluster to manage the resources across the cluster. The ResourceManager has two
major components that are Scheduler and ApplicationManager.
The scheduler allocates resources to various applications running in the cluster.
ApplicationManager takes up the job submitted by the client, and negotiates
thecontainer for executing the application-specific ApplicationMaster, and
restarts the
ApplicationMaster container on failure.
b. Node Manager: NodeManager is the slave daemons of YARN. It runs on all the slave
nodes in the cluster. It is responsible for launching and managing the containers on
nodes. Containers execute the application-specific processes with a constrained set of
resources such as memory, CPU, and so on.When NodeManager starts, it announces
himself to the ResourceManager. It periodically sends a heartbeat to the
ResourceManager. It offers resources to the cluster.
c. Application Master: The per-application ApplicationMaster negotiates containers
form schedulers and tracks container status and monitors the container progress.A
client submits an application to the ResourceManager. The ResourceManager contacts
the NodeManager that launches and monitors the compute containers on nodes in the
cluster. The container executes the ApplicationMaster.
The MapReduce task and the ApplicationMaster run in containers which are scheduled by
the ResourceManager and managed by the NodeManagers.
145
Notes
Architecture of Hadoop
It is a machine with a good configuration of memory and CPU. There are two daemons running on
the master and they are Name Node and Resource Manager.
i. Functions of Name Node
Manages file system namespace, Regulates access to files by clients, Stores metadata of
actual data for example – file path, number of blocks, block id, the location of blocks etc.
Executes file system namespace operations like opening, closing, renaming files and
directories, The Name Node stores the metadata in the memory for fast retrieval. Hence,
we should configure it on a high-end machine.
146
Notes
DataNode
What is Failover?
147
Notes
148
Notes
Distributed
Blocks Replication
Storage
Scalability
High throughput
• Vertical Scaling access to
• Horizontal application data
Scaling
Distributed Storage
HDFS stores data in a distributed manner. It divides the data into small pieces and stores
it on different Data Nodes in the cluster. In this manner, the Hadoop Distributed File
System provides a way to MapReduce to process a subset of large data sets broken into
blocks, parallelly on several nodes. MapReduce is the heart of Hadoop, but HDFS is the
one who provides it all these capabilities.
Blocks
HDFS splits huge files into small chunks known as blocks. Block is the smallest unit of data in a
filesystem. We (client and admin) do not have any control on the block like block location.
NameNode decides all such things.HDFS default block size is 128 MB. We can increase or decrease
the block size as per our need. This is unlike the OS filesystem, where the block size is 4 KB. If the
data size is less than the block size of HDFS, then block size will be equal to the data size.For
example, if the file size is 129 MB, then 2 blocks will be created for it. One block will be of default
size 128 MB, and the other will be 1 MB only and not 128 MB as it will waste the space (here block
size is equal to data size). Hadoop is intelligent enough not to waste the rest of 127 MB. So it is
allocating 1 MB block only for 1 MB data. The major advantage of storing data in such block size is
that it saves disk seek time and another advantage is in the case of processing as mapper processes
1 block at a time. So 1 mapper processes large data at a time.
Replication
Hadoop HDFS creates duplicate copies of each block. This is known as replication. All blocks are
replicated and stored on different DataNodes across the cluster. It tries to put at least 1 replica in a
different rack.
High Availability
Replication of data blocks and storing them on multiple nodes across the cluster provides high
availability of data. As seen earlier in this Hadoop HDFS tutorial, the default replication factor is 3,
and we can change it to the required values according to the requirement by editing the
configuration files (hdfs-site.xml).
Data Reliability
149
Notes
Fault Tolerance
HDFS provides a fault-tolerant storage layer for Hadoop and other components in the
ecosystem.HDFS works with commodity hardware (systems with average configurations) that has
high chances of getting crashed at any time. Thus, to make the entire system highly fault-tolerant,
HDFS replicates and stores data in different places.
Scalability
Scalability means expanding or contracting the cluster. We can scale Hadoop HDFS in 2 ways.
1. Vertical Scaling: We can add more disks on nodes of the cluster. For doing this, we need
to edit the configuration files and make corresponding entries of newly added disks. Here
we need to provide downtime though it is very less. So people generally prefer the second
way of scaling, which is horizontal scaling.
2. Horizontal Scaling: Another option of scalability is of adding more nodes to the cluster on
the fly without any downtime. This is known as horizontal scaling.
We can add as many nodes as we want in the cluster on the fly in real-time without any
downtime. This is a unique feature provided by Hadoop.
Summary
Apache Hadoop is a Java-based open-source software framework for managing data
processing and storage in large data applications. Hadoop works by breaking down huge
data sets and analytical jobs into smaller workloads that can be handled in parallel across
nodes in a computing cluster.
Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary data
storage system. HDFS is a distributed file system that uses a NameNode and DataNode
architecture to allow high-performance data access across highly scalable Hadoop clusters.
YARN is one of Apache Hadoop's main components, and it's in charge of assigning system
resources to the many applications operating in a Hadoop cluster, as well as scheduling jobs
to run on different cluster nodes.
MapReduce is well suited to iterative computations with massive amounts of data that
require parallel processing. Rather than a method, it depicts a data flow. MapReduce may be
used to process a graph in parallel. The map, shuffle, and reduce stages of graph algorithms
all follow the same pattern.
An HDFS file system is built around the NameNode. It maintains the directory tree of all
files in the file system and records where the file data is stored across the cluster.... In
response to successful queries, the NameNode returns a list of relevant DataNode servers
where the data is stored.
150
Notes
Hadoop YARN stands for Yet Another Resource Negotiator (YARN). There is a requirement
to manage resources at both a global and a node level in a Hadoop cluster.
In a Hadoop cluster, MapReduce is a programming paradigm that permits tremendous
scalability over hundreds or thousands of computers. MapReduce, as the processing
component, lies at the heart of Apache Hadoop
Hadoop Ecosystem is a platform or a suite that offers a variety of services to address big
data issues. It consists of Apache projects as well as a variety of commercial tools and
solutions. HDFS, MapReduce, YARN, and Hadoop Common are the four core components
of Hadoop.
Keywords
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing
power, and it can perform almost unlimited concurrent processes or jobs.
Failover:If the primary system fails or is taken down for maintenance, failover is a backup
operational mode that immediately switches to a standby database, server, or network.
Failover technology smoothly sends requests from a downed or failing system to a backup
system that replicates the operating system environment.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on
standard hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and
built with low-cost hardware in mind.
Name node: The name node is a piece of commodity hardware that houses the GNU/Linux
operating system as well as name node software. It's a piece of software that can run on
standard hardware.
Data node: The data node is a commodity computer with the GNU/Linux operating system
and data node software installed. In a cluster, there will be a data node for each node
(common hardware/system).
Data dependability refers to the completeness and accuracy of data, and it is a critical basis
for establishing data confidence within an organisation. One of the key goals of data integrity
programmes, which are also used to maintain data security, data quality, and regulatory
compliance, is to ensure data dependability.
Fault tolerance: Because it replicates data across several DataNodes, HDFS is fault-tolerant. A
block of data is duplicated on three DataNodes by default. Different DataNodes are used to
hold the data blocks. Data can still be obtained from other DataNodes if one node fails.
HBase: HBase is a Hadoop-based open-source database with sorted map data. It's
horizontally scalable and column-oriented.
Blocks:Large files were broken into little segments known as Blocks in Hadoop HDFS. The
physical representation of data is called a block. Except for the final block, which might be the
same size or less, all HDFS blocks are the same size. Hadoop divides files into 128 MB blocks
before storing them in the Hadoop file system.
Self Assessment
1. Filesystems that manage the storage across a network of machines are called
_________________
A. Distributed file systems
151
Notes
3. HDFS operates in a master-slave architecture, this means that there are one master node and
several slave nodes in the cluster. The master node is the __________.
A. Datanode
B. Namenode
C. Both
D. All of the above
4. In which of the following files information is maintained persistently over the local disk.
A. Fsimage and Edit log
B. Edit log and Fedit
C. Fsimage and Fedit
D. All of the above
152
Notes
10. Slave computers have two daemons operating, and they are
A. Nodemanager and edgenode
B. Edgenode and datanode
C. Factnode and datanode
D. Datanode and node manager
11. Hadoop manages the jobs by breaking them down into _____________.
A. Smaller chats.
B. Smaller chunks.
C. Sink chunks.
D. None of the above
12. Failover is a process in which the system transfers control to a secondary system in an event
of failure.
A. Graceful Failover
B. Failover
C. Automatic failover
D. All of the above
13. HDFS splits huge files into small chunks known as ________
A. File
B. Blocks
C. Both
D. None of the above
15. Each block in Hadoop HDFS is duplicated twice. This is referred to as ___________.
A. Job tracker
B. Replication
C. Both
D. None of the above
153
Notes
6. A 7. C 8. C 9. D 10. D
Review Questions
1. Explain architecture of Hadoop.
2. Explain all Hadoop HDFS features.
3. Write down HDFS components.
4. Difference between yarn and MapReduce.
5. Write note on
A. Data reliability
B. Replication
C. Fault tolerance
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution
That Will Transform How We Live, Work, and Think . Houghton Mifflin
Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best
Practices
of Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark.
OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NOSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
154
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 08: Hadoop Master Slave Architecture
Objectives
• Learn Hadoop-MapReduce
• Learn Hadoop – Streaming
• Learn setup of the Hadoop Multi-Node cluster on a distributed environment.
• Learn how to create a system user account
Introduction
Hadoop's MapReduce architecture is used to process massive amounts of data in parallel on
enormous clusters of
hardware in a secure way. It allows an application to store data in a distributed format and process
large datasets across groups of computers using simple programming models, which is why
MapReduce is a programming model for processing large amounts of data distributed across a
number of clusters using steps such as input splits, Map, Shuffle, and Reduce.
155
Notes
Input splits
Map
Shuffle
Reduce
Now we will see each step how they work.
The input splits and Map steps are combined in this stage. The source file is transmitted line by line
via the Map phase. The input is separated into tiny fixed-size Input splits before it is passed to the
Map function task. The input split is a portion of data that a single map may absorb. Each split data
is sent to the mapper function in the Map stage, which processes the data before returning results.
The job input data for the map or mapper is usually stored in the Hadoop file system as a file or
directory (HDFS).
This step is the result of combining the Shuffle and Reduce steps. The reduce function, also known
as the Reducer's job, takes the data returned by the map function and processes it by decreasing the
role to generate a new set of effects, which are then stored in the HDFS.
A Hadoop framework doesn't know which job each cluster does, either Map or Reduce, or both
Map and Reduce. As a result, the Map and Reduce jobs' requests should be routed to the cluster's
relevant servers. All of the responsibilities of issuing, confirming job completion, getting data from
HDFS, transferring data to the nodes' group, and so on are handled by the Hadoop framework.
Hadoop does most of its computing on nodes and stores data on nodes, which lowers network
traffic.
As a result, the MapReduce framework complements the Hadoop framework.
Terminology
Payload:The Map and Reduce functions are implemented by PayLoad Applications, which are at
the heart of the work.
Mapper: The input key/value pairs are mapped to a collection of intermediate key/value pairs by
the Mapper.
Namenode:Node that administers the Hadoop Distributed File System is known as NamedNode
(HDFS).
DataNode is a node in which data is delivered in advance of any processing.
JobTrackeroperates on the MasterNode, which takes work requests from clients.
SlaveNode: This is the node where the Map and Reduce programmes are performed.
JobTrackeris a programme that schedules jobs and assigns them to Task Tracker.
Task Tracker keeps track of the task and updates JobTracker on its progress.
Job:A job is a programme that runs a Mapper and Reducer on a dataset.
Task:A Mapper or Reducer task is the execution of a Mapper or Reducer on a slice of data.
Task Attempt is a specific instance of a task execution attempt on a SlaveNode.
156
Notes
Advantages of MapReduce
Scalable:Hadoop is very scalable because to MapReduce, which allows big data sets to be stored in
distributed form across numerous servers. Because it is distributed over different servers, it may
run in parallel.
Cost-effective solution:MapReduce is a very cost-effective option for organisations that need to
store and process large amounts of data in a very cost-effective way, which is a current business
requirement.
Flexibility: Hadoop is incredibly adaptable when it comes to multiple data sources and even
different types of data, such as structured and unstructured data, thanks to MapReduce. As a
result, it gives you a lot of flexibility when it comes to accessing and processing organised and
unstructured data.
Fast: Because Hadoop stores data in a distributed file system, which stores data on a cluster's local
disc, and MapReduce algorithms are often stored on the same servers, data processing is faster
because there is no need to retrieve data from other servers.
Parallel processing: Because Hadoop stores data in a distributed file system and runs a MapReduce
algorithm, it separates jobs into map and reduce tasks that may run in parallel. Furthermore, due of
the simultaneous execution, the overall run time is reduced.
8.2 Hadoop-Streaming
It is a Hadoop distribution feature that lets developers and programmers to construct Map-Reduce
programmes in a variety of programming languages such as Ruby, Perl, Python, C++, and others.
Any language that can read from standard input (STDIN), such as keyboard input, and write to
standard output can be used (STDOUT). Although the Hadoop Framework is designed entirely in
Java, Hadoop apps do not have to be written in the Java programming language. Hadoop
Streaming is a functionality that has been available since Hadoop version 0.14.1.
157
Notes
158
Notes
On the terminal, use the following command to extract the Hadoop file:
tar -xvf hadoop-3.2.2.tar.gz
Extracting hadoop file as shown in figure below:
Step13: Editing and Configuring HadoopYou must first set the path in the ~/.bashrc file. The
command ~ /.bashrc can be used to set the path from the root user. You should check your Java
configurations before editing ~/.bashrc.
update-alternatives-config java
You'll now be able to see all of the Java versions installed on the computer. Because I only have one
version of Java, which is the most recent, it is displayed below:
159
Notes
To get to this, hit the Insert key on your keyboard, and then start typing the following code to set a
Java path:
fi
#HADOOP VARIABLES START
export JAVA_HOME= (path you copied in the previous step)
export HADOOP_HOME=/home/(your username)/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/HOME/lib/native
export HADOOP_OPTS=”Djava.library.path”=$HADOOP_HOME/lib”
#HADOOP VARIABLES END
After writing the code, click on Esc on your keyboard and write the command: wq!
This will save and exit you from the vi editor. The path has been set now as it can be seen in the
image below:
Step14:Using the vi editor, open hadoop-env.sh. To inform Hadoop which path to use, replace this
path with the Java path. You will be presented with the following window:
160
Notes
Step15:There are multiple XML files that need to be modified now, and you must specify the
property and path for each one. All configuration files are shown in the image below:
Editing core-site.xml
o use the command below to open core-site.xml file
vim core-site.xml
o Enter the following code in between the configuration tags as below
161
Notes
Editing mapred-site.xml
o Use the command below to open hdfs-site.xml
vim mapred-site.xml
o Enter the following code in between the configuration tags as below:
o E
Editing yarn-site.xml
o Use the command below to open hdfs-site.xml
vim yarn-site.xml
o Exit from this window by pressing Esc and then writing the command: wq!
Step16: Create a directory namenode, datanode, and secondary using the command below:
Step 17: As we can see in the above image, permissions have been given only to the root. So, next,
we will use chown command as shown in image below to change permission to hadoop2.
Step 18: To check permissions of all the file that comes under Hadoop_datawarehouse, following
command will be executed:
All the files that comes into this folder their permission has changed.
162
Notes
Step22: Lets go to the Hadoop directory and run the command as shown below to format the name
node.
hadoopnamenode -format
So, we will get a message that namenode has been successfully formatted.
Step 19: To start all the services or start Hadoop daemons. To start services, we will go to sbin
folder and will see all services.
start dfs.sh
163
Notes
Summary
Hadoop MapReduce is a programming paradigm used by Apache Hadoop to provide
tremendous scalability across hundreds or thousands of Hadoop clusters running on
cheap hardware. On a Hadoop cluster, the MapReduce paradigm uses a distributed
algorithm to process huge unstructured data sets.
The fundamental components of Apache Hadoop, which are incorporated into CDH and
supported by a Cloudera Enterprise subscription, allow you to store and handle a limitless
amount of data of any sort on a single platform.
YARN is an open source resource management framework for Hadoop that allows you to
go beyond batch processing and expose your data to a variety of workloads such as
interactive SQL, sophisticated modelling, and real-time streaming.
Hadoop's shuffle phase passes map output from a Mapper to a Reducer in MapReduce.
In MapReduce, the sort phase is responsible for combining and sorting map outputs. The
mapper's data is aggregated by key, distributed across reducers, then sorted by key. All
values associated with the same key are obtained by each reducer.
The JobTracker is a Hadoop service that distributes MapReduce tasks to specified nodes in
the cluster, preferably those that hold the data or are in the same rack. Jobs are submitted
to the Job Tracker by client apps. The JobTracker sends the work to the TaskTracker nodes
that have been selected.
Hadoop streaming is a feature included in the Hadoop distribution. You may use this
programme to construct and run Map/Reduce tasks using any executable or script as the
mapper and/or reducer.
Ssh-keygen is an utility that allows you to generate fresh SSH authentication key pairs.
This type of key pair is used to automate logins, provide single sign-on, and authenticate
hosts.
Hadoop is a Java-based Apache open source platform that allows big datasets to be
processed across clusters of computers using simple programming techniques. The
Hadoop framework application runs in a clustered computing environment that allows for
distributed storage and computation.
Keywords
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
164
Notes
Job Tracker:The job tracker is a master daemon that runs on the same node as the data nodes and
manages all of the jobs. This data will be stored on multiple data nodes, but it is the task tracker's
responsibility to keep track of it.
Process is what daemons stand for. Hadoop Daemons are a collection of Hadoop processes.
Because Hadoop is a Java platform, all of these processes are Java Processes.
Resource Manager:The Resource Manager in YARN is basically a scheduler. In essence, it is
confined to dividing the system's available resources among competing applications. It optimises
optimal cluster utilisation (keeping all resources in use at all times) while taking into account
different limitations such as capacity guarantees, fairness, and service level agreements (SLAs). The
Resource Manager contains a pluggable scheduler that permits other algorithms, such as capacity,
to be utilised as needed to accommodate varied policy restrictions. The "yarn" user is used by the
daemon.
Application Master: The Application Master is a framework-specific library that is in charge of
negotiating resources with the Resource Manager and working with the Node Manager(s) to
execute and monitor Containers and their resource usage. It is in charge of negotiating suitable
resource Containers with the Resource Manager and keeping track of their progress. The Resource
Manager monitors the Application Master, which operates as a single Container.
Job history service:This is a daemon that keeps track of jobs that have been finished. It's best to run
it as a separate daemon. Because it maintains task history information, running this daemon uses a
lot of HDFS space. The "mapred" user runs this daemon.
Container: A Container is a resource allocation that occurs as a result of a Resource Request being
granted by the Resource Manager. A Container allows a programme to access a certain amount of
resources (memory, CPU, etc.) on a certain host. To make use of Container resources, the
Application Master must take the Container and offer it to the Node Manager in charge of the host
to which the Container has been assigned. To guarantee that Application Master(s) cannot fake
allocations in the cluster, the Container allocation is checked in secure mode.
Node Manager:A Container is a resource allocation that occurs as a result of a Resource Request
being granted by the Resource Manager. A Container allows a programme to access a certain
amount of resources (memory, CPU, etc.) on a certain host. To make use of Container resources, the
Application Master must take the Container and offer it to the Node Manager in charge of the host
to which the Container has been assigned. To guarantee that Application Master(s) cannot fake
allocations in the cluster, the Container allocation is checked in secure mode.
NameNode is a component of the Master System. Namenode's main function is to manage all of the
MetaData. The list of files saved in HDFS is known as metadata (Hadoop Distributed File System).
In a Hadoop cluster, data is stored in the form of blocks, as we all know.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with low-
cost hardware in mind.
Data node: The data node is a commodity computer with the GNU/Linux operating system and
data node software installed. In a cluster, there will be a data node for each node (common
hardware/system).
Map-red:It is one of the most significant configuration files for Hadoop's runtime environment
settings. It includes MapReduce's setup options. By setting the MapReduce.framework.name
variable in this file, we may give MapReduce a name.
Self Assessment
1. Which of the following are major pre-requisites for MapReduce programming.
A. The application must lend itself to parallel programming
B. The data for the applications can be expressed in key-value pairs
C. Both
D. None of above
165
Notes
3. Input key/value pairs are mapped to a collection of intermediate key/value pairs using
__________ .
A. Mapper
B. Reducer
C. Both Mapper and Reducer
D. None of the mentioned
4. The master is a ________, and each cluster has only one NameNode.
A. Data Node
B. NameNode
C. Data block
D. Replication
166
Notes
9. Commands to create a system user account on both master and slave systems
A. useraddhadoop
B. adduserhadoop
C. useridhadoop
D. addidHadoop
10. Hadoop Streaming uses standard ____ streams as the interface between Hadoop and user
program.
A. Unix
B. Linux
C. C++
D. None of above
167
Notes
6. A 7. B 8. D 9. A 10. A
Review Questions
1. Difference between job tracker and task tracker.
2. Write down steps to install hadoop.
3. Write down HDFS components.
4. What do you understand by resource manager.
5. What is function of. /bashrc file?
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practice of
Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
168
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 09: Hadoop Node Commands
Objectives
Starting HDFS
Creating User Account
Configuring Key Based Login
Configuring Hadoop on Master Server
Understand how to add a New Data Node in the Hadoop Cluster
Learn Adding User and Ssh Access.
Introduction
Starting HDFS
Format the configured HDFS file system and then open the namenode (HDFS server) and execute
the following command.
hadoopnamenode -format
Namenode is the node in the Hadoop Distributed File System which keeps track of all the data
stored in the Datanode. Namenode has metadata related to the data stored on the Datanodes and
has information related to the location of the data stored. So, when you run the hadoopnamenode -
format command, all these information is deleted from the namenode which means that the system
does not know where the data is stored hence losing all data stored in the Hadoop File
System.Formatting the namenode deletes the information from namenode directory. The
NameNode directory is specified in hdfs-site.xml file in
dfs.namenode.name.dir property.
Formatting the file system means initializing the directory specified by the
dfs.name.dir variable.After you have logged in as the dedicated user for Hadoop(in my case it is
hduser) that you must have created while installation, go to the installation folder of Hadoop(in my
case it is /usr/local/hadoop).
start-dfs.sh
169
Notes
Inside the directory Hadoop, there will be a folder 'sbin', where there will be several files like start-
all.sh, stop-all.sh, start-dfs.sh, stop-dfs.sh, hadoop-daemons.sh, yarn-daemons.sh, etc. Executing
these files can help us start and/or stop in various ways. start-all.sh & stop-all.sh: Used to start and
stop hadoop daemons all at once. Issuing it on the master machine will start/stop the daemons on
all the nodes of a cluster. These commands are now deprecated. start-dfs.sh, stop-dfs.sh and start-
yarn.sh, stop-yarn.sh: Same as above but start/stop HDFS and YARN daemons separately on all
the nodes from the master machine. It is advisable to use these commands instead of start-all.sh &
stop-all.sh. To start individual daemons on an individual machine manually. You need to go to a
particular node and supply these commands. hadoop-daemon.sh namenode/datanode and yarn-
deamon.sh resourcemanager.
170
Notes
Step 2. Use put command transfer and store the data file from the local systems
to the HDFS using the following commands in the terminal
$HADOOP_HOME/bin/hadoop fs -put /home/abc.txt /user/input
Step 3. Verify the file using ls command.
$HADOOP_HOME/bin/hadoop fs -ls /user/input
Multi-Node Cluster
A Multi Node Cluster in Hadoop contains two or more DataNodes in a distributed Hadoop
environment. This is practically used in organizations to store and analyze their Petabytes and
Exabytes of data. Here, we are taking two machines – master and slave. On both the machines, a
Datanode will be running.Installing Java
Syntax of java version command
$ java –version
171
Notes
• Map: It is the first phase of processing, where we specify all the complex logic code. It
takes a set of data and converts into another set of data. It breaks each individual element
into tuples (key-value pairs).
• Reduce: It is the second phase of processing. Here we specify light-weight processing like
aggregation/summation. It takes the output from the map as input. Then it combines
those tuples based on the key. Using a text editor, open the hosts file on every host in your
cluster. A Hosts file is a file that almost all computers and operating systems can use to
map a connection between an IP address and domain names. This file is an ASCII text
file. It contains IP addresses separated by a space and then a domain name.
Enter the following lines in the /etc/hosts file.
192.168.1.109 hadoop-master
192.168.1.145 hadoop-slave-1
192.168.56.1 hadoop-slave-2
172
Notes
The wget command is a command line utility for downloading files from the Internet. It supports
downloading multiple files, downloading in the background, resuming downloads, limiting the
bandwidth used for downloads and viewing headers
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz
The tar command is used to create compressed archives which represent a file or collection of
files. A tar file, commonly known as a “tarball,” a gzip, or a bzip file, will have an extension ending
with . tar or . tar. gz
# tar -xzf hadoop-1.2.0.tar.gz
mv stands for move. mv is used to move one or more files or directories from one place to another
in a file system
# mv hadoop-1.2.0 hadoop
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/
It has two distinct functions:
(i) It renames a file or folder.
(ii) It moves a group of files to a different directory.
No additional space is consumed on a disk during renaming. This command normally works
silently means no prompt for confirmation.
The chown command allows you to change the user and/or group ownership of a given file.
chown command is used to change the file Owner or group. Whenever you want to change
ownership you can use chown command.
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/
-R, --recursive: It is used to perform operations on files and directories recursively.
cd command in linux known as change directory command. It is used to change current working
directory.
# cd /opt/hadoop/hadoop/
Configuring Hadoop
Hadoop server must be configured in core-site.xml and should be edited where ever required.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000/</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
The core-site.xml file informs Hadoop daemon where NameNode runs in the cluster. It
contains the configuration settings for Hadoop Core such as I/O settings that are common
to HDFS and MapReduce.
• The hdfs-site.xml file contains the configuration settings for HDFS daemons; the
NameNode, the Secondary NameNode, and the DataNodes. Here, we can configure hdfs-
173
Notes
site.xml to specify default block replication and permission checking on HDFS. The actual
number of replications can also be specified when the file is created. The default is used if
<configuration> <property>
<property> <name>dfs.name.dir</name>
<name>dfs.data.dir</name> <value>/opt/hadoop/hadoop/dfs/name</v
alue>
<value>/opt/hadoop/hadoop/dfs/na
me/data</value> <final>true</final>
<final>true</final> </property>
</property> <property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
replication is not specified in create time.Hadoop server must be configured in hdfs-site.xml and
should be edited where ever required.
Hadoop server must be configured in mapred-site.xml and should be edited where ever required.
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:9001</value>
</property>
</configuration>
The mapred-site.xml file contains the configuration settings for MapReduce daemons; the job
tracker and the task-trackers.
hadoop-env.sh
JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS should be edited as follows:
export JAVA_HOME=/opt/jdk1.7.0_17
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf
174
Notes
175
Notes
Networking
Add new nodes to an existing Hadoop cluster with some appropriate network configuration.
Assume the following network configuration.
176
Notes
• Check ssh login from the master machine. Now check if you can ssh to the new node
without a password from the master.
sshhadoop@192.168.1.101 or hadoop@slave123
Set Hostname of New Node
You can set hostname in file /etc/sysconfig/network.
On new slave123 machine
Networking=yes
Hostaname=slave123.in
To make the changes effective, either restart the machine or run hostname command to a new
machine with the respective hostname (restart is a good option).
On slave3 node machine −
hostname slave3.in
Update /etc/hosts on all machines of the cluster with the following lines –
192.168.1.100 slave2.in slave2
• Now try to ping the machine with hostnames to check whether it is resolving to IP or not.
• On new node machine –
ping master.in
177
Notes
Examine the output of the jps command on slave2.in. After some time has passed, you will notice
that the DataNode process has been automatically terminated.
178
Notes
remove it. DataNode holds the data that you wish to delete safely and without losing any
information.
The following command may be used to start/stop the tasktracker on the fly at any moment.
$HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker
$HADOOP_HOME/bin/hadoop-daemon.sh strattasktracker
Summary
Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary
data storage system. HDFS is a distributed file system that uses a NameNode and
DataNode architecture to allow high-performance data access across highly scalable
Hadoop clusters.
Hadoop is an Apache Foundation open source framework for processing huge volumes of
heterogeneous data sets in a distributed way across clusters of commodity computers and
hardware using a simplified programming style. Hadoop is a dependable distributed
storage and analysis solution.
By duplicating data over numerous nodes, it provides extremely stable and distributed
storage that assures reliability even on commodity hardware. When data is submitted to
HDFS, it is automatically divided into numerous blocks (adjustable parameter) and
stored/replicated across many data nodes, unlike a traditional file system. As a result,
high availability and fault tolerance are ensured.
MapReduce is a data analysis system that can handle enormous datasets and conduct
sophisticated computations. This component is in charge of all calculations, and it does so
by breaking down a big complicated computation into numerous tasks and assigning
them to individual worker/slave nodes, as well as coordinating and consolidating the
results.
Namenode stores information about all other nodes in the Hadoop Cluster, files in the
cluster, file component blocks and their positions in the cluster, and other information that
is necessary for the Hadoop Cluster's functioning.
Job Tracker manages the sharing of information and outcomes by keeping track of the
specific tasks/jobs allocated to each of the nodes.
Tracker is in charge of completing the job or computation that has been assigned to it.
Datanode is in charge of storing the information.
Keywords
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
Job Tracker:The job tracker is a master daemon that runs on the same node as the data nodes and
manages all of the jobs. This data will be stored on multiple data nodes, but it is the task tracker's
responsibility to keep track of it.
Failover:If the primary system fails or is taken down for maintenance, failover is a backup
operational mode that immediately switches to a standby database, server, or network. Failover
technology smoothly sends requests from a downed or failing system to a backup system that
replicates the operating system environment.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with low-
cost hardware in mind.
179
Notes
Name node: The name node is a piece of commodity hardware that houses the GNU/Linux
operating system as well as name node software. It's a piece of software that can run on standard
hardware.
Resource Manager:The Resource Manager in YARN is basically a scheduler. In essence, it is
confined to dividing the system's available resources among competing applications. It optimises
optimal cluster utilisation (keeping all resources in use at all times) while taking into account
different limitations such as capacity guarantees, fairness, and service level agreements (SLAs). The
Resource Manager contains a pluggable scheduler that permits other algorithms, such as capacity,
to be utilised as needed to accommodate varied policy restrictions. The "yarn" user is used by the
daemon.
Application Master: The Application Master is a framework-specific library that is in charge of
negotiating resources with the Resource Manager and working with the Node Manager(s) to
execute and monitor Containers and their resource usage. It is in charge of negotiating suitable
resource Containers with the Resource Manager and keeping track of their progress. The Resource
Manager monitors the Application Master, which operates as a single Container.
HBase: HBase is a Hadoop-based open-source database with sorted map data. It's horizontally
scalable and column-oriented.
Blocks:Large files were broken into little segments known as Blocks in Hadoop HDFS. The physical
representation of data is called a block. Except for the final block, which might be the same size or
less, all HDFS blocks are the same size. Hadoop divides files into 128 MB blocks before storing
them in the Hadoop file system.
Key/Value Store:A key-value store, sometimes known as a key-value database, is a simple database
that employs an associative array (think of a map or dictionary) as its basic data model, with each
key corresponding to one and only one item in a collection. A key-value pair is the name for this
type of connection.
Review Questions
1. Select the command to format the configured HDFS file system
A. hadoopnamenode -format
B. hadoop -format namenode
C. hadoop name -format
D. hadoop node -format
2. Select the command to starts the hadoopdfs daemons, the namenode and datanodes.
A. start-mapred.sh
B. start-dfs.sh
C. hadoop-env.sh
D. start-daemons.sh
4. Select the command to starts the hadoop map/reduce daemons, the jobtracker and
tasktrackers.
A. start-mapred.sh
B. start-dfs.sh
180
Notes
C. start-env.sh
D. start-daemons.sh
7. The data nodes are used as ________ for blocks by all the namenodes
A. Common points
B. common storage
C. Both
D. None of above
11. Select the command to format the configured HDFS file system
A. hadoopnamenode -format
B. hadoop -format namenode
C. hadoop name -format
181
Notes
12. Select the command to starts the hadoopdfs daemons, the namenode and datanodes.
A. start-mapred.sh
B. start-dfs.sh
C. hadoop-env.sh
D. start-daemons.sh
14. Select the command to starts the hadoop map/reduce daemons, the jobtracker and
tasktrackers.
A. start-mapred.sh
B. start-dfs.sh
C. start-env.sh
D. start-daemons.sh
16. Write down commands and explanation to insert and retrieve data into HDFS.
17. Explain HDFS operation to read and write the data.
18. Write down steps for learning adding user and ssh access.
19. Write down command to explain how to create user account.
20. Explain HDFS commands.
6. A 7. B 8. C 9. B 10. A
182
Notes
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think. Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier for Innovation,
Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
7. https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm
183
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 10: MapReduce Applications
Objectives
• Learn what is unit testing
• Explore concepts of MRUnit
• Learn Developing and testing MapReduce jobs with MRUnit
• Learn Anatomy of a MapReduce Job Run
• explore and learn the concepts of Shuffle and Sort
Introduction
Unit testing is a testing technique in which individual modules are checked by the developer to see
if there are any flaws. It is concerned with the standalone modules' functional soundness.The basic
goal is to isolate each component of the system in order to detect, analyse, and correct any flaws.
184
Notes
JUNIT
JUnit is a unit testing framework for the Java programming language. JUnit has been important in
the development of test-driven development, and is one of a family of unit testing frameworks
collectively known as xUnit that originated with JUnit. JUnit promotes the idea of "first testing then
coding", which emphasizes on setting up the test data for a piece of code that can be tested first and
then implemented. This approach is like "test a little, code a little, taest a little, code a little." It
increases the productivity of the programmer and the stability of program code, which in turn
reduces the stress on the programmer and the time spent on debugging.
MapDriver: The MapDriver class is in charge of calling the Mapper's map() method.
ReducerDriver: This is the driver class that calls the Reducer's reduce() method.
MapReduceDriver: The combined MapReduce driver is in charge of first invoking the Mapper's
map() function, then performing an in-memory Shuffle phase. The Reducer's reduce() function
is called at the end of this phase.
Each of the classes listed above provides methods for providing test inputs and anticipated
outcomes. The setup() method of the JUnit API is responsible for establishing fresh instances of the
Mapper, Reducer, and relevant MRUnit drivers for each test.To include MRUnit in a Hadoop
MapReduce project, include it as a test dependent in the project POM file (of course, the project is a
Maven project, and I'm sure you're not going to skip Maven for this type of project):
<dependency>
<groupId>org.apache.mrunit</groupId>
<artifactId>mrunit</artifactId>
<version>1.1.0</version>
<classifier>hadoop2</classifier>
185
Notes
186
Notes
reduceDriver.withInput(firstMapKey, firstMapValues)
.withInput(secondMapKey, secondMapValues)
.withOutput(firstMapKey, new IntWritable(2))
.withOutput(secondMapKey, new IntWritable(3))
.runTest();
}
and the overall MapReduce flow
187
Notes
188
Notes
This example only showed the testing of a mapper. MRUnit also provides a ReduceDriver
class that can be used in the same way as MapDriver for testing reducers.
189
Notes
190
Notes
Step7: Open the Counters project in Eclipse, and set up a new remote debug configuration. Create a
new breakpoint and debug.
A MapReduce job that is configured to execute in local mode runs entirely in one JVM instance.
Unlike the pseudo-distributed mode, this mode makes it possible to hook up a remote debugger to
debug a job.
Apache Pig also provides a local mode for development and testing. It uses the same
LocalJobRunner class as a local mode MapReduce job. It can be accessed by starting Pig with the
following command:
pig –x local
191
Notes
A MapReduce task may be started with only one line of code: JobClient.runJob (conf).
The whole process is illustrated in belowFigure . At the highest level, there are four independent
entities:
The JobClient'srunJob() function generates a new JobClient object and executes submitJob() on it as
a convenience method. After submitting the work, runJob() polls the job's progress every second
and, if it has changed since the last report, reports it to the console. The job counters are presented
when the work is finished and whether it was successful. Otherwise, the console is recorded with
the error that caused the job to fail.
The submitJob() function of JobClient implements the following job submission process:
Requests a new job ID from the jobtracker (through getNewJobId() on JobTracker) (step 2).
192
Notes
Checks the job's output specification. The task is not submitted if the output directory is
not given or if it already exists, for example, and an error is issued to the MapReduce
application.
Calculates the job's input splits. The task is not submitted and an error is issued to the
MapReduce software if the splits cannot be computed, for example because the input
pathways do not exist.
Copies the resources needed to perform the task to the jobtracker's filesystem in a
directory named after the job ID, including the job JAR file, the configuration file, and the
computed input splits. The job JAR is duplicated with a high replication factor (set by the
mapred.submit.replication parameter, which defaults to 10) so that tasktrackers may
access many copies when running tasks for the job.
By invoking submitJob() on JobTracker, tells the jobtracker that the job is ready to be
executed.
Job Initialization
When the JobTracker gets a call to its submitJob() function, it places it in an internal queue, where it
will be picked up and initialised by the job scheduler. Initialization entails constructing an object to
represent the job being executed, which encapsulates its activities, as well as accounting
information to track the state and progress of the tasks (step 5).
The job scheduler gets the input splits computed by the JobClient from the shared filesystem before
creating the list of jobs to perform (step 6). For each split, it then produces a separate map task. The
scheduler simply produces this amount of reduction tasks to perform, which is defined by the
mapred.reduce.tasks property in the JobConf, which is set by the setNumReduce Tasks() function.
At this moment, task IDs are assigned.
Task Assignment
Tasktrackers use a simple loop to deliver heartbeat method calls to the jobtracker on a regular basis.
Heartbeats not only inform the jobtracker that a tasktracker is alive, but they also serve as a
messaging channel. A tasktracker will signal if it is ready to run a new task as part of the heartbeat,
and if it is, the jobtracker will assign it a task, which it will send to the tasktracker using the
heartbeat return value (step 7).
The jobtracker must choose a job from which to select a work for the tasktracker before it may
choose a task for the tasktracker. There are other scheduling methods, as detailed later in this
chapter (see "Work Scheduling"), but the default one merely keeps track of job priorities. The
jobtracker now selects a task for the job after deciding on a job.
For map tasks and reduce tasks, tasktrackers have a set number of slots: for example, a tasktracker
may be able to execute two map tasks and two reduce tasks at the same time. (The exact number is
determined on the number of cores and memory available on the tasktracker; see "Memory")
Because the default scheduler fills empty map task slots before reduce task slots, the jobtracker will
choose a map task if the tasktracker has at least one empty map task slot; otherwise, it will select a
reduce task.
Because there are no data locality constraints, the jobtracker simply chooses the next reduce task
from its list of yet-to-be-run reduce jobs. For a map job, however, it considers the tasktracker's
network position and selects a task with an input split that is as close to the tasktracker as feasible.
In the best case scenario, the job is data-local, meaning it runs on the same node as the split.
Alternatively, the job might be rack-local, meaning it's on the same rack as the split but not on the
same node. Some jobs are neither data-local nor rack-local, and their data is retrieved from a rack
other than the one on which they are operating. Looking at a job's counters will reveal the
proportion of each sort of work.
Task Execution
After the tasktracker has been given a task to complete, the next step is for it to complete the task.
First, it copies the job JAR from the common filesystem to the tasktracker's filesystem to localise it.
It also moves any files required by the programme from the distributed cache to the local disc (see
"Distributed Cache") (step 8). Second, it creates a task-specific local working directory and un-jars
the JAR's contents into it. Third, it creates a TaskRunner object to carry out the task.
193
Notes
Job Completion
The status of a job is changed to "successful" when the jobtracker receives news that the last task for
the job has been completed. The JobClient then learns that the work has completed successfully
when it polls for status, so it produces a message to inform the user and then exits the runJob()
function.
If the jobtracker is enabled to do so, it will additionally send an HTTP job notice. Clients that want
to get callbacks can set this up using the job.end.notifica tion.url parameter. Finally, the jobtracker
cleans up its working environment for the job, and tasktrackers are instructed to do the same (so
intermediate output is deleted, for example).
194
Notes
scheduling monitoring
tasks them
re-executes
the failed task.
In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced. The basic
idea behind the YARN introduction is to split the functionalities of resource management
and job scheduling or monitoring into separate daemons that are ResorceManager,
ApplicationMaster, and NodeManager. ResourceManager is the master daemon that
arbitrates resources among all the applications in the system. NodeManager is the slave
daemon responsible for containers, monitoring their resource usage, and reporting the
same to ResourceManager or Schedulers. ApplicationMaster negotiates resources from the
ResourceManager and works with NodeManager in order to execute and monitor the
task.The ResourceManager has two main components that are Schedulers and
ApplicationsManager as shown in Error! Reference source not found..
Schedulers
The ResourceManager
has two main
components that are
and
ApplicationsManager.
195
Notes
It is not suitable for shared clusters. If the large application comes before the shorter one, then the
large application will use all the resources in the cluster, and the shorter application has to wait for
its turn. This leads to starvation.It does not take into account the balance of resource allocation
between the long applications and short applications.
Capacity Scheduler
• The CapacityScheduler allows multiple-tenants to securely share a large Hadoop cluster. It
is designed to run Hadoop applications in a shared, multi-tenant cluster while maximizing
the throughput and the utilization of the cluster. It supports hierarchical queues to reflect
the structure of organizations or groups that utilizes the cluster resources. A queue
hierarchy contains three types of queues that are root, parent, and leaf. The root queue
represents the cluster itself, parent queue represents organization/group or sub-
organization/sub-group, and the leaf accepts application submission. Also, when there is
a demand for the free resources that are available on the queue who has completed its
task, by the queues running below capacity, then these resources will be assigned to the
applications on queues running below capacity. This provides elasticity for the
organization in a cost-effective manner. Apart from it, the CapacityScheduler provides a
comprehensive set of limits to ensure that a single application/user/queue cannot use a
disproportionate amount of resources in the cluster.
root
A queue hierarchy
contains three types parent
of queues that are
and leaf.
To ensure fairness and stability, it also provides limits on initialized and pending apps from a
single user and queue.It maximizes the utilization of resources and throughput in the Hadoop
cluster. Provides elasticity for groups or organizations in a cost-effective manner.It also gives
capacity guarantees and safeguards to the organization utilizing cluster.It is complex amongst
the other scheduler.
196
Notes
197
Notes
198
Notes
Failure of JobTracker
• The final case can be job trackerfailure. It is the most serious failure in classic mapreduce.
Nothing much can be dine in this case. Job tracker is single point failure in MapReduce.
• So it is recommended to be run on the better hardware so as to avoid the scenario as much
as possible. We need to resubmit all the jobs in progress once the jobtracker is brought up
again. In YARN,this situation is little improved.
199
Notes
• Task Failure
Failure of the running task is similar to the classic case. Runtime exceptions and sudden
exits of the JVM are propagated back to the application master and the task attempt is
marked as failed. Likewise, hanging tasks are noticed by the application master by the
absence of a ping over the umbilical channel (the timeout is set by mapreduce.task.time
out), and again the task attempt is marked as failed.The configuration properties for
determining when a task is considered to be failed are the same as the classic case: a task is
marked as failed after four attempts (set by mapreduce.map.maxattempts for map tasks
and mapreduce.reduce.maxattempts for reducer tasks).
200
Notes
An application master sends periodic heartbeats to the resource manager, and in the event of
application master failure, the resource manager will detect the failure and start a new instance of
the master running in a new container (managed by a node manager)
In the case of the MapReduce application master, it can recover the state of the tasks that had
already been run by the (failed) application so they don’t have to be rerun. By default, recovery is
not enabled, so failed application masters will not rerun all their tasks, but you can turn it on by
setting yarn.app.mapreduce.am.job.recovery.enable to true.
The client polls the application master for progress reports, so if its application master fails the
client needs to locate the new instance.
During job initialization the client asks the resource manager for the application master’s address,
and then caches it, so it doesn’t overload the the resource manager with a request every time it
needs to poll the application master.
If the application master fails, however, the client will experience a timeout when it issues a status
update, at which point the client will go back to the resource manager to ask for the new
application master’s address.
201
Notes
10.6 Shuffling
In Hadoop, the process by which the intermediate output from mappers is transferred to
the reducer is called Shuffling. Reducer gets 1 or more keys and associated values on the basis of
reducers. Intermediated key-value generated by mapper is sorted automatically by key.
202
Notes
Mapper task is the first phase of processing that processes each input record (from RecordReader)
and generates an intermediate key-value pair. Hadoop Mapper store intermediate-output on the
local disk. In this Hadoop mapper tutorial, we will try to answer what is a MapReduce Mapper
how to generate key-value pair in Hadoop, what is InputSplit and RecordReader in Hadoop, how
mapper works in Hadoop.
Reducer in MapReduce
Reducer in Hadoop MapReduce reduces a set of intermediate values which share a key to a
smaller set of values. In MapReduce job execution flow, Reducer takes a set of an intermediate key-
value pair produced by the mapper as the input.In Hadoop, Reducer takes the output of
the Mapper (intermediate key-value pair) process each of them to generate the output. The output
of the reducer is the final output, which is stored in HDFS. Usually, in the Hadoop Reducer, we do
aggregation or summation sort of computation.
Combiner
MapReduce - Combiners. A Combiner, also known as a semi-reducer, is an optional class that
operates by accepting the inputs from the Map class and thereafter passing the output key-value
pairs to the Reducer class. The main function of a Combiner is to summarize the map output
records with the same key.
Partitioner in MapReduce
The Partitioner in MapReduce controls the partitioning of the key of the intermediate mapper
output. By hash function, key (or a subset of the key) is used to derive the partition. A total number
of partitions depends on the number of reduce task.
InputFormat in MapReduce
Hadoop can process many different types of data formats, from flat text files to databases. In this
section, we will explore the different formats available in next chapter
Working of Shuffle and Sort.
Shuffle and sort steps which are core and hard to every MapReduce job. Every mapreduce goes
through a shuffle and sort phase.
Map:- process input key and value then map output id sorted and is transferred to reducer and that
is known as shuffle.
203
Notes
1)The input
2) Output is not directly written to disk but is written to memory buffer.
3) Size of the buffer is decided by the property io.sort.mb.
4)Default size is 100MB
5) If map has more of output, it may fill up the buffer and in that case map would be paused for a
while till the spill empties the buffer.
6) After spill completed of map may again reach to the threshold.
7)In that case,another spill would be writte in round robin fashion.
8) These are written to the directly specified in the property map.local.dir
9) So there can be many spills before the last key value pair has been written by the map tasks.
10) Each spill is partitioned and sorted by the key and this run through a combiner, if the combiner
function is designed for the job.
11) Once map has finished to process all the records. All the splits are merged to an output file
which is partitioned and sorted.
12) If more than 3 splits are merged together, combiner function is again run through the final
output. 13) Remember that the combiner functions can run many times without changing the final
results.
14) Combiner function reduces the size of output which is advantages as there will be less amount
of data that would required to be transferred to the reducer machine.
15) If the maps output is going to be very large it is recommended to compress the maps output to
reduce the amount of data.
16) This can be done by setting up the property method mapred.compress.map.output to true.
And compression scheme can be specified by the property. Mapred.map.output.compression.codec
18) After this comes the copy phase there would be many map tasks running and they may finish
at different times.
19) As soon as they finish, they notified the job tracker or the app master which asks the reducer to
copy the result to the local disk and so the partitions are copied by the reducer from the network.
• After this comes the sort phase, reducer merges the map output which are then filled into
reducer to create the final result. The mechanism in sort phase is little more involved.Let’s
look at the sort phase the property which plays an important role is merge factor.
io.sort.factor. Its default value is 10. It signifies how many files can be merge at one
go.Suppose if Reducer receives 30 files from different maps. then these can be merged
into batches of 10 and in three rounds it will create the intermediate merge file and in the
final round it would be fed directly into the reducer. Just note that the most files need to
be sorted by the keys as well. To increase disk i/o efficiency, the actual algorithms behave
differently. It picks first three files and merge into one and then picks up the next patches
of ten. In the final round it would take the remaining six files. Doing like this, increases the
I/O efficiency.
204
Notes
205
Notes
Shuffling in MapReduce
The process of transferring data from the mappers to reducers is known as shuffling i.e. the process
by which the system performs the sort and transfers the map output to the reducer as input. So,
MapReduce shuffle phase is necessary for the reducers, otherwise, they would not have any input
(or input from every mapper). As shuffling can start even before the map phase has finished so this
saves some time and completes the tasks in lesser time.
Sorting in MapReduce
The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e. Before
starting of reducer, all intermediate key-value pairs in MapReduce that are generated by mapper
get sorted by key and not by value. Values passed to each reducer are not sorted; they can be in any
order.Sorting in Hadoop helps reducer to easily distinguish when a new reduce task should start.
This saves time for the reducer. Reducer starts a new reduce task when the next key in the sorted
input data is different than the previous. Each reduce task takes key-value pairs as input and
generates key-value pair as output.Note that shuffling and sorting in Hadoop MapReduce is not
performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job
stops at the map phase, and the map phase does not include any kind of sorting (so even the map
phase is faster).
Summary
Hadoop The unit tests are all built to run on a single computer rather than a Hadoop cluster.
Apache Bigtop has a running search for that. The unit tests function by generating miniDFS,
MiniYARN, and MiniMR clusters, as needed. All of them execute the code for the respective
services.
Hadoop has been utilising JUnit4 for some time, yet it appears that many new tests for JUnit
v3 are still being produced.
Apache MRUnit TM is a Java package for unit testing Apache Hadoop map reduce tasks.
The post's example uses the Weather dataset, and it works with the year and temperature
retrieved from it. Obviously, you can simply adapt the example to your own data.
Hadoop includes a RecordReader that transforms input splits into key-value pairs using
TextInputFormat. In the mapping process, the key-value pairs are utilised as inputs. The
only data format that a mapper can read and understand is this one.
206
Notes
Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary data
storage system. HDFS is a distributed file system that uses a NameNode and DataNode
architecture to allow high-performance data access across highly scalable Hadoop clusters.
Hadoop is an Apache Foundation open source framework for processing huge volumes of
heterogeneous data sets in a distributed way across clusters of commodity computers and
hardware using a simplified programming style. Hadoop is a dependable distributed
storage and analysis solution.
If a job fails, Hadoop will identify the failure and reschedule replacements on healthy
computers. It will only end the task if it fails four times, which is the default setting that may
be changed, and it will kill terminate the job. to be finished
Namenode stores information about all other nodes in the Hadoop Cluster, files in the
cluster, file component blocks and their positions in the cluster, and other information that is
necessary for the Hadoop Cluster's functioning.
Job Tracker manages the sharing of information and outcomes by keeping track of the
specific tasks/jobs allocated to each of the nodes.
The CapacityScheduler was created to allow huge clusters to be shared while ensuring that
each organisation has a minimum capacity guarantee. The key principle is that the Hadoop
Map-Reduce cluster's available resources are partitioned among various companies that
finance the cluster collectively based on computation demands.
Keywords
Apache MRUnit: Apache MRUnit TM is a Java package for unit testing Apache Hadoop map
reduce tasks. The post's example uses the Weather dataset, and it works with the year and
temperature retrieved from it. Obviously, you can simply adapt the example to your own data.
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
Job Tracker:The job tracker is a master daemon that runs on the same node as the data nodes and
manages all of the jobs. This data will be stored on multiple data nodes, but it is the task tracker's
responsibility to keep track of it.
FIFO: As the name implies, FIFO stands for First In First Out, which means that the tasks or
applications that arrive first are served first. In Hadoop, this is the default Schedular. The jobs are
placed in a queue and completed in the order in which they were submitted.
Capacity Scheduler:The CapacityScheduler was created to allow huge clusters to be shared while
ensuring that each organisation has a minimum capacity guarantee. The key principle is that the
Hadoop Map-Reduce cluster's available resources are partitioned among various companies that
finance the cluster collectively based on computation demands.
Fair scheduling is a method of allocating resources to apps in such a way that each app receives an
equal proportion of resources over time. Hadoop NextGen can schedule a variety of resource kinds.
The Fair Scheduler's scheduling fairness judgments are based only on memory by default.
Task Failure: If a job fails, Hadoop will identify the failure and reschedule replacements on healthy
computers. It will only end the task if it fails four times, which is the default setting that may be
changed, and it will kill terminate the job. to finalise
Child JVM: The parent MRAppMaster's environment is passed down to the child job.
Themapreduce.map.java.optsandmapred.reduce.java.opts configuration arguments in theJob can
be used to provide the child JVM extra options. -Djava.library.path=>, for example, can be used to
specify non-standard paths for the runtime linker to look for shared libraries. If the symbol(taskid)
property is present in the mapreduce.map.java.opts or mapred.reduce.java.opts properties, it is
interpolated with the taskid value of the MapReduce task.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with low-
cost hardware in mind.
207
Notes
Self Assessment
1. Testing the entire system's end-to-end functioning is characterized as
A. Functional testing
B. Unit Testing
C. Stress Testing
D. Load Testing
2. What is testing?
A. Finding broken code.
B. Evaluating deliverable to find errors.
C. A stage of all projects.
D. All of the above.
6. _________ is a processing technique and a program model for distributed computing based
on java.
A. Composing
B. Decomposing
C. MapReduce
D. None of above
208
Notes
7. __________ a data processing application into mappers and reducers is sometimes nontrivial.
A. Composing
B. Decomposing
C. MapReduce
D. None of above
8. Which of the following method causes call returns only when the job gets finished, and it
returns with its success or failure status which can be used to determine that further steps
are to be run or not?
A. Waitforfinished()
B. waitForCompletion()
C. Both
D. None of the above
9. Which of the following specifies the environment variables that affect the JDK used by
Hadoop Daemon (bin/hadoop).
A. core-site.xml
B. hadoop-env.sh
C. hdfs-site.xml
D. mapred-site.xml
10. Which of the followings are important configuration files which is required for runtime
environment settings of a Hadoop cluster that also informs Hadoop daemons where the
NAMENODE runs in the cluster.
A. core-site.xml
B. hadoop-env.sh
C. hdfs-site.xml
D. mapred-site.xml
11. _________ is responsible for scheduling tasks, monitoring them, and re-executes the failed
task.
A. Hadoop MapReduce
B. Yarn
C. Hive
D. Pig
13. In Hadoop, the process by which the intermediate output from mappers is transferred to
the reducer is called _________.
A. Shuffling
209
Notes
B. Sorting
C. Both
D. None of above
14. Which of the following’s tasks are the first phase of processing that processes each input
record and generates an intermediate key-value pair?
A. Reducer task
B. Mapper Task
C. Compress Task
D. None of above
6. C 7. B 8. B 9. B 10. A
Review Questions
1. Explain all unit testing techniques.
2. Explain three core classes of MRUNIT.
3. Explain Developing and testing MapReduce jobs with MRUnit
4. Diagrammatically explain shuffle and sort concepts
5. Explain three kinds of failure in MapReduce.
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That
Will Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
210
Notes
211
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 11: Hadoop Ecosystem
Objectives
• explore the concepts of HIVE.
• understand architecture of HIVE.
• explore concepts and architecture of Apache pig
• understand Pig-Latin data types, applications and features of Pig
• learn operators in Apache pig.
• learn services offered by HIVE
• Learn fundamentals of Hbase
• Explore concepts of ZooKeeper
• understand IBM InfoSphere Streams
• learn a new paradigm for information processing, learn powerful, real-time analytic
processing made simple
• Explore Enterprise integration and concepts of scale-out architecture.
212
Notes
Introduction
Apache Hive is an opensource data warehousing solution built on the Hadoop platform. Hiv
e is a
database that may be used to analyse and query huge datasets contained in Hadoop files. Hive may
be used to process both structured and semi-
structured data.On the BigData landscape, Apache Hive is one of the most used datawarehouse co
mponents. Its interface is mostly used to supplement the Hadoop file system.Hive was created by F
acebook and is currently maintained by the Apache Software Foundation as Apache Hive.Netflix a
nd Amazon are among the companies that utilise and develop it.Hive is a Hadoop data warehouse
architecture solution that allows you to handle structured data. It resides on top of Hadoop to
summaries Big Data and facilitate searching and analysis.It resides on top of Hadoop to summaries
Big Data and facilitate searching and analysis.Initially created by Facebook, Hive was eventually
taken up by the Apache Software Foundation and maintained as an open-source project under the
name Apache Hive. It is utilized by a variety of businesses. For example, Amazon utilizes it in
Amazon Elastic MapReduce.
Hive is not
• A language for real-time queries
• and row-level updates
213
Notes
• User Interface:This is where the end user interacts with Hive in order for the data to be
processed. We offer various methods to interface with Hive, including the Web UI and the
Hive CLI, which is included with the Hive package
We may also use Thrift Client, JDBC Client, and ODBC Client. Hive also offers services
such as Hive CLI, Beeline, and others.
• Hive Query process engine: The query entered via the user interface is parsed by the
Hive compiler. It uses information contained in the metastore to verify for semantic and
syntactic accuracy. Finally, it generates an execution plan in the form of a DAG (Directed
Acyclic Graph), with each stage representing a mapreduce task as shown in Figure 4.
Execution Engine: Execution Engine is where the actual processing of the data will start.
After compiler checking the syntax, performs the optimizations of the execution. Finally,
this execution plan will be given to Execution Engine.We have several execution engines
that can be used with Hive. MapReduce is one of the execution engines which slower
compared other engines. We can change to this execution engine to Tez or Spark. To
change the execution engine we can use the below command:
set hive.execution.engine=spark;
set hive.execution.engine=tez;
set hive.execution.engine=mr;
214
Notes
Metastore:
Metastore is the central repository where the metadata about tables will be stored. This
metadata includes database names. table names, column details along with data types of
columns, and table partition details. It also stores the details about Serialization and
Deserialization details of the files stored in underlying storage system. In generalmetastore
is relational database. Metastore provides thrift server to interact with it.Metastore can be
used in two modes.
Remote mode
Metastore can
be used in two
modes. Embedded
Mode
Figure 5 Two modes of metastore
• Remote mode: In this mode meta-store is a Thrift Service which can be used in case non-
Java applications.
• Embedded Mode: In this case client can directly interact with meta-store using JDBC.
HDFS/Hbase Storage:
Hive is unable to store data directly. Hive can only analyses data and enter it into tables; the data
itself is kept on storage systems such as HDFS, HBase, or S3. Hive will create tables that point to the
location of the data in any of the above storage systems, and the data will be accessed from there.
• UI (User Interface): The user interface is for users to submit queries and other operations to
the system.
• Driver: This is the component that receives the requests. This component supports the
concept of session handles and provides JDBC/ODBC-style execute and fetch APIs.
• Compiler: The component that parses the query does semantic analysis on the various query
blocks and query expressions, and then creates an execution plan using information from
the Metastore for table and partition metadata.
• Metastore: The component that holds all of the structural information for the warehouse's
different tables and partitions, including column and column type information, serializers
and de-serializers for reading and writing data, and the HDFS files where the data is kept.
• Execution Engine: The component responsible for carrying out the compiler's execution
plan. The strategy is organized as a DAG of phases. The execution engine coordinates the
dependencies between these various plan stages and performs them on the relevant system
components.
• This metadata is used to type check the expressions in the query tree as well as to prune
partitions based on query predicates. The plan generated by the compiler is a DAG of stages
with each stage being either a map/reduce job, a metadata operation or an operation on
HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are
executed on the mappers) and a reduce operator tree (for operations that need reducers)., a
metadata operation or an operation on HDFS. For map/reduce stages, the plan contains
map operator trees (operator trees that are executed on the mappers) and a reduce operator
tree (for operations that need reducers).
215
Notes
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top
of Hadoop to summarize Big Data, and makes querying and analyzing easy.Initially Hive was
developed by Facebook, later the Apache Software Foundation took it up and developed it further
as an open source under the name Apache Hive. It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
216
Notes
217
Notes
In Pig, there is a language we use to analyze data in Hadoop. That is what we call Pig Latin. Also, it
is a high-level data processing language that offers a rich set of data types and operators to perform
several operations on the data.Moreover, in order to perform a particular task, programmers need
to write a Pig script using the Pig Latin language and execute them using any of the execution
mechanisms (Grunt Shell, UDFs, Embedded) using Pig. To produce the desired output, these
scripts will go through a series of transformations applied by the Pig Framework, after execution.
Further, Pig converts these scripts into a series of MapReduce jobs internally. Therefore, it makes
the programmer’s job easy. Here, is the architecture of Apache Pig.
a. Parser: At first, all the Pig Scripts are handled by the Parser. Parser basically checks the
syntax of the script, does type checking, and other miscellaneous checks. Afterwards,
Parser’s output will be a DAG (directed acyclic graph) that represents the Pig Latin
statements as well as logical operators.The logical operators of the script are represented
as the nodes and the data flows are represented as edges in DAG (the logical plan)
Figure 9 Parser
b. Optimizer: Afterwards, the logical plan (DAG) is passed to the logical optimizer. It carries
out the logical optimizations further such as projection and push down.
c. Compiler: Afterwards, the logical plan (DAG) is passed to the logical optimizer. It carries
out the logical optimizations further such as projection and push down.
218
Notes
d. Execution Engine: Eventually, all the MapReduce jobs are submitted to Hadoop in a
sorted order. Ultimately, it produces the desired results while these MapReduce jobs are
executed on Hadoop.
Figure 10 Atom
a. Atom: Atom is defined as any single value in Pig Latin, irrespective of their data.
Basically, we can use it as string and number and store it as the string. Atomic values of
Pig are int, long, float, double, char array, and byte array. Moreover, a field is a piece of
data or a simple atomic value in Pig. For Example − ‘Shubham’ or ‘25’
b. Bag:An unordered set of tuples is what we call Bag. To be more specific, a Bag is a
collection of tuples (non-unique). Moreover, each tuple can have any number of fields
(flexible schema). Generally, we represent a bag by ‘{}’.It is similar to a table in RDBMS. It
is not necessary that every tuple contain the same number of fields in the same
position(column) have the same type.
Figure 11 Bag
219
Notes
1. The byte array is your blob file. Remember one thing if you do not provide any data right
while making your application. It will be by default considered as a byte array schema.
You have got Boolean which consist of true and false.
2. Data/time now like the whole timestamp like thing then you have a big integer and big
decimal for taking the value bigger than double and floats. Now you have also some of
complex types.
Table 2 Complex Types
3. Tuple is basically a single row which will be in a round brackets. In the bag you have got
curly braces. Curly braces contain lots of other tuples inside it. Tuple is represented using
round bracket and bag is represented using a curly bracket
4. A map is basically is a combination of key-value pair which makes up of your key-values
pairs which is separated by a hash tag. A hash tag is a separator between y our map
key value pairs.
5. Values for all the above data types can be NULL. Apache Pig treats null values in a similar
way as SQL does. A null can be an unknown value or a non-existent value. It is used as a
220
Notes
placeholder for optional values. These nulls can occur naturally or can be the result of an
operation.
221
Notes
time sensitive data loads.For collecting large amounts of datasets in form of search logs and web
crawls. Used where the analytical insights are needed using the sampling.
Apache Pig comes with the following features −
• Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc. Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig
script if you are good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.
Extensibility − Using the existing operators, users can develop their own functions to
read, process, and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts. Handles all kinds of
data − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It
stores the results in HDFS.
Local Mode
The source data for 'Local Mode' would be taken from your computer's local directory. The ‘pig –x
local’ command can be used to specify the MapReduce mode.
MapReduce Mode
You'll need access to a Hadoop cluster and an HDFS installation to run Pig in MapReduce mode.
The 'pig' command may be used to specify the MapReduce mode.
222
Notes
The Apache Pig Operators are a high-level procedural language that may be used to query big data
sets utilizing Hadoop and the Map Reduce Platform. A Pig Latin statement is a type of operator
that accepts one relation as input and outputs another. These operators are the most important tools
Pig Latin gives for working with data. Sorting, grouping, joining, projecting, and filtering are all
options for transforming it.
Relational Operators
Pig Latin's main tools for manipulating data are relational operators. Sorting, grouping, merging,
projecting, and filtering the data are all options. The basic relational operators are:-
LOAD
The LOAD operator is used to insert data into a Pig relation from a file system or HDFS storage.
FOREACH
Based on data columns, this operator creates data transformations. It's used to make changes to a
relation's fields. To interact with data columns, use the FOREACH-GENERATE method.
FILTER
This operator uses a condition to pick tuples from a relation.
JOIN
The JOIN operator is used to accomplish an inner, equijoin join of two or more relations based on
field values that are the same. An inner join is always performed by the JOIN operator. Null keys
are ignored by inner joins, therefore filtering them out before the join makes logical.
ORDER BY
Order By is a feature that allows you to sort a relation by one or more fields. Using the ASC and
DESC keywords, you may sort in ascending or descending order.
DISTINCT
In a relation, distinct eliminates duplicate tuples. Consider the following input file, which contains
amr,crap,8 and amr,myblog,10 twice. Duplicate items are deleted when distinct is applied to the
data in this file.
STORE
The term "store" refers to the process of saving results to a file system.
GROUP
The GROUP operator joins tuples that have the same group key (key field). If the group key
includes more than one field, the key field will be a tuple; otherwise, it will be the same type as the
group key. A GROUP operation produces a relation that has one tuple per group.
CROSS
To compute the cross product (Cartesian product) of two or more relations, use the CROSS
operator.
223
Notes
LIMIT
To restrict the number of output tuples, use the LIMIT operator. The output will include all tuples
in the relation if the provided number of output tuples is equal to or more than the number of
tuples in the relation.
SPLIT
The SPLIT operator divides a relation's contents into two or more relations depending on an
expression. In accordance with the requirements indicated in the statement.
Diagnostic Operators
DUMP
The DUMP operator executes Pig Latin commands and displays the results on the screen.
DESCRIBE
To study the schema of a specific relation, use the DESCRIBE operator. When troubleshooting a
script, the DESCRIBE operator is very useful.
ILLUSTRATE
The ILLUSTRATE operator is used to see how data is changed using Pig Latin statements. When it
comes to debugging a script, the ILLUSTRATE command is your best friend. This command alone
could be enough to convince you to use Pig instead of something else.
EXPLAIN
The logical and physical planes are printed by the EXPLAIN operator.
224
Notes
Beeline
Beeline is a command line interface of hive server2 a new launched product of hive. ...
Recently, the Hive community introduced HiveServer2 which is an enhanced Hive server
designed for multi-client concurrency and improved authentication that also provides better
support for clients connecting through JDBC and ODBC
Hive Server2
HiveServer2 is HiveServer1's successor. Clients can use HiveServer2 to run queries against
Hive. It enables numerous clients to send Hive requests and retrieve the results. Its primary
purpose is to provide the greatest possible support for open API clients such as JDBC and
ODBC.The Thrift-based Hive service is at the heart of HS2 and is in charge of handling Hive
queries (e.g., from Beeline). Thrift is a cross-platform RPC framework for creating services.
Server, Transport, Protocol, and Processor are the four levels that make up its stack.
Hive Driver
The Hive driver accepts the HiveQL statements entered into the command shell by the user. It
generates the query's session handles and sends it to the compiler. Hive Driver is a Java Script
driver for connection to Apache Hive via Thrift API. This driver can connect with SASL
authentication mechanisms (such as LDAP, PLAIN, Kerberos) using both HTTP and TCP transport.
Hive Compiler
The query is parsed by the Hive compiler. It uses the metadata stored in the metastore to do
semantic analysis and type-checking on the various query blocks and query expressions, and then
generates an execution plan.Hive Compiler is a tool that allows you to compile data in Hive. The
DAG (Directed Acyclic Graph) is the execution plan generated by the compiler, with each step
consisting of a map/reduce job, an HDFS action, and a metadata operation.Optimizer separates the
job and performs transformation operations on the execution plan to increase efficiency and
scalability.
Compiler communicating with Driver with the proposed plan to execute the query.
Optimize
Optimizer performs the transformation operations on the execution plan and splits the task to
improve efficiency and scalability.optimize. bucketmapjoin=true. This setting hints to Hive to do
bucket level join during the map stage join. It also reduces the scan cycles to find a particular key
because bucketing ensures that the key is present in a specific bucket. It optimises and transforms
an execution plan in order to produce an optimal Directed Acyclic Graph (DAG). To improve
performance and scalability, transformations such as transforming a pipeline of joins to a single join
and job separation, such as putting a transformation on data before a reduce operation, are used.
Optimization Techniques:
Let's go over each of the Hive optimization strategies for Hive Performance Tuning one by one:
225
Notes
an optimized way than the other file formats. To be more specific, ORC reduces the size of
the original data up to 75%. Hence, data processing speed also increases. On comparing to
Text, Sequence and RC file formats, ORC shows better performance.Basically, it contains
rows data in groups. Such as Stripes along with a file footer. Therefore, we can say when
Hive is processing the data ORC format improves the performance.To be more specific,
ORC reduces the size of the original data up to 75%. Hence, data processing speed also
increases. On comparing to Text, Sequence and RC file formats, ORC shows better
performance. Basically, it contains rows data in groups. Such as Stripes along with a file
footer. Therefore, we can say when Hive is processing the data ORC format improves the
performance.
c. Hive Partitioning: Hive Optimization Techniques, Hive reads all the data in the directory
Without partitioning. Further, it applies the query filters on it. Since all data has to be read
this is a slow as well as expensive. Also, users need to filter the data on specific column
values frequently. Although, users need to understand the domain of the data on which they
are doing analysis, to apply the partitioning in the Hive.Basically, by Partitioning all the
entries for the various columns of the dataset are segregated and stored in their respective
partition. Hence, While we write the query to fetch the values from the table, only the
required partitions of the table are queried. Thus, it reduces the time taken by the query to
yield the result.
d. Bucketing in Hive: Hive Optimization Techniques, let’s suppose a scenario. At times, there
is a huge dataset available. However, after partitioning on a particular field or fields, the
partitioned file size doesn’t match with the actual expectation and remains huge.Still, we
want to manage the partition results into different parts. Thus, to solve this issue of
partitioning, Hive offers Bucketing concept. Basically, that allows the user to divide table
data sets into more manageable parts.Hence, to maintain parts that are more manageable we
can use Bucketing. Through it, the user can set the size of the manageable parts or Buckets
too.
e. Vectorization In Hive: Hive Optimization Techniques, to improve the performance of
operations we use Vectorized query execution. Here operations refer to scans, aggregations,
filters, and joins. It happens by performing them in batches of 1024 rows at once instead of
single row each time.However, this feature is introduced in Hive 0.13. It significantly
improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
f. Cost-Based Optimization in Hive (CBO):Hive Optimization Techniques, before submitting
for final execution Hive optimizes each Query’s logical and physical execution plan.
Although, until now these optimizations are not based on the cost of the query.However,
CBO, performs, further optimizations based on query cost in a recent addition to Hive. That
results in potentially different decisions: how to order joins, which type of join to perform,
the degree of parallelism and others.To use CBO, set the following parameters at the
beginning of your query:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
226
Notes
Then, prepare the data for CBO by running Hive’s “analyze” command to collect various
statistics on the tables for which we want to use CBO.
g. Hive Indexing:Hive Optimization Techniques, one of the best ways is Indexing. To increase
your query performance indexing will definitely help. Basically, for the original table use of
indexing will create a separate called index table which acts as a reference.
As we know, there are many numbers of rows and columns, in a Hive table. Basically, it will
take a large amount of time if we want to perform queries only on some columns without
indexing. Because queries will be executed on all the columns present in the table.
Moreover, there is no need for the query to scan all the rows in the table while we perform a
query on a table that has an index, it turned out as the major advantage of using indexing.
Further, it checks the index first and then goes to the particular column and performs the
operation.Hence, maintaining indexes will be easier for Hive query to look into the indexes
first and then perform the needed operations within less amount of time. Well, time is the
only factor that everyone focuses on, eventually.
h. Execution Engine: The execution engine uses Hadoop to execute the execution plan created
by the compiler in order of their dependencies following the compilation and optimization
processes.
i. MetaStore: It also holds serializer and deserializer metadata, which is essential for
read/write operations, as well as HDFS files where data is kept. In most cases, this
metastore is a relational database.For searching and altering Hive metadata, Metastore
provides a Thrift interface.It also holds serializer and deserializer metadata, which is
essential for read/write operations, as well as HDFS files where data is kept. In most cases,
this metastore is a relational database.For searching and altering Hive metadata, Metastore
provides a Thrift interface.Metastore can be configured in one of two ways:
Remote: Metastore is a Thrift service in remote mode, which is suitable for non-
Java applications.
Embedded: In embedded mode, the client can use JDBC to interface directly with
the metastore.
j. HCatalog: Hadoop's table and storage management layer is HCatalog. It allows users to
read and write data on the grid using various data processing tools such as Pig, MapReduce,
and others.It is based on Hive metastore and exposes Hive metastore's tabular data to other
data processing tools.HCatalog is a Hadoop table and storage management layer that allows
users to read and write data on the grid more simply using various data processing tools
such as Pig and MapReduce. Users can have a relational view of data in the Hadoop
distributed file system (HDFS) thanks to HCatalog's table abstraction, which means they
don't have to worry about where or how their data is stored — RCFile format, text files,
SequenceFiles, or ORC files.HCatalog supports reading and writing files in any format for
which a SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile,
CSV, JSON, and SequenceFile, and ORC file formats. To use a custom format, you must
provide the InputFormat, OutputFormat, and SerDe.
k. WebHCat: For HCatalog, WebHCat is the REST API. It's a Hive metadata operations HTTP
interface. It lets users perform Hadoop MapReduce (or YARN), Pig, and Hive
jobs.Developers use HTTP requests to access Hadoop MapReduce (or YARN), Pig, Hive,
and HCatalog DDL from within applications, as demonstrated in the diagram below. HDFS
stores the data and code utilised by this API. When HCatalog DDL commands are
requested, they are immediately executed. WebHCat (Templeton) servers queue
227
Notes
MapReduce, Pig, and Hive jobs, which may be monitored for progress or cancelled as
needed. Pig, Hive, and MapReduce results are stored in HDFS, and developers select where
they should be stored.
11.11 HIVEQL
Hive Query Language (HiveQL) is a query language for Hive that allows you to process and
analyse structured data in a Metastore.A table's data is retrieved using the SELECT command.
The WHERE clause functions in the same way as a condition. It applies the condition to the
data and returns a finite result. The built-in operators and functions provide an expression that
meets the criteria.The SELECT query's syntax is as follows:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
JOIN clause
JOIN is a clause that is used for combining specific fields from two tables by using values common
to each one. It is used to combine records from two or more tables in the database.
Syntax:
join_table:
228
Notes
HBase
Since 1970, RDBMS has been the go-to option for data storage and maintenance issues. Companies
discovered the value of processing big data after the introduction of big data and began to use
technologies such as Hadoop.Hadoop stores massive data in a distributed file system and processes
it using MapReduce. Hadoop excels at storing and processing large amounts of data in a variety of
formats, including random, semi-structured, and unstructured data.Tables in HBase are divided
into regions and served by region servers. Column families divide regions vertically into "Stores."
In HDFS, stores are saved as files. HBase's architecture is depicted below.
Architecture of HBase
HBase provides low-latency random reads and writes on top of HDFS. In HBase, tables are
dynamically distributed by the system whenever they become too large to handle (Auto Sharding).
The simplest and foundational unit of horizontal scalability in HBase is a Region. A continuous,
sorted set of rows that are stored together is referred to as a region (subset of table data). HBase
architecture has a single HBase master node (HMaster) and several slaves i.e. region servers. Each
region server (slave) serves a set of regions, and a region can be served only by a single region
server. Whenever a client sends a write request, HMaster receives the request and forwards it to the
corresponding region server.
HBase can be run in a multiple master setup, wherein there is only single active master at a time.
HBase tables are partitioned into multiple regions with every region storing multiple table’s rows.
MasterServer
• This process is accomplished with the help of Apache ZooKeeper, which assigns regions
to region servers.Handles region load balancing across area servers. It moves the regions
to less crowded servers after unloading the congested servers.By negotiating load
balancing, it keeps the cluster in good shape.Is in charge of schema modifications and
other metadata actions like table and column family formation.
Regions
• Tables are broken up and distributed across area servers to form regions.
• There are regions on the region servers that -
Handle data-related actions and communicate with the client.
For all the areas beneath it, handle read and write requests.
Follow the region size thresholds to determine the region's size.
When we take a closer look at the region server, we can see that it has the following regions and
stores:
Memory and HFiles are both stored in the store. Memstore works similarly to cache memory.
Everything that is entered into HBase is initially saved here. The data is then transported and saved
as blocks in Hfiles, and the memstore is flushed.
229
Notes
HMaster
Base architecture
has 3 important Region Server
components
and ZooKeeper.
Region Server
HBase Tables are separated into Regions horizontally by row key range. Regions are the
fundamental building blocks of an HBase cluster, consisting of a distribution of tables and Column
families. The Region Server runs on an HDFS DataNode in the Hadoop cluster. Region Server
regions are responsible for a variety of tasks, including handling, administering, and performing
HBase operations on that set of regions. A region's default size is 256 MB.
Block Cache
MemStore
Region Server runs on
HDFS DataNode and
consists of the following
components –
Write Ahead Log (WAL)
HFile
• Block Cache – The read cache is located here. The read cache stores the most frequently
read data, and when the block cache is full, recently accessed material is evicted.
• MemStore- This is the write cache, which keeps track of new data that hasn't yet been
written to disc. A MemStore exists for each column family in a region.
230
Notes
• Write Ahead Log (WALs a file that keeps temporary data that isn't persisted to a
permanent storage location.
• HFile is the storage file that contains the rows on a disc as sorted key values.
11.12 ZOOKEEPER
Zookeeper is an open-source project that provides services like maintaining configuration
information, naming, providing distributed synchronization, etc.Zookeeper has ephemeral nodes
representing different region servers. Master servers use these nodes to discover available
servers.In addition to availability, the nodes are also used to track server failures or network
partitions.Clients communicate with region servers via zookeeper.In pseudo and standalone
modes, HBase itself will take care of zookeeper.
In HBase, it's similar to a coordinator. It offers features such as configuration information
management, naming, distributed synchronization, and server failure notification, among others.
Clients use zookeeper to communicate with region servers.Can store large data sets,Database can
be shared, Cost-effective from gigabytes to petabytes, High availability through failover and
replication. No support SQL structure, no transaction support, Sorted only on key, Memory issues
on the cluster
231
Notes
As a key enabler for this new generation of analytic processing methods, IBM® InfoSphere®
Streams provides a state-of-the-art computing platform that can help companies transform
burgeoning data into actionable information and business insights. InfoSphere Streams is a critical
component of the IBM Big Data Platform and delivers a highly scalable, agile software
infrastructure to perform in-motion analytics on a wide variety of relational and non-relational data
types entering the system at unprecedented volumes and speeds—and from thousands of real-time
sources. With InfoSphere Streams, organizations can capture and act on key business data just in
time, all the time.
These data streams can come from both structured and unstructured data sources, and they can
contain a wide range of digital data, including:
• Text files, spreadsheets, images, video and audio recordings •
• Email, chat and instant messages, web traffic, blogs and social networking sites •
• Financial transactions, customer service records, telephone usage records, system and
application logs. Data from satellites, GPS tracking, smart devices and network traffic
sensors
232
Notes
• InfoSphere Streams brings these disparate data kinds together on a computing platform
that allows for advanced data analysis while maintaining great speed and response times.
Getting Started also makes it simple to install, build, configure, and manage application instances
with just a few clicks. Visual programming with drag-and-drop helps to shorten the learning curve
and accelerate application development.
Scale-out architecture
InfoSphere Streams software helps organizations extend their current IT investments without a
massive infrastructure overhaul. It scales from a single server to a virtually unlimited number of
nodes to process data of any volume—from terabytes to zettabytes. InfoSphere Streams provides a
clustered runtime environment that can easily handle up to millions of events per second with
microsecond latency. Actionable results can be achieved with near-zero latency. For improved
speed, the Advanced Compiler combines parts of the application and can distribute parts of the
application across many hardware nodes.Ethernet and InfiniBand are among the high-speed
transports it supports. Existing applications can be flexibly extended with new apps that access the
same data streams, allowing current investments to be used even further.A web-based
management console makes it easy to configure and manage the runtime and applications,
including automatically placing features and deploying application components. Applications and
233
Notes
their individual elements can be monitored for status and performance metrics to help ensure the
company attains its service-level agreements.
An instance graph
• A visual monitoring of application health and metrics is provided by an instance graph,
which is available in both InfoSphere Streams Studio and the management console, and
allows users to immediately spot issues using customisable views.
234
Notes
Complex Event Processing (CEP), which uses patterns to detect composite events in streams of
basic events, resulting in high performance and rich analytics. Existing applications can be simply
moved to an Infosphere Streams environment to benefit from greater scalability and the capacity to
handle up to 10 times more events per second on the same hardware.
Summary
Hive is a SQL-based database that allows users to read, write, and manage petabytes of data.
Hive is based on Apache Hadoop, an open-source system for storing and processing massive
information effectively.
Pig is a high-level programming language for Apache Hadoop. Pig allows data analysts to
create complicated data transformations without needing to know Java.
The IBM InfoSphere Information Server is a prominent data integration technology that makes
understanding, cleansing, monitoring, and transforming data easier.
HDFS is a distributed file system that runs on commodity hardware and can handle massive
data collections. It is used to grow an Apache Hadoop cluster from a few nodes to hundreds (or
even thousands) of nodes. HDFS is one of Apache Hadoop's primary components, along with
MapReduce and YARN.
Hadoop includes a RecordReader that transforms input splits into key-value pairs using
TextInputFormat. In the mapping process, the key-value pairs are utilised as inputs. The only
data format that a mapper can read and understand is this one.
Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary data
storage system. HDFS is a distributed file system that uses a NameNode and DataNode
architecture to allow high-performance data access across highly scalable Hadoop clusters.
Hadoop is an Apache Foundation open source framework for processing huge volumes of
heterogeneous data sets in a distributed way across clusters of commodity computers and
hardware using a simplified programming style. Hadoop is a dependable distributed storage
and analysis solution.
If a job fails, Hadoop will identify the failure and reschedule replacements on healthy
computers. It will only end the task if it fails four times, which is the default setting that may be
changed, and it will kill terminate the job. to be finished
Namenode stores information about all other nodes in the Hadoop Cluster, files in the cluster,
file component blocks and their positions in the cluster, and other information that is necessary
for the Hadoop Cluster's functioning.
235
Notes
Job Tracker manages the sharing of information and outcomes by keeping track of the specific
tasks/jobs allocated to each of the nodes.
The IBM InfoSphere Information Server is a prominent data integration technology that makes
understanding, cleansing, monitoring, and transforming data easier.
Keywords
Apache Hive:Hive is a data warehousing and ETL solution built on top of the Hadoop Distributed
File System (HDFS). Hive makes it simple to carry out tasks such as these.Encapsulation of data,
Querying on the fly, Large-scale data analysis
Apache Pig: Pig is a high-level platform or tool for processing massive datasets. It provides a high-
level of abstraction for MapReduce computation.
Apache MRUnit: Apache MRUnit TM is a Java package for unit testing Apache Hadoop map
reduce tasks. The post's example uses the Weather dataset, and it works with the year and
temperature retrieved from it. Obviously, you can simply adapt the example to your own data.
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
HBase: HBase is a data model that looks a lot like Google's Big Table. It is a Java-based open-source
distributed database created by the Apache Software Foundation. HBase is a critical component of
the Hadoop ecosystem. HDFS is the foundation for HBase.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with low-
cost hardware in mind.
Meta-Store:Apache Hive metadata is stored in Meta-store, a central repository. It uses a relational
database to store Hive table and partition metadata (such as schema and location).
Name node: The name node is a piece of commodity hardware that houses the GNU/Linux
operating system as well as name node software. It's a piece of software that can run on standard
hardware.
Shuffling and sorting:The process of transferring intermediate output from the mapper to the
reducer is known as shuffle. On the basis of reducers, a reducer receives one or more keys and their
associated values. The mapper's intermediated key – value is automatically ordered by key.
Self Assessment
1. Which of the following is an open-source data warehouse system that has been built on top
of Hadoop?
A. Apache Hive
B. Apache Pig
C. Apache Hbase
D. None of the mentioned
236
Notes
4. ________ developed by Yahoo researchers executes Map Reduce jobs on extensive datasets
and provides an easy interface for developers to process the data efficiently.
A. Apache Hive
B. Apache pig
C. Both
D. None of the mentioned
6. Which of the following compiles the optimized logical plan into a series of MapReduce jobs?
A. Parser
B. Atom
C. optimizer
D. compiles
9. ________ operator is used to perform an inner, equijoin join of two or more relations based
on common field values.
A. COMBINE
B. COMBINATION
C. JOIN
D. JOINING
237
Notes
14. HBase Tables are separated into Regions ________ by row key range.
A. Vertically
B. Horizontally
C. Diagonally
D. None of the mentioned
15. Which of the following is not a component of the HDFS data node?
A. Block Cache
B. MemStore
C. HFile
D. None of the mentioned
Review Questions
1. Explain architecture of Pig.
2. Explain working of HIVE.
3. Elaborate classification of Apache Pig operators
4. What do you understand by Infosphere streams?
238
Notes
6. D 7. B 8. A 9. C 10. A
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That Will
Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For Innovation,
Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
239
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 12: Predictive Analytics
Objectives
• learn simple linear regression and multiple linear regression.
• learn visual data analysis techniques.
• learn applications of business analytics.
Introduction
Simple linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables.
• Onevariable,denoted x, is regarded asthe predictor, explanatory, or independent variable.
• The other variable, denoted y, is regarded as the response, outcome,
or dependent variable.
• Often, the objective is to predict the value of an output variable (or response) based on the
value of an input (or predictor) variable.
240
Notes
A line that begins in the lower left corner of the plot and ends in the upper right corner is called
a positive relationship. In a positive linear relationship, high scores on the X variable predict high
scores on the Y variable. A line that begins in the upper left corner of the plot and ends in the lower
right corner (like the relationship shown above) is called a negative relationship as shown in Figure
1. In a negative linear relationship, high scores on the X variable predict low scores on the Y
variable.
Linear Relationship
Linear regression is a linear model, e.g., a model that assumes a linear relationship between the
input variables (x) and the single output variable (y). More specifically, that y can be calculated
from a linear combination of the input variables (x). Scores scattered randomly around a straight
line in the middle of the graph indicate no relationship between variables.Sometimes a scatter plot
will show a curvilinear relationship between two variables. If this happens, we need to use special
statistics developed for curvilinear relationships.Whereas some relationships are straightforward to
understand, explain, and detect statistically (i.e., linear relationships), curvilinear relationships are
more complex because the nature of the relationship is different at different levels of the variables.
Curvilinear relationships can occur often in communication research, given the complex, socially
and contextually dependent phenomena that are the focus of such research.
Linear regression models are not perfect. It tries to approximate the relationship between
dependent and independent variables in a straight line. Approximation leads to errors. Some errors
can be reduced.Some errors are inherent in the nature of the problem. These errors cannot be
eliminated. They are called as an irreducible error, the noise term in the true relationship that
cannot fundamentally be reduced by any model.The same equation of a line can be re-written as: β0
241
Notes
and β1 are two unknown constants that represent the intercept and slope. They are the parameters.
ε is the error term.
Example
• The following are the data provided to him: make: make of the car. fuelType: type of fuel
used by the car. nDoor: number of doors. engineSize: size of the engine of the car. price:
the price of the car.
First and foremost, Fernando wants to see if he can accurately estimate automobile prices based on
engine size. The following questions are addressed in the first set of analyses:
• Is price of car price related with engine size? How strong is the relationship? Is the
relationship linear? Can we predict/estimate car price based on engine size?
Fernando does a correlation study.
Correlation is a metric for determining how closely two variables are linked. The correlation
coefficient is a statistic that is used to quantify it. Its value ranges from 0 to 1.When the correlation
coefficient is large (> 0.7), it means that if one variable grows, the other increases as well. The
presence of a big -ve number implies that when one variable rises, the other falls.He does a
correlation analysis. He plots the relationship between price and engine size. He splits the data into
training and test set. 75% of data is used for training. Remaining is used for the test.He builds a
linear regression model. He uses a statistical package to create the model. The model creates a
linear equation that expresses price of the car as a function of engine size.
242
Notes
243
Notes
The following are some of the most common applications of the Data Visualization
technique:
• It is a strong approach for analysing data and producing presentable and understandable
findings.
• It is a fundamental stage in the pre-processing section of the data mining process.
• It aids in data cleansing by detecting inaccurate data, corrupted or missing values
• It also aids in the construction and selection of variables, which involves deciding which
variables to include and exclude from the study.
• It also plays an important part in the Data Reduction process when merging the categories.
Univariate
Analysis
Three different
types of analysis for Bivariate Analysis
Data Visualization
Multivariate
Analysis
Univariate Analysis: We will use a single characteristic to examine virtually all of its features in a
univariate analysis. Example of univariate analysis are distribution plot, box and whisker plot and
violin plot.
244
Notes
Bivariate Analysis: Bivariate analysis is when we compare data between two characteristics that
are precisely the same. Example of bivariate analysis are line plot, bar plot and scatter plot.
Multivariate Analysis: We shall compare more than two variables in the multivariate analysis. For
example, if a marketer wishes to compare the popularity of four adverts on a website, click rates for
both men and women may be monitored, and associations between factors could be investigated.
It's similar to bivariate, however there are more dependent variables. The methods for analysing
this data are determined by the objectives to be met. Regression analysis, path analysis, factor
analysis, and multivariate analysis of variance are some of the approaches (MANOVA).
245
Notes
Churn Prevention
Churn prevention tries to anticipate which consumers will leave our firm, when they will
leave, and why they will go.Retaining a current client is significantly less expensive than
gaining a new one. As a result, this occurrence has the potential to be extremely
expensive.Companies may create predictive models that enable preemptive action before it's
too late by using the power of large consumer data sets.
246
Notes
Sales Forecasting:
Sales forecasting examines past performance, seasonality, market-moving events, and other factors
to produce a reasonable estimate of demand for a product or service. It may be used to forecast in
the short, medium, or long future.Predictive analytics can forecast consumer reaction and shifting
sentiments by looking at all aspects in this regard.The following are some examples of variables
used in sales forecasting:Data from the calendar: season, hour, holidays, and so on.Temperature,
humidity, rainfall, and other weather dataPrice, promotions, and marketing efforts are examples of
company data.Economic and political aspects that a country is experimenting with are referred to
as social data.
Market Analysis
Market survey analysis aids businesses in meeting consumer needs, boosting profits and lowering
attrition rates.The following are some examples of quality enhancement factors:Components,
presentation, etc. are some of the product's qualities.Gender, age, and other customer
attributesTastes and preferences of customers are surveyed.After the firm has created the predictive
model, it may use it to look for qualities that match customer preferences.For example,
physicochemical testing (e.g., pH levels) can be used to predict wine quality, with the result based
on sensory data (evaluations by wine experts).
Risk assessment
Risk assessment enables businesses to identify the potential for issues in their operations.Predictive
analytics attempts to provide decision-making tools that can predict which activities are lucrative
and which are not.Risk assessment is a broad phrase that may imply a variety of things to different
people. Indeed, we may wish to assess the risk of a client, a firm, or other entity.The risk assessment
can look at the following sorts of data in the instance of a client:Gender, age, education, marital
status... are all socio-demographic characteristics to consider.
Financial modeling
Financial modelling is the process of converting a set of assumptions about market or agent
behaviour into numerical forecasts.These prediction models are used to assist businesses in making
investment and return decisions.Predicting the stock market trend using internal and external data
is one example.Predictive analytics may be used to a variety of sectors and can help you improve
your performance and predict future occurrences so you can respond accordingly.Neural Designer
is a machine learning and data science tool that makes it simple to create prediction models.
Summary
Regression analysis is a collection of statistical techniques used to estimate the associations
between a dependent variable and one or more independent variables in statistical
modelling.
By fitting a linear equation to observed data, linear regression seeks to model the connection
between two variables.
It is a stand-alone variable that is unaffected by the other variables you're attempting to
measure. A person's age, for example, might be an independent variable. Other aspects,
such as what they eat, how often kids go to school, and how much television they watch,
will have no effect on their age.
In an experiment, the dependent variable is the variable that is being measured or assessed.
The dependent variable in research looking at how tutoring affects test results, for example,
would be the participants' test scores, because that's what's being measured.
Correlation is a statistical word that refers to the degree to which two variables move in
lockstep. When two variables move in the same direction, it is said that they have a positive
correlation. A negative correlation exists when they move in opposite directions.
247
Notes
Keywords
Linear Regression:Linear regression is the process of identifying a line that best matches the data
points on the plot so that we can use it to forecast output values for inputs that aren't included in
the data set we have, with the assumption that those outputs will fall on the line.
Independent Variable: The independent variable (IV) is a feature of a psychological experiment
that is manipulated or modified by researchers rather than by other factors..
Dependent Variable: In an experiment, the dependent variable is the variable that is being
measured or assessed. The dependent variable in a research looking at how tutoring affects test
results, for example, would be the participants' test scores, because that's what's being measured.
Correlation: Correlation is a statistical word that refers to the degree to which two variables move
in lockstep. When two variables move in the same direction, it is said that they have a positive
correlation. A negative correlation exists when they move in opposite directions.
Data Visualization:A graphical depiction of information and data is referred to as data
visualisation. Data visualisation techniques make it easy to identify and comprehend trends,
outliers, and patterns in data by employing visual components like charts, graphs, and maps.
Bivariate Analysis: The phrase "bivariate analysis" refers to the study of two variables in order to
discover their correlations. In quality of life research, bivariate analyses are frequently reported.
Multivariate Analysis:MVA stands for multivariate analysis, which is a statistical process for
analysing data including many types of measurements or observations. It might also refer to
difficulties in which more than one dependent variable is investigated at the same time as other
variables.
Predictive Analysis:Predictive analytics is a form of advanced analytics that uses historical data,
statistical modelling, data mining techniques, and machine learning to create predictions about
future events. Predictive analytics is used by businesses to uncover trends in data in order to
identify dangers and opportunities.
Market Analysis: A market study is a proactive investigation of a product or service's market
demand. Market research examines all of the market elements that drive demand for a particular
248
Notes
product or service. Price, location, competition, substitutes, and overall economic activity are all
factors to consider.
HDFS: Hadoop File System was built on a distributed file system architecture. It runs on standard
hardware. HDFS, unlike other distributed systems, is extremely fault-tolerant and built with low-
cost hardware in mind.
Self Assessment
1. Linear regression is a ___________ machine learning algorithm.
A. Supervised
B. Unsupervised
C. Reinforcement
D. Clustering
2. In Linear Regression, which of the following strategies do we apply to determine the best fit line
for data?
A. Least Square Error
B. Maximum Likelihood
C. Logarithmic Loss
D. Both A and B
3. Simple linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables
A. Categorical
B. Continuous
C. Nominal
D. Ordinal
4. _______measures the linear link between two variables, but it doesn't reveal more complicated c
orrelations.
A. Correlation
B. Factorization
C. Regression
D. None of the mentioned
5. Which of the following plot will show a curvilinear relationship between two variables?
A. Scatter Plot
B. Curvilinear
C. Line
D. Bar Plot
249
Notes
D. None of above
12. Which of the following types of analysis used for Data Visualization?
A. Univariate Analysis
B. Bivariate Analysis
C. Multivariate Analysis
D. All of the above
250
Notes
14. Which of the following libraries should be used to make a chart in Python?
A. Visual data
B. Data visualization
C. Matplot
D. None of the above
6. A 7. D 8. B 9. C 10. D
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That
Will Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
251
Notes
252
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 13: Data Analytics with R
Objectives
• learn concepts of machine learning.
• learn four categories of machine learning.
Introduction
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed. Machine
learning focuses on the development of computer programs that can access data and use it to
learn for themselves.Machine Learning is the most widely used method for forecasting the future or
categorizing data to assist humans in making important decisions. Machine Learning algorithms
are taught over examples or situations in which they learn from previous experiences and examine
historical data.When a result, as it trains over and over on the examples, it is able torecognise
patterns and make predictions about the future.The learning process starts with observations or
data, such as examples, direct experience, or instruction, so that we may seek for patterns in data
and make better judgments in the future based on the examples we offer. The fundamental goal is
for computers to learn on their own, without the need for human involvement, and to adapt their
behaviour accordingly.But, using the classic algorithms of machine learning, text is considered as a
sequence of keywords; instead, an approach based on semantic analysis mimics the human
ability to understand the meaning of a text.
253
Notes
Supervised learning
Unsupervised learning
Machine learning
approaches are
traditionally divided into
four broad categories:
Semi-supervised
Learning
Reinforcement learning:
Neural networks
Naive Bayes
techniques are used in supervised machine
Various algorithms and computation
Linear regression
learning processes
Logistic regression
K-nearest neighbor
Random forest
• Neural Network:Neural networks reflect the behavior of the human brain, allowing
computer programs to recognize patterns and solve common problems in the fields of AI,
machine learning, and deep learning.
• Naïve Bayes:In statistics, naive Bayes classifiers are a family of simple "probabilistic
classifiers" based on applying Bayes' theorem with strong independence assumptions
254
Notes
between the features. They are among the simplest Bayesian network models, but coupled
with kernel density estimation, they can achieve higher accuracy levels
• Linear Regression: Linear regression analysis is used to predict the value of a variable
based on the value of another variable. The variable you want to predict is called the
dependent variable. The variable you are using to predict the other variable's value is
called the independent variable.
• Logistic Regression:Logistic regression is a statistical analysis method used to predict a
data value based on prior observations of a data set. A logistic regression model predicts a
dependent data variable by analyzing the relationship between one or more existing
independent variables.
• Support Vector Machine:SVMs (support vector machines) are supervised machine
learning techniques that may be used for both classification and regression. However, they
are most commonly employed in categorization issues. SVMs were initially developed in
the 1960s, but they were improved around 1990.
• KNN: The supervised machine learning method k-nearest neighbours (KNN) is a basic,
easy-to-implement technique that may be used to tackle both classification and regression
issues.
• Random Forest: A random forest is a machine learning approach for solving
classification and regression issues. It makes use of ensemble learning, which is a
technique for solving difficult problems by combining many classifiers. A random forest
method is made up of a large number of decision trees.
• Image and object recognition: When applied to different computer vision techniques and
visual analysis, supervised learning algorithms may be used to find, isolate, and categorise
items from movies or images, making them usable.
• Predictive Analytics: The creation of predictive analytics systems to give deep insights into
multiple business data points is a common use case for supervised learning models. This
enables businesses to predict certain outcomes depending on a particular output variable,
assisting business executives in justifying choices or pivoting for the organization's
advantage.
• Customer Sentiment Analysis: Organizations can extract and categorise significant bits of
information from enormous amounts of data using supervised machine learning
algorithms with very little human interaction, including context, emotion, and purpose.
This may be quite beneficial in terms of obtaining a better knowledge of consumer
interactions and improving brand engagement initiatives.
• Spam Detection: Another example of a supervised learning model is spam detection.
Organizations may train databases to identify patterns or abnormalities in fresh data using
supervised classification algorithms, allowing them to efficiently arrange spam and non-
spam correspondences.
255
Notes
• Clustering: Clustering is a way of arranging items into clusters so that those with the
most similarities stay in one group while those with less or no similarities stay in another.
Cluster analysis identifies similarities among data items and classifies them according to
the presence or absence of such commonalities.
• Associate rule: An association rule is an unsupervised learning approach that is used to
discover associations between variables in a big database. It identifies the group of items
that appear in the dataset together. The association rule improves the effectiveness of
marketing strategies. People who buy X (let's say a loaf of bread) are more likely to buy Y
(butter/jam). Market Basket Analysis is a good example of an association rule.
256
Notes
A Semi-Supervised
algorithm assumes the three
factors about the data
• A Semi-Supervised algorithm assumes three factors about the data as shown in Figure 7.
Continuity Assumption: The method believes that points that are closer together have a
higher probability of having the same output label.
• Cluster Assumption: The data may be split into distinct clusters, with points in the same
cluster having a higher chance of having the same output label.
• Manifold Assumption:The data are roughly distributed over a manifold with a significantly
smaller size than the input space. This assumption permits distances and densities defined
on a manifold to be used.
257
Notes
determine the best feasible action or path in a given scenario. Reinforcement learning differs from
supervised learning in that the solution key is included in the training data, allowing the model to
be trained with the right answer, but in reinforcement learning, there is no answer and the
reinforcement agent determines what to do to complete the job. It is obliged to learn from its
experience in the absence of a training dataset.Here are some important terms used in
Reinforcement.
Agent: It is an assumed entity which performs actions in an environment to gain some reward.
Environment (e): A scenario that an agent has to face.
Reward (R): An immediate return given to an agent when he or she performs specific action or
task.
• State (s): State refers to the current situation returned by the environment. Policy (π): It is a
strategy which applies by the agent to decide the next action based on the current state.
Value (V): It is expected long-term return with discount, as compared to the short-term
reward.
• Value Function: It specifies the value of a state that is the total amount of reward. It is an
agent which should be expected beginning from that state.
• Model of the environment: This mimics the behavior of the environment. It helps you to
make inferences to be made and also determine how the environment will behave.
• Model based methods: It is a method for solving reinforcement learning problems which
use model-based methods. Q value or action value (Q): Q value is quite similar to value.
The only difference between the two is that it takes an additional parameter as a current
action.
Our agent responds by making an action transition from one "state" to the next.Your cat, for
example, progresses from sitting to walking.An agent's reaction is an action, and a policy is a way
of choosing an action given a situation in the hopes of better results.They may receive a reward or a
punishment as a result of the transfer.
258
Notes
Value-Based
Model-Based
Positive
Two kinds of
reinforcement learning
methods are:
Negative
259
Notes
maintaining change for a longer length of time. However, too much reinforcement might
lead to state over-optimization, which can have an impact on the outcome.
• Negative:Bad Reinforcement is defined as behaviour strengthening that happens as a
result of a negative circumstance that should have been avoided or halted. It assists you in
determining the minimal level of performance. The disadvantage of this technique is that
it just supplies enough to fulfil the minimal behaviour requirements.
Markov Decision
Process
There are two important
learning models in
reinforcement learning:
Q learning
260
Notes
Figure 13 Q Learning
• After that, you must assign a prize value to each door:100 points are awarded for doors
that go directly to the objective.There is no reward for doors that are not directly
connected to the target room.Because doors are two-way and each chamber has two
arrows,each of the arrows in the picture above represents an immediate prize value.
261
Notes
• In this image, you can view that room represents a state. Agent’s movement from one
room to another represents an action. In the below-given image, a state is described as a
node, while the arrows show the action. or example, an agent traverse from room number
2 to 5
• Initial state = state 2, State 2-> state 3, State 3 -> state (2,1,4), State 4-> state (0,5,3)
• State 1-> state (5,3), State 0-> state 4
262
Notes
263
Notes
It helps you to create training systems that provide custom instruction and materials according to
the requirement of students. Aircraft control and robot motion control.
Summary
Machine learning (ML) is the study of computer algorithms that may improve themselves
over time by gaining experience and using data. Machine learning algorithms create a model
based on training data to make predictions or judgments without having to be explicitly
programmed to do so.
The process of supplying input data as well as proper output data to the machine learning
model is known as supervised learning. A supervised learning algorithm's goal is to
discover a mapping function that will translate the input variable(x) to the output
variable(y) (y).
Unsupervised learning, also known as unsupervised machine learning, analyses and
clusters unlabeled information using machine learning techniques. Without the need for
human interaction, these algorithms uncover hidden patterns or data groupings.
264
Notes
A learning issue with a small number of labelled instances and a large number ofunlabeled
examples is known as semi-supervised learning.
Reinforcement learning (RL) is a branch of machine learning that studies how intelligent
agents should operate in a given environment to maximise the concept of cumulative
reward. Reinforcement learning, along with supervised and unsupervised learning, is one of
the three main machine learning paradigms.
In statistics, naive Bayes classifiers are a subset of "probabilistic classifiers" based on Bayes'
theorem and strong independence assumptions between features. They are one of the most
basic Bayesian network models, but when combined with kernel density estimation, they
may attain greater levels of accuracy.
In statistics, naive Bayes classifiers are a subset of "probabilistic classifiers" based on Bayes'
theorem and strong independence assumptions between features. They are one of the most
basic Bayesian network models, but when combined with kernel density estimation, they
may attain greater levels of accuracy.
Sentiment analysis is the systematic identification, extraction, quantification, and study of
emotional states and subjective information using natural language processing, text analysis,
computational linguistics, and biometrics.
Clustering is the process of splitting a population or set of data points into many groups so
that data points in the same group are more similar than data points in other groups. To put
it another way, the goal is to separate groups with similar characteristics and assign them to
clusters.
In psychology, association refers to a mental link formed by specific experiences between
concepts, events, or mental states. Behaviorism, associationism, psychoanalysis, social
psychology, and structuralism are all schools of thought in psychology that use associations.
Keywords
Machine Learning:Machine learning is a type of data analysis that automates the creation of
analytical models. It's a field of artificial intelligence based on the premise that computers can learn
from data, recognise patterns, and make judgments with little or no human input.
Linear Regression:Linear regression is the process of identifying a line that best matches the data
points on the plot so that we can use it to forecast output values for inputs that aren't included in
the data set we have, with the assumption that those outputs will fall on the line.
Supervised Learning: The machine learning job of learning a function that translates an input to an
output based on example input-output pairs is known as supervised learning. It uses labelled
training data and a collection of training examples to infer a function.
Unsupervised Learning: Unsupervised learning is a sort of algorithm that uses untagged data to
discover patterns. The objective is that the machine will be pushed to create a compact internal
picture of its surroundings through imitation, which is the fundamental method young infants
learn, and will be able to generate inventive material as a result.
Semi-supervised Learning: Semi-supervised learning is a machine learning technique that involves
training using a small quantity of labelled data and a big amount of unlabeled data. Semi-
supervised learning is the middle ground between unsupervised and supervised learning. It's a
unique case of poor supervision.
Reinforcement Learning:Reinforcement learning is a branch of machine learning that studies how
intelligent agents should operate in a given environment to maximise the concept of cumulative
reward.
265
Notes
Naïve Bayes: The Bayes Theorem provides the basis for the Nave Bayes algorithm, which is utilised
in a broad range of classification problems. In this essay, we will learn about the Nave Bayes
algorithm as well as all of the key ideas so that there are no questions.
Clustering: Clustering is the process of splitting a population or set of data points into many
groups so that data points in the same group are more similar than data points in other groups. To
put it another way, the goal is to separate groups with similar characteristics and assign them to
clusters.
Association analysis: The challenge of uncovering intriguing correlations in vast datasets is known
as association analysis. There are two types of interesting relationships: frequent item sets and
association rules. According to association rules, two objects have a strong link.
Markov Decision Process: A Markov decision process (MDP) is a discrete-time stochastic control
process in mathematics. It gives a mathematical framework for modelling decision-making in
settings where outcomes are partially random and partly controlled by a decision maker.
Q Learning:Q-learning is a model-free reinforcement learning technique for determining the worth
of a certain action in a given state. It doesn't require an environment model, and it can handle
problems like stochastic transitions and incentives without the need for adaptations.
Predictive Analysis:Predictive analytics is a form of advanced analytics that uses historical data,
statistical modelling, data mining techniques, and machine learning to create predictions about
future events. Predictive analytics is used by businesses to uncover trends in data in order to
identify dangers and opportunities.
Market Analysis: A market study is a proactive investigation of a product or service's market
demand. Market research examines all of the market elements that drive demand for a particular
product or service. Price, location, competition, substitutes, and overall economic activity are all
factors to consider.
Self Assessment
1. Supervised machine learning algorithms can use _________- examples to apply what they've
learned in the past to fresh data and predict future events.
A. Labelled
B. Unlabelled
C. Predicted
D. Unpredictable
3. In which of the following option ___________ networks reflect the behavior of the human
brain, allowing computers to recognize patterns and solve common problems.
A. Neural networks
B. Naïve Bayes
C. Linear Regression
D. All of the above
266
Notes
B. probabilistic central
C. probabilistic classifiers
D. None of above
7. If we are considering feature to understand the taste of user that is example of ____________
A. Content based filtering
B. Collaborative filtering
C. Both
D. None of above
10. _____________ uses item features to recommend other items similar to what the user likes,
based on their previous actions or explicit feedback.
A. Content-based filtering
B. Collaborative filtering
C. Both
D. None of the above
267
Notes
A. Functions
B. Packages
C. Domains
D. Classes
12. Advanced users can edit R objects directly using ___________ Computer code.
A. C, C++
B. C++, Java
C. Java, C
D. Java
13. In the R programming language, which of the following is utilised for statistical analysis?
A. Studio
B. Heck
C. KStudio
D. RStudio
15. The R programming language resembles the __ programming language on the surface.
A. C
B. Java
C. C++
D. None of the above
6. C 7. A 8. B 9. A 10. A
Review Questions
1) What is machine learning? Why is the machine learning trend emerging so fast.?
2) Explain different types of machine learning algorithms.
3) Elaborate difference between classification and regression.
268
Notes
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That
Will Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Realtime Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NoSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
269
Notes
Dr. Rajni Bhalla, Lovely Professional University Unit 14: Big Data Management using Splunk
Objectives
• explore concepts of SPLUNK
• Learn features of Splunk
• Understand Interfaces, data ingestion and uploading data
• Understand concepts of data Meer
• learn steps to install Splunk enterprise on windows
Introduction
Splunk is a software used to search and analyze machine data. This machine data can come from
web applications, sensors, devices or any data created by user. It serves the needs of IT
infrastructure by analyzing the logs generated in various processes but it can also analyze any
structured or semi-structured data with proper data modelling. It has built-in features to recognize
the data types, field separators and optimize the search processes. It also provides data
visualization on the search results.
Prerequisites
• The reader should be familiar with querying language like SQL.
• General knowledge in typical operations in using computer applications like storing and
retrieving data and reading the logs generated by computer programs will be a highly
useful.
270
Notes
• Splunk is a software which processes and brings out insight from machine data and other
forms of big data. This machine data is generated by CPU running a webserver, IOT
devices, logs from mobile apps, etc
• It is not necessary to provide this data to the end users and does not have any business
meaning. However, they are extremely important to understand, monitor and optimize
the performance of the machines.
• Splunk can read this unstructured, semi-structured or rarely structured data. After
reading the data, it allows to search, tag, create reports and dashboards on these data.
With the advent of big data, Splunk is now able to ingest big data from various sources,
which may or may not be machine data and run analytics on big data.
• So, from a simple tool for log analysis, Splunk has come a long way to become a general
analytical tool for unstructured machine data and various forms of big data.
Features of SPLUNK
Features of SPLUNK are shown in Figure 2.
271
Notes
• Data Ingestion: Splunk accepts a wide range of data types, including JSON, XML, and
unstructured machine data such as web and application logs. The user can model the
unstructured data into a data structure as desired.
• Data Indexing: Splunk indexes the imported data for quicker searching and querying
under various situations.
• Data Searching:In Splunk, searching entails utilising the indexed data to create metrics,
forecast future trends, and spot patterns.
• Using Alerts:When certain criteria are identified in the data being examined, Splunk
alerts may be used to send emails or RSS feeds.
• Dashboards:Splunk Dashboards may display search results as charts, reports, and pivot
tables, among other things.
• Data Model:Based on specialized domain knowledge, the indexed data can be modelled
into one or more data sets. This makes it easy for end users to navigate and evaluate
business cases without having to grasp the intricacies of Splunk's search processing
language.
Administrator Link
The Administrator drop down menu allows you to customise and modify the administrator's
information. Using the interface below, we may reset the admin email ID and password.
We can also go to the preferences option from the administrator link to select the time zone and
home application on which the landing page will open once you log in. It now appears on the home
page, as shown below in Figure 4.
272
Notes
Figure 4 Preferences
Settings Link
This is a link to a page that lists all of Splunk's key functionality. By selecting the lookup link, you
may add the lookup files and lookup definitions, for example.
• Search and Reporting Link:The link to search and reporting brings us to the
features where we can locate the data sets that are accessible for searching
the reports and alerts that have been produced for these searches. The
screenshot below clearly demonstrates this. −
Splunk's Add Data function, which is part of the search and reporting interface, is where data gets
ingested.
273
Notes
Figure 6Events are stored in the index as a group of files that fall into two categories
274
Notes
When we click this option, we're sent to a screen where we can choose the source and format of the
data we want to send to Splunk for analysis.
275
Notes
Input Settings
We configure the host name from which the data is imported in this phase of the data ingestion
process. For the host name, there are several possibilities to pick from as shown in Figure 10.
Constant value
It's the full host name of the server where the source data is stored.
regex on path
When using a regular expression to obtain the host name. Then, in the Regular expression area,
type the regex for the host you wish to extract.
segment in path
Enter the segment number in the Segment number box to extract the host name from a segment in
your data source's route. For example, if the source path is /var/log/ and you want the host value
to be the third segment (the host server name), enter "3."
The next step is to select the index type that will be used to search the input data. The default index
approach is chosen. The summary index is used to construct a summary of the data and establish
an index on it, whereas the history index is used to store the search history. In the image below, it is
clearly represented.
276
Notes
Review Settings
After clicking on the next button, we see a summary of the settings we have chosen. We review it
and choose Next to finish the uploading of dataas shown in Figure 11.
When the load is complete, the screen below opens, indicating that the data was successfully
ingested and outlining the next steps we may do with the data.
277
Notes
Splunk's inbuilt data processing unit evaluates all incoming data and classifies it into several data
kinds and categories. Splunk, for example, can distinguish a log from an Apache web server and
construct suitable fields from the data read.
Splunk's source type identification capability does this by utilising its built-in source types,
sometimes known as "pretrained" source types.
The user does not have to manually classify the data or assign any data types to the fields of the
incoming data, making analysis easy.
278
Notes
279
Notes
When we select the Search & Reporting app, we are greeted with a search box from which we can
begin our log data searchFigure 16. We input the host name in the format indicated below and then
click the search icon in the upper right corner. This returns a result that highlights the search word.
280
Notes
281
Notes
282
Notes
283
Notes
Field Summary
By clicking on the name of the field, you may get more specific stats about that field. It displays all
of the field's different values, as well as their counts and percentages as shown in Figure 22.
284
Notes
14.7 DataMeer
Datameeractsas a job compiler or code generator likeHive.This means
every function, filter or join that the user designs in the spreadsheet will be translated into native
Tez code. Tez is great for splitting up workloads into smaller pieces. To do so, Datameer compiles a
job for a Hadoop cluster, where it is sent to be executed. After the job is compiled and sent to the
cluster Datameer does not control job execution, and can only receive the
telemetry metrics provided by the cluster's services. The job will run with any scheduling settings
and use resources granted by the scheduler.All users working with Datameer's Excel-like User
Interface (UI) are generating a Java program for distributed computing on the cluster backend. This
high level of abstraction is one of the key features that makes Datameer such an outstanding
technology. However, this approach does mean that business users need to keep in mind the types
of problems every programmer deal with, i.e., data types, memory, and disk usage.These separates
analytics work into two stages. First, the design/edit time and second the execution/runtime of a
data link/import job/workbook. Both stages are located on different parts within your distributed
computing system (cluster).
DESIGN/EDIT TIME
The first stage is served on the Datameer application server, running the Datameer service Java
Virtual Machine (JVM), started and executed under the Datameer serviceaccount user. Depending
on your configuration and if (Secure) Impersonation is configured or not, calls are made
from <datameerServiceAccountUser> @ <datameerHost> or <loggedinUser> @ <datameerHost>.
EXECUTION/RUN TIME
The second stage is served on random DataNodes (DN) in the cluster. The DN is running the
container JVM, started by the ApplicationMaster (AM) and executed under the YARN service
account user. Depending on the configuration and if (Secure) Impersonation is configured or not,
callsaremade
from <yarnServiceAccountUser>@<dataNode> or <impersonatedUser>@<dataNode>
285
Notes
286
Notes
287
Notes
• (Optional) Check the boxes to Launch browser with Splunk and Create Start Menu Shortcut.
Click Finish. The installation completes, Splunk Enterprise starts and launches in a supported
browser if you checked the appropriate box.
Summary
Splunk is a tool for tracking and searching large amounts of data. It indexes and correlates
data in a searchable container and allows for the generation of alerts, reports, and
visualisations.
Splunk enterprise's goal is to help you figure out what's going on in your company and
take action swiftly.
Splunk cloud is a versatile, secure, and cost-effective data platform service that allows you
to search, analyse, visualise, and act on your data.
Splunk Light solves this problem by allowing you to collect and correlate data from almost
any source, format, or location. Data flowing from packaged and client applications, app
servers, web servers, databases, network wire data, virtual machines, operating systems,
sensors, and other sources are just a few of the possibilities.
The process of acquiring and importing data for immediate use or storage in a database is
known as data intake. Ingesting anything means "to take in or absorb something." Data
can be ingested in batches or broadcast in real time.
Indexing is a technique for improving database speed by reducing the number of disc
accesses necessary when a query is run. It's a data structure strategy for finding and
accessing data in a database rapidly. A few database columns are used to generate
indexes.
Panel-based displays are known as dashboards. Modules like as search boxes, fields,
charts, tables, and lists can be included in the panels. Reports are frequently linked to
dashboard panels. You may add a search visualisation or a report to a new or existing
dashboard after you build it.
The structure of your data is defined by a Splunk data model, which is a hierarchy of
datasets. Your data model should represent the data's basic structure as well as the Pivot
reports that your end users demand.
Keywords
Splunk:Splunk is a search and analysis tool for machine data. Machine data might originate from
online applications, sensors, devices, or any data that the user has developed. It supports IT
infrastructure by analysing logs created during various operations, but it may also evaluate any
organised or semi-structured data with correct data modelling.
Splunk Interface: Splunk's web interface includes all of the tools you'll need to search, report, and
analyse the data you've ingested. The same web interface allows administrators to manage users
288
Notes
and their responsibilities. It also includes connections for data intake as well as Splunk's built-in
applications.
Datameer bills itself as an all-in-one analytics solution. They help ingest (in Hadoop),
cleanse/prepare data that has been ingested or is being ingested, and then query data using
Hive/Tex/Spark, as well as give visualisation for the searched data, according to Datameer.
Data Ingestion: Splunk accepts a wide range of data types, including JSON, XML, and
unstructured machine data such as web and application logs. The user can model the unstructured
data into a data structure as desired.
Data Indexing: Splunk indexes the imported data for quicker searching and querying under
various situations.
Data Searching:In Splunk, searching entails utilising the indexed data to create metrics, forecast
future trends, and spot patterns.
Using Alerts:When certain criteria are identified in the data being examined, Splunk alerts may be
used to send emails or RSS feeds.
Dashboards:Splunk Dashboards may display search results as charts, reports, and pivot tables,
among other things.
Data Model:Based on specialized domain knowledge, the indexed data can be modelled into one or
more data sets. This makes it easy for end users to navigate and evaluate business cases without
having to grasp the intricacies of Splunk's search processing language.
Hadoop: Hadoop is an open-source software framework for storing and processing data on
commodity hardware clusters. It has a lot of storage for any sort of data, a lot of processing power,
and it can perform almost unlimited concurrent processes or jobs.
Application Master: The Application Master is a framework-specific library that is in charge of
negotiating resources with the Resource Manager and working with the Node Manager(s) to
execute and monitor Containers and their resource usage. It is in charge of negotiating suitable
resource Containers with the Resource Manager and keeping track of their progress. The Resource
Manager monitors the Application Master, which operates as a single Container.
NameNode is a component of the Master System. Namenode's main function is to manage all of the
MetaData. The list of files saved in HDFS is known as metadata (Hadoop Distributed File System).
In a Hadoop cluster, data is stored in the form of blocks, as we all know.
Self Assessment
1. Splunk is a software used to _______________ machine data.
A. search and attention
B. search and analyze
C. surfing and analyze
D. none of the mentioned
2. The Administrator drop down menu allows you to customize and modify the
____________ information
A. Administrator's
B. Reporting
C. Customer
D. User
289
Notes
3. The link to _______________ brings us to the features where we can locate the data sets that
are accessible for searching the reports and alerts that have been produced for these
searches.
A. search and records
B. short and reporting
C. search and reporting
D. None of the above
4. Data ingestion in Splunk happens through the ______ feature which is part of the search and
reporting app.
A. Add data
B. Upload data
C. Ingest data
D. None of the above
8. Which of the following problems every programmer deals and business users need to keep
in mind?
A. Datatype
B. Memory
C. Disk usage
D. All of the above
290
Notes
C. Both
D. None of the above
10. Select the options that can be used to install splunk enterprise on windows
a. GUI interface
b. Command Line Interface
c. Both
d. None of the above
11. Select the parameter(s) that prevent users from installing Splunk.
A. Unsupported OS
B. Windows Server 2003
C. Both
D. None of the above
12. The MAX PATH path restriction in the Windows API is ___ characters long.
A. 250
B. 260
C. 270
D. None of the above
13. Which feature of Splunk is used to search the entire data set that is ingested.
A. Search & Reporting
B. Refining search results
C. Using fields in search
D. Sharing the search result.
14. What are the different formats are available for exports?
A. CSV
B. XML
C. JSON
D. All of the above
15. Which of the following are the components of the SPLUNK search processing language
(SPL)?
A. Search terms
B. Commands
C. Functions
D. All of the above
291
Notes
6. B 7. A 8. D 9. C 10. C
Review Questions
1) Write down steps forinstalling splunk enterprise on windows
2) What is data preparation and datameer?
3) Write down functions of search and reporting app?
4) What are the different types of Splunk dashboards and also write down components of
Splunk architecture?
5) What are the benefits of feeding data into a Splunk instance through Splunk Forwarders?
Further Readings
Maheshwari, Anil. Big Data. McGraw-Hill Education, 2019.
Mayer-Schonberger, Viktor; Cukier, Kenneth (2013). Big Data: A Revolution That
Will Transform How We Live, Work, and Think . Houghton Mifflin Harcourt.
McKinsey Global Institute Report (2011). Big Data: The Next Frontier For
Innovation, Competition, and Productivity. Mckinsey.com
Marz, Nathan, and James Warren (2015). Big Data: Principles and Best Practices
of Scalable Real time Data Systems. Manning Publications.
Sandy Ryza, Uri Laserson et.al (2014). Advanced-Analytics-with-Spark. OReilley.
• White, Tom (2014). Mastering Hadoop. OReilley.
Web Links
1. Apache Hadoop resources: https://hadoop.apache.org/docs/r2.7.2/
2. Apache HDFS: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
3. Hadoop API site: http://hadoop.apache.org/docs/current/api/
4. NOSQL databases: http://nosql-database.org/
5. Apache Spark: http://spark.apache.org/docs/latest/
6. Tutorials on Big Data technologies: https://www.tutorialspoint.com/
292