Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Big Data

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

1 What is Big Data? List out the best practices of Big Data Analytics?

ANS-
 Big Data is high volume, high-velocity or high variety information asset that requires
new forms of processing for enhanced decision making, insight discovery and process
optimization.
 Big data is a combination of structured, semistructured and unstructured data
collected by organizations that can be mined for information and used in machine
learning projects, predictive modeling and other advanced analytics applications.
 Systems that process and store big data have become a common component of data
management architectures in organizations, combined with tools that support big data
analytics uses. Big data is often characterized by the three V's:
 the large volume of data in many environments;
 the wide variety of data types frequently stored in big data systems; and
 the velocity at which much of the data is generated, collected and processed.
 These characteristics were first identified in 2001 by Doug Laney, then an analyst at
consulting firm Meta Group Inc.; Gartner further popularized them after it acquired
Meta Group in 2005. More recently, several other V's have been added to different
descriptions of big data, including veracity,  value and variability.
 Although big data doesn't equate to any specific volume of data, big data
deployments often involve terabytes, petabytes and even exabytes of data created and
collected over time.
 Why is big data important?
o Companies use big data in their systems to improve operations, provide better
customer service, create personalized marketing campaigns and take other
actions that, ultimately, can increase revenue and profits. Businesses that use it
effectively hold a potential competitive advantage over those that don't
because they're able to make faster and more informed business decisions.
o For example, big data provides valuable insights into customers that
companies can use to refine their marketing, advertising and promotions in
order to increase customer engagement and conversion rates. Both historical
and real-time data can be analyzed to assess the evolving preferences of
consumers or corporate buyers, enabling businesses to become more
responsive to customer wants and needs.
o Big data is also used by medical researchers to identify disease signs and risk
factors and by doctors to help diagnose illnesses and medical conditions in
patients. In addition, a combination of data from electronic health records,
social media sites, the web and other sources gives healthcare organizations
and government agencies up-to-date information on infectious disease threats
or outbreaks.
o Here are some more examples of how big data is used by organizations:
 In the energy industry, big data helps oil and gas companies identify
potential drilling locations and monitor pipeline operations; likewise,
utilities use it to track electrical grids.
 Financial services firms use big data systems for risk management
and real-time analysis of market data.
 Manufacturers and transportation companies rely on big data to
manage their supply chains and optimize delivery routes.
 Other government uses include emergency response, crime prevention
and smart city initiatives.
o Big Data solutions are ideal for analyzing not only raw structured data, but
semi structured and unstructured data from a wide variety of sources.
o Big Data solutions are ideal when all, or most, of the data needs to be analyzed
versus a sample of the data; or a sampling of data isn’t nearly as effective as a
larger set of data from which to derive analysis.
o Big Data solutions are ideal for iterative and exploratory analysis when
business measures on data are not predetermined.
o Big Data is well suited for solving information challenges that don’t natively
fit within a traditional relational database approach for handling the problem at
hand.

 UNDERSTAND THE BUSINESS REQUIREMENTS


Analyzing and understanding the business requirements and organizational goals is
the first and the foremost step that must be carried out even before leveraging big data
analytics into your projects. The business users must understand which projects in
their company must use big data analytics to make maximum profit.
 DETERMINE THE COLLECTED DIGITAL ASSETS
The second best big data practice is to identify the type of data pouring into the
organization, as well as, the data generated in-house. Usually, the data collected is
disorganized and in varying formats. Moreover, some data is never even exploited
(read dark data), and it is essential that organizations identify this data too.
 IDENTIFY WHAT IS MISSING
The third practice is analyzing and understanding what is missing. Once you have
collected the data needed for a project, identify the additional information that might
be required for that particular project and where can it come from. For instance, if you
want to leverage big data analytics in your organization to understand your
employee's well-being, then along with information such as login logout time,
medical reports, and email reports, we need to have some additional information
about the employee’s, let’s say, stress levels. This information can be provided by co-
workers or leaders.
 COMPREHEND WHICH BIG DATA ANALYTICS MUST BE LEVERAGED
After analyzing and collecting data from different sources, it's time for the
organization to understand which big data technologies, such as predictive
analytics, stream analytics, data preparation, fraud detection, sentiment analysis, and
so on can be best used for the current business requirements. For instance, big data
analytics helps the HR team in companies for the recruitment process to identify the
right talent faster by collaborating the social media and job portals using predictive
and sentiment analysis.
 ANALYZE DATA CONTINUOUSLY
This is the final best practice that an organization must follow when it comes to big
data. You must always be aware of what data is lying with your organization and what
is being done with it. Check the health of your data periodically to never miss out on
any important but hidden signals in the data. Before implementing any new
technology in your organization, it is vital to have a strategy to help you get the most
out of it. With adequate and accurate data at their disposal, companies must also
follow the above mentioned big data practices to extract value from this data.

2 Write down the characteristics of Big Data Applications?


ANS-
The primary characteristics of Big Data are –
1. Volume
Volume refers to the huge amounts of data that is collected and generated
every second in large organizations. This data is generated from different
sources such as IoT devices, social media, videos, financial transactions,
and customer logs.
Storing and processing this huge amount of data was a problem earlier.
But now distributed systems such as Hadoop are used for organizing data
collected from all these sources. The size of the data is crucial for
understanding its value. Also, the volume is useful in determining whether
a collection of data is Big Data or not.
Data volume can vary. For example, a text file is a few kilobytes whereas
a video file is a few megabytes. In fact, Facebook from Meta itself can
produce an enormous proportion of data in a single day. Billions of
messages, likes, and posts each day contribute to generating such huge
data.

The global mobile traffic was tallied to be around 6.2 ExaBytes( 6.2
billion GB) per month in the year 2016. The total amount of data stored
worldwide was 800,000 Petabytes in the year 2000.

The amount is anticipated to increase to 35 zettabytes in 2020. Every hour


of the year, businesses generate terabytes of data.

We store everything: environmental data, financial data, medical data,


surveillance data, and the many other data in devices.
The St. Anthony Falls Bridge in Minneapolis has more than 200 embedded
sensors positioned at strategic points to provide a fully comprehensive
monitoring system where all sorts of detailed data is collected.

Organizations that don’t know how to manage this data are overwhelmed
by it. The amount of data available to the enterprise is on the rise, the
percent of data it can process, understand, and analyze is on the decline,
thereby creating the blind zone

2. Variety
Another one of the most important Big Data characteristics is its variety. It
refers to the different sources of data and their nature. The sources of data
have changed over the years. Earlier, it was only available in spreadsheets
and databases. Nowadays, data is present in photos, audio files, videos,
text files, and PDFs.
The variety of data is crucial for its storage and analysis. 
A variety of data can be classified into three distinct parts:

1. Structured data
2. Semi-Structured data
3. Unstructured data
Data consists of various forms and formats.
The Variety is due to the availability of large number of heterogenous
platforms in the industry.
variety represents all types of data—a fundamental shift in analysis
requirements from traditional structured data to include raw,
semistructured, and unstructured data as part of the decision-making and
insight process.
80 percent of the world’s data is unstructured, or semi structured.
Example,
Twitter feed uses JSON format.
Video and picture images aren’t easily stored in a relational database.
To capitalize on the Big Data opportunity, enterprises must be able to
analyze all types of data, both relational and nonrelational.
3. Velocity
This term refers to the speed at which the data is created or generated. This
speed of data producing is also related to how fast this data is going to be
processed. This is because only after analysis and processing, the data can
meet the demands of the clients/users.
Massive amounts of data are produced from sensors, social media sites,
and application logs – and all of it is continuous. If the data flow is not
continuous, there is no point in investing time or effort on it.
As an example, per day, people generate more than 3.5 billion searches on
Google.

It refers to the speed of generation of data and it needs to be handled.

The velocity typically considers how quickly data is being received,


stored, and retrieved.

Organizations must be able to analyze this data in near real-time if they


want to find insights in it because more and more of the data being
produced today has a very short shelf-life.

5. Veracity
This feature of Big Data is connected to the previous one. It defines the
degree of trustworthiness of the data. As most of the data you encounter is
unstructured, it is important to filter out the unnecessary information and
use the rest for processing.
Read: Big data jobs and its career opportunities
Veracity is one of the characteristics of big data analytics that denotes data
inconsistency as well as data uncertainty.

As an example, a huge amount of data can create much confusion on the


other hand, when there is a fewer amount of data, that creates inadequate
information.
Quality, accuracy and trustworthiness of the data captured.

The 4Vs( i.e. Volume, velocity, variety and veracity) data needs tools for
mining, discovering patterns, business intelligence, machine learning, text
analytics, descriptive and predictive analytics and the data visualization
tools.

3 Write down the four computing resources of Big Data Storage?


4 What is HDFS?
ANS-
The HDFS is an open-source distributed file system suitable for applications withhigh-
throughput access requirements for large amounts of data. The HDFS has the conceptof a
block, but it is a much larger unit—128 MB by default. Like in a filesystem for a singledisk,
files in the HDFS are broken into block-sized chunks, which are stored as independentunits
[25]. Each HDFS block was replicated three times for fault tolerance, as illustrated inFigure
4. However, there is a problem with using the HDFS to store raster data directly.
Thecalculation procedure of each cell requires its adjacent cells. If the raster data are
submitteddirectly to the HDFS, the cell and its adjacent cells may be stored on different
nodes, whichcould bring out additional communication overhead.

It works on master slave architecture.


Name node acts as a master node
Name node stores the meta data.
File is divided into blocks.
Name node maps the block to the data node
The Default size of HDFS Block in Hadoop 1.0 is 64 MB and default size in Hadoop 2.0
is 128 MB.

HDFS follows the master-slave architecture and it has the following elements.

Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks −
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and
directories.

Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.

Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

5 What is Map Reduce?


ANS-
MapReduce is a framework which is used for making
applications that help us with processing of huge volume of
data on a large cluster of commodity hardware.

Why MapReduce?
Traditional systems tend to use a centralized server for storing
and retrieving data. Such huge amount of data cannot be
accommodated by standard database servers.  Also,
centralized systems create too much of a bottleneck while
processing multiple files simultaneously.
Google, came up with MapReduce to solve such bottleneck
issues. MapReduce will divide the task into small parts and
process each part independently by assigning them to different
systems. After all the parts are processed and analyzed, the
output of each computer is collected in one single location and
then an output dataset is prepared for the given problem.

How does MapReduce works?


MapReduce is a programming paradigm or model used to
process large datasets with a parallel distributed algorithm on a
cluster (source: Wikipedia). In Big Data Analytics, MapReduce
plays a crucial role. When it is combined with HDFS we can
use MapReduce to handle Big Data.
The basic unit of information used by MapReduce is a key-
value pair. All the data whether structured or unstructured
needs to be translated to the key-value pair before it is passed
through the MapReduce model.
MapReduce model as the name suggests has two different
functions; Map-function and Reduce-function. The order of
operation is always Map|Shuffle|Reduce.

Let us understand each phase in detail:

         Map stage:  Map stage is the crucial step in the


MapReduce framework. Mapper will give a structure to the
unstructured data. For example, if I have to count the
songs and music files in my laptop as per genre in my
playlist, I will have to analyze the unstructured data. The
mapper makes key-value pairs from this dataset. So, in
this case, the key is genre and value is the music file.
Once all this data is given to the mapper, we have a whole
dataset ready with the key-value pair structure.

So the mapper will work on one key-value pair at a time. One


input may produce any number of outputs. Basically, Map-
function will process the data and make several small chunks of
data.

         Reduce stage: The Shuffle stage and the Reduce


stage together are called the Reduce stage. Reducer will
take the output from the mapper as an input and make the
final output as specified by the programmer. This new
output will be saved to the HDFS. The Reducer will take
all the key-value pairs from the mapper and check the
association of all keys with value. All the values
associated with a single key will be taken and it will
provide an output of any number of key-value pairs.

By understanding the Map and Reduce stages, we understand


that MapReduce is a sequential computation. For any Reducer
to work, the Mapper must have completed the execution. If that
is not the case, the Reducer stage won’t run. Since, the
Reducer will have access to all the values, we can say that it
will find all values with the same key and perform computations
on them. So what actually happens is, since reducers are
working on different keys, they are made to work
simultaneously and parallelism is achieved.
Let us understand this by an example:
Suppose we have 4 sentences for processing:

1.       Red, Green, Blue, Red, Blue


2.       Green, Brown, Red, Yellow
3.       Yellow, Blue, Green, Orange
4.       Yellow, Orange, Red, Blue

When such an input is passed into the mapper, mapper will


divide those into two different subsets.
First will be the subset of the first two sentences and second,
the subset of remaining two sentences. Now, Mapper has:

         Subset 1: Red, Green, Blue, Red, Blue  and Green,


Brown, Red, Yellow
         Subset 2: Yellow, Blue, Green, Orange and Yellow,
Orange, Red, Blue

Mapper will make the key-value pair for each subset. For our
example, key is the colour and value is the number of times
they have appeared. So, we will have key- value pairs for
subset 1 as (Red, 1), (Green, 1), (Blue, 1) and so on. Similarly
for subset 2.
Once this is done, the key-value pairs are given to the reducer
as input. So, reducer will give us the final count of all the
colours in our input subsets and then combine the two outputs.
Reducer output will be, (Red, 4), (Green, 3), (Blue, 4), (Brown,
1), (Yellow, 3), (Orange, 2).
It also works on master slave architecture( one master multiple slaves(computing agent).

MAP breaks individual element of data into tuples(key/value pairs).

Sends tuples to the reduce function or to the reduce module.

Reduce module combine tuples on the basis of key and form a set of tuples.

One master job Tracker:-


Managing resources
Scheduling Tasks
Monitoring Tasks

Multiple Task Tracker:-


Executes the task
Provide Task Status

6 What is YARN?
7 What is Map Reduce Programming Model?
8 What are the characteristics of big data?
9 What is Big Data Platform?
10 What is Bigdata? Give some examples related to big data big data?
11 Explain in detail about Types as well as sub types of Data?
12 Briefly discuss about Map Reduce and YARN.
13 Explain in detail about HDFS.
14 Write a note on: Yarn Architecture.
15 Explain Hadoop Ecosystem?
16 Write note on : Apache Oozie,Sqoop,Apache Ambari,HBase,Apache Hive,Apache Pig.
17 Explain in detail about MAHOUT?

You might also like