Big Data
Big Data
Big Data
ANS-
Big Data is high volume, high-velocity or high variety information asset that requires
new forms of processing for enhanced decision making, insight discovery and process
optimization.
Big data is a combination of structured, semistructured and unstructured data
collected by organizations that can be mined for information and used in machine
learning projects, predictive modeling and other advanced analytics applications.
Systems that process and store big data have become a common component of data
management architectures in organizations, combined with tools that support big data
analytics uses. Big data is often characterized by the three V's:
the large volume of data in many environments;
the wide variety of data types frequently stored in big data systems; and
the velocity at which much of the data is generated, collected and processed.
These characteristics were first identified in 2001 by Doug Laney, then an analyst at
consulting firm Meta Group Inc.; Gartner further popularized them after it acquired
Meta Group in 2005. More recently, several other V's have been added to different
descriptions of big data, including veracity, value and variability.
Although big data doesn't equate to any specific volume of data, big data
deployments often involve terabytes, petabytes and even exabytes of data created and
collected over time.
Why is big data important?
o Companies use big data in their systems to improve operations, provide better
customer service, create personalized marketing campaigns and take other
actions that, ultimately, can increase revenue and profits. Businesses that use it
effectively hold a potential competitive advantage over those that don't
because they're able to make faster and more informed business decisions.
o For example, big data provides valuable insights into customers that
companies can use to refine their marketing, advertising and promotions in
order to increase customer engagement and conversion rates. Both historical
and real-time data can be analyzed to assess the evolving preferences of
consumers or corporate buyers, enabling businesses to become more
responsive to customer wants and needs.
o Big data is also used by medical researchers to identify disease signs and risk
factors and by doctors to help diagnose illnesses and medical conditions in
patients. In addition, a combination of data from electronic health records,
social media sites, the web and other sources gives healthcare organizations
and government agencies up-to-date information on infectious disease threats
or outbreaks.
o Here are some more examples of how big data is used by organizations:
In the energy industry, big data helps oil and gas companies identify
potential drilling locations and monitor pipeline operations; likewise,
utilities use it to track electrical grids.
Financial services firms use big data systems for risk management
and real-time analysis of market data.
Manufacturers and transportation companies rely on big data to
manage their supply chains and optimize delivery routes.
Other government uses include emergency response, crime prevention
and smart city initiatives.
o Big Data solutions are ideal for analyzing not only raw structured data, but
semi structured and unstructured data from a wide variety of sources.
o Big Data solutions are ideal when all, or most, of the data needs to be analyzed
versus a sample of the data; or a sampling of data isn’t nearly as effective as a
larger set of data from which to derive analysis.
o Big Data solutions are ideal for iterative and exploratory analysis when
business measures on data are not predetermined.
o Big Data is well suited for solving information challenges that don’t natively
fit within a traditional relational database approach for handling the problem at
hand.
The global mobile traffic was tallied to be around 6.2 ExaBytes( 6.2
billion GB) per month in the year 2016. The total amount of data stored
worldwide was 800,000 Petabytes in the year 2000.
Organizations that don’t know how to manage this data are overwhelmed
by it. The amount of data available to the enterprise is on the rise, the
percent of data it can process, understand, and analyze is on the decline,
thereby creating the blind zone
2. Variety
Another one of the most important Big Data characteristics is its variety. It
refers to the different sources of data and their nature. The sources of data
have changed over the years. Earlier, it was only available in spreadsheets
and databases. Nowadays, data is present in photos, audio files, videos,
text files, and PDFs.
The variety of data is crucial for its storage and analysis.
A variety of data can be classified into three distinct parts:
1. Structured data
2. Semi-Structured data
3. Unstructured data
Data consists of various forms and formats.
The Variety is due to the availability of large number of heterogenous
platforms in the industry.
variety represents all types of data—a fundamental shift in analysis
requirements from traditional structured data to include raw,
semistructured, and unstructured data as part of the decision-making and
insight process.
80 percent of the world’s data is unstructured, or semi structured.
Example,
Twitter feed uses JSON format.
Video and picture images aren’t easily stored in a relational database.
To capitalize on the Big Data opportunity, enterprises must be able to
analyze all types of data, both relational and nonrelational.
3. Velocity
This term refers to the speed at which the data is created or generated. This
speed of data producing is also related to how fast this data is going to be
processed. This is because only after analysis and processing, the data can
meet the demands of the clients/users.
Massive amounts of data are produced from sensors, social media sites,
and application logs – and all of it is continuous. If the data flow is not
continuous, there is no point in investing time or effort on it.
As an example, per day, people generate more than 3.5 billion searches on
Google.
5. Veracity
This feature of Big Data is connected to the previous one. It defines the
degree of trustworthiness of the data. As most of the data you encounter is
unstructured, it is important to filter out the unnecessary information and
use the rest for processing.
Read: Big data jobs and its career opportunities
Veracity is one of the characteristics of big data analytics that denotes data
inconsistency as well as data uncertainty.
The 4Vs( i.e. Volume, velocity, variety and veracity) data needs tools for
mining, discovering patterns, business intelligence, machine learning, text
analytics, descriptive and predictive analytics and the data visualization
tools.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks −
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
Why MapReduce?
Traditional systems tend to use a centralized server for storing
and retrieving data. Such huge amount of data cannot be
accommodated by standard database servers. Also,
centralized systems create too much of a bottleneck while
processing multiple files simultaneously.
Google, came up with MapReduce to solve such bottleneck
issues. MapReduce will divide the task into small parts and
process each part independently by assigning them to different
systems. After all the parts are processed and analyzed, the
output of each computer is collected in one single location and
then an output dataset is prepared for the given problem.
Mapper will make the key-value pair for each subset. For our
example, key is the colour and value is the number of times
they have appeared. So, we will have key- value pairs for
subset 1 as (Red, 1), (Green, 1), (Blue, 1) and so on. Similarly
for subset 2.
Once this is done, the key-value pairs are given to the reducer
as input. So, reducer will give us the final count of all the
colours in our input subsets and then combine the two outputs.
Reducer output will be, (Red, 4), (Green, 3), (Blue, 4), (Brown,
1), (Yellow, 3), (Orange, 2).
It also works on master slave architecture( one master multiple slaves(computing agent).
Reduce module combine tuples on the basis of key and form a set of tuples.
6 What is YARN?
7 What is Map Reduce Programming Model?
8 What are the characteristics of big data?
9 What is Big Data Platform?
10 What is Bigdata? Give some examples related to big data big data?
11 Explain in detail about Types as well as sub types of Data?
12 Briefly discuss about Map Reduce and YARN.
13 Explain in detail about HDFS.
14 Write a note on: Yarn Architecture.
15 Explain Hadoop Ecosystem?
16 Write note on : Apache Oozie,Sqoop,Apache Ambari,HBase,Apache Hive,Apache Pig.
17 Explain in detail about MAHOUT?