Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Big Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Big data

The term “Big Data” has recently been applied to datasets that grow so large that they become
awkward to work with using traditional database management systems. They are data sets whose
size is beyond the ability of commonly used software tools and storage systems to capture, store,
manage, as well as process the data within a tolerable elapsed time [12]. Big data sizes are
constantly increasing, currently ranging from a few dozen terabytes (TB) to many petabytes (PB)
of data in a single data set. Consequently, some of the difficulties related to big data include
capture, storage, search, sharing, analytics, and visualizing. Today, enterprises are exploring large
volumes of highly detailed data so as to discover facts they didn’t know before [17]. Hence, big
data analytics is where advanced analytic techniques are applied on big data sets. Analytics based
on large data samples reveals and leverages business change. However, the larger the set of data,
the more difficult it becomes to manage.
Characteristics
Big data is data whose scale, distribution, diversity, and/or timeliness require the use of new
technical architectures, analytics, and tools in order to enable insights that unlock new sources of
business value. Three main features characterize big data: volume, variety, and velocity, or the
three V’s. The volume of the data is its size, and how enormous it is. Velocity refers to the rate
with which data is changing, or how often it is created. Finally, variety includes the different
formats and types of data, as well as the different kinds of uses and ways of analyzing the data.
Some researchers and organizations have discussed the addition of a fourth V, or veracity.
Veracity focuses on the quality of the data. This characterizes big data quality as good, bad, or
undefined due to data inconsistency, incompleteness, ambiguity, latency, deception, and
approximations.

In Descriptive analytics, the following questions are answered: What has happened, what is
happening, and why, in this process, visualization tools and online analytical processing (OLAP)
system are used and supported by reporting technology (e.g. RFID, GPS, and transaction bar-code)
and real-time information to identify new opportunities and problems. Descriptive statistics are
used to collect, describe, and analyze the raw data of past events. It analyzes and describes the past
events and makes it something that is interpretable and understandable by humans. Descriptive
analytics enables organizations to learn from their past and understand the relationship between
variables and how it can influence future outcomes. For example, it can be used to illustrate
average money, stock in inventory, and annual sale changes. Descriptive analytics is also useful to
an organization’s financials, sales, operations, and production reports. Predictive analytics
techniques are used to answer the question of what will happen in the future or likely to happen,
by examining past data trends using statistical, programming and simulation techniques. These
techniques seek to discover the causes of events and phenomena as well as to predict the future
accurately or to fill in the data or information that already does not exist. Statistical techniques
cannot be used to predict the future with 100% accuracy. Predictive analytics is used to predict
purchasing patterns, customer behavior and purchase patterns to identifying and predicting the
future trend of sales activities. These techniques are also used to predict customer demands,
inventory records and operations. Prescriptive analytics deals with the question of what should be
happening and how to influence it. Prescriptive analytics guides alternative decision based on
predictive and descriptive analytics using descriptive and predictive analytics, simulation,
mathematical optimization, or multicriteria decision-making techniques. The application of
prescriptive analytics is relatively complex in practice, and most companies are still unable to
apply it in their daily activities of business. Correct application of prescriptive analytics techniques
can lead to optimal and efficient decision making. A number of large companies have used data
analytics to optimize production and inventory. Some of the crucial scenarios that prescriptive
analytics allows companies to answer include in the following:
 What kind of an offer should make to each end-user?
 What should be the shipment strategy for each retail location?
 Which product should launch and when?
Hadoop
Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to
maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that
was introduced by Google. Today lots of Big Brand Companies are using Hadoop in their
Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop
Architecture Mainly consists of 4 components.
 Map Reduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common
Map Reduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of Map Reduce is to perform the distributed processing in parallel
in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data,
serial processing is no more of any use. Map Reduce has mainly 2 tasks which are divided phase-
wise:
In first phase, Map is utilized and in next phase Reduce is utilized.

Here, we can see that the Input is provided to the Map () function then it’s output is used as an
input to the Reduce function and after that, we receive our final output. Let’s understand what
this Map () and Reduce () does.
As we can see that an Input is provided to the Map (), now as we are using Big Data. The Input
is a set of Data. The Map () function here breaks this Data Blocks into Tuples that are nothing
but a key-value pair. These key-value pairs are now sent as input to the Reduce (). The Reduce()
function then combines this broken Tuples or key-value pair based on its Key value and form
set of Tuples, and perform some operation like sorting, summation type job, etc. which is then
sent to the final Output Node. Finally, the Output is obtained.
The data processing is always done in Reducer depending upon the business requirement of that
industry. This is How First Map () and then Reduce is utilized one by one.
HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed
for working on commodity Hardware devices(inexpensive devices), working on a distributed
file system design. HDFS is designed in such a way that it believes more in storing the data in a
large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the
other devices present in that Hadoop cluster. Data storage Nodes in HDFS.

 NameNode(Master)
 DataNode(Slave)

YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into
small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing
can be Maximized. Job Scheduler also keeps track of which job is important, which job has more
priority, dependencies between the jobs and all the other information like job timing, etc. And
the use of Resource Manager is to manage all the resources that are made available for running
a Hadoop cluster.
Features of YARN

 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility

Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files or we can
say the java scripts that we need for all the other components present in a Hadoop cluster. these
utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common
verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically
in software by Hadoop Framework.

You might also like