What Is Bigdata
What Is Bigdata
What Is Bigdata
'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe
collection of data that is huge in size and yet growing exponentially with time. In
short, such a data is so large and complex that none of the traditional data management
tools are able to store it or process it efficiently.
Example of bigdata :
1. The New York Stock Exchange generates about one terabyte of new trade data
per day.
2. Statistic shows that 500+terabytes of new data gets ingested into the databases of
social media site Facebook, every day. This data is mainly generated in terms of photo
and video uploads, message exchanges, putting comments etc.
3. Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time.
With many thousand flights per day, generation of data reaches up to many Petabytes.
Data sets grow rapidly - in part because they are increasingly gathered by cheap and
numerous information-sensing Internet of things devices such as mobile devices, aerial
(remote sensing), software logs, cameras, microphones, radio-frequency
identification (RFID) readers and wireless sensor networks.
Based on an IDC report prediction, the global data volume will grow exponentially from
4.4 zettabytes to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there
will be 163 zettabytes of data.
The amount of data matters. With big data, you’ll have to process high volumes of low-
density, unstructured data. This can be data of unknown value, such as Twitter data
feeds, clickstreams on a webpage or a mobile app, or sensor-enabled equipment. For
some organizations, this might be tens of terabytes of data. For others, it may be
hundreds of petabytes.
Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the
highest velocity of data streams directly into memory versus being written to disk. Some
internet-enabled smart products operate in real time or near real time and will require
real-time evaluation and action.
Variety refers to the many types of data that are available. Traditional data types were
structured and fit neatly in a relational database. With the rise of big data, data comes in
new unstructured data types. Unstructured and semistructured data types, such as text,
audio, and video require additional preprocessing to derive meaning and support
Two more Vs have emerged over the past few years: value and Variability
Variability – This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.
Value - Data has intrinsic value. But it’s of no use until that value is discovered. Equally
important: How truthful is your data—and how much can you rely on it?
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web search
engine, itself a part of the Lucene project. Nutch was started in 2002, and a working
crawler and search system quickly emerged.
However, its creators realized that their architecture wouldn’t scale to the billions of
pages on the Web. In 2004, Nutch’s developers set about writing an open source
implementation, the Nutch Distributed Filesystem (NDFS). In 2004, Google published
the paper that introduced MapReduce to the world. Early in 2005, the Nutch developers
had a working MapReduce implementation in Nutch, and by the middle of that year all
the major Nutch algorithms had been ported to run using MapReduce and NDFS. In
January 2006, Yahoo! hired Doug Cutting, and a month later we decided to abandon our
prototype and adopt Hadoop. In January 2008, Hadoop was made its own top-level
project at Apache, confirming its success and its diverse, active community. By this time,
Hadoop was being used by many other companies besides Yahoo!, such as Last.fm,
Facebook, and the New York Times.
Type of 'Big Data' :
Big data' could be found in three type:
1. Structure
1. Structure :
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
example :
2.UnStructure :
Any data with unknown form or the structure is classified as unstructured data.
Typical example of unstructured data is images, videos, facebook message etc.
3.SamiStructure :
Semi-structured data can contain both the forms of data.
In this type of data has some perticular format but not with schema .
example : xml data.
aditional RDBMS MapReduce
Traditional RDBMS Traditional MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times read many times
Write once
Transactions ACID None
Structure Schema-on-write Schema-on-read
Integrity High Low
Scaling Nonlinear Linear