Big Data Tools and Applications Assignment
Big Data Tools and Applications Assignment
Big Data Tools and Applications Assignment
PGDM – A
ROLL NO. 12
BIG DATA TOOLS AND APPLICATIONS
ASSIGNMENT
Volume -
Volume refers to the unimaginable amounts of information generated every second from social media,
cell phones, cars, credit cards, M2M sensors, images, video, and whatnot. We are currently using
distributed systems, to store data in several locations and brought together by a software Framework like
Hadoop
Facebook alone can generate about billion messages, 4.5 billion times that the “like” button is recorded,
and over 350 million new posts are uploaded each day. Such a huge amount of data can only be handled
by Big Data Technologies
Variety -
As Discussed before, Big Data is generated in multiple varieties. Compared to the traditional data like
phone numbers and addresses, the latest trend of data is in the form of photos, videos, and audios and
many more, making about 80% of the data to be completely unstructured.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
Value
Value is the major issue that we need to concentrate on. It is not just the amount of data that we store or
process. It is actually the amount of valuable, reliable and trustworthy data that needs to be stored,
processed, analyzed to find insights.
Velocity
Last but never least, Velocity plays a major role compared to the others, there is no point in investing so
much to end up waiting for the data. So, the major aspect of Big Dat is to provide data on demand and at a
faster pace.
Q5) Why do we use HDFS for applications having large data sets ?
Ans - The Hadoop Distributed File System is more suitable for large amount of data sets in a single file
as compared to small amount of data spread across multiple files. This is because Namenode is a very
expensive high performance system, so it is not prudent to occupy the space in the Namenode by
unnecessary amount of metadata that is generated for multiple small files. So, when there is a large
amount of data in a single file, name node will occupy less space. Hence for getting optimized
performance, HDFS supports large data sets instead of multiple small files.
The conventional wisdom is that HDFS because of its large block size and the constraints of the
namenode.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
A study of small files in HDFS was performed and it was determined by investigating the actual file
allocation space for the blocks that, while the block is theoretically 64 mb, the actual allocation was
limited to the actual file size. The concern regarding wasted disc space with large block sizes and
small files was not confirmed by this investigation.
b) NameNode :- It works as Master in Hadoop cluster. Below listed are the main function
performed by NameNode:
1. Stores metadata of actual data.
2. Manages files system namespace.
3. Regulates client access request for actual file data file.
4. Executes file system name space operation like opening/closing files, renaming files and
directories.
5. As Name node keep metadata in memory for fast retrieval, the huge amount of memory is
required for its operation. This should be hosted on reliable hardware.
DataNode works as Slave in Hadoop cluster. Below listed are the main function performed by
DataNode:
1. Actually stores business data.
2. Actual work load like read, write and Data processing is handled.
3. Upon instruction from Master, it performs creation/replication/deletion of data blocks.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
4. As all the Business data is stored on DataNode, the huge amount of storage is required for its operation.
Commodity hardware can be used for hosting DataNode.
1. Standalone Mode
By default, Hadoop is configured to run in a no distributed mode. It runs as a single Java process. Instead
of HDFS, this mode utilizes the local file system. This mode useful for debugging and there isn't any need
to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves.
2. Pseudo – Distributed Mode (Single node) – Hadoop can also run on a single node in a Pseudo
Distributed mode. In this mode, each daemon runs on separate java process. In this mode custom
configuration is required.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
3. Fully Distrbuted Mode :-
This is the production mode of Hadoop. In this mode typically one machine in the cluster is designated as
NameNode and another as Resource Manager exclusively. These are masters. All other nodes act as Data
Node and Node Manager.This mode offers fully distributed computing capability, reliability, fault
tolerance and scalability.
DBMS Hadoop
Traditional row-column based databases, basically An open-source software used for storing data and
used for data storage, manipulation and retrieval. running applications or processes concurrently.
It is best suited for OLTP environment. It is best suited for BIG data.
The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
Q10) What are the advantages, application areas and challenges for Big Data?
5. Media and Entertainment Sector: Media and entertainment service providing company like
Netflix, Amazon Prime, Spotify do analysis on data collected from their users. Data like what
type of video, music users are watching, listening most, how long users are spending on site, etc.
are collected and analyzed to set the next business strategy.