Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Big Data Tools and Applications Assignment

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10
At a glance
Powered by AI
Big data refers to large, complex data sets that are difficult to analyze and extract value from using traditional data processing applications. It is commonly analyzed by organizations to gain insights that help make better decisions.

The four characteristics of Big Data are volume, variety, velocity and value.

Big data analysis helps organizations in sectors like government, education, insurance etc. by providing customized learning, grading systems, fraud detection, customer insights etc.

Darvesh Singh Bedi

PGDM – A
ROLL NO. 12
BIG DATA TOOLS AND APPLICATIONS
ASSIGNMENT

Q1) What is Big Data?


Ans - Big data is a field that treats ways to analyze, systematically extract information from, or otherwise
deal with data sets that are too large or complex to be dealt with by traditional data-processing application
software. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
The use of Big Data is becoming common these days by the companies to outperform their peers. In most
industries, existing competitors and new entrants alike will use the strategies resulting from the analyzed
data to compete, innovate and capture value.  
Big Data helps the organizations to create new growth opportunities and entirely new categories of
companies that can combine and analyze industry data. These companies have ample information about
the products and services, buyers and suppliers, consumer preferences that can be captured and analyzed.
Big Data is used is many sectors like – Education Sector, Grading Systems, Insurance Industries Threat
Mapping, Fraud Detection, etc.

Q2) What are the four characteristics of Big Data?


Ans – Four characteristics of Big Data are :-

 Volume -
Volume refers to the unimaginable amounts of information generated every second from social media,
cell phones, cars, credit cards, M2M sensors, images, video, and whatnot. We are currently using
distributed systems, to store data in several locations and brought together by a software Framework like
Hadoop
Facebook alone can generate about billion messages, 4.5 billion times that the “like” button is recorded,
and over 350 million new posts are uploaded each day. Such a huge amount of data can only be handled
by Big Data Technologies
 Variety -
As Discussed before, Big Data is generated in multiple varieties. Compared to the traditional data like
phone numbers and addresses, the latest trend of data is in the form of photos, videos, and audios and
many more, making about 80% of the data to be completely unstructured.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
 Value
Value is the major issue that we need to concentrate on. It is not just the amount of data that we store or
process. It is actually the amount of valuable, reliable and trustworthy data that needs to be stored,
processed, analyzed to find insights.
 Velocity
Last but never least, Velocity plays a major role compared to the others, there is no point in investing so
much to end up waiting for the data. So, the major aspect of Big Dat is to provide data on demand and at a
faster pace.

Q3) How is analysis of Big Data useful for Organizations?


Ans – Big data analysis is done in many organiszations –

Big data in Government industry


Along with many other areas, big data in government can have an enormous impact — local, national and
global. With so many complex issues on the table today, governments have their work cut out trying to
make sense of all the information they receive and make vital decisions that affect millions of people.
Governments, be it of any country, come face to face with a very huge amount of data on almost daily
basis. Reason being, they have to keep track of various records and databases regarding the citizens The
proper study and analysis of this data helps the Governments in endless ways. Few of them are:
 Welfare schemes
 Cyber security

Big Data in Education industry


 Customized and dynamic learning programs:
 Grading Systems
 Career prediction
Big Data in Insurance industry
The insurance industry holds importance not only for individuals but also business companies. The reason
insurance holds a significant place is because it supports people during times of adversities and
uncertainties. The data collected from these sources are of varying formats and change at tremendous
speeds.
 Gaining customer insight
Determining customer experience and making customers the center of a company’s attraction is of
prime importance to organizations.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
 Fraud detection
Insurance frauds are a common incidence. Big data use case for reducing fraud is highly
effective.
 Threat mapping
When an insurance agency sells an insurance, they want to be aware of all the possibilities of
things going unfavourably with their customer, making them file a claim.
 
Big Data in Banking Sector
 Study and analysis of big data can help in detecting -
 The misuse of credit cards
 Misuse of debit cards
 Risk Mitigation
 Venture credit hazard treatment
 Business clarity
 Customer statistics alteration
 Money laundering

Q4) Why do we need Hadoop?


Ans - Hadoop is an open-source software framework for storing data and running applications on clusters
of commodity hardware. It provides massive storage for any kind of data, enormous processing power
and the ability to handle virtually limitless concurrent tasks or jobs.
 Ability to store and process huge amounts of any kind of data, quickly. With data volumes
and varieties constantly increasing, especially from social media and the Internet of Things (IoT),
that's a key consideration.
 Computing power. Hadoop's distributed computing model processes big data fast. The more
computing nodes you use, the more processing power you have.
 Fault tolerance. Data and application processing are protected against hardware failure. If a node
goes down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail. Multiple copies of all data are stored automatically.
 Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before
storing it. You can store as much data as you want and decide how to use it later. That includes
unstructured data like text, images and videos.
 Low cost. The open-source framework is free and uses commodity hardware to store large
quantities of data.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
 Scalability. You can easily grow your system to handle more data simply by adding nodes. Little
administration is required.
Demand for Hadoop:
Low cost implementation of hadoop platform is attracting the companies to adopt this technology more
conveniently. As per a report by Allied Market Research, The market for Hadoop is projected to rise from
a $1.5 billion in 2012 to an estimated $16.1 billion by 2020. Significantly noted, the data management
industry has expanded from software and web into retail, hospitals, government, etc. This creates a huge
demand for scalable and cost effective platforms of data storage like Hadoop. Let us take a look at how
Hadoop helps in providing excellent analytics services.
All business Data Captured and stored Successfully:
The enterprises and organizations estimated that they use and analyze less volume of info and most
amount of it goes wasted. The reason being the organizations lack the analytics capabilities. It is a bad
practice to term the data as unwanted as any part of data can be put to good use by the organization.
So, it is necessary to collect and keep all the data in well manner. With its capabilities of handling large
volume of data, Hadoop helped the companies to store and analyze the high volume of data successfully.
Hadoop makes data sharing with its high sharing Ability:
The organizations use big data to improve the functionality of each and every business unit. This includes
research, design, development, marketing, advertising, sales and customer handling. Sharing is difficult
for to share across different platforms. Hadoop is used to create a pond. It is a repository of various
sources of data, intrinsic or extrinsic sources of data.
Hadoop also supports Advanced Analytics.

Q5) Why do we use HDFS for applications having large data sets ?
Ans - The Hadoop Distributed File System is more suitable for large amount of data sets in a single file
as compared to small amount of data spread across multiple files. This is because Namenode is a very
expensive high performance system, so it is not prudent to occupy the space in the Namenode by
unnecessary amount of metadata that is generated for multiple small files. So, when there is a large
amount of data in a single file, name node will occupy less space. Hence for getting optimized
performance, HDFS supports large data sets instead of multiple small files.

The conventional wisdom is that HDFS because of its large block size and the constraints of the
namenode.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
 A study of small files in HDFS was performed and it was determined by investigating the actual file
allocation space for the blocks that, while the block is theoretically 64 mb, the actual allocation was
limited to the actual file size. The concern regarding wasted disc space with large block sizes and
small files was not confirmed by this investigation.

Q 6. Write short notes on :


a) Fault tolerance
b) Namenode & datanode
c) Job tracker & task tracker
d) Heartbeat and block in HDFS
Ans :
a) Fault tolerance :- It refers to the ability of a system (computer, network, cloud cluster, etc.) to
continue operating without interruption when one or more of its components fail.The objective of
creating a fault-tolerant system is to prevent disruptions arising from a single point of failure,
ensuring the high availability and business continuity of mission-critical applications or systems.
Fault-tolerant systems use backup components that automatically take the place of failed components,
ensuring no loss of service.

b) NameNode :- It works as Master in Hadoop cluster. Below listed are the main function
performed by NameNode:
1. Stores metadata of actual data.
2. Manages files system namespace.
3. Regulates client access request for actual file data file.
4. Executes file system name space operation like opening/closing files, renaming files and
directories.
5. As Name node keep metadata in memory for fast retrieval, the huge amount of memory is
required for its operation. This should be hosted on reliable hardware.

DataNode works as Slave in Hadoop cluster. Below listed are the main function performed by
DataNode:
1. Actually stores business data.
2. Actual work load like read, write and Data processing is handled.
3. Upon instruction from Master, it performs creation/replication/deletion of data blocks.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
4. As all the Business data is stored on DataNode, the huge amount of storage is required for its operation.
Commodity hardware can be used for hosting DataNode.

c) Job Tracker and Task Tracker


Job Tracker and Task Tracker are 2 essential process involved in Map reduce execution in MRv1 (or
Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and replaced by
Resource Manager, Application Master and Node Manager Daemons.
Job Tracker -
1. Job Tracker process runs on a separate node and not usually on a Data Node.
2. Job Tracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by Resource
Manager/Application Master in MRv2.
3. Job Tracker receives the requests for MapReduce execution from the client.
4. Job Tracker talks to the NameNode to determine the location of the data.
5. Job Tracker finds the best Task Tracker nodes to execute tasks based on the data locality (proximity of
the data) and the available slots to execute a task on a given node.
Task Tracker -
1. Task Tracker runs on DataNode. Mostly on all DataNodes.
2. Task Tracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by Task Trackers.
4. Task Trackers will be assigned Mapper and Reducer tasks to execute by Job Tracker.
5. Task Tracker will be in constant communication with the Job Tracker signalling the progress of the
task in execution.

Q7 What are the three modes in which Hadoop can ?

Ans : Hadoop can run in 3 different modes

1. Standalone Mode
By default, Hadoop is configured to run in a no distributed mode. It runs as a single Java process. Instead
of HDFS, this mode utilizes the local file system. This mode useful for debugging and there isn't any need
to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves.
2. Pseudo – Distributed Mode (Single node) – Hadoop can also run on a single node in a Pseudo
Distributed mode. In this mode, each daemon runs on separate java process. In this mode custom
configuration is required.
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
3. Fully Distrbuted Mode :-

This is the production mode of Hadoop. In this mode typically one machine in the cluster is designated as
NameNode and another as Resource Manager exclusively. These are masters. All other nodes act as Data
Node and Node Manager.This mode offers fully distributed computing capability, reliability, fault
tolerance and scalability.

Q8 What is the basic difference between relational DBMS and Hadoop?


Ans –

DBMS Hadoop

Traditional row-column based databases, basically An open-source software used for storing data and
used for data storage, manipulation and retrieval. running applications or processes concurrently.

In this both structured and unstructured data is


In this structured data is mostly processed.
processed.

It is best suited for OLTP environment. It is best suited for BIG data.

It is less scalable than Hadoop. It is highly scalable.

It stores transformed and aggregated data. It stores huge volume of data.

The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.

It has no latency in response. It has some latency in response.


Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
Q9) What are sources and types of data that make Big data?
Ans -
The three primary sources of Big Data-
1) Streaming data
This category includes data that reaches your IT systems from a web of connected devices, often part of
the IoT. You can analyze this data as it arrives and make decisions on what data to keep, what not to keep
and what requires further analysis.

2) Social media data


The data on social interactions is an increasingly attractive set of information, particularly for marketing,
sales and support functions. It's often in unstructured or semi structured forms, so it poses a unique
challenge when it comes to consumption and analysis.

3) Publicly available sources


Massive amounts of data are available through open data sources like the US government’s data.gov, the
CIA World Facebook or the European Union Open Data Portal.

Q10) What are the advantages, application areas and challenges for Big Data?

Advantages of Big data-


 Data accumulation from multiple sources, including the Internet, social media platforms, online
shopping sites, company databases, external third-party sources, etc.
 Real-time forecasting and monitoring of business as well as the market.
 Identify issues in systems and business processes in real-time.
 Unlock the true potential of data-driven marketing.
 Dig in customer data to create tailor-made products, services, offers, discounts, etc.
 Facilitate speedy delivery of products/services that meet and exceed client expectations.
 Diversify revenue streams to boost company profits and ROI.
 Respond to customer requests, grievances, and queries in real-time.

Applications of Big data-


1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like Amazon,
Walmart, Big Bazar etc.) management team has to keep data of customer’s spending habit (in
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12
which product customer spent, in which band they wish to spent, how frequently they spent),
shopping behavior, customer’s most liked product (so that they can keep those products in the
store). Which product is being searched/sold most, based on that data, production/collection rate
of that product get fixed.
2. Recommendation: By tracking customer spending habit, shopping behavior, Big retails store
provide a recommendation to the customer. E-commerce site like Amazon, Walmart, Flipkart
does product recommendation. They track what product a customer is searching, based on that
data they recommend that type of product to that customer.
3. Smart Traffic System: Data about the condition of the traffic of different road, collected through
camera kept beside the road, at entry and exit point of the city, GPS device placed in the vehicle
(Ola, Uber cab, etc.). All such data are analyzed and jam-free or less jam way, less time taking
ways are recommended.
4. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool (like Siri in
Apple Device, Cortana in Windows, Google Assistant in Android) to provide the answer of the
various question asked by users. This tool tracks the location of the user, their local time, season,
other data related to question asked, etc. Analyzing all such data, it provides an answer.

5. Media and Entertainment Sector: Media and entertainment service providing company like
Netflix, Amazon Prime, Spotify do analysis on data collected from their users. Data like what
type of video, music users are watching, listening most, how long users are spending on site, etc.
are collected and analyzed to set the next business strategy.

Challenges for Big data-


 Storage
 Complexity
 Management
 Preprocessing
 Analytics
 Utilization gap
 Lack of skilled people
 Privacy
 Security
 Real-time analysis
Darvesh Singh Bedi
PGDM – A
ROLL NO. 12

You might also like