Cloud & Big Data

CLOUD & BIG DATA
Experiential Learning
Submitted to-
Prof. Col. Ravindra Bhate Sir
Henish Kanani
Roll No 42201
Introduction to Big Data
First, we need to understand what data is and what big data is.
Generally speaking, the numbers, characters, or symbols on which operations are performed
by a device that can be processed and transmitted in the form of electrical signals, and recorded
on mechanical, optical, or magnetic storage media are referred to as data.
Big Data is usually also data, but with a huge size. Big Data is a concept used to describe the
type of data that is immense in size and exponentially grows over time. Example -The New
York Stock Exchange produces about 1 Tb of new trading data per day. 5 Tb of data is
generated by Facebook in a single day. Data is created in the form of uploads of photos and
videos, exchanges of messages, comments, etc. A single jet engine produces about 10Tb of
data within 30 minutes of flight time. The data generation reaches up to many Petabytes with
thousands of flights per day.
Big data has many characteristics, and they are as follows:

• Volume – Volume is a function that explains the quantity of information generated in
the present context is very important, it is the scale of the data that explains the
information's value and potential, and whether or not it can be regarded as big data.
• Variety – The next feature of big data is the variety and the class to which big data
belongs, and the data analyst should be aware of that is also a key factor.
• Velocity – The term velocity in the present context refers to the velocity at which
information is generated.
• Variability – Variability is characteristic and causes problems to those analyzing the
details.
• Veracity – The quality of the information being collected can vary considerably, and
the accuracy of the analysis depends on the veracity of the information from the source.
• Complexity – When vast information originates from different sources, data
management can become a difficult process.
Advantages of Big Data Processing

The ability to process Big Data brings multiple benefits, such as-Businesses can use data-
generated knowledge to improve service and customer service. Businesses will be able to
develop new and better strategies. Customer feedback systems may be improved. Early
identification of risks to the product or service may be done to prevent damage. Improved
operating efficiency can be achieved.
Big data analysis is a often complex process of examining large and varied data sets or big data
to uncover information— such as hidden patterns, unknown correlations, market trends and
customer preferences — that can help organizations make informed business decisions.
Applications of Big Data
Healthcare – Data analysts collect and analyze information from multiple sources to gain
insight. Multiple sources include electronic patient records, a clinical decision support system
including medical imaging, written and prescription notes from physicians, pharmacy and
laboratories, clinical data, and sensor data generated by machines. The incorporation of such
data helps to develop a comprehensive treatment system that can reduce costs, while improving
treatment quality. Obtaining information from external sources, such as social media, helps to
detect epidemics early and take preventive measures.
Retail – Evolution of e-commerce, online shopping, social-network conversations and the
recent location of personal mobile interactions lead to data-driven retail customization volume
and quality. Major retail outlets can place CCTV not only to observe the theft instances, but
also to monitor customer flow. This helps track trends of buying by age groups, gender and
customers during weekdays and weekends. Retailers are pooling their products based on
customer buying patterns using a well-known data analysis technique called Market Basket
Analysis, so that customers who buy bread and milk can buy jam as well.
Banking – Investment worthiness of customers can be measured using demographic data,
behavioural data and financial jobs. The definition of cross-selling can be used to target specific
consumer groups based on past purchasing behaviour, demographic information, sentiment
analysis, and CRM data here.
Apache Hadoop
Hadoop is an open source framework written in java that allows distributed processing of large
sets of data across clusters of computers using programming models.
HDFS (Hadoop Distributed File System) – HDFS is the storage layer that generates a
distributed repository. While (YARN) Yet another resource negotiator is a layer of data
refinery and a level of processing to schedule parallel compute jobs. This system reduces
distributed computing complexities. Hadoop comes with a distributed system of files, the
HDFS. HDFS is a file system designed to store very large files running on commodity hardware
clusters, with streaming data access patterns. HDFS block size is much larger than normal
filesystem size. The explanation for this large block size is to cut back on the number of disks
you're searching for.
MapReduce – MapReduce is a parallel programming model for the writing of distributed

applications developed by Google for the efficient, fault-tolerant processing of large amounts
of data (multi-terabyte data sets) on large clusters (thousands of nodes) of commodity
hardware. The MapReduce program runs on the open-source platform Apache, Hadoop.
The Hadoop framework also contains, in addition to the two core components described above,
the following two modules-Hadoop Common − Java libraries and utilities provided by other
Hadoop modules. Hadoop YARN − This is a job preparation and resource management system
for the clusters.
Apache Sqoop – Apache Sqoop is a command-line interface application used for transferring
data between relational databases and Hadoop. Apache Sqoop is part of the Hadoop ecosystem.
Sqoop Flume
Apache Flume – Apache Flume is a tool / service / data ingestion system for gathering and
transmitting large amounts of streaming data from different sources to a centralized data store,
such as log files, events, etc.
Apache Hive – Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analysing
easy.

Cloud & Big Data

Uploaded by

Copyright:

Available Formats

Cloud & Big Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloud & Big Data

Uploaded by

Copyright:

Available Formats

CLOUD & BIG DATA

Big data has many characteristics, and they are as follows:

Advantages of Big Data Processing

MapReduce – MapReduce is a parallel programming model for the writing of distributed

You might also like