BDA Notes Unit-1
BDA Notes Unit-1
BDA Notes Unit-1
What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.
‘‘Big Data’’ is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but with
huge size.
“Big Data” is data whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value and hidden knowledge
from it…
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
E-Commerce Site: Sites Like Amazon,Flipkart,Alibaba generates huge amount of logs from
which users buying trends can be traced.
Weather Station:: All the weather station and satellite gives very huge data which are store and
manipulated to forecast weather.
Telecom Company::Telecom giants like airtel, Vodafone study the user trends and accordingly
publish their plans and for this they store the data of its million users .
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of
data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data. Hence, ‘Volume’ is one characteristic which needs to be
considered while dealing with Big Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources
of data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast
the data is generated and processed to meet the demands, determines real potential in
the data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
5’V :Value-This fifth and final characteristic can be defined as the added value or
utility that the collected data can bring to a decision-making process, business activity
or analysis. However, for data to be useful, it is necessary to convert it into knowledge.
This requires the use and combination of different technologies such as data mining,
predictive analytics, text mining, etc. This aims to achieve three major business
objectives: cost reduction, quick and effective decision-making, and the design of new
products or services.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
The companies in the present market need to collect it and analyse it because:
1. Cost Savings:
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses
when they have to store large amounts of data. These tools help organizations in
identifying more effective ways of doing business.
2. Time-Saving:
Real-time in-memory analytics helps companies to collect data from various sources.
Tools like Hadoop help them to analyze data immediately thus helping in making
quick decisions based on the learnings.
For example, analysis of customer purchasing behavior helps companies to identify the
products sold most and thus produces those products accordingly. This helps companies
to get ahead of their competitors.
If we don’t know what our customers want then it will degrade companies’ success. It
will result in the loss of clientele which creates an adverse effect on business growth.
Big data analytics helps businesses to identify customer related trends and patterns.
Customer behavior analysis leads to a profitable business.
A number of business drivers are at the core of this success and explain why Big Data has
quickly risen to become one of the most coveted topics in the industry. Six main business
drivers can be identified:
In this blog post, we will explore a high-level overview of each of these business drivers.
Each of these adds to the competitive advantage of enterprises by creating new revenue
streams by reducing the operational costs.
Historical Costs of Computer Memory, reprinted from McCallum and Blok, 2017
Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks. The most popular software framework (nowadays considered the standard for
Big Data) is Apache Hadoop for distributed storage and processing. Due to the high
availability of these software frameworks in open sources, it has become increasingly
inexpensive to start Big Data projects in organizations.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Social media data provides insights into the behaviours, preferences and opinions of ‘the
public’ on a scale that has never been known before. Due to this, it is immensely valuable
to anyone who is able to derive meaning from these large quantities of data. Social media
data can be used to identify customer preferences for product development, target new
customers for future purchases, or even target potential voters in elections. Social media
data might even be considered one of the most important business drivers of Big Data.
1. Collect Data: Data collection looks different for every organization. With today’s technology,
organizations can gather both structured and unstructured data from a variety of sources —
from cloud storage to mobile applications to in-store IoT sensors and beyond. Some data will be
stored in data warehouses where business intelligence tools and solutions can access it easily. Raw
or unstructured data that is too diverse or complex for a warehouse may be assigned metadata
and stored in a data lake.
2. Process Data: Once data is collected and stored, it must be organized properly to get
accurate results on analytical queries, especially when it’s large and unstructured. Available data
is growing exponentially, making data processing a challenge for organizations. One processing
option is batch processing, which looks at large data blocks over time. Batch processing is
useful when there is a longer turnaround time between collecting and analyzing data. Stream
processing looks at small batches of data at once, shortening the delay time between collection
and analysis for quicker decision-making. Stream processing is more complex and often more
expensive.
3. Clean Data: Data big or small requires scrubbing to improve data quality and get stronger
results; all data must be formatted correctly, and any duplicative or irrelevant data must be
eliminated or accounted for. Dirty data can obscure and mislead, creating flawed insights.
4. Analyze Data: Getting big data into a usable state takes time. Once it’s ready, advanced
analytics processes can turn big data into big insights. Some of these big data analysis methods
include:
Data mining sorts through large datasets to identify patterns and relationships by
identifying anomalies and creating data clusters.
Predictive analytics uses an organization’s historical data to make predictions about the
future, identifying upcoming risks and opportunities.
Deep learning imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and abstract
data
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Cost reduction: Big data can reduce costs in storing all business data in one
place. Tracking analytics also helps companies find ways to work more
efficiently to cut costs wherever possible.
1. Transportation:
Big Data powers the GPS smartphone applications most of us depend on to get from place to
place in the least amount of time. GPS data sources include satellite images and government
agencies.
Airplanes generate enormous volumes of data, on the order of 1,000 gigabytes for transatlantic
flights. Aviation analytics systems ingest all of this to analyze fuel efficiency, passenger and
cargo weights, and weather conditions, with a view toward optimizing safety and energy
consumption.
Route planning
Different itineraries can be compared in terms of user needs, fuel consumption, and
other factors to plan for maximize efficiency.
Traffic safety
Real-time processing and predictive analytics are used to pinpoint accident-prone areas
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Ads have always been targeted towards specific consumer segments. In the past, marketers
have employed TV and radio preferences, survey responses, and focus groups to try to
ascertain people’s likely responses to campaigns. At best, these methods amounted to
educated guesswork.
Today, advertisers buy or gather huge quantities of data to identify what consumers actually
click on, search for, and “like.” Marketing campaigns are also monitored for effectiveness using
click-through rates, views, and other precise metrics.
For example, Amazon accumulates massive data stories on the purchases, delivery methods,
and payment preferences of its millions of customers. The company then sells ad placements
that can be highly targeted to very specific segments and subgroups.
Fraud detection:
Banks monitor credit cardholders’ purchasing patterns and other activity to flag
atypical movements and anomalies that may signal fraudulent transactions.
Risk management:
Big Data analytics enable banks to monitor and report on operational processes,
KPIs, and employee activities.
Personalized marketing:
Banks use Big Data to construct rich profiles of individual customer lifestyles,
preferences, and goals, which are then utilized for micro-targeted marketing
initiatives.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
4. Government:
Government agencies collect voluminous quantities of data, but many, especially at the
local level, don’t employ modern data mining and analytics techniques to extract real
value from it.
Examples of agencies that do include the IRS and the Social Security Administration,
which use data analysis to identify tax fraud and fraudulent disability claims. The FBI
and SEC apply Big Data strategies to monitor markets in their quest to detect criminal
business activities. For years now, the Federal Housing Authority has been using Big
Data analytics to forecast mortgage default’s and repayment rates.
The Centers for Disease Control tracks the spread of infectious illnesses using data
from social media, and the FDA deploys Big Data techniques across testing labs to
investigate patterns of foodborne illness. The U.S. Department of Agriculture supports
agribusiness and ranching by developing Big Data-driven technologies.
Military agencies, with expert assistance from a sizable ecosystem of defines
contractors, make sophisticated and extensive use of data-driven insights for domestic
intelligence, foreign surveillance, and cyber security.
Two conspicuous examples are Amazon Prime, which uses Big Data analytics to
recommend programming for individual users, and Spotify, which does the same to
offer personalized music suggestions.
6. Meteorology:
Weather satellites and sensors all over the world collect large amounts of data for
tracking environmental conditions. Meteorologists use Big Data to:
7. Healthcare:
Big Data is slowly but surely making a major impact on the huge healthcare industry.
Wearable devices and sensors collect patient data which is then fed in real-time to
individuals’ electronic health records. Providers and practice organizations are now
using Big Data for a number of purposes, including these:
Real-time alerting
Strategic planning
Research acceleration
Telemedicine
8. Cyber security:
While Big Data can expose businesses to a greater risk of cyber attacks, the same data
stores can be used to prevent and counteract online crime through the power of
machine learning and analytics. Historical data analysis can yield intelligence to create
more effective threat controls. And machine learning can warn businesses when
deviations from normal patterns and sequences occur, so that effective
countermeasures can be taken against threats such as ransomware attacks, malicious
insider programs, and attempts at unauthorized access.
After a company has suffered an intrusion or data theft, post-attack analysis can
uncover the methods used, and machine learning can then be deployed to devise
safeguards that will foil similar attempts in the future.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
9. Education:
Administrators, faculty, and stakeholders are embracing Big Data to help improve their
curricula, attract the best talent, and optimize the student experience. Examples
include:
Customizing curricula:
Big Data enables academic programs to be tailored to the needs of
individual students, often drawing on a combination of online learning,
traditional on-site classes, and independent study.