Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
57 views18 pages

BDA Notes Unit-1

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 18

MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.

 No single standard definition…

‘‘Big Data’’ is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but with
huge size.

“Big Data” is data whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value and hidden knowledge
from it…
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

E-Commerce Site: Sites Like Amazon,Flipkart,Alibaba generates huge amount of logs from
which users buying trends can be traced.
Weather Station:: All the weather station and satellite gives very huge data which are store and
manipulated to forecast weather.
Telecom Company::Telecom giants like airtel, Vodafone study the user trends and accordingly
publish their plans and for this they store the data of its million users .
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Characteristics Of Big Data


Big data can be described by the following characteristics:

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of
data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data. Hence, ‘Volume’ is one characteristic which needs to be
considered while dealing with Big Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources
of data considered by most of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast
the data is generated and processed to meet the demands, determines real potential in
the data.

Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks, and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.

5’V :Value-This fifth and final characteristic can be defined as the added value or
utility that the collected data can bring to a decision-making process, business activity
or analysis. However, for data to be useful, it is necessary to convert it into knowledge.
This requires the use and combination of different technologies such as data mining,
predictive analytics, text mining, etc. This aims to achieve three major business
objectives: cost reduction, quick and effective decision-making, and the design of new
products or services.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Importance of Big data:


Big Data importance doesn’t revolve around the amount of data a company has. Its
importance lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the company uses
its data, more rapidly it grows.

The companies in the present market need to collect it and analyse it because:
1. Cost Savings:
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses
when they have to store large amounts of data. These tools help organizations in
identifying more effective ways of doing business.

2. Time-Saving:
Real-time in-memory analytics helps companies to collect data from various sources.
Tools like Hadoop help them to analyze data immediately thus helping in making
quick decisions based on the learnings.

3. Understand the market conditions:


Big Data analysis helps businesses to get a better understanding of market situations.

For example, analysis of customer purchasing behavior helps companies to identify the
products sold most and thus produces those products accordingly. This helps companies
to get ahead of their competitors.

4. Social Media Listening:


Companies can perform sentiment analysis using Big Data tools. These enable them to
get feedback about their company, that is, who is saying what about the company.
Companies can use Big data tools to improve their online presence.

5. Boost Customer Acquisition and Retention:


Customers are a vital asset on which any business depends on. No single business can
achieve its success without building a robust customer base. But even with a solid
customer base, the companies can’t ignore the competition in the market.

If we don’t know what our customers want then it will degrade companies’ success. It
will result in the loss of clientele which creates an adverse effect on business growth.

Big data analytics helps businesses to identify customer related trends and patterns.
Customer behavior analysis leads to a profitable business.

6. Solve Advertisers Problem and Offer Marketing Insights:


Big data analytics shapes all business operations. It enables companies to fulfill
customer expectations. Big data analytics helps in changing the company’s product
line. It ensures powerful marketing campaigns.

7. The driver of Innovations and Product Development:


Big data makes companies capable to innovate and redevelop their products.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Drivers of Big Data:

A number of business drivers are at the core of this success and explain why Big Data has
quickly risen to become one of the most coveted topics in the industry. Six main business
drivers can be identified:

1. The digitization of society;


2. The plummeting of technology costs;
3. Connectivity through cloud computing;
4. Increased knowledge about data science;
5. Social media applications;
6. The upcoming Internet-of-Things (IoT).

In this blog post, we will explore a high-level overview of each of these business drivers.
Each of these adds to the competitive advantage of enterprises by creating new revenue
streams by reducing the operational costs.

1. The digitization of society


Big Data is largely consumer driven and consumer oriented. Most of the data in the world is
generated by consumers, who are nowadays ‘always-on’. Most people now spend 4-6
hours per day consuming and generating data through a variety of devices and (social)
applications. With every click, swipe or message, new data is created in a database
somewhere around the world. Because everyone now has a smartphone in their pocket, the
data creation sums to incomprehensible amounts. Some studies estimate that 60% of data
was generated within the last two years, which is a good indication of the rate with which
society has digitized.

2. The plummeting of technology costs


Technology related to collecting and processing massive quantities of diverse (high variety)
data has become increasingly more affordable. The costs of data storage and processors
keep declining, making it possible for small businesses and individuals to become involved
with Big Data. For storage capacity, the often-cited Moore’s Law still holds that the storage
density (and therefore capacity) still doubles every two years. The plummeting of
technology costs has been depicted in the figure below.

Historical Costs of Computer Memory, reprinted from McCallum and Blok, 2017
Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks. The most popular software framework (nowadays considered the standard for
Big Data) is Apache Hadoop for distributed storage and processing. Due to the high
availability of these software frameworks in open sources, it has become increasingly
inexpensive to start Big Data projects in organizations.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

3. Connectivity through cloud computing


Cloud computing environments (where data is remotely stored in distributed storage
systems) have made it possible to quickly scale up or scale down IT infrastructure and
facilitate a pay-as-you-go model. This means that organizations that want to process
massive quantities of data (and thus have large storage and processing requirements) do
not have to invest in large quantities of IT infrastructure. Instead, they can license the
storage and processing capacity they need and only pay for the amounts they actually
used. As a result, most of Big Data solutions leverage the possibilities of cloud computing to
deliver their solutions to enterprises.

4. Increased knowledge about data science


In the last decade, the term data science and data scientist have become tremendously
popular. In October 2012, Harvard Business Review called the data scientist “sexiest job of
the 21st century” and many other publications have featured this new job role in recent
years. The demand for data scientist (and similar job titles) has increased tremendously
and many people have actively become engaged in the domain of data science.

Increased knowledge about data science


As a result, the knowledge and education about data science has greatly professionalized
and more information becomes available every day. While statistics and data analysis
mostly remained an academic field previously, it is quickly becoming a popular subject
among students and the working population.

5. Social media applications


Everyone understands the impact that social media has on daily life. However, in the study
of Big Data, social media plays a role of paramount importance. Not only because of the
sheer volume of data that is produced everyday through platforms such as Twitter,
Facebook, LinkedIn and Instagram, but also because social media provides nearly real-time
data about human behaviour.

Social media data provides insights into the behaviours, preferences and opinions of ‘the
public’ on a scale that has never been known before. Due to this, it is immensely valuable
to anyone who is able to derive meaning from these large quantities of data. Social media
data can be used to identify customer preferences for product development, target new
customers for future purchases, or even target potential voters in elections. Social media
data might even be considered one of the most important business drivers of Big Data.

6. The upcoming internet of things (IoT)


The Internet of things (IoT) is the network of physical devices, vehicles, home appliances
and other items embedded with electronics, software, sensors, actuators, and network
connectivity which enables these objects to connect and exchange data. It is increasingly
gaining popularity as consumer goods providers start including ‘smart’ sensors in
household appliances. Whereas the average household in 2010 had around 10 devices that
connected to the internet, this number is expected to rise to 50 per household by 2020.
Examples of these devices include thermostats, smoke detectors, televisions, audio
systems and even smart refrigerators.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

What is big data analytics?


Big data analytics is the process of collecting, examining, and analysing large amounts
of data to discover market trends, insights, and patterns that can help companies make
better business decisions. This information is available quickly and efficiently so that
companies can be agile in crafting plans to maintain their competitive advantage.

How big data analytics works


Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets
to help organizations operationalize their big data.

1. Collect Data: Data collection looks different for every organization. With today’s technology,
organizations can gather both structured and unstructured data from a variety of sources —
from cloud storage to mobile applications to in-store IoT sensors and beyond. Some data will be
stored in data warehouses where business intelligence tools and solutions can access it easily. Raw
or unstructured data that is too diverse or complex for a warehouse may be assigned metadata
and stored in a data lake.
2. Process Data: Once data is collected and stored, it must be organized properly to get
accurate results on analytical queries, especially when it’s large and unstructured. Available data
is growing exponentially, making data processing a challenge for organizations. One processing
option is batch processing, which looks at large data blocks over time. Batch processing is
useful when there is a longer turnaround time between collecting and analyzing data. Stream
processing looks at small batches of data at once, shortening the delay time between collection
and analysis for quicker decision-making. Stream processing is more complex and often more
expensive.
3. Clean Data: Data big or small requires scrubbing to improve data quality and get stronger
results; all data must be formatted correctly, and any duplicative or irrelevant data must be
eliminated or accounted for. Dirty data can obscure and mislead, creating flawed insights.
4. Analyze Data: Getting big data into a usable state takes time. Once it’s ready, advanced
analytics processes can turn big data into big insights. Some of these big data analysis methods
include:

 Data mining sorts through large datasets to identify patterns and relationships by
identifying anomalies and creating data clusters.

 Predictive analytics uses an organization’s historical data to make predictions about the
future, identifying upcoming risks and opportunities.

 Deep learning imitates human learning patterns by using artificial intelligence and
machine learning to layer algorithms and find patterns in the most complex and abstract
data
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Benefits of big data analytics

Incorporating big data analytics into a business or organisation has several


advantages. These include:

 Cost reduction: Big data can reduce costs in storing all business data in one
place. Tracking analytics also helps companies find ways to work more
efficiently to cut costs wherever possible.

 Product development: Developing and marketing new products, services, or


brands is much easier when based on data collected from customers’ needs
and wants. Big data analytics also helps businesses understand product
viability and to keep up with trends.

 Strategic business decisions: The ability to constantly analyse data helps


businesses make better and faster decisions, such as cost and supply chain
optimisation.

 Customer experience: Data-driven algorithms help marketing efforts (targeted


ads, for example) and increase customer satisfaction by delivering an
enhanced customer experience.

 Risk management: Businesses can identify risks by analysing data patterns


and developing solutions for managing those risks.

There are four categories of analytics that organizations need


to consider:

 Descriptive analytics: Descriptive analytics or data mining are at the bottom of


the big data value chain, but they can be valuable for uncovering patterns that
offer insight. A simple example of descriptive analytics would be reviewing the
number of people that visited the company’s website over the past few months.
Descriptive analytics can be useful in the sales cycle, for example, in order to spot
seasonal trends and to adjust purchasing decisions accordingly.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

 Diagnostic analytics: Diagnostic analytics are used for discovery or to


determine why something happened. In a social media marketing campaign for
example, diagnostic analytics can be used to determine why certain
advertisements resulted in increased conversion rates. Diagnostic analytics
provide valuable insights for organizations, because it helps them understand
which decisions impact the company’s performance.
 Predictive analytics: Predictive analytics use Big Data to identify past
patterns to predict the future. From trends or patterns in existing data
sets, predictive algorithms calculate the probability that a certain event
will occur. For example, some companies are using predictive analytics
for sales lead scoring, indicating which incoming sales leads will have the
highest chance of converting into an actual customer. Properly tuned
predictive analytics can be used to support sales, marketing, or for other
types of complex forecasts.
 Prescriptive analytics: Prescriptive analytics is the last and most
valuable level of analytics. While Big Data analytics in general sheds light
on a subject, prescriptive analytics gives you a laser-like focus to answer
specific questions. For example, in the health care industry, you can
better manage the patient population by using prescriptive analytics to
measure the number of patients who are clinically obese, then add filters
for factors like diabetes and LDL cholesterol levels to determine where to
focus treatment. The same prescriptive model can be applied to almost
any industry target group or problem.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Big Data Analytics Applications:

1. Transportation:
Big Data powers the GPS smartphone applications most of us depend on to get from place to
place in the least amount of time. GPS data sources include satellite images and government
agencies.

Airplanes generate enormous volumes of data, on the order of 1,000 gigabytes for transatlantic
flights. Aviation analytics systems ingest all of this to analyze fuel efficiency, passenger and
cargo weights, and weather conditions, with a view toward optimizing safety and energy
consumption.

Big Data simplifies and streamlines transportation through:

 Congestion management and traffic control


Thanks to Big Data analytics, Google Maps can now tell you the least traffic-prone route
to any destination.

 Route planning
Different itineraries can be compared in terms of user needs, fuel consumption, and
other factors to plan for maximize efficiency.

 Traffic safety
Real-time processing and predictive analytics are used to pinpoint accident-prone areas
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

2. Advertising and Marketing:

Ads have always been targeted towards specific consumer segments. In the past, marketers
have employed TV and radio preferences, survey responses, and focus groups to try to
ascertain people’s likely responses to campaigns. At best, these methods amounted to
educated guesswork.

Today, advertisers buy or gather huge quantities of data to identify what consumers actually
click on, search for, and “like.” Marketing campaigns are also monitored for effectiveness using
click-through rates, views, and other precise metrics.

For example, Amazon accumulates massive data stories on the purchases, delivery methods,
and payment preferences of its millions of customers. The company then sells ad placements
that can be highly targeted to very specific segments and subgroups.

3. Banking and Financial Services:


The financial industry puts Big Data and analytics to highly productive use, for:

 Fraud detection:
Banks monitor credit cardholders’ purchasing patterns and other activity to flag
atypical movements and anomalies that may signal fraudulent transactions.

 Risk management:
Big Data analytics enable banks to monitor and report on operational processes,
KPIs, and employee activities.

 Customer relationship optimization:


Financial institutions analyse data from website usage and transactions to better
understand how to convert prospects to customers and incentivize greater use of
various financial products.

 Personalized marketing:
Banks use Big Data to construct rich profiles of individual customer lifestyles,
preferences, and goals, which are then utilized for micro-targeted marketing
initiatives.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

4. Government:

Government agencies collect voluminous quantities of data, but many, especially at the
local level, don’t employ modern data mining and analytics techniques to extract real
value from it.
Examples of agencies that do include the IRS and the Social Security Administration,
which use data analysis to identify tax fraud and fraudulent disability claims. The FBI
and SEC apply Big Data strategies to monitor markets in their quest to detect criminal
business activities. For years now, the Federal Housing Authority has been using Big
Data analytics to forecast mortgage default’s and repayment rates.

The Centers for Disease Control tracks the spread of infectious illnesses using data
from social media, and the FDA deploys Big Data techniques across testing labs to
investigate patterns of foodborne illness. The U.S. Department of Agriculture supports
agribusiness and ranching by developing Big Data-driven technologies.
Military agencies, with expert assistance from a sizable ecosystem of defines
contractors, make sophisticated and extensive use of data-driven insights for domestic
intelligence, foreign surveillance, and cyber security.

5. Media and Entertainment:


The entertainment industry harnesses Big Data to glean insights from customer
reviews, predict audience interests and preferences, optimize programming schedules,
and target marketing campaigns.

Two conspicuous examples are Amazon Prime, which uses Big Data analytics to
recommend programming for individual users, and Spotify, which does the same to
offer personalized music suggestions.

6. Meteorology:
Weather satellites and sensors all over the world collect large amounts of data for
tracking environmental conditions. Meteorologists use Big Data to:

 Study natural disaster patterns

 Prepare weather forecasts

 Understand the impact of global warming

 Predict the availability of drinking water in various world regions

 Provide early warning of impending crises such as hurricanes and tsunamis


MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

7. Healthcare:

Big Data is slowly but surely making a major impact on the huge healthcare industry.
Wearable devices and sensors collect patient data which is then fed in real-time to
individuals’ electronic health records. Providers and practice organizations are now
using Big Data for a number of purposes, including these:

 Prediction of epidemic outbreaks

 Early symptom detection to avoid preventable diseases

 Electronic health records

 Real-time alerting

 Enhancing patient engagement

 Prediction and prevention of serious medical conditions

 Strategic planning

 Research acceleration

 Telemedicine

 Enhanced analysis of medical images

8. Cyber security:

While Big Data can expose businesses to a greater risk of cyber attacks, the same data
stores can be used to prevent and counteract online crime through the power of
machine learning and analytics. Historical data analysis can yield intelligence to create
more effective threat controls. And machine learning can warn businesses when
deviations from normal patterns and sequences occur, so that effective
countermeasures can be taken against threats such as ransomware attacks, malicious
insider programs, and attempts at unauthorized access.

After a company has suffered an intrusion or data theft, post-attack analysis can
uncover the methods used, and machine learning can then be deployed to devise
safeguards that will foil similar attempts in the future.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

9. Education:
Administrators, faculty, and stakeholders are embracing Big Data to help improve their
curricula, attract the best talent, and optimize the student experience. Examples
include:

 Customizing curricula:
Big Data enables academic programs to be tailored to the needs of
individual students, often drawing on a combination of online learning,
traditional on-site classes, and independent study.

 Reducing dropout rates:


Predictive analytics give educational institutions insights on student
results, responses to proposed programs of study, and input on how
students fare in the job market after graduation.

 Improving student outcomes:


Analysing students’ personal “data trails” can provide a better
understanding of their learning styles and behaviours, and be used to
create an optimal learning environment.

 Targeted international recruiting:


Big Data analysis helps institutions more accurately predict applicants’
likely success. Conversely, it aids international students in pinpointing the
schools best matched to their academic goals and most likely to admit
them.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM

Big data analytics tools


Harnessing all of that data requires tools. Thankfully, technology has advanced
so that many intuitive software systems are available for data analysts to use.
 Hadoop: An open-source framework that stores and processes big data sets.
Hadoop can handle and analyse structured and unstructured data.
 Spark: An open-source cluster computing framework for real-time processing
and data analysis.
 Data integration software: Programs that allow big data to be streamlined
across different platforms, such as MongoDB, Apache, Hadoop, and Amazon
EMR.
 Stream analytics tools: Systems that filter, aggregate, and analyse data that
might be stored in different platforms and formats, such as Kafka.
 Distributed storage: Databases that can split data across multiple servers and
can identify lost or corrupt data, such as Cassandra.
 Predictive analytics hardware and software: Systems that process large
amounts of complex data, using machine learning and algorithms to predict
future outcomes, such as fraud detection, marketing, and risk assessments.
 Data mining tools: Programs that allow users to search within structured and
unstructured big data.
 NoSQL databases: Non-relational data management systems ideal for dealing
with raw and unstructured data.
 Data warehouses: Storage for large amounts of data collected from many
different sources, typically using predefined schemas.
 HEAVY.AI-Interactive visual analytics platform that can process massive multi-
source datasets in milliseconds
 Apache Kafka-Scalable messaging system that lets users publish and
consume large numbers of messages in real time by subscription.
 HBaseColumn-oriented key/value data store that runs run on the Hadoop
Distributed File System.
 Hive-Open source data warehouse system for analyzing data sets in Hadoop
files.
 MapReduce-Software framework for processing massive amounts of
unstructured data in parallel across a distributed cluster.
 Pig-Open source technology for parallel programming of MapReduce jobs on
Hadoop clusters.
 YARN-Cluster management technology in second-generation Hadoop.

You might also like