Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

UNIT-1

Introduction to Big Data


Traditional Data vs Big Data:
Traditional Data Big Data
Traditional data is generated in Big data is generated outside the
enterprise level. enterprise level.
Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes Zettabytes or Exabytes.
Traditional database system deals with Big data system deals with structured,
structured data. semi-structured, database, and
unstructured data.
Traditional data is generated per hour But big data is generated more
or per day or more. frequently mainly per seconds.
Traditional data source is centralized Big data source is distributed and it is
and it is managed in centralized form. managed in distributed form.
Data integration is very easy. Data integration is very difficult.
Its data sources includes ERP Its data sources includes social media,
transaction data, CRM transaction data, device data, sensor data, video, images,
financial data, organizational data, web audio etc.
transaction data etc.
Traditional data base tools are required Special kind of data base tools are
to perform any data base operation. required to perform any
databaseschema-based operation.
Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.

Big Data vs Data Warehouse:


Big Data Data Warehouse
Big data is the data which is in Data warehouse is the collection of
enormous form on which technologies historical data from different
can be applied. operations in an enterprise.
Big data is a technology to store and Data warehouse is an architecture used
manage large amount of data. to organize the data.
It takes structured, non-structured or It only takes structured data as an
semi-structured data as an input. input.
Big data does processing by using Data warehouse doesn’t use distributed
distributed file system. file system for processing.
Big data doesn’t follow any SQL In data warehouse we use SQL queries
queries to fetch data from database to fetch data from relational databases.
Apache Hadoop can be used to handle Data warehouse cannot be used to
enormous amount of data. handle enormous amount of data.
When new data is added, the changes When new data is added, the changes
in data are stored in the form of a file in data do not directly impact the data
which is represented by a table. warehouse.
Big data doesn’t require efficient Data warehouse requires more efficient
management techniques as compared management techniques as the data is
to data warehouse. collected from different departments of
the enterprise.

Big Data vs Cloud Computing:


Big Data Cloud Computing
Big data refers to the data which is Cloud computing refers to the on
huge in size and also increasing rapidly demand availability of computing
with respect to time. resources over internet.
Big data includes structured data, Cloud Computing Services includes
unstructured data as well as semi- Infrastructure as a Service (IaaS),
structured data. Platform as a Service (PaaS) and
Software as a Service (SaaS).
Volume of data, Velocity of data, On-Demand availability of IT
Variety of data, Veracity of data, and resources, broad network access,
Value of data are considered as the 5 resource pooling, elasticity and
most important characteristics of Big measured service are considered as the
data. main characteristics of cloud
computing.
The purpose of big data is to The purpose of cloud computing is to
organizing the large volume of data store and process data in cloud or
and extracting the useful information availing remote IT services without
from it and using that information for physically installing any IT resources.
the improvement of business.
Distributed computing is used for Internet is used to get the cloud based
analyzing the data and extracting the services from different cloud vendors.
useful information.
Big data management allows Cloud computing services are cost
centralized platform, provision for effective, scalable and robust.
backup and recovery and low
maintenance cost.
Some of the challenges of big data are Some of the challenges of cloud
variety of data, data storage and computing are availability,
integration, data processing and transformation, security concern,
resource management. charging model.
Big data refers to huge volume of data, Cloud computing refers to remote IT
its management, and useful resources and different internet service
information extraction. models.
Some of the sources where big data is Some of the cloud computing vendors
generated includes social media data, who provides cloud computing
e-commerce data, weather station data, services are Amazon Web Service
IoT Sensor data etc. (AWS), Microsoft Azure, Google
Cloud Platform, IBM Cloud Services
etc.

Big Data vs Data Analytics:


Big Data Data Analytics
Big data refers to a large volume of Data Analytics refers to the process of
data and also the data is increasing at, analyzing the raw data and finding out
modeling rapid speed with respect to conclusions about that information.
time.
Big data includes Structured, Descriptive, Diagnostic, Predictive,
Unstructured and Semi-structured the Prescriptive are the four basic types of
three types of data. data analytics.
The purpose of big data is to store hugeThe purpose of data analytics is to
volume of data and to process it. analyze the raw data and find out
insights for the information.
Parallel computing and other complex Predictive and statistical modelling
automation tools are used to handle big with relatively simple tools used to
data. handle data analytics.
Big data operations are handled by big Data analytics is performed by skilled
data professionals. data analysts.
Big data analysts need the knowledge Data Analysts need the knowledge of
of programming, NoSQL databases, programming, statistics, and
distributed systems, and frameworks. mathematics.
Big data is mainly found in financial Data analytics is mainly used in
services, Media and Entertainment, business for risk detection and
communication, Banking, information management, science, travelling, health
technology, retail, etc. care, Gaming, energy management,
and information technology.
It is considered as the first step as first It is considered as second step as it
big data generated and then stored. performs analysis on the large data
sets.
Some of the big data tools are Apache Some of the data analytics tools are
Hadoop, Cloudera Distribution for Tableau Public, Python, Apache Spark,
Hadoop, Cassandra, MongoDB etc. Excel, RapidMiner, KNIME etc.

Data Science vs Big Data:


Data Science Big Data
Data Science is an area. Big Data is a technique to collect,
maintain and process huge information.
It is about the collection, processing, It is about extracting vital and valuable
analyzing, and utilizing of data in information from a huge amount of
various operations. It is more data.
conceptual.
It is a field of study just like Computer It is a technique for tracking and
Science, Applied Statistics, or Applied discovering trends in complex data
Mathematics. sets.
The goal is to build data-dominant The goal is to make data more vital and
products for a venture. usable i.e. by extracting only important
information from the huge data within
existing traditional aspects.
Tools mainly used in Data Science Tools mostly used in Big Data include
include SAS, R, Python, etc Hadoop, Spark, Flink, etc.
It is a superset of Big Data as data It is a sub-set of Data Science as
science consists of Data scrapping, mining activities which is in a pipeline
cleaning, visualization, statistics, and of Data science.
many more techniques.
It is mainly used for scientific It is mainly used for business purposes
purposes. and customer satisfaction.
It broadly focuses on the science of the It is more involved with the processes
data. of handling voluminous data.

Big Data and its Importance:


Big data refers to the massive volume of structured and unstructured data
generated by businesses, users, sensors, and other sources.

The importance of big data lies in its potential to provide valuable insights and
benefits across various sectors:

1. Business Insights: Big data analytics can help businesses analyze customer
behavior, market trends, and operational patterns to make informed
decisions. It enables organizations to identify new opportunities, optimize
processes, and improve overall performance.
2. Innovation: Big data serves as a foundation for innovation in fields such as
healthcare, finance, transportation, and retail. Analyzing large datasets can
lead to the development of new products, services, and business models.
3. Personalization: With big data analytics, companies can personalize their
products and services based on individual preferences and behavior. This
personalized approach enhances customer satisfaction and loyalty.
4. Predictive Analytics: Big data analytics allows organizations to predict
future trends and outcomes by analyzing historical data patterns. This
capability is invaluable for risk management, forecasting, and strategic
planning.
5. Scientific Research: In fields like genomics, astronomy, climate science,
and particle physics, big data plays a crucial role in analyzing complex
datasets, uncovering patterns, and advancing scientific knowledge.
6. Healthcare Improvements: Big data analytics in healthcare can improve
patient outcomes, optimize resource allocation, and facilitate medical
research. It enables healthcare providers to identify trends, diagnose diseases
earlier, and personalize treatment plans.
7. Social Good: Big data can be leveraged to address social challenges such as
poverty, disease outbreaks, and environmental sustainability. By analyzing
large datasets, organizations can identify areas of need, allocate resources
efficiently, and implement targeted interventions.
Four V’s of Big Data:

This data is characterized by its volume, velocity, variety, and veracity, often
referred to as the "4 Vs" of big data:

1. Volume: Big data involves vast amounts of data. Traditional data


management tools may not be capable of processing such large volumes
efficiently.
2. Velocity: Data streams in at unprecedented speeds. Social media updates,
sensor data, and other real-time information sources contribute to this
velocity.
3. Variety: Big data comes in various formats, including structured data (like
numbers and dates) and unstructured data (like text, images, and videos).
Managing and analyzing this diverse data is a significant challenge.
4. Veracity: Veracity refers to the quality and reliability of the data. With big
data, there's often uncertainty about the accuracy and trustworthiness of the
information.
Drivers for Big Data:

Several factors act as drivers for the proliferation and importance of big data in
contemporary society:

1. Data Growth: The sheer volume of data generated globally is growing at an


exponential rate. This growth is fueled by the increasing digitization of
processes, the rise of social media, the Internet of Things (IoT), and other
technological advancements.
2. Technological Advancements: Advances in storage, processing power, and
data analytics technologies have made it feasible to capture, store, and
analyze massive datasets in real-time. Technologies like cloud computing,
distributed computing frameworks (e.g., Hadoop, Spark), and advanced
analytics algorithms have democratized access to big data capabilities.
3. Internet of Things (IoT): The proliferation of connected devices, sensors,
and smart systems has led to the generation of vast amounts of data. IoT
devices collect data on everything from temperature and humidity to user
behavior and machine performance, contributing to the expansion of big data
ecosystems.
4. Social Media and Digital Platforms: Social media platforms, e-commerce
websites, and digital platforms generate enormous volumes of user-
generated content, interactions, and transactional data. Analyzing this data
provides valuable insights into consumer behavior, preferences, and trends.
5. Business Imperatives: In today's highly competitive business landscape,
organizations recognize the value of data-driven decision-making. Big data
analytics enables businesses to optimize operations, improve customer
experiences, drive innovation, and gain a competitive edge in their
respective industries.
6. Regulatory Requirements: Regulatory mandates and compliance standards,
such as GDPR (General Data Protection Regulation) and HIPAA (Health
Insurance Portability and Accountability Act), necessitate robust data
management practices. Organizations must collect, store, and process data in
compliance with legal and regulatory requirements, driving the adoption of
big data technologies and governance frameworks.
7. Demand for Personalization: Consumers increasingly expect personalized
products, services, and experiences. Big data analytics enables organizations
to analyze vast datasets and tailor offerings to individual preferences,
behaviors, and demographics.
8. Scientific and Research Applications: In scientific research, big data
analytics facilitates the analysis of complex datasets, simulation modeling,
and hypothesis testing. Fields such as genomics, climate science, astronomy,
and particle physics rely on big data technologies to advance knowledge and
drive discoveries.
9. Security and Fraud Detection: Big data analytics plays a crucial role in
cybersecurity and fraud detection efforts. By analyzing patterns, anomalies,
and trends across large datasets, organizations can detect security threats,
identify fraudulent activities, and mitigate risks in real-time.
10.Healthcare Innovations: In healthcare, big data analytics supports clinical
decision-making, patient outcomes research, population health management,
and drug discovery efforts. By analyzing electronic health records, medical
imaging data, and genomic information, healthcare providers can improve
diagnoses, treatment outcomes, and public health initiatives.

Introduction to Big Data Analytics:

Big data analytics is the process of examining large and complex datasets to
uncover hidden patterns, correlations, trends, and insights that can inform decision-
making, optimize processes, and drive innovation. It involves the use of advanced
analytical techniques, algorithms, and tools to extract meaningful information from
vast volumes of structured and unstructured data.

The key components of big data analytics include:

1. Data Collection: Big data analytics begins with the collection of diverse
datasets from various sources, including transactional systems, social media
platforms, sensors, IoT devices, and other sources. This data may include
structured data (e.g., databases, spreadsheets) and unstructured data (e.g.,
text, images, videos).
2. Data Storage: Once collected, the data is stored in scalable and distributed
storage systems that can handle the volume, velocity, and variety of big data.
Technologies like Hadoop Distributed File System (HDFS), NoSQL
databases, and cloud storage platforms are commonly used for storing big
data.
3. Data Processing: Big data processing involves the transformation, cleaning,
and preprocessing of raw data to prepare it for analysis. This may include
data integration, data cleansing, and data normalization techniques to ensure
data quality and consistency.
4. Data Analysis: The core of big data analytics involves applying various
analytical techniques and algorithms to analyze large datasets. This may
include descriptive analytics to summarize and visualize data, diagnostic
analytics to understand relationships and causality, predictive analytics to
forecast future trends and outcomes, and prescriptive analytics to
recommend actions and strategies.
5. Data Visualization: Data visualization tools and techniques are used to
represent complex datasets visually, making it easier for users to understand
and interpret the insights derived from big data analytics. Visualization
techniques include charts, graphs, heatmaps, and interactive dashboards.
6. Machine Learning and AI: Big data analytics often leverages machine
learning algorithms and artificial intelligence techniques to automate data
analysis, identify patterns, and make predictions based on historical data.
Machine learning models can be trained to classify data, detect anomalies,
and perform complex tasks without explicit programming.
7. Scalability and Performance: Big data analytics platforms are designed to
be highly scalable and performant, capable of processing and analyzing
massive datasets efficiently. Distributed computing frameworks like Apache
Spark and Hadoop enable parallel processing across clusters of nodes to
achieve high performance and scalability.
Big Data Analytics applications:

Applications of big data analytics span across various industries and domains,
including:

 Business and Finance: Customer segmentation, market basket analysis, risk


management, fraud detection, and financial forecasting.
 Healthcare: Clinical decision support, patient outcomes research, disease
surveillance, and drug discovery.
 Retail and E-commerce: Customer behavior analysis, personalized
marketing, supply chain optimization, and inventory management.
 Manufacturing and Supply Chain: Predictive maintenance, quality
control, demand forecasting, and supply chain visibility.
 Telecommunications: Network optimization, customer churn prediction,
and sentiment analysis.
 Government and Public Sector: Smart city initiatives, public safety
analytics, traffic management, and social services optimization.

You might also like