Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
25 views

Module 1 and NoSQL

Uploaded by

ATHARVA THAKUR
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Module 1 and NoSQL

Uploaded by

ATHARVA THAKUR
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Big Data

• Big data is an evolving term that describes any voluminous amount of


structured, semi-structured and unstructured data that has the potential to
be mined for information.

• Can’t be processed or analyzed using traditional processes or tools.

By Santosh Tamboli Sir...


1
http://www.youtube.com/@santoshtamboli
Characteristics

By Santosh Tamboli Sir...


2
http://www.youtube.com/@santoshtamboli
Volume
• describes the relative size of data to the processing capability.

• not Terabytes but Zettabytes or Yottabytes.

To Overcoming the volume issue :

• Two options exist today: Apache Hadoop based solutions and


massively parallel processing databases such as CalPont, EMC
GreenPlum, EXASOL, HP Vertica, IBM Netezza, Kognitio, ParAccel, and
Teradata Kickfire.
By Santosh Tamboli Sir...
3
http://www.youtube.com/@santoshtamboli
By Santosh Tamboli Sir...
4
http://www.youtube.com/@santoshtamboli
Velocity
• describes the frequency at which data is generated, captured, and
shared.

• affect the ability to parse text, detect sentiment, and identify new
patterns.

• Key technologies that address velocity include streaming processing


and complex event processing.
• NoSQL databases are used when relational approaches no longer
make sense.
By Santosh Tamboli Sir...
5
http://www.youtube.com/@santoshtamboli
Variety
• Various data types from social, machine to machine, and mobile sources add new
data types to traditional transactional data.

• New types include content, geo-spatial, hardware data points, location based, log
data, machine data, metrics, mobile, physical data points, process, RFID’s, search,
sentiment, streaming data, social, text, and web.

• The addition of unstructured data such as speech, text, image, video increasingly
complicate the ability to categorize data.

• Some technologies that deal with unstructured data include data mining, text
analytics, and noisy text analytics.

By Santosh Tamboli Sir...


6
http://www.youtube.com/@santoshtamboli
Value

• Today data is being produced in large volumes. And just collecting the produced
data is of no use. Instead, we have to look for data from which business insights
can be generated which adds “value” to the company.

• This is where Big data analytics comes into the big picture. There are companies
that have invested in establishing data and data storage infrastructure, but they
fail to understand that the aggregation of data doesn’t equal value addition.

• Data analytics helps to derive useful insights from the collected data. These
insights, in turn, add value to the decision-making process.

By Santosh Tamboli Sir...


7
http://www.youtube.com/@santoshtamboli
Validity / Veracity

• The Validity and Veracity of Big data can be described as the assurance of quality
or credibility of the collected data.
• Since Big data is vast and involves so many data sources, it is the possibility that
not all the collected data is accurate and of good quality.
• Hence, when processing big data sets, it is important to check the validity of the
data before proceeding with further analysis.
• Questions like Can you trust the data that you have collected? Is the data reliable
enough? , etc. need to be entertained. Hence, before processing the data for
further analysis, it is important to check the validity of the data.

By Santosh Tamboli Sir...


8
http://www.youtube.com/@santoshtamboli
Types of Big Data
Structured and Unstructured

By Santosh Tamboli Sir...


9
http://www.youtube.com/@santoshtamboli
Structured data
• refers to data that has a defined length and format.

By Santosh Tamboli Sir...


10
http://www.youtube.com/@santoshtamboli
Types of Structured data
M/c generated:
• i. Sensor data: Examples include radio frequency ID (RFID) tags, smart
meters, medical devices, and Global Positioning System (GPS) data.

• ii. Web log data: When servers, applications, networks, and so on operate,
they capture all kinds of data about their activity.

• iii. Point-of-sale data: When the cashier swipes the bar code of any
product that you are purchasing.

• iv. Financial data: such as the company symbol and dollar value.

By Santosh Tamboli Sir...


11
http://www.youtube.com/@santoshtamboli
Human generated data

• generated by human intervention by interacting with computers.


Types:
• i. Input data: data that a human might input into a computer, such as
name, age, income, non-free-form survey responses, etc.

• ii. Click-stream data: Data is generated every time when you click a
link on a website.

• iii. Gaming-related data: Every move you make in a game can be


recorded.
By Santosh Tamboli Sir...
12
http://www.youtube.com/@santoshtamboli
Unstructured data
• not follow any format.
Types:

M/c generated:

• i. Satellite images: includes weather data or the data that the government captures in its satellite
surveillance imagery.

• ii. Scientific data: includes seismic imagery, atmospheric data and high energy physics.

• iii. Photographs and video: includes security, surveillance, and traffic video.

• iv. Radar or sonar data: includes vehicular, meteorological, and oceanographic data.

By Santosh Tamboli Sir...


13
http://www.youtube.com/@santoshtamboli
b. Human generated:
Types:
i. Text internal to your company: All the text within documents, logs,
survey results, and e-mails.
ii. Social media data: This data is generated from the social media
platforms such as YouTube, Facebook, Twitter, LinkedIn, and Flickr.
iii. Mobile data: This includes data such as text messages and location
information.
iv. Website content: This comes from any site delivering unstructured
content, like YouTube, Flickr, or Instagram.
By Santosh Tamboli Sir...
14
http://www.youtube.com/@santoshtamboli
Traditional Vs Big data approach

By Santosh Tamboli Sir...


15
http://www.youtube.com/@santoshtamboli
Big Data challenges
Dealing with data growth
Shortage of Skilled People
Recruiting and retaining big data talent
Collecting and Integrating Massive and Diverse Datasets
Validating data
Maintaining Data Integrity, Security, and Privacy
Picking the Right NoSQL Tools
Real-time can be Complex

By Santosh Tamboli Sir...


16
http://www.youtube.com/@santoshtamboli
Applications of Big data
Education Industry
Healthcare Industry
Government Sector
Media and Entertainment Industry
Weather Patterns
Transportation Industry
Banking Sector

By Santosh Tamboli Sir...


17
http://www.youtube.com/@santoshtamboli
What is NoSQL
• NoSQL is a set of concepts that allows the rapid and efficient
processing of data sets with a focus on performance, reliability, and
agility.
• It’s more than rows in tables—NoSQL systems store and retrieve data
from many formats: key-value stores, graph databases, column-family
(Bigtable) stores, document stores, and even rows in tables.
• It’s free of joins—NoSQL systems allow you to extract your data using
simple interfaces without joins.
• It’s schema-free—NoSQL systems allow you to drag-and-drop your
data into a folder and then query it without creating an entity-
relational model.
By Santosh Tamboli Sir...
18
http://www.youtube.com/@santoshtamboli
• It works on many processors—NoSQL systems allow you to store
your database on multiple processors and maintain high-speed
performance.
• It uses shared-nothing commodity computers—Most (but not all)
NoSQL systems leverage low-cost commodity processors that have
separate RAM and disk.
• It supports linear scalability—When you add more processors, you
get a consistent increase in performance.
• It’s innovative—NoSQL offers options to a single way of storing,
retrieving, and manipulating data.
By Santosh Tamboli Sir...
19
http://www.youtube.com/@santoshtamboli
NoSQL Data Architecture Patterns
• Key-value stores
• Graph stores
• Column family stores
• Document stores

By Santosh Tamboli Sir...


20
http://www.youtube.com/@santoshtamboli
Key-value stores

• A key-value store is a simple database that when presented with a


simple string (the key) returns an arbitrary large BLOB of data (the
value).

• Key-value stores have no query language; they provide a way to add


and remove key-value pairs into/from a database.

• A key-value store is like a dictionary. A dictionary has a list of words


and each word has one or more definitions

By Santosh Tamboli Sir...


21
http://www.youtube.com/@santoshtamboli
By Santosh Tamboli Sir...
22
http://www.youtube.com/@santoshtamboli
Graph stores

• Graph stores are important in applications that need to analyze


relationships between objects or visit all nodes in a graph in a
particular manner.

• Graph stores are highly optimized to efficiently store graph nodes and
links that allow you to query these graphs.

• Graph databases are useful for any business problem that has
complex relationships between objects such as social networking,
rules-based engines, creating mashups.

By Santosh Tamboli Sir...


23
http://www.youtube.com/@santoshtamboli
By Santosh Tamboli Sir...
24
http://www.youtube.com/@santoshtamboli
• A graph store is a system that contains a sequence of nodes and
relationships to create a graph.
• In a key-value store there two data fields: the key and the value. In
contrast, a graph store has three data fields: nodes, relationships, and
properties.

By Santosh Tamboli Sir...


25
http://www.youtube.com/@santoshtamboli
Column family (Bigtable) stores

• These are important NoSQL data architecture patterns because they can
scale to manage large volumes of data.

• In the MapReduce framework, the map operation has a master node which
breaks up an operation into subparts and distributes each operation to
another node for processing, and reduce is the process where the master
node collects the results from the other nodes and combines them into the
answer to the original problem.

• Column family stores use row and column identifiers as general purposes
keys for data lookup. They’re sometimes referred to as data stores rather
than databases

By Santosh Tamboli Sir...


26
http://www.youtube.com/@santoshtamboli
• HBase, Hypertable and Cassandra are good examples of systems that
have Bigtable like interfaces.
• MonetDB, SybaseIQ and Vertica are examples of column-store
systems.

By Santosh Tamboli Sir...


27
http://www.youtube.com/@santoshtamboli
Document stores
• The key-value store and Bigtable values lack a formal structure and aren’t
indexed or searchable.

• Document stores work in the opposite manner: the key may be a simple ID

• But you can get almost any item out of a document store by querying any
value or content within the document.

• A consequence of using a document store is everything inside a document


is automatically indexed when a new document is added.

By Santosh Tamboli Sir...


28
http://www.youtube.com/@santoshtamboli
• Document stores can tell not only that your search item is in the
document but also the search item’s exact location by using the
document path as shown below:

By Santosh Tamboli Sir...


29
http://www.youtube.com/@santoshtamboli
• Document trees have a single root element. Beneath the root
element there is a sequence of branches, sub-branches and values.

• Each branch has a related path expression that shows you how to
navigate from the root of the tree to any given branch, sub-branch or
value.

By Santosh Tamboli Sir...


30
http://www.youtube.com/@santoshtamboli
CAP theorem

By Santosh Tamboli Sir...


31
http://www.youtube.com/@santoshtamboli
The three letters in CAP refer to three desirable properties of
distributed systems with replicated data:
consistency (among replicated copies)
availability (of the system for read and write operations)
partition tolerance (in the face of the nodes in the system being
partitioned by a network fault).

By Santosh Tamboli Sir...


32
http://www.youtube.com/@santoshtamboli
Consistency –
Consistency means that the nodes will have the same copies of a
replicated data item visible for various transactions.
A guarantee that every node in a distributed cluster returns the same,
most recent and a successful write.
Consistency refers to every client having the same view of the data.

By Santosh Tamboli Sir...


33
http://www.youtube.com/@santoshtamboli
Availability –
Availability means that each read or write request for a data item will
either be processed successfully or will receive a message that the
operation cannot be completed.
Every non-failing node returns a response for all the read and write
requests in a reasonable amount of time.
The key word here is “every”. In simple terms, every node must be able
to respond in a reasonable amount of time.

By Santosh Tamboli Sir...


34
http://www.youtube.com/@santoshtamboli
Partition Tolerance –
Partition tolerance means that the system can continue operating even
if the network connecting the nodes has a fault that results in two or
more partitions, where the nodes in each partition can only
communicate among each other.
That means, the system continues to function and upholds its
consistency guarantees in spite of network partitions.
Distributed systems guaranteeing partition tolerance can gracefully
recover from partitions once the partition heals.

By Santosh Tamboli Sir...


35
http://www.youtube.com/@santoshtamboli

You might also like