Data Analytics Using NoSQL
Data Analytics Using NoSQL
DHINESHKUMAR S K
Taxonomy of NoSQL
Key-value
Graph database
Document-oriented
Column family
3
Typical NoSQL architecture
Hashing
K
function
maps each
key to a
server (node)
CAP theorem for NoSQL
How it is interpreted:
• You must always give something up: consistency, availability or
tolerance to failure and reconfiguration
Theory of NOSQL: CAP
GIVEN:
C
• Many nodes
• Nodes contain replicas of partitions
of the data
• Consistency
• All replicas contain the same version
of data
• Client always has the same view of
the data (no matter what node)
• Availability A P
• System remains operationalon failing
nodes
• All clients can always read and write
• Partition tolerance CAP Theorem: satisfying
• multiple entry points
• System remains operationalon
all three at the same
system split (communication
malfunction) time is impossible
• System works well across physical
network partitions
Sharding of data
Geospatial features
How does NoSQL vary from RDBMS?
• Support • Maturity
• RDBMS vendors • RDMS mature
provide a high level of product: means stable
support to clients and dependable
• Stellar reputation • Also means old no
• NoSQL – are open longer cutting edge nor
interesting
source projects
with startups • NoSQL are still
supporting them implementing their
• Reputation not yet basic feature set
established
Drawbacks of NoSQL
Atomicity Basically
Isolation Soft-state
(State of system may change
over time)
Durability Eventually
consistent
(Asynchronous propagation)
MongoDB
What is MongoDB?
Developed by 10gen
Founded in 2007
A document-oriented, NoSQL database
Written in C++
Supports APIs (drivers) in many computer languages
JavaScript, Python, Ruby, Perl, Java, Java Scala, C#, C++, Haskell, Erlang
Functionality of MongoDB
• Dynamic schema
• No DDL
• Document-based database
• Secondary indexes
• Query language via an API
• Atomic writes and fully-consistent reads
• If system configured that way
• Master-slave replication with automated failover (replica sets)
• Built-in horizontal scaling via automated range-based
• partitioning of data (sharding)
Why use MongoDB?
Simple queries
Functionality provided applicable to most web applications
Easy and fast integration of data
No ERD diagram
Not well suited for heavy and complex transactions systems
MongoDB: CAP approach
C
Focus on Consistency and
Partition tolerance
• Consistency
• all replicas contain the same
version of the data
• Availability
• system remains operational on A P
failingnodes
• Partition tolarence
CAP Theorem:
• multiple entry points satisfying all three at the same time is
• system remains operational on impossible
system split
MongoDB: Hierarchical Objects
0 or more Databases
A MongoDB instance may have
0 or more Collections
zero or more ‘databases’
0 or more Documents
A database may have
zero or more ‘collections’.
A collection may have
zero or more ‘documents’.
0 or
more
A document may have
Fields
one or more ‘fields’.
MongoDB ‘Indexes’ function much like their
RDBMS counterparts.
RDB Concepts to NO SQL
RDBMS MongoDB
Database Database
Table, View Collection
Column Field
Index Index
Join Embedded Document
Foreign Key Reference
Partition Shard
Choices made for Design of MongoDB
Document-Oriented storage
Full Index Support
Replication & High Availability Agile
Auto-Sharding
Querying
Fast In-Place Updates
Map/Reduce functionality Scalable
Index Functionality
• B+ tree indexes
• An index is automatically created on the _id field (the primary key)
• Users can create other indexes to improve query performance or to enforce Unique values
for a particular field
• Supports single field index as well as Compound index
• Like SQL order of the fields in a compound index matters
• If you index a field that holds an array value, MongoDB creates
• separate index entries for every element of the array
• Sparse property of an index ensures that the index only contain entries for documents that
have the indexed field. (so ignore records that do not have the field defined)
• If an index is both unique and sparse – then the system will reject records that have a
duplicate key value but allow records that do not have the indexed field defined
Hands ON!!!!!
Example: Mongo Document
{
name: 'Brad Steve’,
address: {
street: 'Oak Terrace', city: 'Denton’
}
}
Example: Mongo Collection
{
"_id": ObjectId("4efa8d2b7d284dad101e4bc9"),
"Last Name": "DUMONT",
"First Name": "Jean",
"Date of Birth": "01-22-1963" Obligatory, and automatically generated
}, by MongoDB
{
"_id": ObjectId("4efa8d2b7d284dad101e4bc7"),
"Last Name": "PELLERIN",
"First Name": "Franck",
"Date of Birth": "09-19-1983",
"Address": "1 chemin des Loges",
"City": "VERSAILLES"
}
Sample!
BLOG
A blog post has an author, some text, and many comments
The comments are unique per post, but one author has many posts
post = { author = {
id: 150, id: 100,
author: 100, name: 'Michael Arrington' posts: [150]
text: 'This is a pretty awesome post.’, }
comments: [100, 105, 112]
} comment = {
id: 105,
text: 'Whatever this is good comment’
}
Sample: Better Design
post = {
author: 'Michael Arrington’,
text: 'This is a pretty awesome post.’,
comments: [ 'Whatever this post sux.', 'I agree, lame!’ ]
}
Installation
CRUD Operations
• Create
• db.collection.insert( <document> )
• db.collection.save( <document> )
• db.collection.update( <query>, <update>, { upsert: true } )
Collection
• Read specifies the
• db.collection.find( <query>, <projection> )
• db.collection.findOne( <query>, <projection> )
collection or
• Update
the ‘table’ to
•
• db.collection.update( <query>, <update>, <options> )
Delete
store the
• db.collection.remove( <query>, <justOne> ) document
Create Operations
Name Description
$eq Matches value that are equal to a specified value
$gt, $gte Matches values that are greater than (or equal to a specified value
• db.collection_name.insert( <document> )
• Omit the _id field to have MongoDB generate a unique key
• Example db.parts.insert( {{type: “screwdriver”, quantity: 15 } )
• db.parts.insert({_id: 10, type: “hammer”, quantity: 1 })
• db.collection_name.save( <document> )
• Updates an existing record or creates a new record
• db.collection_name.update( <query>, <update>, { upsert: true } )
• Will update 1 or more records in a collection satisfying query
• db.collection_name.findAndModify(<query>, <sort>, <update>,<new>,
<fields>,<upsert>)
• Modify existing record(s) – retrieve old or new version of the record
Delete Operations
• db.collection_name.remove(<query>, <justone>)
• Delete all records from a collection or matching a criterion
• <justone> - specifies to delete only 1 record matching the criterion
• Example: db.parts.remove(type: /^h/ } ) - remove all parts starting with h
• Db.parts.remove() – delete all documents in the parts collections
SQL vs. Mongo DB entities
My SQL Mongo DB
START TRANSACTION; db.contacts.save( { user
INSERT INTO contacts VALUES (NULL, Name: “joeblow”,
‘joeblow’);
emailAddresses:
INSERT INTO contact_emails
VALUES
[ “joe@blow.co
( NULL, ”joe@blow.com”, m”,
“joseph@blow.com” ] }
LAST_INSERT_ID() ),
( NULL,
“joseph@blow.com”,
LAST_INSERT_ID() ); COMMIT; MongoDB separates physical structure
from logical structure
Designed to deal with large &distributed
Aggregation
Aggregation Framework Operators
$project
$match
$limit
$skip
$sort
$unwind
$group
…….
$match
Filter documents
Uses existing query syntax
If using $geoNear it has to be first in pipeline
$where is not supported
Matching Field Values
{
"_id" : 271421,
"amenity" : "pub",
"name" : "Sir Walter Tyrrell",
"location" : {
"type" : "Point",
"coordinates" : [
-1.6192422,
50.9131996
]
}
} {
"_id" : 271466,
{ "amenity" : "pub",
"_id" : 271466,
"amenity" : "pub", "name" : "The Red Lion",
"name" : "The Red Lion", "location" : {
"location" : { "type" : "Point",
"type" : "Point",
"coordinates" : [ "coordinates" : [
-1.5494749, -1.5494749,
50.7837119 50.7837119
]
} ]}
$project
Reshape documents
Include, exclude or rename fields
Inject computed fields
Create sub-document fields
Including and Excluding Fields
{ { “$project”: {
"_id" : 271466,
“_id”: 0,
"amenity" : "pub", “amenity”: 1,
"name" : "The Red Lion", “name”: 1,
"location" : { }}
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119 {
] “amenity” : “pub”,
“name” : “The Red Lion”
} }
}
Reformatting Documents
{ { “$project”: {
"_id" : 271466,
“_id”: 0,
"amenity" : "pub", “name”: 1,
"name" : "The Red Lion", “meta”: {
“type”: “$amenity”}
"location" : { }}
"type" : "Point",
"coordinates" : [
-1.5494749,
50.7837119 {
] “name” : “The Red Lion”
“meta” : {
} “type” : “pub”
} }}
$group
• Group documents by an ID
Real-time
Simple yet powerful interface
Declared in JSON, executes in C++
Runs inside MongoDB on local data
− Adds load to your DB
− Limited Operators
− Data output is limited